aboutsummaryrefslogtreecommitdiff
path: root/fs/nfs
AgeCommit message (Collapse)AuthorFilesLines
2022-02-25NFS: Use of mapping_set_error() results in spurious errorsTrond Myklebust1-1/+4
The use of mapping_set_error() in conjunction with calls to filemap_check_errors() is problematic because every error gets reported as either an EIO or an ENOSPC by filemap_check_errors() in functions such as filemap_write_and_wait() or filemap_write_and_wait_range(). In almost all cases, we prefer to use the more nuanced wb errors. Fixes: b8946d7bfb94 ("NFS: Revalidate the file mapping on all fatal writeback errors") Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25NFS: Clean up NFSv4.2 xattrsTrond Myklebust3-12/+18
Add a helper for the xattr mask so that we can get rid of the inlined ifdefs. Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25NFS: Remove unnecessary XATTR cache invalidation in nfs_fhget()Trond Myklebust1-2/+0
We should never expect the 'xattr_cache' to be non-null in that case, hence nfs_set_cache_invalid() is just going to optimise it away. Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25NFS: NFSv2/v3 clients should never be setting NFS_CAP_XATTRTrond Myklebust2-0/+2
Ensure that we always initialise the 'xattr_support' field in struct nfs_fsinfo, so that nfs_server_set_fsinfo() doesn't declare our NFSv2/v3 client to be capable of supporting the NFSv4.2 xattr protocol by setting the NFS_CAP_XATTR capability. This configuration can cause nfs_do_access() to set access mode bits that are unsupported by the NFSv3 ACCESS call, which may confuse spec-compliant servers. Reported-by: Olga Kornievskaia <[email protected]> Fixes: b78ef845c35d ("NFSv4.2: query the server for extended attribute support") Cc: [email protected] Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25NFS: Remove unused flag NFS_INO_REVAL_PAGECACHETrond Myklebust2-4/+2
Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25NFS: Replace last uses of NFS_INO_REVAL_PAGECACHETrond Myklebust2-14/+12
Now that we have more fine grained attribute revalidation, let's just get rid of NFS_INO_REVAL_PAGECACHE. Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25NFSv4: use unique client identifiers in network namespacesBenjamin Coddington1-0/+14
In order to differentiate client state, assign a random uuid to the uniquifing portion of the client identifier when a network namespace is created. Containers may still override this value if they wish to maintain stable client identifiers by writing to /sys/fs/nfs/net/client/identifier, either by udev rules or other means. Signed-off-by: Benjamin Coddington <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25NFSv4.1 support for NFS4_RESULT_PRESERVER_UNLINKEDOlga Kornievskaia2-2/+10
In 4.1+, the server is allowed to set a flag NFS4_RESULT_PRESERVE_UNLINKED in reply to the OPEN, that tells the client that it does not need to do a silly rename of an opened file when it's being removed. Signed-off-by: Olga Kornievskaia <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25NFSv4.2/copyoffload: Convert GFP_NOFS to GFP_KERNELTrond Myklebust3-8/+8
There doesn't seem to be any reason why the copy offload code can't use GFP_KERNEL. It can't get called by direct reclaim. Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25NFSv4/flexfiles: Convert GFP_NOFS to GFP_KERNELTrond Myklebust2-9/+10
Assume that the higher layers will have set memalloc_nofs_save/restore as appropriate. Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25NFS: Convert GFP_NOFS to GFP_KERNELTrond Myklebust4-14/+13
Assume that sections that should not re-enter the filesystem are already protected with memalloc_nofs_save/restore call, so relax those GFP_NOFS instances which might be used by other contexts. Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25NFSv4.2: Fix up an invalid combination of memory allocation flagsTrond Myklebust1-4/+3
We should use either GFP_KERNEL or GFP_NOFS, but not both. Also strip GFP_KERNEL_ACCOUNT down to GFP_KERNEL. This memory is shrinkable, so does not need to be limited by kmemcg. Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25NFSv4: Charge NFSv4 open state trackers to kmemcgTrond Myklebust2-4/+5
Allow kmemcg to limit the number of NFSv4 delegation, lock and open state trackers. Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25NFS: Charge open/lock file contexts to kmemcgTrond Myklebust2-3/+3
Allow kmemcg to limit the number of open/lock file contexts, in the same way that it limits the parent file descriptors. Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25NFSv4: Protect the state recovery thread against direct reclaimTrond Myklebust1-0/+12
If memory allocation triggers a direct reclaim from the state recovery thread, then we can deadlock. Use memalloc_nofs_save/restore to ensure that doesn't happen. Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25NFSv4.2: fix reference count leaks in _nfs42_proc_copy_notify()Xin Xiong1-3/+6
[You don't often get email from [email protected]. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.] The reference counting issue happens in two error paths in the function _nfs42_proc_copy_notify(). In both error paths, the function simply returns the error code and forgets to balance the refcount of object `ctx`, bumped by get_nfs_open_context() earlier, which may cause refcount leaks. Fix it by balancing refcount of the `ctx` object before the function returns in both error paths. Signed-off-by: Xin Xiong <[email protected]> Signed-off-by: Xiyu Yang <[email protected]> Signed-off-by: Xin Tan <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25Convert NFS from readpages to readaheadMatthew Wilcox (Oracle)3-12/+17
NFS is one of the last two users of the deprecated ->readpages aop. This conversion looks straightforward, but I have only compile-tested it. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-02-25NFS: simplify check for freeing cn_respTom Rix1-2/+2
nfs42_files_from_same_server() is called to check if freeing cn_resp is required, just do the free. Signed-off-by: Tom Rix <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-02-16NFS: Do not report writeback errors in nfs_getattr()Trond Myklebust1-6/+3
The result of the writeback, whether it is an ENOSPC or an EIO, or anything else, does not inhibit the NFS client from reporting the correct file timestamps. Fixes: 79566ef018f5 ("NFS: Getattr doesn't require data sync semantics") Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-02-14NFS: LOOKUP_DIRECTORY is also ok with symlinksTrond Myklebust1-2/+2
Commit ac795161c936 (NFSv4: Handle case where the lookup of a directory fails) [1], part of Linux since 5.17-rc2, introduced a regression, where a symbolic link on an NFS mount to a directory on another NFS does not resolve(?) the first time it is accessed: Reported-by: Paul Menzel <[email protected]> Fixes: ac795161c936 ("NFSv4: Handle case where the lookup of a directory fails") Signed-off-by: Trond Myklebust <[email protected]> Tested-by: Donald Buczek <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-02-14NFS: Remove an incorrect revalidation in nfs4_update_changeattr_locked()Trond Myklebust1-2/+1
In nfs4_update_changeattr_locked(), we don't need to set the NFS_INO_REVAL_PAGECACHE flag, because we already know the value of the change attribute, and we're already flagging the size. In fact, this forces us to revalidate the change attribute a second time for no good reason. This extra flag appears to have been introduced as part of the xattr feature, when update_changeattr_locked() was converted for use by the xattr code. Fixes: 1b523ca972ed ("nfs: modify update_changeattr to deal with regular files") Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-02-08NFS: Fix nfs4_proc_get_locations() kernel-doc commentYang Li1-1/+2
Add the description of @server and @fhandle, and remove the excess @inode in nfs4_proc_get_locations() kernel-doc comment to remove warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/nfs/nfs4proc.c:8219: warning: Function parameter or member 'server' not described in 'nfs4_proc_get_locations' fs/nfs/nfs4proc.c:8219: warning: Function parameter or member 'fhandle' not described in 'nfs4_proc_get_locations' fs/nfs/nfs4proc.c:8219: warning: Excess function parameter 'inode' description in 'nfs4_proc_get_locations' Reported-by: Abaci Robot <[email protected]> Signed-off-by: Yang Li <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-02-08NFS: Fix initialisation of nfs_client cl_flags fieldTrond Myklebust1-1/+1
For some long forgotten reason, the nfs_client cl_flags field is initialised in nfs_get_client() instead of being initialised at allocation time. This quirk was harmless until we moved the call to nfs_create_rpc_client(). Fixes: dd99e9f98fbf ("NFSv4: Initialise connection to the server in nfs4_alloc_client()") Cc: [email protected] # 4.8.x Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-02-02NFS: Avoid duplicate uncached readdir calls on eofTrond Myklebust1-5/+15
If we've reached the end of the directory, then cache that information in the context so that we don't need to do an uncached readdir in order to rediscover that fact. Fixes: 794092c57f89 ("NFS: Do uncached readdir when we're seeking a cookie in an empty page cache") Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-02-02NFS: Don't skip directory entries when doing uncached readdir[email protected]1-0/+1
Ensure that we initialise desc->cache_entry_index correctly in uncached_readdir(). Fixes: d1bacf9eb2fd ("NFS: add readdir cache array") Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-02-02NFS: Don't overfill uncached readdir pages[email protected]1-1/+2
If we're doing an uncached read of the directory, then we ideally want to read only the exact set of entries that will fit in the buffer supplied by the getdents() system call. So unlike the case where we're reading into the page cache, let's send only one READDIR call, before trying to fill up the buffer. Fixes: 35df59d3ef69 ("NFS: Reduce number of RPC calls when doing uncached readdir") Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-02-02block: pass a block_device and opf to bio_allocChristoph Hellwig1-3/+1
Pass the block_device and operation that we plan to use this bio for to bio_alloc to optimize the assignment. NULL/0 can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Also move the gfp_mask argument after the nr_vecs argument for a much more logical calling convention matching what most of the kernel does. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-02-02nfs/blocklayout: remove bl_alloc_init_bioChristoph Hellwig1-21/+5
bio_alloc will never fail when it can sleep. Remove the now simple bl_alloc_init_bio helper and open code it in the only caller. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-02-02block: remove genhd.hChristoph Hellwig1-1/+0
There is no good reason to keep genhd.h separate from the main blkdev.h header that includes it. So fold the contents of genhd.h into blkdev.h and remove genhd.h entirely. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-01-25Merge tag 'nfs-for-5.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds17-111/+382
Pull NFS client updates from Anna Schumaker: "New Features: - Basic handling for case insensitive filesystems - Initial support for fs_locations and server trunking Bugfixes and Cleanups: - Cleanups to how the "struct cred *" is handled for the nfs_access_entry - Ensure the server has an up to date ctimes before hardlinking or renaming - Update 'blocks used' after writeback, fallocate, and clone - nfs_atomic_open() fixes - Improvements to sunrpc tracing - Various null check & indenting related cleanups - Some improvements to the sunrpc sysfs code: - Use default_groups in kobj_type - Fix some potential races and reference leaks - A few tracepoint cleanups in xprtrdma" [ This should have gone in during the merge window, but didn't. The original pull request - sent during the merge window - had gotten marked as spam and discarded due missing DKIM headers in the email from Anna. - Linus ] * tag 'nfs-for-5.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (35 commits) SUNRPC: Don't dereference xprt->snd_task if it's a cookie xprtrdma: Remove definitions of RPCDBG_FACILITY xprtrdma: Remove final dprintk call sites from xprtrdma sunrpc: Fix potential race conditions in rpc_sysfs_xprt_state_change() net/sunrpc: fix reference count leaks in rpc_sysfs_xprt_state_change NFSv4.1 test and add 4.1 trunking transport SUNRPC allow for unspecified transport time in rpc_clnt_add_xprt NFSv4 handle port presence in fs_location server string NFSv4 expose nfs_parse_server_name function NFSv4.1 query for fs_location attr on a new file system NFSv4 store server support for fs_location attribute NFSv4 remove zero number of fs_locations entries error check NFSv4: nfs_atomic_open() can race when looking up a non-regular file NFSv4: Handle case where the lookup of a directory fails NFSv42: Fallocate and clone should also request 'blocks used' NFSv4: Allow writebacks to request 'blocks used' SUNRPC: use default_groups in kobj_type NFS: use default_groups in kobj_type NFS: Fix the verifier for case sensitive filesystem in nfs_atomic_open() NFS: Add a helper to remove case-insensitive aliases ...
2022-01-17Merge branch 'signal-for-v5.17' of ↵Linus Torvalds2-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull signal/exit/ptrace updates from Eric Biederman: "This set of changes deletes some dead code, makes a lot of cleanups which hopefully make the code easier to follow, and fixes bugs found along the way. The end-game which I have not yet reached yet is for fatal signals that generate coredumps to be short-circuit deliverable from complete_signal, for force_siginfo_to_task not to require changing userspace configured signal delivery state, and for the ptrace stops to always happen in locations where we can guarantee on all architectures that the all of the registers are saved and available on the stack. Removal of profile_task_ext, profile_munmap, and profile_handoff_task are the big successes for dead code removal this round. A bunch of small bug fixes are included, as most of the issues reported were small enough that they would not affect bisection so I simply added the fixes and did not fold the fixes into the changes they were fixing. There was a bug that broke coredumps piped to systemd-coredump. I dropped the change that caused that bug and replaced it entirely with something much more restrained. Unfortunately that required some rebasing. Some successes after this set of changes: There are few enough calls to do_exit to audit in a reasonable amount of time. The lifetime of struct kthread now matches the lifetime of struct task, and the pointer to struct kthread is no longer stored in set_child_tid. The flag SIGNAL_GROUP_COREDUMP is removed. The field group_exit_task is removed. Issues where task->exit_code was examined with signal->group_exit_code should been examined were fixed. There are several loosely related changes included because I am cleaning up and if I don't include them they will probably get lost. The original postings of these changes can be found at: https://lkml.kernel.org/r/[email protected] https://lkml.kernel.org/r/[email protected] https://lkml.kernel.org/r/[email protected] I trimmed back the last set of changes to only the obviously correct once. Simply because there was less time for review than I had hoped" * 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (44 commits) ptrace/m68k: Stop open coding ptrace_report_syscall ptrace: Remove unused regs argument from ptrace_report_syscall ptrace: Remove second setting of PT_SEIZED in ptrace_attach taskstats: Cleanup the use of task->exit_code exit: Use the correct exit_code in /proc/<pid>/stat exit: Fix the exit_code for wait_task_zombie exit: Coredumps reach do_group_exit exit: Remove profile_handoff_task exit: Remove profile_task_exit & profile_munmap signal: clean up kernel-doc comments signal: Remove the helper signal_group_exit signal: Rename group_exit_task group_exec_task coredump: Stop setting signal->group_exit_task signal: Remove SIGNAL_GROUP_COREDUMP signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process signal: Make coredump handling explicit in complete_signal signal: Have prepare_signal detect coredumps using signal->core_state signal: Have the oom killer detect coredumps using signal->core_state exit: Move force_uaccess back into do_exit exit: Guarantee make_task_dead leaks the tsk when calling do_task_exit ...
2022-01-16Merge tag 'nfsd-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds2-24/+10
Pull nfsd updates from Chuck Lever: "Bruce has announced he is leaving Red Hat at the end of the month and is stepping back from his role as NFSD co-maintainer. As a result, this includes a patch removing him from the MAINTAINERS file. There is one patch in here that Jeff Layton was carrying in the locks tree. Since he had only one for this cycle, he asked us to send it to you via the nfsd tree. There continues to be 0-day reports from Robert Morris @MIT. This time we include a fix for a crash in the COPY_NOTIFY operation. Highlights: - Bruce steps down as NFSD maintainer - Prepare for dynamic nfsd thread management - More work on supporting re-exporting NFS mounts - One fs/locks patch on behalf of Jeff Layton Notable bug fixes: - Fix zero-length NFSv3 WRITEs - Fix directory cinfo on FS's that do not support iversion - Fix WRITE verifiers for stable writes - Fix crash on COPY_NOTIFY with a special state ID" * tag 'nfsd-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (51 commits) SUNRPC: Fix sockaddr handling in svcsock_accept_class trace points SUNRPC: Fix sockaddr handling in the svc_xprt_create_error trace point fs/locks: fix fcntl_getlk64/fcntl_setlk64 stub prototypes nfsd: fix crash on COPY_NOTIFY with special stateid MAINTAINERS: remove bfields NFSD: Move fill_pre_wcc() and fill_post_wcc() Revert "nfsd: skip some unnecessary stats in the v4 case" NFSD: Trace boot verifier resets NFSD: Rename boot verifier functions NFSD: Clean up the nfsd_net::nfssvc_boot field NFSD: Write verifier might go backwards nfsd: Add a tracepoint for errors in nfsd4_clone_file_range() NFSD: De-duplicate net_generic(nf->nf_net, nfsd_net_id) NFSD: De-duplicate net_generic(SVC_NET(rqstp), nfsd_net_id) NFSD: Clean up nfsd_vfs_write() nfsd: Replace use of rwsem with errseq_t NFSD: Fix verifier returned in stable WRITEs nfsd: Retry once in nfsd_open on an -EOPENSTALE return nfsd: Add errno mapping for EREMOTEIO nfsd: map EBADF ...
2022-01-13NFSv4.1 test and add 4.1 trunking transportOlga Kornievskaia1-1/+55
For each location returned in FS_LOCATION query, establish a transport to the server, send EXCHANGE_ID and test for trunking, if successful, add the transport to the exiting client. Signed-off-by: Olga Kornievskaia <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-01-13NFSv4 handle port presence in fs_location server stringOlga Kornievskaia2-7/+12
An fs_location attribute returns a string that can be ipv4, ipv6, or DNS name. An ip location can have a port appended to it and if no port is present a default port needs to be set. If rpc_pton() fails to parse, try calling rpc_uaddr2socaddr() that can convert an universal address. Signed-off-by: Olga Kornievskaia <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-01-13NFSv4 expose nfs_parse_server_name functionOlga Kornievskaia2-3/+4
Make nfs_parse_server_name available outside of nfs4namespace.c. Signed-off-by: Olga Kornievskaia <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-01-13NFSv4.1 query for fs_location attr on a new file systemOlga Kornievskaia4-15/+80
Query the server for other possible trunkable locations for a given file system on a 4.1+ mount. v2: -- added missing static to nfs4_discover_trunking, reported by the kernel test robot Signed-off-by: Olga Kornievskaia <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-01-12Merge tag 'fscache-rewrite-20220111' of ↵Linus Torvalds12-642/+262
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull fscache rewrite from David Howells: "This is a set of patches that rewrites the fscache driver and the cachefiles driver, significantly simplifying the code compared to what's upstream, removing the complex operation scheduling and object state machine in favour of something much smaller and simpler. The series is structured such that the first few patches disable fscache use by the network filesystems using it, remove the cachefiles driver entirely and as much of the fscache driver as can be got away with without causing build failures in the network filesystems. The patches after that recreate fscache and then cachefiles, attempting to add the pieces in a logical order. Finally, the filesystems are reenabled and then the very last patch changes the documentation. [!] Note: I have dropped the cifs patch for the moment, leaving local caching in cifs disabled. I've been having trouble getting that working. I think I have it done, but it needs more testing (there seem to be some test failures occurring with v5.16 also from xfstests), so I propose deferring that patch to the end of the merge window. WHY REWRITE? ============ Fscache's operation scheduling API was intended to handle sequencing of cache operations, which were all required (where possible) to run asynchronously in parallel with the operations being done by the network filesystem, whilst allowing the cache to be brought online and offline and to interrupt service for invalidation. With the advent of the tmpfile capacity in the VFS, however, an opportunity arises to do invalidation much more simply, without having to wait for I/O that's actually in progress: Cachefiles can simply create a tmpfile, cut over the file pointer for the backing object attached to a cookie and abandon the in-progress I/O, dismissing it upon completion. Future work here would involve using Omar Sandoval's vfs_link() with AT_LINK_REPLACE[1] to allow an extant file to be displaced by a new hard link from a tmpfile as currently I have to unlink the old file first. These patches can also simplify the object state handling as I/O operations to the cache don't all have to be brought to a stop in order to invalidate a file. To that end, and with an eye on to writing a new backing cache model in the future, I've taken the opportunity to simplify the indexing structure. I've separated the index cookie concept from the file cookie concept by C type now. The former is now called a "volume cookie" (struct fscache_volume) and there is a container of file cookies. There are then just the two levels. All the index cookie levels are collapsed into a single volume cookie, and this has a single printable string as a key. For instance, an AFS volume would have a key of something like "afs,example.com,1000555", combining the filesystem name, cell name and volume ID. This is freeform, but must not have '/' chars in it. I've also eliminated all pointers back from fscache into the network filesystem. This required the duplication of a little bit of data in the cookie (cookie key, coherency data and file size), but it's not actually that much. This gets rid of problems with making sure we keep netfs data structures around so that the cache can access them. These patches mean that most of the code that was in the drivers before is simply gone and those drivers are now almost entirely new code. That being the case, there doesn't seem any particular reason to try and maintain bisectability across it. Further, there has to be a point in the middle where things are cut over as there's a single point everything has to go through (ie. /dev/cachefiles) and it can't be in use by two drivers at once. ISSUES YET OUTSTANDING ====================== There are some issues still outstanding, unaddressed by this patchset, that will need fixing in future patchsets, but that don't stop this series from being usable: (1) The cachefiles driver needs to stop using the backing filesystem's metadata to store information about what parts of the cache are populated. This is not reliable with modern extent-based filesystems. Fixing this is deferred to a separate patchset as it involves negotiation with the network filesystem and the VM as to how much data to download to fulfil a read - which brings me on to (2)... (2) NFS (and CIFS with the dropped patch) do not take account of how the cache would like I/O to be structured to meet its granularity requirements. Previously, the cache used page granularity, which was fine as the network filesystems also dealt in page granularity, and the backing filesystem (ext4, xfs or whatever) did whatever it did out of sight. However, we now have folios to deal with and the cache will now have to store its own metadata to track its contents. The change I'm looking at making for cachefiles is to store content bitmaps in one or more xattrs and making a bit in the map correspond to something like a 256KiB block. However, the size of an xattr and the fact that they have to be read/updated in one go means that I'm looking at covering 1GiB of data per 512-byte map and storing each map in an xattr. Cachefiles has the potential to grow into a fully fledged filesystem of its very own if I'm not careful. However, I'm also looking at changing things even more radically and going to a different model of how the cache is arranged and managed - one that's more akin to the way, say, openafs does things - which brings me on to (3)... (3) The way cachefilesd does culling is very inefficient for large caches and it would be better to move it into the kernel if I can as cachefilesd has to keep asking the kernel if it can cull a file. Changing the way the backend works would allow this to be addressed. BITS THAT MAY BE CONTROVERSIAL ============================== There are some bits I've added that may be controversial: (1) I've provided a flag, S_KERNEL_FILE, that cachefiles uses to check if a files is already being used by some other kernel service (e.g. a duplicate cachefiles cache in the same directory) and reject it if it is. This isn't entirely necessary, but it helps prevent accidental data corruption. I don't want to use S_SWAPFILE as that has other effects, but quite possibly swapon() should set S_KERNEL_FILE too. Note that it doesn't prevent userspace from interfering, though perhaps it should. (I have made it prevent a marked directory from being rmdir-able). (2) Cachefiles wants to keep the backing file for a cookie open whilst we might need to write to it from network filesystem writeback. The problem is that the network filesystem unuses its cookie when its file is closed, and so we have nothing pinning the cachefiles file open and it will get closed automatically after a short time to avoid EMFILE/ENFILE problems. Reopening the cache file, however, is a problem if this is being done due to writeback triggered by exit(). Some filesystems will oops if we try to open a file in that context because they want to access current->fs or suchlike. To get around this, I added the following: (A) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network filesystem inode to indicate that we have a usage count on the cookie caching that inode. (B) A flag in struct writeback_control, unpinned_fscache_wb, that is set when __writeback_single_inode() clears the last dirty page from i_pages - at which point it clears I_PINNING_FSCACHE_WB and sets this flag. This has to be done here so that clearing I_PINNING_FSCACHE_WB can be done atomically with the check of PAGECACHE_TAG_DIRTY that clears I_DIRTY_PAGES. (C) A function, fscache_set_page_dirty(), which if it is not set, sets I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the cache resources. (D) A function, fscache_unpin_writeback(), to be called by ->write_inode() to unuse the cookie. (E) A function, fscache_clear_inode_writeback(), to be called when the inode is evicted, before clear_inode() is called. This cleans up any lingering I_PINNING_FSCACHE_WB. The network filesystem can then use these tools to make sure that fscache_write_to_cache() can write locally modified data to the cache as well as to the server. For the future, I'm working on write helpers for netfs lib that should allow this facility to be removed by keeping track of the dirty regions separately - but that's incomplete at the moment and is also going to be affected by folios, one way or another, since it deals with pages" Link: https://lore.kernel.org/all/[email protected]/ Tested-by: Dominique Martinet <[email protected]> # 9p Tested-by: [email protected] # afs Tested-by: Jeff Layton <[email protected]> # ceph Tested-by: Dave Wysochanski <[email protected]> # nfs Tested-by: Daire Byrne <[email protected]> # nfs * tag 'fscache-rewrite-20220111' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (67 commits) 9p, afs, ceph, nfs: Use current_is_kswapd() rather than gfpflags_allow_blocking() fscache: Add a tracepoint for cookie use/unuse fscache: Rewrite documentation ceph: add fscache writeback support ceph: conversion to new fscache API nfs: Implement cache I/O by accessing the cache directly nfs: Convert to new fscache volume/cookie API 9p: Copy local writes to the cache when writing to the server 9p: Use fscache indexing rewrite and reenable caching afs: Skip truncation on the server of data we haven't written yet afs: Copy local writes to the cache when writing to the server afs: Convert afs to use the new fscache API fscache, cachefiles: Display stat of culling events fscache, cachefiles: Display stats of no-space events cachefiles: Allow cachefiles to actually function fscache, cachefiles: Store the volume coherency data cachefiles: Implement the I/O routines cachefiles: Implement cookie resize for truncate cachefiles: Implement begin and end I/O operation cachefiles: Implement backing file wrangling ...
2022-01-12NFSv4 store server support for fs_location attributeOlga Kornievskaia1-0/+2
Define and store if server returns it supports fs_locations attribute as a capability. Signed-off-by: Olga Kornievskaia <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-01-12NFSv4 remove zero number of fs_locations entries error checkOlga Kornievskaia2-2/+3
Remove the check for the zero length fs_locations reply in the xdr decoding, and instead check for that in the migration code. Signed-off-by: Olga Kornievskaia <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-01-119p, afs, ceph, nfs: Use current_is_kswapd() rather than ↵David Howells1-1/+2
gfpflags_allow_blocking() In 9p, afs ceph, and nfs, gfpflags_allow_blocking() (which wraps a test for __GFP_DIRECT_RECLAIM being set) is used to determine if ->releasepage() should wait for the completion of a DIO write to fscache with something like: if (folio_test_fscache(folio)) { if (!gfpflags_allow_blocking(gfp) || !(gfp & __GFP_FS)) return false; folio_wait_fscache(folio); } Instead, current_is_kswapd() should be used instead. Note that this is based on a patch originally by Zhaoyang Huang[1]. In addition to extending it to the other network filesystems and putting it on top of my fscache rewrite, it also needs to include linux/swap.h in a bunch of places. Can current_is_kswapd() be moved to linux/mm.h? Changes ======= ver #5: - Dropping the changes for cifs. Originally-signed-off-by: Zhaoyang Huang <[email protected]> Co-developed-by: David Howells <[email protected]> Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: Zhaoyang Huang <[email protected]> cc: Dominique Martinet <[email protected]> cc: Marc Dionne <[email protected]> cc: Steve French <[email protected]> cc: Trond Myklebust <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] Link: https://lore.kernel.org/r/[email protected]/ [1] Link: https://lore.kernel.org/r/164021590773.640689.16777975200823659231.stgit@warthog.procyon.org.uk/ # v4
2022-01-10nfs: Implement cache I/O by accessing the cache directlyDavid Howells4-83/+127
Move NFS to using fscache DIO API instead of the old upstream I/O API as that has been removed. This is a stopgap solution as the intention is that at sometime in the future, the cache will move to using larger blocks and won't be able to store individual pages in order to deal with the potential for data corruption due to the backing filesystem being able insert/remove bridging blocks of zeros into its extent list[1]. NFS then reads and writes cache pages synchronously and one page at a time. The preferred change would be to use the netfs lib, but the new I/O API can be used directly. It's just that as the cache now needs to track data for itself, caching blocks may exceed page size... This code is somewhat borrowed from my "fallback I/O" patchset[2]. Changes ======= ver #3: - Restore lost =n fallback for nfs_fscache_release_page()[2]. Signed-off-by: David Howells <[email protected]> Tested-by: Dave Wysochanski <[email protected]> Acked-by: Jeff Layton <[email protected]> cc: Trond Myklebust <[email protected]> cc: Anna Schumaker <[email protected]> cc: [email protected] cc: [email protected] Link: https://lore.kernel.org/r/[email protected] [1] Link: https://lore.kernel.org/r/[email protected]/ [2] Link: https://lore.kernel.org/r/163189108292.2509237.12615909591150927232.stgit@warthog.procyon.org.uk/ [2] Link: https://lore.kernel.org/r/163906981318.143852.17220018647843475985.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967184451.1823006.6450645559828329590.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021577632.640689.11069627070150063812.stgit@warthog.procyon.org.uk/ # v4
2022-01-10nfs: Convert to new fscache volume/cookie APIDave Wysochanski12-595/+170
Change the nfs filesystem to support fscache's indexing rewrite and reenable caching in nfs. The following changes have been made: (1) The fscache_netfs struct is no more, and there's no need to register the filesystem as a whole. (2) The session cookie is now an fscache_volume cookie, allocated with fscache_acquire_volume(). That takes three parameters: a string representing the "volume" in the index, a string naming the cache to use (or NULL) and a u64 that conveys coherency metadata for the volume. For nfs, I've made it render the volume name string as: "nfs,<ver>,<family>,<address>,<port>,<fsidH>,<fsidL>*<,param>[,<uniq>]" (3) The fscache_cookie_def is no more and needed information is passed directly to fscache_acquire_cookie(). The cache no longer calls back into the filesystem, but rather metadata changes are indicated at other times. fscache_acquire_cookie() is passed the same keying and coherency information as before. (4) fscache_enable/disable_cookie() have been removed. Call fscache_use_cookie() and fscache_unuse_cookie() when a file is opened or closed to prevent a cache file from being culled and to keep resources to hand that are needed to do I/O. If a file is opened for writing, we invalidate it with FSCACHE_INVAL_DIO_WRITE in lieu of doing writeback to the cache, thereby making it cease caching until all currently open files are closed. This should give the same behaviour as the uptream code. Making the cache store local modifications isn't straightforward for NFS, so that's left for future patches. (5) fscache_invalidate() now needs to be given uptodate auxiliary data and a file size. It also takes a flag to indicate if this was due to a DIO write. (6) Call nfs_fscache_invalidate() with FSCACHE_INVAL_DIO_WRITE on a file to which a DIO write is made. (7) Call fscache_note_page_release() from nfs_release_page(). (8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for PG_fscache to be cleared. (9) The functions to read and write data to/from the cache are stubbed out pending a conversion to use netfslib. Changes ======= ver #3: - Added missing =n fallback for nfs_fscache_release_file()[1][2]. ver #2: - Use gfpflags_allow_blocking() rather than using flag directly. - fscache_acquire_volume() now returns errors. - Remove NFS_INO_FSCACHE as it's no longer used. - Need to unuse a cookie on file-release, not inode-clear. Signed-off-by: Dave Wysochanski <[email protected]> Co-developed-by: David Howells <[email protected]> Signed-off-by: David Howells <[email protected]> Tested-by: Dave Wysochanski <[email protected]> Acked-by: Jeff Layton <[email protected]> cc: Trond Myklebust <[email protected]> cc: Anna Schumaker <[email protected]> cc: [email protected] cc: [email protected] Link: https://lore.kernel.org/r/[email protected]/ [1] Link: https://lore.kernel.org/r/[email protected]/ [2] Link: https://lore.kernel.org/r/163819668938.215744.14448852181937731615.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906979003.143852.2601189243864854724.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967182112.1823006.7791504655391213379.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021575950.640689.12069642327533368467.stgit@warthog.procyon.org.uk/ # v4
2022-01-08nfs: block notification on fs with its own ->lockJ. Bruce Fields1-1/+1
NFSv4.1 supports an optional lock notification feature which notifies the client when a lock comes available. (Normally NFSv4 clients just poll for locks if necessary.) To make that work, we need to request a blocking lock from the filesystem. We turned that off for NFS in commit f657f8eef3ff ("nfs: don't atempt blocking locks on nfs reexports") [sic] because it actually blocks the nfsd thread while waiting for the lock. Thanks to Vasily Averin for pointing out that NFS isn't the only filesystem with that problem. Any filesystem that leaves ->lock NULL will use posix_lock_file(), which does the right thing. Simplest is just to assume that any filesystem that defines its own ->lock is not safe to request a blocking lock from. So, this patch mostly reverts commit f657f8eef3ff ("nfs: don't atempt blocking locks on nfs reexports") [sic] and commit b840be2f00c0 ("lockd: don't attempt blocking locks on nfs reexports"), and instead uses a check of ->lock (Vasily's suggestion) to decide whether to support blocking lock notifications on a given filesystem. Also add a little documentation. Perhaps someday we could add back an export flag later to allow filesystems with "good" ->lock methods to support blocking lock notifications. Reported-by: Vasily Averin <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]> [ cel: Description rewritten to address checkpatch nits ] [ cel: Fixed warning when SUNRPC debugging is disabled ] [ cel: Fixed NULL check ] Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Vasily Averin <[email protected]>
2022-01-07NFSv4: nfs_atomic_open() can race when looking up a non-regular fileTrond Myklebust1-0/+5
If the file type changes back to being a regular file on the server between the failed OPEN and our LOOKUP, then we need to re-run the OPEN. Fixes: 0dd2b474d0b6 ("nfs: implement i_op->atomic_open()") Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-01-07NFSv4: Handle case where the lookup of a directory failsTrond Myklebust1-0/+13
If the application sets the O_DIRECTORY flag, and tries to open a regular file, nfs_atomic_open() will punt to doing a regular lookup. If the server then returns a regular file, we will happily return a file descriptor with uninitialised open state. The fix is to return the expected ENOTDIR error in these cases. Reported-by: Lyu Tao <[email protected]> Fixes: 0dd2b474d0b6 ("nfs: implement i_op->atomic_open()") Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-01-07fscache, cachefiles: Disable configurationDavid Howells1-1/+1
Disable fscache and cachefiles in Kconfig whilst it is rewritten. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] Link: https://lore.kernel.org/r/163819576672.215744.12444272479560406780.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906882835.143852.11073015983885872901.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967075113.1823006.277316290062782998.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021481179.640689.2004199594774033658.stgit@warthog.procyon.org.uk/ # v4
2022-01-06NFSv42: Fallocate and clone should also request 'blocks used'Trond Myklebust1-5/+8
Both fallocate and clone can end up updating the blocks used attribute. Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-01-06NFSv4: Allow writebacks to request 'blocks used'Trond Myklebust2-14/+9
When doing a non-pNFS write, allow the writeback code to specify that it also needs to update 'blocks used'. Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-01-06NFS: use default_groups in kobj_typeGreg Kroah-Hartman1-1/+2
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the NFS code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: Trond Myklebust <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: [email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2022-01-06NFS: Fix the verifier for case sensitive filesystem in nfs_atomic_open()Trond Myklebust1-1/+6
Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>