diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/autofs.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/cifs/ksmbd.rst | 10 | ||||
-rw-r--r-- | Documentation/filesystems/erofs.rst | 8 | ||||
-rw-r--r-- | Documentation/filesystems/f2fs.rst | 19 | ||||
-rw-r--r-- | Documentation/filesystems/idmappings.rst | 72 | ||||
-rw-r--r-- | Documentation/filesystems/locking.rst | 5 | ||||
-rw-r--r-- | Documentation/filesystems/netfs_library.rst | 95 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/index.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/reexport.rst | 113 |
9 files changed, 203 insertions, 122 deletions
diff --git a/Documentation/filesystems/autofs.rst b/Documentation/filesystems/autofs.rst index 681c6a492bc0..4f490278d22f 100644 --- a/Documentation/filesystems/autofs.rst +++ b/Documentation/filesystems/autofs.rst @@ -35,7 +35,7 @@ This document describes only the kernel module and the interactions required with any user-space program. Subsequent text refers to this as the "automount daemon" or simply "the daemon". -"autofs" is a Linux kernel module with provides the "autofs" +"autofs" is a Linux kernel module which provides the "autofs" filesystem type. Several "autofs" filesystems can be mounted and they can each be managed separately, or all managed by the same daemon. diff --git a/Documentation/filesystems/cifs/ksmbd.rst b/Documentation/filesystems/cifs/ksmbd.rst index a1326157d53f..b0d354fd8066 100644 --- a/Documentation/filesystems/cifs/ksmbd.rst +++ b/Documentation/filesystems/cifs/ksmbd.rst @@ -50,11 +50,11 @@ ksmbd.mountd (user space daemon) -------------------------------- ksmbd.mountd is userspace process to, transfer user account and password that -are registered using ksmbd.adduser(part of utils for user space). Further it +are registered using ksmbd.adduser (part of utils for user space). Further it allows sharing information parameters that parsed from smb.conf to ksmbd in kernel. For the execution part it has a daemon which is continuously running and connected to the kernel interface using netlink socket, it waits for the -requests(dcerpc and share/user info). It handles RPC calls (at a minimum few +requests (dcerpc and share/user info). It handles RPC calls (at a minimum few dozen) that are most important for file server from NetShareEnum and NetServerGetInfo. Complete DCE/RPC response is prepared from the user space and passed over to the associated kernel thread for the client. @@ -154,11 +154,11 @@ Each layer 1. Enable all component prints # sudo ksmbd.control -d "all" -2. Enable one of components(smb, auth, vfs, oplock, ipc, conn, rdma) +2. Enable one of components (smb, auth, vfs, oplock, ipc, conn, rdma) # sudo ksmbd.control -d "smb" -3. Show what prints are enable. - # cat/sys/class/ksmbd-control/debug +3. Show what prints are enabled. + # cat /sys/class/ksmbd-control/debug [smb] auth vfs oplock ipc conn [rdma] 4. Disable prints: diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst index 01df283c7d04..7119aa213be7 100644 --- a/Documentation/filesystems/erofs.rst +++ b/Documentation/filesystems/erofs.rst @@ -93,6 +93,14 @@ dax A legacy option which is an alias for ``dax=always``. device=%s Specify a path to an extra device to be used together. =================== ========================================================= +Sysfs Entries +============= + +Information about mounted erofs file systems can be found in /sys/fs/erofs. +Each mounted filesystem will have a directory in /sys/fs/erofs based on its +device name (i.e., /sys/fs/erofs/sda). +(see also Documentation/ABI/testing/sysfs-fs-erofs) + On-disk details =============== diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst index 6f3c6e91346d..d7b84695f56a 100644 --- a/Documentation/filesystems/f2fs.rst +++ b/Documentation/filesystems/f2fs.rst @@ -197,10 +197,29 @@ fault_type=%d Support configuring fault injection type, should be FAULT_DISCARD 0x000002000 FAULT_WRITE_IO 0x000004000 FAULT_SLAB_ALLOC 0x000008000 + FAULT_DQUOT_INIT 0x000010000 =================== =========== mode=%s Control block allocation mode which supports "adaptive" and "lfs". In "lfs" mode, there should be no random writes towards main area. + "fragment:segment" and "fragment:block" are newly added here. + These are developer options for experiments to simulate filesystem + fragmentation/after-GC situation itself. The developers use these + modes to understand filesystem fragmentation/after-GC condition well, + and eventually get some insights to handle them better. + In "fragment:segment", f2fs allocates a new segment in ramdom + position. With this, we can simulate the after-GC condition. + In "fragment:block", we can scatter block allocation with + "max_fragment_chunk" and "max_fragment_hole" sysfs nodes. + We added some randomness to both chunk and hole size to make + it close to realistic IO pattern. So, in this mode, f2fs will allocate + 1..<max_fragment_chunk> blocks in a chunk and make a hole in the + length of 1..<max_fragment_hole> by turns. With this, the newly + allocated blocks will be scattered throughout the whole partition. + Note that "fragment:block" implicitly enables "fragment:segment" + option for more randomness. + Please, use these options for your experiments and we strongly + recommend to re-format the filesystem after using these options. io_bits=%u Set the bit size of write IO requests. It should be set with "mode=lfs". usrquota Enable plain user disk quota accounting. diff --git a/Documentation/filesystems/idmappings.rst b/Documentation/filesystems/idmappings.rst index 1229a75ec75d..7a879ec3b6bf 100644 --- a/Documentation/filesystems/idmappings.rst +++ b/Documentation/filesystems/idmappings.rst @@ -952,75 +952,3 @@ The raw userspace id that is put on disk is ``u1000`` so when the user takes their home directory back to their home computer where they are assigned ``u1000`` using the initial idmapping and mount the filesystem with the initial idmapping they will see all those files owned by ``u1000``. - -Shortcircuting --------------- - -Currently, the implementation of idmapped mounts enforces that the filesystem -is mounted with the initial idmapping. The reason is simply that none of the -filesystems that we targeted were mountable with a non-initial idmapping. But -that might change soon enough. As we've seen above, thanks to the properties of -idmappings the translation works for both filesystems mounted with the initial -idmapping and filesystem with non-initial idmappings. - -Based on this current restriction to filesystem mounted with the initial -idmapping two noticeable shortcuts have been taken: - -1. We always stash a reference to the initial user namespace in ``struct - vfsmount``. Idmapped mounts are thus mounts that have a non-initial user - namespace attached to them. - - In order to support idmapped mounts this needs to be changed. Instead of - stashing the initial user namespace the user namespace the filesystem was - mounted with must be stashed. An idmapped mount is then any mount that has - a different user namespace attached then the filesystem was mounted with. - This has no user-visible consequences. - -2. The translation algorithms in ``mapped_fs*id()`` and ``i_*id_into_mnt()`` - are simplified. - - Let's consider ``mapped_fs*id()`` first. This function translates the - caller's kernel id into a kernel id in the filesystem's idmapping via - a mount's idmapping. The full algorithm is:: - - mapped_fsuid(kid): - /* Map the kernel id up into a userspace id in the mount's idmapping. */ - from_kuid(mount-idmapping, kid) = uid - - /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ - make_kuid(filesystem-idmapping, uid) = kuid - - We know that the filesystem is always mounted with the initial idmapping as - we enforce this in ``mount_setattr()``. So this can be shortened to:: - - mapped_fsuid(kid): - /* Map the kernel id up into a userspace id in the mount's idmapping. */ - from_kuid(mount-idmapping, kid) = uid - - /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ - KUIDT_INIT(uid) = kuid - - Similarly, for ``i_*id_into_mnt()`` which translated the filesystem's kernel - id into a mount's kernel id:: - - i_uid_into_mnt(kid): - /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ - from_kuid(filesystem-idmapping, kid) = uid - - /* Map the userspace id down into a kernel id in the mounts's idmapping. */ - make_kuid(mount-idmapping, uid) = kuid - - Again, we know that the filesystem is always mounted with the initial - idmapping as we enforce this in ``mount_setattr()``. So this can be - shortened to:: - - i_uid_into_mnt(kid): - /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ - __kuid_val(kid) = uid - - /* Map the userspace id down into a kernel id in the mounts's idmapping. */ - make_kuid(mount-idmapping, uid) = kuid - -Handling filesystems mounted with non-initial idmappings requires that the -translation functions be converted to their full form. They can still be -shortcircuited on non-idmapped mounts. This has no user-visible consequences. diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index d36fe79167b3..3f9b1497ebb8 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -169,7 +169,6 @@ prototypes:: int (*show_options)(struct seq_file *, struct dentry *); ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); - int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); locking rules: All may block [not true, see below] @@ -194,7 +193,6 @@ umount_begin: no show_options: no (namespace_sem) quota_read: no (see below) quota_write: no (see below) -bdev_try_to_free_page: no (see below) ====================== ============ ======================== ->statfs() has s_umount (shared) when called by ustat(2) (native or @@ -210,9 +208,6 @@ dqio_sem) (unless an admin really wants to screw up something and writes to quota files with quotas on). For other details about locking see also dquot_operations section. -->bdev_try_to_free_page is called from the ->releasepage handler of -the block device inode. See there for more details. - file_system_type ================ diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst index bb68d39f03b7..375baca7edcd 100644 --- a/Documentation/filesystems/netfs_library.rst +++ b/Documentation/filesystems/netfs_library.rst @@ -1,7 +1,7 @@ .. SPDX-License-Identifier: GPL-2.0 ================================= -NETWORK FILESYSTEM HELPER LIBRARY +Network Filesystem Helper Library ================================= .. Contents: @@ -37,22 +37,22 @@ into a common call framework. The following services are provided: - * Handles transparent huge pages (THPs). + * Handle folios that span multiple pages. - * Insulates the netfs from VM interface changes. + * Insulate the netfs from VM interface changes. - * Allows the netfs to arbitrarily split reads up into pieces, even ones that - don't match page sizes or page alignments and that may cross pages. + * Allow the netfs to arbitrarily split reads up into pieces, even ones that + don't match folio sizes or folio alignments and that may cross folios. - * Allows the netfs to expand a readahead request in both directions to meet - its needs. + * Allow the netfs to expand a readahead request in both directions to meet its + needs. - * Allows the netfs to partially fulfil a read, which will then be resubmitted. + * Allow the netfs to partially fulfil a read, which will then be resubmitted. - * Handles local caching, allowing cached data and server-read data to be + * Handle local caching, allowing cached data and server-read data to be interleaved for a single request. - * Handles clearing of bufferage that aren't on the server. + * Handle clearing of bufferage that aren't on the server. * Handle retrying of reads that failed, switching reads from the cache to the server as necessary. @@ -70,22 +70,22 @@ Read Helper Functions Three read helpers are provided:: - * void netfs_readahead(struct readahead_control *ractl, - const struct netfs_read_request_ops *ops, - void *netfs_priv);`` - * int netfs_readpage(struct file *file, - struct page *page, - const struct netfs_read_request_ops *ops, - void *netfs_priv); - * int netfs_write_begin(struct file *file, - struct address_space *mapping, - loff_t pos, - unsigned int len, - unsigned int flags, - struct page **_page, - void **_fsdata, - const struct netfs_read_request_ops *ops, - void *netfs_priv); + void netfs_readahead(struct readahead_control *ractl, + const struct netfs_read_request_ops *ops, + void *netfs_priv); + int netfs_readpage(struct file *file, + struct folio *folio, + const struct netfs_read_request_ops *ops, + void *netfs_priv); + int netfs_write_begin(struct file *file, + struct address_space *mapping, + loff_t pos, + unsigned int len, + unsigned int flags, + struct folio **_folio, + void **_fsdata, + const struct netfs_read_request_ops *ops, + void *netfs_priv); Each corresponds to a VM operation, with the addition of a couple of parameters for the use of the read helpers: @@ -103,8 +103,8 @@ Both of these values will be stored into the read request structure. For ->readahead() and ->readpage(), the network filesystem should just jump into the corresponding read helper; whereas for ->write_begin(), it may be a little more complicated as the network filesystem might want to flush -conflicting writes or track dirty data and needs to put the acquired page if an -error occurs after calling the helper. +conflicting writes or track dirty data and needs to put the acquired folio if +an error occurs after calling the helper. The helpers manage the read request, calling back into the network filesystem through the suppplied table of operations. Waits will be performed as @@ -253,7 +253,7 @@ through which it can issue requests and negotiate:: void (*issue_op)(struct netfs_read_subrequest *subreq); bool (*is_still_valid)(struct netfs_read_request *rreq); int (*check_write_begin)(struct file *file, loff_t pos, unsigned len, - struct page *page, void **_fsdata); + struct folio *folio, void **_fsdata); void (*done)(struct netfs_read_request *rreq); void (*cleanup)(struct address_space *mapping, void *netfs_priv); }; @@ -313,13 +313,14 @@ The operations are as follows: There is no return value; the netfs_subreq_terminated() function should be called to indicate whether or not the operation succeeded and how much data - it transferred. The filesystem also should not deal with setting pages + it transferred. The filesystem also should not deal with setting folios uptodate, unlocking them or dropping their refs - the helpers need to deal with this as they have to coordinate with copying to the local cache. - Note that the helpers have the pages locked, but not pinned. It is possible - to use the ITER_XARRAY iov iterator to refer to the range of the inode that - is being operated upon without the need to allocate large bvec tables. + Note that the helpers have the folios locked, but not pinned. It is + possible to use the ITER_XARRAY iov iterator to refer to the range of the + inode that is being operated upon without the need to allocate large bvec + tables. * ``is_still_valid()`` @@ -330,15 +331,15 @@ The operations are as follows: * ``check_write_begin()`` [Optional] This is called from the netfs_write_begin() helper once it has - allocated/grabbed the page to be modified to allow the filesystem to flush + allocated/grabbed the folio to be modified to allow the filesystem to flush conflicting state before allowing it to be modified. - It should return 0 if everything is now fine, -EAGAIN if the page should be + It should return 0 if everything is now fine, -EAGAIN if the folio should be regrabbed and any other error code to abort the operation. * ``done`` - [Optional] This is called after the pages in the request have all been + [Optional] This is called after the folios in the request have all been unlocked (and marked uptodate if applicable). * ``cleanup`` @@ -390,7 +391,7 @@ The read helpers work by the following general procedure: * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the end of the slice instead of reissuing. - * Once the data is read, the pages that have been fully read/cleared: + * Once the data is read, the folios that have been fully read/cleared: * Will be marked uptodate. @@ -398,11 +399,11 @@ The read helpers work by the following general procedure: * Unlocked - * Any pages that need writing to the cache will then have DIO writes issued. + * Any folios that need writing to the cache will then have DIO writes issued. * Synchronous operations will wait for reading to be complete. - * Writes to the cache will proceed asynchronously and the pages will have the + * Writes to the cache will proceed asynchronously and the folios will have the PG_fscache mark removed when that completes. * The request structures will be cleaned up when everything has completed. @@ -452,6 +453,9 @@ operation table looks like the following:: netfs_io_terminated_t term_func, void *term_func_priv); + int (*prepare_write)(struct netfs_cache_resources *cres, + loff_t *_start, size_t *_len, loff_t i_size); + int (*write)(struct netfs_cache_resources *cres, loff_t start_pos, struct iov_iter *iter, @@ -509,6 +513,14 @@ The methods defined in the table are: indicating whether the termination is definitely happening in the caller's context. + * ``prepare_write()`` + + [Required] Called to adjust a write to the cache and check that there is + sufficient space in the cache. The start and length values indicate the + size of the write that netfslib is proposing, and this can be adjusted by + the cache to respect DIO boundaries. The file size is passed for + information. + * ``write()`` [Required] Called to write to the cache. The start file offset is given @@ -525,4 +537,9 @@ not the read request structure as they could be used in other situations where there isn't a read request structure as well, such as writing dirty data to the cache. + +API Function Reference +====================== + .. kernel-doc:: include/linux/netfs.h +.. kernel-doc:: fs/netfs/read_helper.c diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst index 65805624e39b..288d8ddb2bc6 100644 --- a/Documentation/filesystems/nfs/index.rst +++ b/Documentation/filesystems/nfs/index.rst @@ -11,3 +11,4 @@ NFS rpc-server-gss nfs41-server knfsd-stats + reexport diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst new file mode 100644 index 000000000000..ff9ae4a46530 --- /dev/null +++ b/Documentation/filesystems/nfs/reexport.rst @@ -0,0 +1,113 @@ +Reexporting NFS filesystems +=========================== + +Overview +-------- + +It is possible to reexport an NFS filesystem over NFS. However, this +feature comes with a number of limitations. Before trying it, we +recommend some careful research to determine whether it will work for +your purposes. + +A discussion of current known limitations follows. + +"fsid=" required, crossmnt broken +--------------------------------- + +We require the "fsid=" export option on any reexport of an NFS +filesystem. You can use "uuidgen -r" to generate a unique argument. + +The "crossmnt" export does not propagate "fsid=", so it will not allow +traversing into further nfs filesystems; if you wish to export nfs +filesystems mounted under the exported filesystem, you'll need to export +them explicitly, assigning each its own unique "fsid= option. + +Reboot recovery +--------------- + +The NFS protocol's normal reboot recovery mechanisms don't work for the +case when the reexport server reboots. Clients will lose any locks +they held before the reboot, and further IO will result in errors. +Closing and reopening files should clear the errors. + +Filehandle limits +----------------- + +If the original server uses an X byte filehandle for a given object, the +reexport server's filehandle for the reexported object will be X+22 +bytes, rounded up to the nearest multiple of four bytes. + +The result must fit into the RFC-mandated filehandle size limits: + ++-------+-----------+ +| NFSv2 | 32 bytes | ++-------+-----------+ +| NFSv3 | 64 bytes | ++-------+-----------+ +| NFSv4 | 128 bytes | ++-------+-----------+ + +So, for example, you will only be able to reexport a filesystem over +NFSv2 if the original server gives you filehandles that fit in 10 +bytes--which is unlikely. + +In general there's no way to know the maximum filehandle size given out +by an NFS server without asking the server vendor. + +But the following table gives a few examples. The first column is the +typical length of the filehandle from a Linux server exporting the given +filesystem, the second is the length after that nfs export is reexported +by another Linux host: + ++--------+-------------------+----------------+ +| | filehandle length | after reexport | ++========+===================+================+ +| ext4: | 28 bytes | 52 bytes | ++--------+-------------------+----------------+ +| xfs: | 32 bytes | 56 bytes | ++--------+-------------------+----------------+ +| btrfs: | 40 bytes | 64 bytes | ++--------+-------------------+----------------+ + +All will therefore fit in an NFSv3 or NFSv4 filehandle after reexport, +but none are reexportable over NFSv2. + +Linux server filehandles are a bit more complicated than this, though; +for example: + + - The (non-default) "subtreecheck" export option generally + requires another 4 to 8 bytes in the filehandle. + - If you export a subdirectory of a filesystem (instead of + exporting the filesystem root), that also usually adds 4 to 8 + bytes. + - If you export over NFSv2, knfsd usually uses a shorter + filesystem identifier that saves 8 bytes. + - The root directory of an export uses a filehandle that is + shorter. + +As you can see, the 128-byte NFSv4 filehandle is large enough that +you're unlikely to have trouble using NFSv4 to reexport any filesystem +exported from a Linux server. In general, if the original server is +something that also supports NFSv3, you're *probably* OK. Re-exporting +over NFSv3 may be dicier, and reexporting over NFSv2 will probably +never work. + +For more details of Linux filehandle structure, the best reference is +the source code and comments; see in particular: + + - include/linux/exportfs.h:enum fid_type + - include/uapi/linux/nfsd/nfsfh.h:struct nfs_fhbase_new + - fs/nfsd/nfsfh.c:set_version_and_fsid_type + - fs/nfs/export.c:nfs_encode_fh + +Open DENY bits ignored +---------------------- + +NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which +allow you, for example, to open a file in a mode which forbids other +read opens or write opens. The Linux client doesn't use them, and the +server's support has always been incomplete: they are enforced only +against other NFS users, not against processes accessing the exported +filesystem locally. A reexport server will also not pass them along to +the original server, so they will not be enforced between clients of +different reexport servers. |