diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/9p.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/autofs.rst | 4 | ||||
-rw-r--r-- | Documentation/filesystems/bcachefs/CodingStyle.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/caching/fscache.rst | 8 | ||||
-rw-r--r-- | Documentation/filesystems/erofs.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/fsverity.rst | 27 | ||||
-rw-r--r-- | Documentation/filesystems/idmappings.rst | 8 | ||||
-rw-r--r-- | Documentation/filesystems/iomap/design.rst | 10 | ||||
-rw-r--r-- | Documentation/filesystems/journalling.rst | 6 | ||||
-rw-r--r-- | Documentation/filesystems/locking.rst | 6 | ||||
-rw-r--r-- | Documentation/filesystems/netfs_library.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/index.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/localio.rst | 357 | ||||
-rw-r--r-- | Documentation/filesystems/overlayfs.rst | 7 | ||||
-rw-r--r-- | Documentation/filesystems/smb/ksmbd.rst | 26 | ||||
-rw-r--r-- | Documentation/filesystems/vfs.rst | 15 |
16 files changed, 433 insertions, 50 deletions
diff --git a/Documentation/filesystems/9p.rst b/Documentation/filesystems/9p.rst index 1e0e0bb6fdf9..d270a0aa8d55 100644 --- a/Documentation/filesystems/9p.rst +++ b/Documentation/filesystems/9p.rst @@ -31,7 +31,7 @@ Other applications are described in the following papers: * PROSE I/O: Using 9p to enable Application Partitions http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf * VirtFS: A Virtualization Aware File System pass-through - http://goo.gl/3WPDg + https://kernel.org/doc/ols/2010/ols2010-pages-109-120.pdf Usage ===== diff --git a/Documentation/filesystems/autofs.rst b/Documentation/filesystems/autofs.rst index 3b6e38e646cd..1ac576458c69 100644 --- a/Documentation/filesystems/autofs.rst +++ b/Documentation/filesystems/autofs.rst @@ -18,7 +18,7 @@ key advantages: 2. The names and locations of filesystems can be stored in a remote database and can change at any time. The content - in that data base at the time of access will be used to provide + in that database at the time of access will be used to provide a target for the access. The interpretation of names in the filesystem can even be programmatic rather than database-backed, allowing wildcards for example, and can vary based on the user who @@ -423,7 +423,7 @@ The available ioctl commands are: and objects are expired if the are not in use. **AUTOFS_EXP_FORCED** causes the in use status to be ignored - and objects are expired ieven if they are in use. This assumes + and objects are expired even if they are in use. This assumes that the daemon has requested this because it is capable of performing the umount. diff --git a/Documentation/filesystems/bcachefs/CodingStyle.rst b/Documentation/filesystems/bcachefs/CodingStyle.rst index 0c45829a4899..01de555e21d8 100644 --- a/Documentation/filesystems/bcachefs/CodingStyle.rst +++ b/Documentation/filesystems/bcachefs/CodingStyle.rst @@ -175,7 +175,7 @@ errors in our thinking by running our code and seeing what happens. If your time is being wasted because your tools are bad or too slow - don't accept it, fix it. -Put effort into your documentation, commmit messages, and code comments - but +Put effort into your documentation, commit messages, and code comments - but don't go overboard. A good commit message is wonderful - but if the information was important enough to go in a commit message, ask yourself if it would be even better as a code comment. diff --git a/Documentation/filesystems/caching/fscache.rst b/Documentation/filesystems/caching/fscache.rst index a74d7b052dc1..de1f32526cc1 100644 --- a/Documentation/filesystems/caching/fscache.rst +++ b/Documentation/filesystems/caching/fscache.rst @@ -318,10 +318,10 @@ where the columns are: Debugging ========= -If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime -debugging enabled by adjusting the value in:: +If CONFIG_NETFS_DEBUG is enabled, the FS-Cache facility and NETFS support can +have runtime debugging enabled by adjusting the value in:: - /sys/module/fscache/parameters/debug + /sys/module/netfs/parameters/debug This is a bitmask of debugging streams to enable: @@ -343,6 +343,6 @@ This is a bitmask of debugging streams to enable: The appropriate set of values should be OR'd together and the result written to the control file. For example:: - echo $((1|8|512)) >/sys/module/fscache/parameters/debug + echo $((1|8|512)) >/sys/module/netfs/parameters/debug will turn on all function entry debugging. diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst index cc4626d6ee4f..c293f8e37468 100644 --- a/Documentation/filesystems/erofs.rst +++ b/Documentation/filesystems/erofs.rst @@ -75,7 +75,7 @@ Here are the main features of EROFS: - Support merging tail-end data into a special inode as fragments. - - Support large folios for uncompressed files. + - Support large folios to make use of THPs (Transparent Hugepages); - Support direct I/O on uncompressed files to avoid double caching for loop devices; diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst index 13e4b18e5dbb..0e2fac7a16da 100644 --- a/Documentation/filesystems/fsverity.rst +++ b/Documentation/filesystems/fsverity.rst @@ -86,6 +86,16 @@ authenticating fs-verity file hashes include: signature in their "security.ima" extended attribute, as controlled by the IMA policy. For more information, see the IMA documentation. +- Integrity Policy Enforcement (IPE). IPE supports enforcing access + control decisions based on immutable security properties of files, + including those protected by fs-verity's built-in signatures. + "IPE policy" specifically allows for the authorization of fs-verity + files using properties ``fsverity_digest`` for identifying + files by their verity digest, and ``fsverity_signature`` to authorize + files with a verified fs-verity's built-in signature. For + details on configuring IPE policies and understanding its operational + modes, please refer to :doc:`IPE admin guide </admin-guide/LSM/ipe>`. + - Trusted userspace code in combination with `Built-in signature verification`_. This approach should be used only with great care. @@ -457,7 +467,11 @@ Enabling this option adds the following: On success, the ioctl persists the signature alongside the Merkle tree. Then, any time the file is opened, the kernel verifies the file's actual digest against this signature, using the certificates - in the ".fs-verity" keyring. + in the ".fs-verity" keyring. This verification happens as long as the + file's signature exists, regardless of the state of the sysctl variable + "fs.verity.require_signatures" described in the next item. The IPE LSM + relies on this behavior to recognize and label fsverity files + that contain a verified built-in fsverity signature. 3. A new sysctl "fs.verity.require_signatures" is made available. When set to 1, the kernel requires that all verity files have a @@ -481,7 +495,7 @@ be carefully considered before using them: - Builtin signature verification does *not* make the kernel enforce that any files actually have fs-verity enabled. Thus, it is not a - complete authentication policy. Currently, if it is used, the only + complete authentication policy. Currently, if it is used, one way to complete the authentication policy is for trusted userspace code to explicitly check whether files have fs-verity enabled with a signature before they are accessed. (With @@ -490,6 +504,15 @@ be carefully considered before using them: could just store the signature alongside the file and verify it itself using a cryptographic library, instead of using this feature. +- Another approach is to utilize fs-verity builtin signature + verification in conjunction with the IPE LSM, which supports defining + a kernel-enforced, system-wide authentication policy that allows only + files with a verified fs-verity builtin signature to perform certain + operations, such as execution. Note that IPE doesn't require + fs.verity.require_signatures=1. + Please refer to :doc:`IPE admin guide </admin-guide/LSM/ipe>` for + more details. + - A file's builtin signature can only be set at the same time that fs-verity is being enabled on the file. Changing or deleting the builtin signature later requires re-creating the file. diff --git a/Documentation/filesystems/idmappings.rst b/Documentation/filesystems/idmappings.rst index ac0af679e61e..77930c77fcfe 100644 --- a/Documentation/filesystems/idmappings.rst +++ b/Documentation/filesystems/idmappings.rst @@ -821,7 +821,7 @@ the same idmapping to the mount. We now perform three steps: /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ make_kuid(u0:k20000:r10000, u1000) = k21000 -2. Verify that the caller's kernel ids can be mapped to userspace ids in the +3. Verify that the caller's kernel ids can be mapped to userspace ids in the filesystem's idmapping:: from_kuid(u0:k20000:r10000, k21000) = u1000 @@ -854,10 +854,10 @@ The same translation algorithm works with the third example. /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ make_kuid(u0:k0:r4294967295, u1000) = k1000 -2. Verify that the caller's kernel ids can be mapped to userspace ids in the +3. Verify that the caller's kernel ids can be mapped to userspace ids in the filesystem's idmapping:: - from_kuid(u0:k0:r4294967295, k21000) = u1000 + from_kuid(u0:k0:r4294967295, k1000) = u1000 So the ownership that lands on disk will be ``u1000``. @@ -994,7 +994,7 @@ from above::: /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ make_kuid(u0:k0:r4294967295, u1000) = k1000 -2. Verify that the caller's filesystem ids can be mapped to userspace ids in the +3. Verify that the caller's filesystem ids can be mapped to userspace ids in the filesystem's idmapping:: from_kuid(u0:k0:r4294967295, k1000) = u1000 diff --git a/Documentation/filesystems/iomap/design.rst b/Documentation/filesystems/iomap/design.rst index f8ee3427bc1a..b0d0188a095e 100644 --- a/Documentation/filesystems/iomap/design.rst +++ b/Documentation/filesystems/iomap/design.rst @@ -142,9 +142,9 @@ Definitions * **pure overwrite**: A write operation that does not require any metadata or zeroing operations to perform during either submission or completion. - This implies that the fileystem must have already allocated space + This implies that the filesystem must have already allocated space on disk as ``IOMAP_MAPPED`` and the filesystem must not place any - constaints on IO alignment or size. + constraints on IO alignment or size. The only constraints on I/O alignment are device level (minimum I/O size and alignment, typically sector size). @@ -165,7 +165,7 @@ structure below: u16 flags; struct block_device *bdev; struct dax_device *dax_dev; - voidw *inline_data; + void *inline_data; void *private; const struct iomap_folio_ops *folio_ops; u64 validity_cookie; @@ -394,7 +394,7 @@ iomap is concerned: * The **upper** level primitive is provided by the filesystem to coordinate access to different iomap operations. - The exact primitive is specifc to the filesystem and operation, + The exact primitive is specific to the filesystem and operation, but is often a VFS inode, pagecache invalidation, or folio lock. For example, a filesystem might take ``i_rwsem`` before calling ``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent @@ -426,7 +426,7 @@ iomap is concerned: The exact locking requirements are specific to the filesystem; for certain operations, some of these locks can be elided. -All further mention of locking are *recommendations*, not mandates. +All further mentions of locking are *recommendations*, not mandates. Each filesystem author must figure out the locking for themself. Bugs and Limitations diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst index e18f90ffc6fd..0254f7d57429 100644 --- a/Documentation/filesystems/journalling.rst +++ b/Documentation/filesystems/journalling.rst @@ -137,7 +137,7 @@ Fast commits JBD2 to also allows you to perform file-system specific delta commits known as fast commits. In order to use fast commits, you will need to set following -callbacks that perform correspodning work: +callbacks that perform corresponding work: `journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and fast commit. @@ -149,7 +149,7 @@ File system is free to perform fast commits as and when it wants as long as it gets permission from JBD2 to do so by calling the function :c:func:`jbd2_fc_begin_commit()`. Once a fast commit is done, the client file system should tell JBD2 about it by calling -:c:func:`jbd2_fc_end_commit()`. If file system wants JBD2 to perform a full +:c:func:`jbd2_fc_end_commit()`. If the file system wants JBD2 to perform a full commit immediately after stopping the fast commit it can do so by calling :c:func:`jbd2_fc_end_commit_fallback()`. This is useful if fast commit operation fails for some reason and the only way to guarantee consistency is for JBD2 to @@ -199,7 +199,7 @@ Journal Level .. kernel-doc:: fs/jbd2/recovery.c :internal: -Transasction Level +Transaction Level ~~~~~~~~~~~~~~~~~~ .. kernel-doc:: fs/jbd2/transaction.c diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index e664061ed55d..f5e3676db954 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -251,10 +251,10 @@ prototypes:: void (*readahead)(struct readahead_control *); int (*write_begin)(struct file *, struct address_space *mapping, loff_t pos, unsigned len, - struct page **pagep, void **fsdata); + struct folio **foliop, void **fsdata); int (*write_end)(struct file *, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, - struct page *page, void *fsdata); + struct folio *folio, void *fsdata); sector_t (*bmap)(struct address_space *, sector_t); void (*invalidate_folio) (struct folio *, size_t start, size_t len); bool (*release_folio)(struct folio *, gfp_t); @@ -280,7 +280,7 @@ read_folio: yes, unlocks shared writepages: dirty_folio: maybe readahead: yes, unlocks shared -write_begin: locks the page exclusive +write_begin: locks the folio exclusive write_end: yes, unlocks exclusive bmap: invalidate_folio: yes exclusive diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst index 4cc657d743f7..f0d2cb257bb8 100644 --- a/Documentation/filesystems/netfs_library.rst +++ b/Documentation/filesystems/netfs_library.rst @@ -116,7 +116,7 @@ The following services are provided: * Handle local caching, allowing cached data and server-read data to be interleaved for a single request. - * Handle clearing of bufferage that aren't on the server. + * Handle clearing of bufferage that isn't on the server. * Handle retrying of reads that failed, switching reads from the cache to the server as necessary. diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst index 8536134f31fd..95c2c009874c 100644 --- a/Documentation/filesystems/nfs/index.rst +++ b/Documentation/filesystems/nfs/index.rst @@ -8,6 +8,7 @@ NFS client-identifier exporting + localio pnfs rpc-cache rpc-server-gss diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst new file mode 100644 index 000000000000..bd1967e2eab3 --- /dev/null +++ b/Documentation/filesystems/nfs/localio.rst @@ -0,0 +1,357 @@ +=========== +NFS LOCALIO +=========== + +Overview +======== + +The LOCALIO auxiliary RPC protocol allows the Linux NFS client and +server to reliably handshake to determine if they are on the same +host. Select "NFS client and server support for LOCALIO auxiliary +protocol" in menuconfig to enable CONFIG_NFS_LOCALIO in the kernel +config (both CONFIG_NFS_FS and CONFIG_NFSD must also be enabled). + +Once an NFS client and server handshake as "local", the client will +bypass the network RPC protocol for read, write and commit operations. +Due to this XDR and RPC bypass, these operations will operate faster. + +The LOCALIO auxiliary protocol's implementation, which uses the same +connection as NFS traffic, follows the pattern established by the NFS +ACL protocol extension. + +The LOCALIO auxiliary protocol is needed to allow robust discovery of +clients local to their servers. In a private implementation that +preceded use of this LOCALIO protocol, a fragile sockaddr network +address based match against all local network interfaces was attempted. +But unlike the LOCALIO protocol, the sockaddr-based matching didn't +handle use of iptables or containers. + +The robust handshake between local client and server is just the +beginning, the ultimate use case this locality makes possible is the +client is able to open files and issue reads, writes and commits +directly to the server without having to go over the network. The +requirement is to perform these loopback NFS operations as efficiently +as possible, this is particularly useful for container use cases +(e.g. kubernetes) where it is possible to run an IO job local to the +server. + +The performance advantage realized from LOCALIO's ability to bypass +using XDR and RPC for reads, writes and commits can be extreme, e.g.: + +fio for 20 secs with directio, qd of 8, 16 libaio threads: + - With LOCALIO: + 4K read: IOPS=979k, BW=3825MiB/s (4011MB/s)(74.7GiB/20002msec) + 4K write: IOPS=165k, BW=646MiB/s (678MB/s)(12.6GiB/20002msec) + 128K read: IOPS=402k, BW=49.1GiB/s (52.7GB/s)(982GiB/20002msec) + 128K write: IOPS=11.5k, BW=1433MiB/s (1503MB/s)(28.0GiB/20004msec) + + - Without LOCALIO: + 4K read: IOPS=79.2k, BW=309MiB/s (324MB/s)(6188MiB/20003msec) + 4K write: IOPS=59.8k, BW=234MiB/s (245MB/s)(4671MiB/20002msec) + 128K read: IOPS=33.9k, BW=4234MiB/s (4440MB/s)(82.7GiB/20004msec) + 128K write: IOPS=11.5k, BW=1434MiB/s (1504MB/s)(28.0GiB/20011msec) + +fio for 20 secs with directio, qd of 8, 1 libaio thread: + - With LOCALIO: + 4K read: IOPS=230k, BW=898MiB/s (941MB/s)(17.5GiB/20001msec) + 4K write: IOPS=22.6k, BW=88.3MiB/s (92.6MB/s)(1766MiB/20001msec) + 128K read: IOPS=38.8k, BW=4855MiB/s (5091MB/s)(94.8GiB/20001msec) + 128K write: IOPS=11.4k, BW=1428MiB/s (1497MB/s)(27.9GiB/20001msec) + + - Without LOCALIO: + 4K read: IOPS=77.1k, BW=301MiB/s (316MB/s)(6022MiB/20001msec) + 4K write: IOPS=32.8k, BW=128MiB/s (135MB/s)(2566MiB/20001msec) + 128K read: IOPS=24.4k, BW=3050MiB/s (3198MB/s)(59.6GiB/20001msec) + 128K write: IOPS=11.4k, BW=1430MiB/s (1500MB/s)(27.9GiB/20001msec) + +FAQ +=== + +1. What are the use cases for LOCALIO? + + a. Workloads where the NFS client and server are on the same host + realize improved IO performance. In particular, it is common when + running containerised workloads for jobs to find themselves + running on the same host as the knfsd server being used for + storage. + +2. What are the requirements for LOCALIO? + + a. Bypass use of the network RPC protocol as much as possible. This + includes bypassing XDR and RPC for open, read, write and commit + operations. + b. Allow client and server to autonomously discover if they are + running local to each other without making any assumptions about + the local network topology. + c. Support the use of containers by being compatible with relevant + namespaces (e.g. network, user, mount). + d. Support all versions of NFS. NFSv3 is of particular importance + because it has wide enterprise usage and pNFS flexfiles makes use + of it for the data path. + +3. Why doesn’t LOCALIO just compare IP addresses or hostnames when + deciding if the NFS client and server are co-located on the same + host? + + Since one of the main use cases is containerised workloads, we cannot + assume that IP addresses will be shared between the client and + server. This sets up a requirement for a handshake protocol that + needs to go over the same connection as the NFS traffic in order to + identify that the client and the server really are running on the + same host. The handshake uses a secret that is sent over the wire, + and can be verified by both parties by comparing with a value stored + in shared kernel memory if they are truly co-located. + +4. Does LOCALIO improve pNFS flexfiles? + + Yes, LOCALIO complements pNFS flexfiles by allowing it to take + advantage of NFS client and server locality. Policy that initiates + client IO as closely to the server where the data is stored naturally + benefits from the data path optimization LOCALIO provides. + +5. Why not develop a new pNFS layout to enable LOCALIO? + + A new pNFS layout could be developed, but doing so would put the + onus on the server to somehow discover that the client is co-located + when deciding to hand out the layout. + There is value in a simpler approach (as provided by LOCALIO) that + allows the NFS client to negotiate and leverage locality without + requiring more elaborate modeling and discovery of such locality in a + more centralized manner. + +6. Why is having the client perform a server-side file OPEN, without + using RPC, beneficial? Is the benefit pNFS specific? + + Avoiding the use of XDR and RPC for file opens is beneficial to + performance regardless of whether pNFS is used. Especially when + dealing with small files its best to avoid going over the wire + whenever possible, otherwise it could reduce or even negate the + benefits of avoiding the wire for doing the small file I/O itself. + Given LOCALIO's requirements the current approach of having the + client perform a server-side file open, without using RPC, is ideal. + If in the future requirements change then we can adapt accordingly. + +7. Why is LOCALIO only supported with UNIX Authentication (AUTH_UNIX)? + + Strong authentication is usually tied to the connection itself. It + works by establishing a context that is cached by the server, and + that acts as the key for discovering the authorisation token, which + can then be passed to rpc.mountd to complete the authentication + process. On the other hand, in the case of AUTH_UNIX, the credential + that was passed over the wire is used directly as the key in the + upcall to rpc.mountd. This simplifies the authentication process, and + so makes AUTH_UNIX easier to support. + +8. How do export options that translate RPC user IDs behave for LOCALIO + operations (eg. root_squash, all_squash)? + + Export options that translate user IDs are managed by nfsd_setuser() + which is called by nfsd_setuser_and_check_port() which is called by + __fh_verify(). So they get handled exactly the same way for LOCALIO + as they do for non-LOCALIO. + +9. How does LOCALIO make certain that object lifetimes are managed + properly given NFSD and NFS operate in different contexts? + + See the detailed "NFS Client and Server Interlock" section below. + +RPC +=== + +The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL" +RPC method that allows the Linux NFS client to verify the local Linux +NFS server can see the nonce (single-use UUID) the client generated and +made available in nfs_common. This protocol isn't part of an IETF +standard, nor does it need to be considering it is Linux-to-Linux +auxiliary RPC protocol that amounts to an implementation detail. + +The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of +the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode +XDR methods are used instead of the less efficient variable sized +methods. + +The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned +by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ): +Linux Kernel Organization 400122 nfslocalio + +The LOCALIO protocol spec in rpcgen syntax is:: + + /* raw RFC 9562 UUID */ + #define UUID_SIZE 16 + typedef u8 uuid_t<UUID_SIZE>; + + program NFS_LOCALIO_PROGRAM { + version LOCALIO_V1 { + void + NULL(void) = 0; + + void + UUID_IS_LOCAL(uuid_t) = 1; + } = 1; + } = 400122; + +LOCALIO uses the same transport connection as NFS traffic. As such, +LOCALIO is not registered with rpcbind. + +NFS Common and Client/Server Handshake +====================================== + +fs/nfs_common/nfslocalio.c provides interfaces that enable an NFS client +to generate a nonce (single-use UUID) and associated short-lived +nfs_uuid_t struct, register it with nfs_common for subsequent lookup and +verification by the NFS server and if matched the NFS server populates +members in the nfs_uuid_t struct. The NFS client then uses nfs_common to +transfer the nfs_uuid_t from its nfs_uuids to the nn->nfsd_serv +clients_list from the nfs_common's uuids_list. See: +fs/nfs/localio.c:nfs_local_probe() + +nfs_common's nfs_uuids list is the basis for LOCALIO enablement, as such +it has members that point to nfsd memory for direct use by the client +(e.g. 'net' is the server's network namespace, through it the client can +access nn->nfsd_serv with proper rcu read access). It is this client +and server synchronization that enables advanced usage and lifetime of +objects to span from the host kernel's nfsd to per-container knfsd +instances that are connected to nfs client's running on the same local +host. + +NFS Client and Server Interlock +=============================== + +LOCALIO provides the nfs_uuid_t object and associated interfaces to +allow proper network namespace (net-ns) and NFSD object refcounting: + + We don't want to keep a long-term counted reference on each NFSD's + net-ns in the client because that prevents a server container from + completely shutting down. + + So we avoid taking a reference at all and rely on the per-cpu + reference to the server (detailed below) being sufficient to keep + the net-ns active. This involves allowing the NFSD's net-ns exit + code to iterate all active clients and clear their ->net pointers + (which are needed to find the per-cpu-refcount for the nfsd_serv). + + Details: + + - Embed nfs_uuid_t in nfs_client. nfs_uuid_t provides a list_head + that can be used to find the client. It does add the 16-byte + uuid_t to nfs_client so it is bigger than needed (given that + uuid_t is only used during the initial NFS client and server + LOCALIO handshake to determine if they are local to each other). + If that is really a problem we can find a fix. + + - When the nfs server confirms that the uuid_t is local, it moves + the nfs_uuid_t onto a per-net-ns list in NFSD's nfsd_net. + + - When each server's net-ns is shutting down - in a "pre_exit" + handler, all these nfs_uuid_t have their ->net cleared. There is + an rcu_synchronize() call between pre_exit() handlers and exit() + handlers so any caller that sees nfs_uuid_t ->net as not NULL can + safely manage the per-cpu-refcount for nfsd_serv. + + - The client's nfs_uuid_t is passed to nfsd_open_local_fh() so it + can safely dereference ->net in a private rcu_read_lock() section + to allow safe access to the associated nfsd_net and nfsd_serv. + +So LOCALIO required the introduction and use of NFSD's percpu_ref to +interlock nfsd_destroy_serv() and nfsd_open_local_fh(), to ensure each +nn->nfsd_serv is not destroyed while in use by nfsd_open_local_fh(), and +warrants a more detailed explanation: + + nfsd_open_local_fh() uses nfsd_serv_try_get() before opening its + nfsd_file handle and then the caller (NFS client) must drop the + reference for the nfsd_file and associated nn->nfsd_serv using + nfs_file_put_local() once it has completed its IO. + + This interlock working relies heavily on nfsd_open_local_fh() being + afforded the ability to safely deal with the possibility that the + NFSD's net-ns (and nfsd_net by association) may have been destroyed + by nfsd_destroy_serv() via nfsd_shutdown_net() -- which is only + possible given the nfs_uuid_t ->net pointer managemenet detailed + above. + +All told, this elaborate interlock of the NFS client and server has been +verified to fix an easy to hit crash that would occur if an NFSD +instance running in a container, with a LOCALIO client mounted, is +shutdown. Upon restart of the container and associated NFSD the client +would go on to crash due to NULL pointer dereference that occurred due +to the LOCALIO client's attempting to nfsd_open_local_fh(), using +nn->nfsd_serv, without having a proper reference on nn->nfsd_serv. + +NFS Client issues IO instead of Server +====================================== + +Because LOCALIO is focused on protocol bypass to achieve improved IO +performance, alternatives to the traditional NFS wire protocol (SUNRPC +with XDR) must be provided to access the backing filesystem. + +See fs/nfs/localio.c:nfs_local_open_fh() and +fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes +focused use of select nfs server objects to allow a client local to a +server to open a file pointer without needing to go over the network. + +The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the +server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access +both the associated nfsd network namespace and nn->nfsd_serv in terms of +RCU. If nfsd_open_local_fh() finds that the client no longer sees valid +nfsd objects (be it struct net or nn->nfsd_serv) it returns -ENXIO +to nfs_local_open_fh() and the client will try to reestablish the +LOCALIO resources needed by calling nfs_local_probe() again. This +recovery is needed if/when an nfsd instance running in a container were +to reboot while a LOCALIO client is connected to it. + +Once the client has an open nfsd_file pointer it will issue reads, +writes and commits directly to the underlying local filesystem (normally +done by the nfs server). As such, for these operations, the NFS client +is issuing IO to the underlying local filesystem that it is sharing with +the NFS server. See: fs/nfs/localio.c:nfs_local_doio() and +fs/nfs/localio.c:nfs_local_commit(). + +Security +======== + +Localio is only supported when UNIX-style authentication (AUTH_UNIX, aka +AUTH_SYS) is used. + +Care is taken to ensure the same NFS security mechanisms are used +(authentication, etc) regardless of whether LOCALIO or regular NFS +access is used. The auth_domain established as part of the traditional +NFS client access to the NFS server is also used for LOCALIO. + +Relative to containers, LOCALIO gives the client access to the network +namespace the server has. This is required to allow the client to access +the server's per-namespace nfsd_net struct. With traditional NFS, the +client is afforded this same level of access (albeit in terms of the NFS +protocol via SUNRPC). No other namespaces (user, mount, etc) have been +altered or purposely extended from the server to the client. + +Testing +======= + +The LOCALIO auxiliary protocol and associated NFS LOCALIO read, write +and commit access have proven stable against various test scenarios: + +- Client and server both on the same host. + +- All permutations of client and server support enablement for both + local and remote client and server. + +- Testing against NFS storage products that don't support the LOCALIO + protocol was also performed. + +- Client on host, server within a container (for both v3 and v4.2). + The container testing was in terms of podman managed containers and + includes successful container stop/restart scenario. + +- Formalizing these test scenarios in terms of existing test + infrastructure is on-going. Initial regular coverage is provided in + terms of ktest running xfstests against a LOCALIO-enabled NFS loopback + mount configuration, and includes lockdep and KASAN coverage, see: + https://evilpiepirate.org/~testdashboard/ci?user=snitzer&branch=snitm-nfs-next + https://github.com/koverstreet/ktest + +- Various kdevops testing (in terms of "Chuck's BuildBot") has been + performed to regularly verify the LOCALIO changes haven't caused any + regressions to non-LOCALIO NFS use cases. + +- All of Hammerspace's various sanity tests pass with LOCALIO enabled + (this includes numerous pNFS and flexfiles tests). diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst index 165514401441..343644712340 100644 --- a/Documentation/filesystems/overlayfs.rst +++ b/Documentation/filesystems/overlayfs.rst @@ -367,8 +367,11 @@ Metadata only copy up When the "metacopy" feature is enabled, overlayfs will only copy up metadata (as opposed to whole file), when a metadata specific operation -like chown/chmod is performed. Full file will be copied up later when -file is opened for WRITE operation. +like chown/chmod is performed. An upper file in this state is marked with +"trusted.overlayfs.metacopy" xattr which indicates that the upper file +contains no data. The data will be copied up later when file is opened for +WRITE operation. After the lower file's data is copied up, +the "trusted.overlayfs.metacopy" xattr is removed from the upper file. In other words, this is delayed data copy up operation and data is copied up when there is a need to actually modify data. diff --git a/Documentation/filesystems/smb/ksmbd.rst b/Documentation/filesystems/smb/ksmbd.rst index 6b30e43a0d11..67cb68ea6e68 100644 --- a/Documentation/filesystems/smb/ksmbd.rst +++ b/Documentation/filesystems/smb/ksmbd.rst @@ -13,7 +13,7 @@ KSMBD architecture The subset of performance related operations belong in kernelspace and the other subset which belong to operations which are not really related with performance in userspace. So, DCE/RPC management that has historically resulted -into number of buffer overflow issues and dangerous security bugs and user +into a number of buffer overflow issues and dangerous security bugs and user account management are implemented in user space as ksmbd.mountd. File operations that are related with performance (open/read/write/close etc.) in kernel space (ksmbd). This also allows for easier integration with VFS @@ -24,8 +24,8 @@ ksmbd (kernel daemon) When the server daemon is started, It starts up a forker thread (ksmbd/interface name) at initialization time and open a dedicated port 445 -for listening to SMB requests. Whenever new clients make request, Forker -thread will accept the client connection and fork a new thread for dedicated +for listening to SMB requests. Whenever new clients make a request, the Forker +thread will accept the client connection and fork a new thread for a dedicated communication channel between the client and the server. It allows for parallel processing of SMB requests(commands) from clients as well as allowing for new clients to make new connections. Each instance is named ksmbd/1~n(port number) @@ -34,12 +34,12 @@ thread can decide to pass through the commands to the user space (ksmbd.mountd), currently DCE/RPC commands are identified to be handled through the user space. To further utilize the linux kernel, it has been chosen to process the commands as workitems and to be executed in the handlers of the ksmbd-io kworker threads. -It allows for multiplexing of the handlers as the kernel take care of initiating +It allows for multiplexing of the handlers as the kernel takes care of initiating extra worker threads if the load is increased and vice versa, if the load is -decreased it destroys the extra worker threads. So, after connection is -established with client. Dedicated ksmbd/1..n(port number) takes complete +decreased it destroys the extra worker threads. So, after the connection is +established with the client. Dedicated ksmbd/1..n(port number) takes complete ownership of receiving/parsing of SMB commands. Each received command is worked -in parallel i.e., There can be multiple clients commands which are worked in +in parallel i.e., there can be multiple client commands which are worked in parallel. After receiving each command a separated kernel workitem is prepared for each command which is further queued to be handled by ksmbd-io kworkers. So, each SMB workitem is queued to the kworkers. This allows the benefit of load @@ -49,9 +49,9 @@ performance by handling client commands in parallel. ksmbd.mountd (user space daemon) -------------------------------- -ksmbd.mountd is userspace process to, transfer user account and password that +ksmbd.mountd is a userspace process to, transfer the user account and password that are registered using ksmbd.adduser (part of utils for user space). Further it -allows sharing information parameters that parsed from smb.conf to ksmbd in +allows sharing information parameters that are parsed from smb.conf to ksmbd in kernel. For the execution part it has a daemon which is continuously running and connected to the kernel interface using netlink socket, it waits for the requests (dcerpc and share/user info). It handles RPC calls (at a minimum few @@ -124,7 +124,7 @@ How to run 1. Download ksmbd-tools(https://github.com/cifsd-team/ksmbd-tools/releases) and compile them. - - Refer README(https://github.com/cifsd-team/ksmbd-tools/blob/master/README.md) + - Refer to README(https://github.com/cifsd-team/ksmbd-tools/blob/master/README.md) to know how to use ksmbd.mountd/adduser/addshare/control utils $ ./autogen.sh @@ -133,7 +133,7 @@ How to run 2. Create /usr/local/etc/ksmbd/ksmbd.conf file, add SMB share in ksmbd.conf file. - - Refer ksmbd.conf.example in ksmbd-utils, See ksmbd.conf manpage + - Refer to ksmbd.conf.example in ksmbd-utils, See ksmbd.conf manpage for details to configure shares. $ man ksmbd.conf @@ -145,7 +145,7 @@ How to run $ man ksmbd.adduser $ sudo ksmbd.adduser -a <Enter USERNAME for SMB share access> -4. Insert ksmbd.ko module after build your kernel. No need to load module +4. Insert the ksmbd.ko module after you build your kernel. No need to load the module if ksmbd is built into the kernel. - Set ksmbd in menuconfig(e.g. $ make menuconfig) @@ -175,7 +175,7 @@ Each layer 1. Enable all component prints # sudo ksmbd.control -d "all" -2. Enable one of components (smb, auth, vfs, oplock, ipc, conn, rdma) +2. Enable one of the components (smb, auth, vfs, oplock, ipc, conn, rdma) # sudo ksmbd.control -d "smb" 3. Show what prints are enabled. diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 6e903a903f8f..0b18af3f954e 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -810,7 +810,7 @@ cache in your filesystem. The following members are defined: struct page **pagep, void **fsdata); int (*write_end)(struct file *, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, - struct page *page, void *fsdata); + struct folio *folio, void *fsdata); sector_t (*bmap)(struct address_space *, sector_t); void (*invalidate_folio) (struct folio *, size_t start, size_t len); bool (*release_folio)(struct folio *, gfp_t); @@ -913,8 +913,7 @@ cache in your filesystem. The following members are defined: stop attempting I/O, it can simply return. The caller will remove the remaining pages from the address space, unlock them and decrement the page refcount. Set PageUptodate if the I/O - completes successfully. Setting PageError on any page will be - ignored; simply unlock the page if an I/O error occurs. + completes successfully. ``write_begin`` Called by the generic buffered write code to ask the filesystem @@ -926,12 +925,12 @@ cache in your filesystem. The following members are defined: (if they haven't been read already) so that the updated blocks can be written out properly. - The filesystem must return the locked pagecache page for the - specified offset, in ``*pagep``, for the caller to write into. + The filesystem must return the locked pagecache folio for the + specified offset, in ``*foliop``, for the caller to write into. It must be able to cope with short writes (where the length passed to write_begin is greater than the number of bytes copied - into the page). + into the folio). A void * may be returned in fsdata, which then gets passed into write_end. @@ -944,8 +943,8 @@ cache in your filesystem. The following members are defined: called. len is the original len passed to write_begin, and copied is the amount that was able to be copied. - The filesystem must take care of unlocking the page and - releasing it refcount, and updating i_size. + The filesystem must take care of unlocking the folio, + decrementing its refcount, and updating i_size. Returns < 0 on failure, otherwise the number of bytes (<= 'copied') that were able to be copied into pagecache. |