aboutsummaryrefslogtreecommitdiff
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/bcachefs/errorcodes.rst30
-rw-r--r--Documentation/filesystems/bcachefs/index.rst11
-rw-r--r--Documentation/filesystems/directory-locking.rst349
-rw-r--r--Documentation/filesystems/efivarfs.rst2
-rw-r--r--Documentation/filesystems/erofs.rst4
-rw-r--r--Documentation/filesystems/f2fs.rst54
-rw-r--r--Documentation/filesystems/files.rst2
-rw-r--r--Documentation/filesystems/fscrypt.rst48
-rw-r--r--Documentation/filesystems/fuse-io.rst3
-rw-r--r--Documentation/filesystems/index.rst7
-rw-r--r--Documentation/filesystems/locking.rst11
-rw-r--r--Documentation/filesystems/netfs_library.rst23
-rw-r--r--Documentation/filesystems/nfs/exporting.rst7
-rw-r--r--Documentation/filesystems/ntfs.rst466
-rw-r--r--Documentation/filesystems/overlayfs.rst152
-rw-r--r--Documentation/filesystems/porting.rst86
-rw-r--r--Documentation/filesystems/proc.rst10
-rw-r--r--Documentation/filesystems/smb/ksmbd.rst9
-rw-r--r--Documentation/filesystems/squashfs.rst60
-rw-r--r--Documentation/filesystems/vfs.rst24
-rw-r--r--Documentation/filesystems/xfs/index.rst14
-rw-r--r--Documentation/filesystems/xfs/xfs-delayed-logging-design.rst (renamed from Documentation/filesystems/xfs-delayed-logging-design.rst)0
-rw-r--r--Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst (renamed from Documentation/filesystems/xfs-maintainer-entry-profile.rst)0
-rw-r--r--Documentation/filesystems/xfs/xfs-online-fsck-design.rst (renamed from Documentation/filesystems/xfs-online-fsck-design.rst)32
-rw-r--r--Documentation/filesystems/xfs/xfs-self-describing-metadata.rst (renamed from Documentation/filesystems/xfs-self-describing-metadata.rst)0
25 files changed, 652 insertions, 752 deletions
diff --git a/Documentation/filesystems/bcachefs/errorcodes.rst b/Documentation/filesystems/bcachefs/errorcodes.rst
new file mode 100644
index 000000000000..2cccaa0ba7cd
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/errorcodes.rst
@@ -0,0 +1,30 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+bcachefs private error codes
+----------------------------
+
+In bcachefs, as a hard rule we do not throw or directly use standard error
+codes (-EINVAL, -EBUSY, etc.). Instead, we define private error codes as needed
+in fs/bcachefs/errcode.h.
+
+This gives us much better error messages and makes debugging much easier. Any
+direct uses of standard error codes you see in the source code are simply old
+code that has yet to be converted - feel free to clean it up!
+
+Private error codes may subtype another error code, this allows for grouping of
+related errors that should be handled similarly (e.g. transaction restart
+errors), as well as specifying which standard error code should be returned at
+the bcachefs module boundary.
+
+At the module boundary, we use bch2_err_class() to convert to a standard error
+code; this also emits a trace event so that the original error code be
+recovered even if it wasn't logged.
+
+Do not reuse error codes! Generally speaking, a private error code should only
+be thrown in one place. That means that when we see it in a log message we can
+see, unambiguously, exactly which file and line number it was returned from.
+
+Try to give error codes names that are as reasonably descriptive of the error
+as possible. Frequently, the error will be logged at a place far removed from
+where the error was generated; good names for error codes mean much more
+descriptive and useful error messages.
diff --git a/Documentation/filesystems/bcachefs/index.rst b/Documentation/filesystems/bcachefs/index.rst
new file mode 100644
index 000000000000..e2bd61ccd96f
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/index.rst
@@ -0,0 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+bcachefs Documentation
+======================
+
+.. toctree::
+ :maxdepth: 2
+ :numbered:
+
+ errorcodes
diff --git a/Documentation/filesystems/directory-locking.rst b/Documentation/filesystems/directory-locking.rst
index dccd61c7c5c3..6fdf0b02df43 100644
--- a/Documentation/filesystems/directory-locking.rst
+++ b/Documentation/filesystems/directory-locking.rst
@@ -11,129 +11,268 @@ When taking the i_rwsem on multiple non-directory objects, we
always acquire the locks in order by increasing address. We'll call
that "inode pointer" order in the following.
-For our purposes all operations fall in 5 classes:
-1) read access. Locking rules: caller locks directory we are accessing.
-The lock is taken shared.
+Primitives
+==========
-2) object creation. Locking rules: same as above, but the lock is taken
-exclusive.
+For our purposes all operations fall in 6 classes:
-3) object removal. Locking rules: caller locks parent, finds victim,
-locks victim and calls the method. Locks are exclusive.
+1. read access. Locking rules:
-4) rename() that is _not_ cross-directory. Locking rules: caller locks the
-parent and finds source and target. We lock both (provided they exist). If we
-need to lock two inodes of different type (dir vs non-dir), we lock directory
-first. If we need to lock two inodes of the same type, lock them in inode
-pointer order. Then call the method. All locks are exclusive.
-NB: we might get away with locking the source (and target in exchange
-case) shared.
+ * lock the directory we are accessing (shared)
-5) link creation. Locking rules:
+2. object creation. Locking rules:
- * lock parent
- * check that source is not a directory
- * lock source
- * call the method.
+ * lock the directory we are accessing (exclusive)
-All locks are exclusive.
+3. object removal. Locking rules:
-6) cross-directory rename. The trickiest in the whole bunch. Locking
-rules:
+ * lock the parent (exclusive)
+ * find the victim
+ * lock the victim (exclusive)
- * lock the filesystem
- * lock parents in "ancestors first" order. If one is not ancestor of
- the other, lock them in inode pointer order.
- * find source and target.
- * if old parent is equal to or is a descendent of target
- fail with -ENOTEMPTY
- * if new parent is equal to or is a descendent of source
- fail with -ELOOP
- * Lock both the source and the target provided they exist. If we
- need to lock two inodes of different type (dir vs non-dir), we lock
- the directory first. If we need to lock two inodes of the same type,
- lock them in inode pointer order.
- * call the method.
-
-All ->i_rwsem are taken exclusive. Again, we might get away with locking
-the source (and target in exchange case) shared.
-
-The rules above obviously guarantee that all directories that are going to be
-read, modified or removed by method will be locked by caller.
+4. link creation. Locking rules:
+
+ * lock the parent (exclusive)
+ * check that the source is not a directory
+ * lock the source (exclusive; probably could be weakened to shared)
+
+5. rename that is _not_ cross-directory. Locking rules:
+
+ * lock the parent (exclusive)
+ * find the source and target
+ * decide which of the source and target need to be locked.
+ The source needs to be locked if it's a non-directory, target - if it's
+ a non-directory or about to be removed.
+ * take the locks that need to be taken (exclusive), in inode pointer order
+ if need to take both (that can happen only when both source and target
+ are non-directories - the source because it wouldn't need to be locked
+ otherwise and the target because mixing directory and non-directory is
+ allowed only with RENAME_EXCHANGE, and that won't be removing the target).
+6. cross-directory rename. The trickiest in the whole bunch. Locking rules:
+
+ * lock the filesystem
+ * if the parents don't have a common ancestor, fail the operation.
+ * lock the parents in "ancestors first" order (exclusive). If neither is an
+ ancestor of the other, lock the parent of source first.
+ * find the source and target.
+ * verify that the source is not a descendent of the target and
+ target is not a descendent of source; fail the operation otherwise.
+ * lock the subdirectories involved (exclusive), source before target.
+ * lock the non-directories involved (exclusive), in inode pointer order.
+
+The rules above obviously guarantee that all directories that are going
+to be read, modified or removed by method will be locked by the caller.
+
+
+Splicing
+========
+
+There is one more thing to consider - splicing. It's not an operation
+in its own right; it may happen as part of lookup. We speak of the
+operations on directory trees, but we obviously do not have the full
+picture of those - especially for network filesystems. What we have
+is a bunch of subtrees visible in dcache and locking happens on those.
+Trees grow as we do operations; memory pressure prunes them. Normally
+that's not a problem, but there is a nasty twist - what should we do
+when one growing tree reaches the root of another? That can happen in
+several scenarios, starting from "somebody mounted two nested subtrees
+from the same NFS4 server and doing lookups in one of them has reached
+the root of another"; there's also open-by-fhandle stuff, and there's a
+possibility that directory we see in one place gets moved by the server
+to another and we run into it when we do a lookup.
+
+For a lot of reasons we want to have the same directory present in dcache
+only once. Multiple aliases are not allowed. So when lookup runs into
+a subdirectory that already has an alias, something needs to be done with
+dcache trees. Lookup is already holding the parent locked. If alias is
+a root of separate tree, it gets attached to the directory we are doing a
+lookup in, under the name we'd been looking for. If the alias is already
+a child of the directory we are looking in, it changes name to the one
+we'd been looking for. No extra locking is involved in these two cases.
+However, if it's a child of some other directory, the things get trickier.
+First of all, we verify that it is *not* an ancestor of our directory
+and fail the lookup if it is. Then we try to lock the filesystem and the
+current parent of the alias. If either trylock fails, we fail the lookup.
+If trylocks succeed, we detach the alias from its current parent and
+attach to our directory, under the name we are looking for.
+
+Note that splicing does *not* involve any modification of the filesystem;
+all we change is the view in dcache. Moreover, holding a directory locked
+exclusive prevents such changes involving its children and holding the
+filesystem lock prevents any changes of tree topology, other than having a
+root of one tree becoming a child of directory in another. In particular,
+if two dentries have been found to have a common ancestor after taking
+the filesystem lock, their relationship will remain unchanged until
+the lock is dropped. So from the directory operations' point of view
+splicing is almost irrelevant - the only place where it matters is one
+step in cross-directory renames; we need to be careful when checking if
+parents have a common ancestor.
+
+
+Multiple-filesystem stuff
+=========================
+
+For some filesystems a method can involve a directory operation on
+another filesystem; it may be ecryptfs doing operation in the underlying
+filesystem, overlayfs doing something to the layers, network filesystem
+using a local one as a cache, etc. In all such cases the operations
+on other filesystems must follow the same locking rules. Moreover, "a
+directory operation on this filesystem might involve directory operations
+on that filesystem" should be an asymmetric relation (or, if you will,
+it should be possible to rank the filesystems so that directory operation
+on a filesystem could trigger directory operations only on higher-ranked
+ones - in these terms overlayfs ranks lower than its layers, network
+filesystem ranks lower than whatever it caches on, etc.)
+
+
+Deadlock avoidance
+==================
If no directory is its own ancestor, the scheme above is deadlock-free.
Proof:
- First of all, at any moment we have a linear ordering of the
- objects - A < B iff (A is an ancestor of B) or (B is not an ancestor
- of A and ptr(A) < ptr(B)).
-
- That ordering can change. However, the following is true:
-
-(1) if object removal or non-cross-directory rename holds lock on A and
- attempts to acquire lock on B, A will remain the parent of B until we
- acquire the lock on B. (Proof: only cross-directory rename can change
- the parent of object and it would have to lock the parent).
-
-(2) if cross-directory rename holds the lock on filesystem, order will not
- change until rename acquires all locks. (Proof: other cross-directory
- renames will be blocked on filesystem lock and we don't start changing
- the order until we had acquired all locks).
-
-(3) locks on non-directory objects are acquired only after locks on
- directory objects, and are acquired in inode pointer order.
- (Proof: all operations but renames take lock on at most one
- non-directory object, except renames, which take locks on source and
- target in inode pointer order in the case they are not directories.)
-
-Now consider the minimal deadlock. Each process is blocked on
-attempt to acquire some lock and already holds at least one lock. Let's
-consider the set of contended locks. First of all, filesystem lock is
-not contended, since any process blocked on it is not holding any locks.
-Thus all processes are blocked on ->i_rwsem.
-
-By (3), any process holding a non-directory lock can only be
-waiting on another non-directory lock with a larger address. Therefore
-the process holding the "largest" such lock can always make progress, and
-non-directory objects are not included in the set of contended locks.
-
-Thus link creation can't be a part of deadlock - it can't be
-blocked on source and it means that it doesn't hold any locks.
-
-Any contended object is either held by cross-directory rename or
-has a child that is also contended. Indeed, suppose that it is held by
-operation other than cross-directory rename. Then the lock this operation
-is blocked on belongs to child of that object due to (1).
-
-It means that one of the operations is cross-directory rename.
-Otherwise the set of contended objects would be infinite - each of them
-would have a contended child and we had assumed that no object is its
-own descendent. Moreover, there is exactly one cross-directory rename
-(see above).
-
-Consider the object blocking the cross-directory rename. One
-of its descendents is locked by cross-directory rename (otherwise we
-would again have an infinite set of contended objects). But that
-means that cross-directory rename is taking locks out of order. Due
-to (2) the order hadn't changed since we had acquired filesystem lock.
-But locking rules for cross-directory rename guarantee that we do not
-try to acquire lock on descendent before the lock on ancestor.
-Contradiction. I.e. deadlock is impossible. Q.E.D.
-
+There is a ranking on the locks, such that all primitives take
+them in order of non-decreasing rank. Namely,
+
+ * rank ->i_rwsem of non-directories on given filesystem in inode pointer
+ order.
+ * put ->i_rwsem of all directories on a filesystem at the same rank,
+ lower than ->i_rwsem of any non-directory on the same filesystem.
+ * put ->s_vfs_rename_mutex at rank lower than that of any ->i_rwsem
+ on the same filesystem.
+ * among the locks on different filesystems use the relative
+ rank of those filesystems.
+
+For example, if we have NFS filesystem caching on a local one, we have
+
+ 1. ->s_vfs_rename_mutex of NFS filesystem
+ 2. ->i_rwsem of directories on that NFS filesystem, same rank for all
+ 3. ->i_rwsem of non-directories on that filesystem, in order of
+ increasing address of inode
+ 4. ->s_vfs_rename_mutex of local filesystem
+ 5. ->i_rwsem of directories on the local filesystem, same rank for all
+ 6. ->i_rwsem of non-directories on local filesystem, in order of
+ increasing address of inode.
+
+It's easy to verify that operations never take a lock with rank
+lower than that of an already held lock.
+
+Suppose deadlocks are possible. Consider the minimal deadlocked
+set of threads. It is a cycle of several threads, each blocked on a lock
+held by the next thread in the cycle.
+
+Since the locking order is consistent with the ranking, all
+contended locks in the minimal deadlock will be of the same rank,
+i.e. they all will be ->i_rwsem of directories on the same filesystem.
+Moreover, without loss of generality we can assume that all operations
+are done directly to that filesystem and none of them has actually
+reached the method call.
+
+In other words, we have a cycle of threads, T1,..., Tn,
+and the same number of directories (D1,...,Dn) such that
+
+ T1 is blocked on D1 which is held by T2
+
+ T2 is blocked on D2 which is held by T3
+
+ ...
+
+ Tn is blocked on Dn which is held by T1.
+
+Each operation in the minimal cycle must have locked at least
+one directory and blocked on attempt to lock another. That leaves
+only 3 possible operations: directory removal (locks parent, then
+child), same-directory rename killing a subdirectory (ditto) and
+cross-directory rename of some sort.
+
+There must be a cross-directory rename in the set; indeed,
+if all operations had been of the "lock parent, then child" sort
+we would have Dn a parent of D1, which is a parent of D2, which is
+a parent of D3, ..., which is a parent of Dn. Relationships couldn't
+have changed since the moment directory locks had been acquired,
+so they would all hold simultaneously at the deadlock time and
+we would have a loop.
+
+Since all operations are on the same filesystem, there can't be
+more than one cross-directory rename among them. Without loss of
+generality we can assume that T1 is the one doing a cross-directory
+rename and everything else is of the "lock parent, then child" sort.
+
+In other words, we have a cross-directory rename that locked
+Dn and blocked on attempt to lock D1, which is a parent of D2, which is
+a parent of D3, ..., which is a parent of Dn. Relationships between
+D1,...,Dn all hold simultaneously at the deadlock time. Moreover,
+cross-directory rename does not get to locking any directories until it
+has acquired filesystem lock and verified that directories involved have
+a common ancestor, which guarantees that ancestry relationships between
+all of them had been stable.
+
+Consider the order in which directories are locked by the
+cross-directory rename; parents first, then possibly their children.
+Dn and D1 would have to be among those, with Dn locked before D1.
+Which pair could it be?
+
+It can't be the parents - indeed, since D1 is an ancestor of Dn,
+it would be the first parent to be locked. Therefore at least one of the
+children must be involved and thus neither of them could be a descendent
+of another - otherwise the operation would not have progressed past
+locking the parents.
+
+It can't be a parent and its child; otherwise we would've had
+a loop, since the parents are locked before the children, so the parent
+would have to be a descendent of its child.
+
+It can't be a parent and a child of another parent either.
+Otherwise the child of the parent in question would've been a descendent
+of another child.
+
+That leaves only one possibility - namely, both Dn and D1 are
+among the children, in some order. But that is also impossible, since
+neither of the children is a descendent of another.
+
+That concludes the proof, since the set of operations with the
+properties required for a minimal deadlock can not exist.
+
+Note that the check for having a common ancestor in cross-directory
+rename is crucial - without it a deadlock would be possible. Indeed,
+suppose the parents are initially in different trees; we would lock the
+parent of source, then try to lock the parent of target, only to have
+an unrelated lookup splice a distant ancestor of source to some distant
+descendent of the parent of target. At that point we have cross-directory
+rename holding the lock on parent of source and trying to lock its
+distant ancestor. Add a bunch of rmdir() attempts on all directories
+in between (all of those would fail with -ENOTEMPTY, had they ever gotten
+the locks) and voila - we have a deadlock.
+
+Loop avoidance
+==============
These operations are guaranteed to avoid loop creation. Indeed,
the only operation that could introduce loops is cross-directory rename.
-Since the only new (parent, child) pair added by rename() is (new parent,
-source), such loop would have to contain these objects and the rest of it
-would have to exist before rename(). I.e. at the moment of loop creation
-rename() responsible for that would be holding filesystem lock and new parent
-would have to be equal to or a descendent of source. But that means that
-new parent had been equal to or a descendent of source since the moment when
-we had acquired filesystem lock and rename() would fail with -ELOOP in that
-case.
+Suppose after the operation there is a loop; since there hadn't been such
+loops before the operation, at least on of the nodes in that loop must've
+had its parent changed. In other words, the loop must be passing through
+the source or, in case of exchange, possibly the target.
+
+Since the operation has succeeded, neither source nor target could have
+been ancestors of each other. Therefore the chain of ancestors starting
+in the parent of source could not have passed through the target and
+vice versa. On the other hand, the chain of ancestors of any node could
+not have passed through the node itself, or we would've had a loop before
+the operation. But everything other than source and target has kept
+the parent after the operation, so the operation does not change the
+chains of ancestors of (ex-)parents of source and target. In particular,
+those chains must end after a finite number of steps.
+
+Now consider the loop created by the operation. It passes through either
+source or target; the next node in the loop would be the ex-parent of
+target or source resp. After that the loop would follow the chain of
+ancestors of that parent. But as we have just shown, that chain must
+end after a finite number of steps, which means that it can't be a part
+of any loop. Q.E.D.
While this locking scheme works for arbitrary DAGs, it relies on
ability to check that directory is a descendent of another object. Current
diff --git a/Documentation/filesystems/efivarfs.rst b/Documentation/filesystems/efivarfs.rst
index 0551985821b8..f646c3f0980f 100644
--- a/Documentation/filesystems/efivarfs.rst
+++ b/Documentation/filesystems/efivarfs.rst
@@ -40,4 +40,4 @@ accidentally.
*See also:*
- Documentation/admin-guide/acpi/ssdt-overlays.rst
-- Documentation/ABI/stable/sysfs-firmware-efi-vars
+- Documentation/ABI/removed/sysfs-firmware-efi-vars
diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst
index 57c6ae23b3fc..cc4626d6ee4f 100644
--- a/Documentation/filesystems/erofs.rst
+++ b/Documentation/filesystems/erofs.rst
@@ -91,6 +91,10 @@ compatibility checking tool (fsck.erofs), and a debugging tool (dump.erofs):
- git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
+For more information, please also refer to the documentation site:
+
+- https://erofs.docs.kernel.org
+
Bugs and patches are welcome, please kindly help us and send to the following
linux-erofs mailing list:
diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
index d32c6209685d..efc3493fd6f8 100644
--- a/Documentation/filesystems/f2fs.rst
+++ b/Documentation/filesystems/f2fs.rst
@@ -126,9 +126,7 @@ norecovery Disable the roll-forward recovery routine, mounted read-
discard/nodiscard Enable/disable real-time discard in f2fs, if discard is
enabled, f2fs will issue discard/TRIM commands when a
segment is cleaned.
-no_heap Disable heap-style segment allocation which finds free
- segments for data from the beginning of main area, while
- for node from the end of main area.
+heap/no_heap Deprecated.
nouser_xattr Disable Extended User Attributes. Note: xattr is enabled
by default if CONFIG_F2FS_FS_XATTR is selected.
noacl Disable POSIX Access Control List. Note: acl is enabled
@@ -184,29 +182,31 @@ fault_type=%d Support configuring fault injection type, should be
enabled with fault_injection option, fault type value
is shown below, it supports single or combined type.
- =================== ===========
- Type_Name Type_Value
- =================== ===========
- FAULT_KMALLOC 0x000000001
- FAULT_KVMALLOC 0x000000002
- FAULT_PAGE_ALLOC 0x000000004
- FAULT_PAGE_GET 0x000000008
- FAULT_ALLOC_BIO 0x000000010 (obsolete)
- FAULT_ALLOC_NID 0x000000020
- FAULT_ORPHAN 0x000000040
- FAULT_BLOCK 0x000000080
- FAULT_DIR_DEPTH 0x000000100
- FAULT_EVICT_INODE 0x000000200
- FAULT_TRUNCATE 0x000000400
- FAULT_READ_IO 0x000000800
- FAULT_CHECKPOINT 0x000001000
- FAULT_DISCARD 0x000002000
- FAULT_WRITE_IO 0x000004000
- FAULT_SLAB_ALLOC 0x000008000
- FAULT_DQUOT_INIT 0x000010000
- FAULT_LOCK_OP 0x000020000
- FAULT_BLKADDR 0x000040000
- =================== ===========
+ =========================== ===========
+ Type_Name Type_Value
+ =========================== ===========
+ FAULT_KMALLOC 0x000000001
+ FAULT_KVMALLOC 0x000000002
+ FAULT_PAGE_ALLOC 0x000000004
+ FAULT_PAGE_GET 0x000000008
+ FAULT_ALLOC_BIO 0x000000010 (obsolete)
+ FAULT_ALLOC_NID 0x000000020
+ FAULT_ORPHAN 0x000000040
+ FAULT_BLOCK 0x000000080
+ FAULT_DIR_DEPTH 0x000000100
+ FAULT_EVICT_INODE 0x000000200
+ FAULT_TRUNCATE 0x000000400
+ FAULT_READ_IO 0x000000800
+ FAULT_CHECKPOINT 0x000001000
+ FAULT_DISCARD 0x000002000
+ FAULT_WRITE_IO 0x000004000
+ FAULT_SLAB_ALLOC 0x000008000
+ FAULT_DQUOT_INIT 0x000010000
+ FAULT_LOCK_OP 0x000020000
+ FAULT_BLKADDR_VALIDITY 0x000040000
+ FAULT_BLKADDR_CONSISTENCE 0x000080000
+ FAULT_NO_SEGMENT 0x000100000
+ =========================== ===========
mode=%s Control block allocation mode which supports "adaptive"
and "lfs". In "lfs" mode, there should be no random
writes towards main area.
@@ -228,8 +228,6 @@ mode=%s Control block allocation mode which supports "adaptive"
option for more randomness.
Please, use these options for your experiments and we strongly
recommend to re-format the filesystem after using these options.
-io_bits=%u Set the bit size of write IO requests. It should be set
- with "mode=lfs".
usrquota Enable plain user disk quota accounting.
grpquota Enable plain group disk quota accounting.
prjquota Enable plain project quota accounting.
diff --git a/Documentation/filesystems/files.rst b/Documentation/filesystems/files.rst
index 9e38e4c221ca..eb770f891b27 100644
--- a/Documentation/filesystems/files.rst
+++ b/Documentation/filesystems/files.rst
@@ -116,7 +116,7 @@ before and after the reference count increment. This pattern can be seen
in get_file_rcu() and __files_get_rcu().
In addition, it isn't possible to access or check fields in struct file
-without first aqcuiring a reference on it under rcu lookup. Not doing
+without first acquiring a reference on it under rcu lookup. Not doing
that was always very dodgy and it was only usable for non-pointer data
in struct file. With SLAB_TYPESAFE_BY_RCU it is necessary that callers
either first acquire a reference or they must hold the files_lock of the
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index 1b84f818e574..04eaab01314b 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -31,15 +31,15 @@ However, except for filenames, fscrypt does not encrypt filesystem
metadata.
Unlike eCryptfs, which is a stacked filesystem, fscrypt is integrated
-directly into supported filesystems --- currently ext4, F2FS, and
-UBIFS. This allows encrypted files to be read and written without
-caching both the decrypted and encrypted pages in the pagecache,
-thereby nearly halving the memory used and bringing it in line with
-unencrypted files. Similarly, half as many dentries and inodes are
-needed. eCryptfs also limits encrypted filenames to 143 bytes,
-causing application compatibility issues; fscrypt allows the full 255
-bytes (NAME_MAX). Finally, unlike eCryptfs, the fscrypt API can be
-used by unprivileged users, with no need to mount anything.
+directly into supported filesystems --- currently ext4, F2FS, UBIFS,
+and CephFS. This allows encrypted files to be read and written
+without caching both the decrypted and encrypted pages in the
+pagecache, thereby nearly halving the memory used and bringing it in
+line with unencrypted files. Similarly, half as many dentries and
+inodes are needed. eCryptfs also limits encrypted filenames to 143
+bytes, causing application compatibility issues; fscrypt allows the
+full 255 bytes (NAME_MAX). Finally, unlike eCryptfs, the fscrypt API
+can be used by unprivileged users, with no need to mount anything.
fscrypt does not support encrypting files in-place. Instead, it
supports marking an empty directory as encrypted. Then, after
@@ -338,11 +338,14 @@ Supported modes
Currently, the following pairs of encryption modes are supported:
-- AES-256-XTS for contents and AES-256-CTS-CBC for filenames
+- AES-256-XTS for contents and AES-256-CBC-CTS for filenames
- AES-256-XTS for contents and AES-256-HCTR2 for filenames
- Adiantum for both contents and filenames
-- AES-128-CBC-ESSIV for contents and AES-128-CTS-CBC for filenames
-- SM4-XTS for contents and SM4-CTS-CBC for filenames
+- AES-128-CBC-ESSIV for contents and AES-128-CBC-CTS for filenames
+- SM4-XTS for contents and SM4-CBC-CTS for filenames
+
+Note: in the API, "CBC" means CBC-ESSIV, and "CTS" means CBC-CTS.
+So, for example, FSCRYPT_MODE_AES_256_CTS means AES-256-CBC-CTS.
Authenticated encryption modes are not currently supported because of
the difficulty of dealing with ciphertext expansion. Therefore,
@@ -351,11 +354,11 @@ contents encryption uses a block cipher in `XTS mode
`CBC-ESSIV mode
<https://en.wikipedia.org/wiki/Disk_encryption_theory#Encrypted_salt-sector_initialization_vector_(ESSIV)>`_,
or a wide-block cipher. Filenames encryption uses a
-block cipher in `CTS-CBC mode
+block cipher in `CBC-CTS mode
<https://en.wikipedia.org/wiki/Ciphertext_stealing>`_ or a wide-block
cipher.
-The (AES-256-XTS, AES-256-CTS-CBC) pair is the recommended default.
+The (AES-256-XTS, AES-256-CBC-CTS) pair is the recommended default.
It is also the only option that is *guaranteed* to always be supported
if the kernel supports fscrypt at all; see `Kernel config options`_.
@@ -364,7 +367,7 @@ upgrades the filenames encryption to use a wide-block cipher. (A
*wide-block cipher*, also called a tweakable super-pseudorandom
permutation, has the property that changing one bit scrambles the
entire result.) As described in `Filenames encryption`_, a wide-block
-cipher is the ideal mode for the problem domain, though CTS-CBC is the
+cipher is the ideal mode for the problem domain, though CBC-CTS is the
"least bad" choice among the alternatives. For more information about
HCTR2, see `the HCTR2 paper <https://eprint.iacr.org/2021/1441.pdf>`_.
@@ -375,13 +378,13 @@ the work is done by XChaCha12, which is much faster than AES when AES
acceleration is unavailable. For more information about Adiantum, see
`the Adiantum paper <https://eprint.iacr.org/2018/720.pdf>`_.
-The (AES-128-CBC-ESSIV, AES-128-CTS-CBC) pair exists only to support
+The (AES-128-CBC-ESSIV, AES-128-CBC-CTS) pair exists only to support
systems whose only form of AES acceleration is an off-CPU crypto
accelerator such as CAAM or CESA that does not support XTS.
The remaining mode pairs are the "national pride ciphers":
-- (SM4-XTS, SM4-CTS-CBC)
+- (SM4-XTS, SM4-CBC-CTS)
Generally speaking, these ciphers aren't "bad" per se, but they
receive limited security review compared to the usual choices such as
@@ -393,7 +396,7 @@ Kernel config options
Enabling fscrypt support (CONFIG_FS_ENCRYPTION) automatically pulls in
only the basic support from the crypto API needed to use AES-256-XTS
-and AES-256-CTS-CBC encryption. For optimal performance, it is
+and AES-256-CBC-CTS encryption. For optimal performance, it is
strongly recommended to also enable any available platform-specific
kconfig options that provide acceleration for the algorithm(s) you
wish to use. Support for any "non-default" encryption modes typically
@@ -407,7 +410,7 @@ kernel crypto API (see `Inline encryption support`_); in that case,
the file contents mode doesn't need to supported in the kernel crypto
API, but the filenames mode still does.
-- AES-256-XTS and AES-256-CTS-CBC
+- AES-256-XTS and AES-256-CBC-CTS
- Recommended:
- arm64: CONFIG_CRYPTO_AES_ARM64_CE_BLK
- x86: CONFIG_CRYPTO_AES_NI_INTEL
@@ -433,7 +436,7 @@ API, but the filenames mode still does.
- x86: CONFIG_CRYPTO_NHPOLY1305_SSE2
- x86: CONFIG_CRYPTO_NHPOLY1305_AVX2
-- AES-128-CBC-ESSIV and AES-128-CTS-CBC:
+- AES-128-CBC-ESSIV and AES-128-CBC-CTS:
- Mandatory:
- CONFIG_CRYPTO_ESSIV
- CONFIG_CRYPTO_SHA256 or another SHA-256 implementation
@@ -521,7 +524,7 @@ alternatively has the file's nonce (for `DIRECT_KEY policies`_) or
inode number (for `IV_INO_LBLK_64 policies`_) included in the IVs.
Thus, IV reuse is limited to within a single directory.
-With CTS-CBC, the IV reuse means that when the plaintext filenames share a
+With CBC-CTS, the IV reuse means that when the plaintext filenames share a
common prefix at least as long as the cipher block size (16 bytes for AES), the
corresponding encrypted filenames will also share a common prefix. This is
undesirable. Adiantum and HCTR2 do not have this weakness, as they are
@@ -1382,7 +1385,8 @@ directory.) These structs are defined as follows::
u8 contents_encryption_mode;
u8 filenames_encryption_mode;
u8 flags;
- u8 __reserved[4];
+ u8 log2_data_unit_size;
+ u8 __reserved[3];
u8 master_key_identifier[FSCRYPT_KEY_IDENTIFIER_SIZE];
u8 nonce[FSCRYPT_FILE_NONCE_SIZE];
};
diff --git a/Documentation/filesystems/fuse-io.rst b/Documentation/filesystems/fuse-io.rst
index 255a368fe534..6464de4266ad 100644
--- a/Documentation/filesystems/fuse-io.rst
+++ b/Documentation/filesystems/fuse-io.rst
@@ -15,7 +15,8 @@ The direct-io mode can be selected with the FOPEN_DIRECT_IO flag in the
FUSE_OPEN reply.
In direct-io mode the page cache is completely bypassed for reads and writes.
-No read-ahead takes place. Shared mmap is disabled.
+No read-ahead takes place. Shared mmap is disabled by default. To allow shared
+mmap, the FUSE_DIRECT_IO_ALLOW_MMAP flag may be enabled in the FUSE_INIT reply.
In cached mode reads may be satisfied from the page cache, and data may be
read-ahead by the kernel to fill the cache. The cache is always kept consistent
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 09cade7eaefc..1f9b4c905a6a 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -69,6 +69,7 @@ Documentation for filesystem implementations.
afs
autofs
autofs-mount-control
+ bcachefs/index
befs
bfs
btrfs
@@ -98,7 +99,6 @@ Documentation for filesystem implementations.
isofs
nilfs2
nfs/index
- ntfs
ntfs3
ocfs2
ocfs2-online-filecheck
@@ -121,8 +121,5 @@ Documentation for filesystem implementations.
udf
virtiofs
vfat
- xfs-delayed-logging-design
- xfs-maintainer-entry-profile
- xfs-self-describing-metadata
- xfs-online-fsck-design
+ xfs/index
zonefs
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 7be2900806c8..e664061ed55d 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -29,7 +29,7 @@ prototypes::
char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
struct vfsmount *(*d_automount)(struct path *path);
int (*d_manage)(const struct path *, bool);
- struct dentry *(*d_real)(struct dentry *, const struct inode *);
+ struct dentry *(*d_real)(struct dentry *, enum d_real_type type);
locking rules:
@@ -101,7 +101,7 @@ symlink: exclusive
mkdir: exclusive
unlink: exclusive (both)
rmdir: exclusive (both)(see below)
-rename: exclusive (all) (see below)
+rename: exclusive (both parents, some children) (see below)
readlink: no
get_link: no
setattr: exclusive
@@ -123,6 +123,9 @@ get_offset_ctx no
Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
exclusive on victim.
cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
+ ->unlink() and ->rename() have ->i_rwsem exclusive on all non-directories
+ involved.
+ ->rename() has ->i_rwsem exclusive on any subdirectory that changes parent.
See Documentation/filesystems/directory-locking.rst for more detailed discussion
of the locking scheme for directory operations.
@@ -261,7 +264,7 @@ prototypes::
struct folio *src, enum migrate_mode);
int (*launder_folio)(struct folio *);
bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count);
- int (*error_remove_page)(struct address_space *, struct page *);
+ int (*error_remove_folio)(struct address_space *, struct folio *);
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *);
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
@@ -287,7 +290,7 @@ direct_IO:
migrate_folio: yes (both)
launder_folio: yes
is_partially_uptodate: yes
-error_remove_page: yes
+error_remove_folio: yes
swap_activate: no
swap_deactivate: no
swap_rw: yes, unlocks
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst
index 48b95d04f72d..4cc657d743f7 100644
--- a/Documentation/filesystems/netfs_library.rst
+++ b/Documentation/filesystems/netfs_library.rst
@@ -295,7 +295,6 @@ through which it can issue requests and negotiate::
struct netfs_request_ops {
void (*init_request)(struct netfs_io_request *rreq, struct file *file);
void (*free_request)(struct netfs_io_request *rreq);
- int (*begin_cache_operation)(struct netfs_io_request *rreq);
void (*expand_readahead)(struct netfs_io_request *rreq);
bool (*clamp_length)(struct netfs_io_subrequest *subreq);
void (*issue_read)(struct netfs_io_subrequest *subreq);
@@ -317,20 +316,6 @@ The operations are as follows:
[Optional] This is called as the request is being deallocated so that the
filesystem can clean up any state it has attached there.
- * ``begin_cache_operation()``
-
- [Optional] This is called to ask the network filesystem to call into the
- cache (if present) to initialise the caching state for this read. The netfs
- library module cannot access the cache directly, so the cache should call
- something like fscache_begin_read_operation() to do this.
-
- The cache gets to store its state in ->cache_resources and must set a table
- of operations of its own there (though of a different type).
-
- This should return 0 on success and an error code otherwise. If an error is
- reported, the operation may proceed anyway, just without local caching (only
- out of memory and interruption errors cause failure here).
-
* ``expand_readahead()``
[Optional] This is called to allow the filesystem to expand the size of a
@@ -460,14 +445,14 @@ When implementing a local cache to be used by the read helpers, two things are
required: some way for the network filesystem to initialise the caching for a
read request and a table of operations for the helpers to call.
-The network filesystem's ->begin_cache_operation() method is called to set up a
-cache and this must call into the cache to do the work. If using fscache, for
-example, the cache would call::
+To begin a cache operation on an fscache object, the following function is
+called::
int fscache_begin_read_operation(struct netfs_io_request *rreq,
struct fscache_cookie *cookie);
-passing in the request pointer and the cookie corresponding to the file.
+passing in the request pointer and the cookie corresponding to the file. This
+fills in the cache resources mentioned below.
The netfs_io_request object contains a place for the cache to hang its
state::
diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
index 198d805d611c..f04ce1215a03 100644
--- a/Documentation/filesystems/nfs/exporting.rst
+++ b/Documentation/filesystems/nfs/exporting.rst
@@ -122,12 +122,9 @@ are exportable by setting the s_export_op field in the struct
super_block. This field must point to a "struct export_operations"
struct which has the following members:
- encode_fh (optional)
+ encode_fh (mandatory)
Takes a dentry and creates a filehandle fragment which may later be used
- to find or create a dentry for the same object. The default
- implementation creates a filehandle fragment that encodes a 32bit inode
- and generation number for the inode encoded, and if necessary the
- same information for the parent.
+ to find or create a dentry for the same object.
fh_to_dentry (mandatory)
Given a filehandle fragment, this should find the implied object and
diff --git a/Documentation/filesystems/ntfs.rst b/Documentation/filesystems/ntfs.rst
deleted file mode 100644
index 5bb093a26485..000000000000
--- a/Documentation/filesystems/ntfs.rst
+++ /dev/null
@@ -1,466 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-================================
-The Linux NTFS filesystem driver
-================================
-
-
-.. Table of contents
-
- - Overview
- - Web site
- - Features
- - Supported mount options
- - Known bugs and (mis-)features
- - Using NTFS volume and stripe sets
- - The Device-Mapper driver
- - The Software RAID / MD driver
- - Limitations when using the MD driver
-
-
-Overview
-========
-
-Linux-NTFS comes with a number of user-space programs known as ntfsprogs.
-These include mkntfs, a full-featured ntfs filesystem format utility,
-ntfsundelete used for recovering files that were unintentionally deleted
-from an NTFS volume and ntfsresize which is used to resize an NTFS partition.
-See the web site for more information.
-
-To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file
-system type 'ntfs'. The driver currently supports read-only mode (with no
-fault-tolerance, encryption or journalling) and very limited, but safe, write
-support.
-
-For fault tolerance and raid support (i.e. volume and stripe sets), you can
-use the kernel's Software RAID / MD driver. See section "Using Software RAID
-with NTFS" for details.
-
-
-Web site
-========
-
-There is plenty of additional information on the linux-ntfs web site
-at http://www.linux-ntfs.org/
-
-The web site has a lot of additional information, such as a comprehensive
-FAQ, documentation on the NTFS on-disk format, information on the Linux-NTFS
-userspace utilities, etc.
-
-
-Features
-========
-
-- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and
- earlier kernels. This new driver implements NTFS read support and is
- functionally equivalent to the old ntfs driver and it also implements limited
- write support. The biggest limitation at present is that files/directories
- cannot be created or deleted. See below for the list of write features that
- are so far supported. Another limitation is that writing to compressed files
- is not implemented at all. Also, neither read nor write access to encrypted
- files is so far implemented.
-- The new driver has full support for sparse files on NTFS 3.x volumes which
- the old driver isn't happy with.
-- The new driver supports execution of binaries due to mmap() now being
- supported.
-- The new driver supports loopback mounting of files on NTFS which is used by
- some Linux distributions to enable the user to run Linux from an NTFS
- partition by creating a large file while in Windows and then loopback
- mounting the file while in Linux and creating a Linux filesystem on it that
- is used to install Linux on it.
-- A comparison of the two drivers using::
-
- time find . -type f -exec md5sum "{}" \;
-
- run three times in sequence with each driver (after a reboot) on a 1.4GiB
- NTFS partition, showed the new driver to be 20% faster in total time elapsed
- (from 9:43 minutes on average down to 7:53). The time spent in user space
- was unchanged but the time spent in the kernel was decreased by a factor of
- 2.5 (from 85 CPU seconds down to 33).
-- The driver does not support short file names in general. For backwards
- compatibility, we implement access to files using their short file names if
- they exist. The driver will not create short file names however, and a
- rename will discard any existing short file name.
-- The new driver supports exporting of mounted NTFS volumes via NFS.
-- The new driver supports async io (aio).
-- The new driver supports fsync(2), fdatasync(2), and msync(2).
-- The new driver supports readv(2) and writev(2).
-- The new driver supports access time updates (including mtime and ctime).
-- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present
- only very limited support for highly fragmented files, i.e. ones which have
- their data attribute split across multiple extents, is included. Another
- limitation is that at present truncate(2) will never create sparse files,
- since to mark a file sparse we need to modify the directory entry for the
- file and we do not implement directory modifications yet.
-- The new driver supports write(2) which can both overwrite existing data and
- extend the file size so that you can write beyond the existing data. Also,
- writing into sparse regions is supported and the holes are filled in with
- clusters. But at present only limited support for highly fragmented files,
- i.e. ones which have their data attribute split across multiple extents, is
- included. Another limitation is that write(2) will never create sparse
- files, since to mark a file sparse we need to modify the directory entry for
- the file and we do not implement directory modifications yet.
-
-Supported mount options
-=======================
-
-In addition to the generic mount options described by the manual page for the
-mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the
-following mount options:
-
-======================= =======================================================
-iocharset=name Deprecated option. Still supported but please use
- nls=name in the future. See description for nls=name.
-
-nls=name Character set to use when returning file names.
- Unlike VFAT, NTFS suppresses names that contain
- unconvertible characters. Note that most character
- sets contain insufficient characters to represent all
- possible Unicode characters that can exist on NTFS.
- To be sure you are not missing any files, you are
- advised to use nls=utf8 which is capable of
- representing all Unicode characters.
-
-utf8=<bool> Option no longer supported. Currently mapped to
- nls=utf8 but please use nls=utf8 in the future and
- make sure utf8 is compiled either as module or into
- the kernel. See description for nls=name.
-
-uid=
-gid=
-umask= Provide default owner, group, and access mode mask.
- These options work as documented in mount(8). By
- default, the files/directories are owned by root and
- he/she has read and write permissions, as well as
- browse permission for directories. No one else has any
- access permissions. I.e. the mode on all files is by
- default rw------- and for directories rwx------, a
- consequence of the default fmask=0177 and dmask=0077.
- Using a umask of zero will grant all permissions to
- everyone, i.e. all files and directories will have mode
- rwxrwxrwx.
-
-fmask=
-dmask= Instead of specifying umask which applies both to
- files and directories, fmask applies only to files and
- dmask only to directories.
-
-sloppy=<BOOL> If sloppy is specified, ignore unknown mount options.
- Otherwise the default behaviour is to abort mount if
- any unknown options are found.
-
-show_sys_files=<BOOL> If show_sys_files is specified, show the system files
- in directory listings. Otherwise the default behaviour
- is to hide the system files.
- Note that even when show_sys_files is specified, "$MFT"
- will not be visible due to bugs/mis-features in glibc.
- Further, note that irrespective of show_sys_files, all
- files are accessible by name, i.e. you can always do
- "ls -l \$UpCase" for example to specifically show the
- system file containing the Unicode upcase table.
-
-case_sensitive=<BOOL> If case_sensitive is specified, treat all file names as
- case sensitive and create file names in the POSIX
- namespace. Otherwise the default behaviour is to treat
- file names as case insensitive and to create file names
- in the WIN32/LONG name space. Note, the Linux NTFS
- driver will never create short file names and will
- remove them on rename/delete of the corresponding long
- file name.
- Note that files remain accessible via their short file
- name, if it exists. If case_sensitive, you will need
- to provide the correct case of the short file name.
-
-disable_sparse=<BOOL> If disable_sparse is specified, creation of sparse
- regions, i.e. holes, inside files is disabled for the
- volume (for the duration of this mount only). By
- default, creation of sparse regions is enabled, which
- is consistent with the behaviour of traditional Unix
- filesystems.
-
-errors=opt What to do when critical filesystem errors are found.
- Following values can be used for "opt":
-
- ======== =========================================
- continue DEFAULT, try to clean-up as much as
- possible, e.g. marking a corrupt inode as
- bad so it is no longer accessed, and then
- continue.
- recover At present only supported is recovery of
- the boot sector from the backup copy.
- If read-only mount, the recovery is done
- in memory only and not written to disk.
- ======== =========================================
-
- Note that the options are additive, i.e. specifying::
-
- errors=continue,errors=recover
-
- means the driver will attempt to recover and if that
- fails it will clean-up as much as possible and
- continue.
-
-mft_zone_multiplier= Set the MFT zone multiplier for the volume (this
- setting is not persistent across mounts and can be
- changed from mount to mount but cannot be changed on
- remount). Values of 1 to 4 are allowed, 1 being the
- default. The MFT zone multiplier determines how much
- space is reserved for the MFT on the volume. If all
- other space is used up, then the MFT zone will be
- shrunk dynamically, so this has no impact on the
- amount of free space. However, it can have an impact
- on performance by affecting fragmentation of the MFT.
- In general use the default. If you have a lot of small
- files then use a higher value. The values have the
- following meaning:
-
- ===== =================================
- Value MFT zone size (% of volume size)
- ===== =================================
- 1 12.5%
- 2 25%
- 3 37.5%
- 4 50%
- ===== =================================
-
- Note this option is irrelevant for read-only mounts.
-======================= =======================================================
-
-
-Known bugs and (mis-)features
-=============================
-
-- The link count on each directory inode entry is set to 1, due to Linux not
- supporting directory hard links. This may well confuse some user space
- applications, since the directory names will have the same inode numbers.
- This also speeds up ntfs_read_inode() immensely. And we haven't found any
- problems with this approach so far. If you find a problem with this, please
- let us know.
-
-
-Please send bug reports/comments/feedback/abuse to the Linux-NTFS development
-list at sourceforge: [email protected]
-
-
-Using NTFS volume and stripe sets
-=================================
-
-For support of volume and stripe sets, you can either use the kernel's
-Device-Mapper driver or the kernel's Software RAID / MD driver. The former is
-the recommended one to use for linear raid. But the latter is required for
-raid level 5. For striping and mirroring, either driver should work fine.
-
-
-The Device-Mapper driver
-------------------------
-
-You will need to create a table of the components of the volume/stripe set and
-how they fit together and load this into the kernel using the dmsetup utility
-(see man 8 dmsetup).
-
-Linear volume sets, i.e. linear raid, has been tested and works fine. Even
-though untested, there is no reason why stripe sets, i.e. raid level 0, and
-mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e.
-raid level 5, unfortunately cannot work yet because the current version of the
-Device-Mapper driver does not support raid level 5. You may be able to use the
-Software RAID / MD driver for raid level 5, see the next section for details.
-
-To create the table describing your volume you will need to know each of its
-components and their sizes in sectors, i.e. multiples of 512-byte blocks.
-
-For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for
-example if one of your partitions is /dev/hda2 you would do::
-
- $ fdisk -ul /dev/hda
-
- Disk /dev/hda: 81.9 GB, 81964302336 bytes
- 255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
- Units = sectors of 1 * 512 = 512 bytes
-
- Device Boot Start End Blocks Id System
- /dev/hda1 * 63 4209029 2104483+ 83 Linux
- /dev/hda2 4209030 37768814 16779892+ 86 NTFS
- /dev/hda3 37768815 46170809 4200997+ 83 Linux
-
-And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
-33559785 sectors.
-
-For Win2k and later dynamic disks, you can for example use the ldminfo utility
-which is part of the Linux LDM tools (the latest version at the time of
-writing is linux-ldm-0.0.8.tar.bz2). You can download it from:
-
- http://www.linux-ntfs.org/
-
-Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go
-into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You
-will find the precompiled (i386) ldminfo utility there. NOTE: You will not be
-able to compile this yourself easily so use the binary version!
-
-Then you would use ldminfo in dump mode to obtain the necessary information::
-
- $ ./ldminfo --dump /dev/hda
-
-This would dump the LDM database found on /dev/hda which describes all of your
-dynamic disks and all the volumes on them. At the bottom you will see the
-VOLUME DEFINITIONS section which is all you really need. You may need to look
-further above to determine which of the disks in the volume definitions is
-which device in Linux. Hint: Run ldminfo on each of your dynamic disks and
-look at the Disk Id close to the top of the output for each (the PRIVATE HEADER
-section). You can then find these Disk Ids in the VBLK DATABASE section in the
-<Disk> components where you will get the LDM Name for the disk that is found in
-the VOLUME DEFINITIONS section.
-
-Note you will also need to enable the LDM driver in the Linux kernel. If your
-distribution did not enable it, you will need to recompile the kernel with it
-enabled. This will create the LDM partitions on each device at boot time. You
-would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc)
-in the Device-Mapper table.
-
-You can also bypass using the LDM driver by using the main device (e.g.
-/dev/hda) and then using the offsets of the LDM partitions into this device as
-the "Start sector of device" when creating the table. Once again ldminfo would
-give you the correct information to do this.
-
-Assuming you know all your devices and their sizes things are easy.
-
-For a linear raid the table would look like this (note all values are in
-512-byte sectors)::
-
- # Offset into Size of this Raid type Device Start sector
- # volume device of device
- 0 1028161 linear /dev/hda1 0
- 1028161 3903762 linear /dev/hdb2 0
- 4931923 2103211 linear /dev/hdc1 0
-
-For a striped volume, i.e. raid level 0, you will need to know the chunk size
-you used when creating the volume. Windows uses 64kiB as the default, so it
-will probably be this unless you changes the defaults when creating the array.
-
-For a raid level 0 the table would look like this (note all values are in
-512-byte sectors)::
-
- # Offset Size Raid Number Chunk 1st Start 2nd Start
- # into of the type of size Device in Device in
- # volume volume stripes device device
- 0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
-
-If there are more than two devices, just add each of them to the end of the
-line.
-
-Finally, for a mirrored volume, i.e. raid level 1, the table would look like
-this (note all values are in 512-byte sectors)::
-
- # Ofs Size Raid Log Number Region Should Number Source Start Target Start
- # in of the type type of log size sync? of Device in Device in
- # vol volume params mirrors Device Device
- 0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
-
-If you are mirroring to multiple devices you can specify further targets at the
-end of the line.
-
-Note the "Should sync?" parameter "nosync" means that the two mirrors are
-already in sync which will be the case on a clean shutdown of Windows. If the
-mirrors are not clean, you can specify the "sync" option instead of "nosync"
-and the Device-Mapper driver will then copy the entirety of the "Source Device"
-to the "Target Device" or if you specified multiple target devices to all of
-them.
-
-Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
-and hand it over to dmsetup to work with, like so::
-
- $ dmsetup create myvolume1 /etc/ntfsvolume1
-
-You can obviously replace "myvolume1" with whatever name you like.
-
-If it all worked, you will now have the device /dev/device-mapper/myvolume1
-which you can then just use as an argument to the mount command as usual to
-mount the ntfs volume. For example::
-
- $ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
-
-(You need to create the directory /mnt/myvol1 first and of course you can use
-anything you like instead of /mnt/myvol1 as long as it is an existing
-directory.)
-
-It is advisable to do the mount read-only to see if the volume has been setup
-correctly to avoid the possibility of causing damage to the data on the ntfs
-volume.
-
-
-The Software RAID / MD driver
------------------------------
-
-An alternative to using the Device-Mapper driver is to use the kernel's
-Software RAID / MD driver. For which you need to set up your /etc/raidtab
-appropriately (see man 5 raidtab).
-
-Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level
-0, have been tested and work fine (though see section "Limitations when using
-the MD driver with NTFS volumes" especially if you want to use linear raid).
-Even though untested, there is no reason why mirrors, i.e. raid level 1, and
-stripes with parity, i.e. raid level 5, should not work, too.
-
-You have to use the "persistent-superblock 0" option for each raid-disk in the
-NTFS volume/stripe you are configuring in /etc/raidtab as the persistent
-superblock used by the MD driver would damage the NTFS volume.
-
-Windows by default uses a stripe chunk size of 64k, so you probably want the
-"chunk-size 64k" option for each raid-disk, too.
-
-For example, if you have a stripe set consisting of two partitions /dev/hda5
-and /dev/hdb1 your /etc/raidtab would look like this::
-
- raiddev /dev/md0
- raid-level 0
- nr-raid-disks 2
- nr-spare-disks 0
- persistent-superblock 0
- chunk-size 64k
- device /dev/hda5
- raid-disk 0
- device /dev/hdb1
- raid-disk 1
-
-For linear raid, just change the raid-level above to "raid-level linear", for
-mirrors, change it to "raid-level 1", and for stripe sets with parity, change
-it to "raid-level 5".
-
-Note for stripe sets with parity you will also need to tell the MD driver
-which parity algorithm to use by specifying the option "parity-algorithm
-which", where you need to replace "which" with the name of the algorithm to
-use (see man 5 raidtab for available algorithms) and you will have to try the
-different available algorithms until you find one that works. Make sure you
-are working read-only when playing with this as you may damage your data
-otherwise. If you find which algorithm works please let us know (email the
-linux-ntfs developers list [email protected] or drop in on
-IRC in channel #ntfs on the irc.freenode.net network) so we can update this
-documentation.
-
-Once the raidtab is setup, run for example raid0run -a to start all devices or
-raid0run /dev/md0 to start a particular md device, in this case /dev/md0.
-
-Then just use the mount command as usual to mount the ntfs volume using for
-example::
-
- mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
-
-It is advisable to do the mount read-only to see if the md volume has been
-setup correctly to avoid the possibility of causing damage to the data on the
-ntfs volume.
-
-
-Limitations when using the Software RAID / MD driver
------------------------------------------------------
-
-Using the md driver will not work properly if any of your NTFS partitions have
-an odd number of sectors. This is especially important for linear raid as all
-data after the first partition with an odd number of sectors will be offset by
-one or more sectors so if you mount such a partition with write support you
-will cause massive damage to the data on the volume which will only become
-apparent when you try to use the volume again under Windows.
-
-So when using linear raid, make sure that all your partitions have an even
-number of sectors BEFORE attempting to use it. You have been warned!
-
-Even better is to simply use the Device-Mapper for linear raid and then you do
-not have this problem with odd numbers of sectors.
diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index 5b93268e400f..165514401441 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -39,7 +39,7 @@ objects in the original filesystem.
On 64bit systems, even if all overlay layers are not on the same
underlying filesystem, the same compliant behavior could be achieved
with the "xino" feature. The "xino" feature composes a unique object
-identifier from the real object st_ino and an underlying fsid index.
+identifier from the real object st_ino and an underlying fsid number.
The "xino" feature uses the high inode number bits for fsid, because the
underlying filesystems rarely use the high inode number bits. In case
the underlying inode number does overflow into the high xino bits, overlay
@@ -118,7 +118,7 @@ Where both upper and lower objects are directories, a merged directory
is formed.
At mount time, the two directories given as mount options "lowerdir" and
-"upperdir" are combined into a merged directory:
+"upperdir" are combined into a merged directory::
mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,\
workdir=/work /merged
@@ -145,7 +145,9 @@ filesystem, an overlay filesystem needs to record in the upper filesystem
that files have been removed. This is done using whiteouts and opaque
directories (non-directories are always opaque).
-A whiteout is created as a character device with 0/0 device number.
+A whiteout is created as a character device with 0/0 device number or
+as a zero-size regular file with the xattr "trusted.overlay.whiteout".
+
When a whiteout is found in the upper level of a merged directory, any
matching name in the lower level is ignored, and the whiteout itself
is also hidden.
@@ -154,6 +156,13 @@ A directory is made opaque by setting the xattr "trusted.overlay.opaque"
to "y". Where the upper filesystem contains an opaque directory, any
directory in the lower filesystem with the same name is ignored.
+An opaque directory should not conntain any whiteouts, because they do not
+serve any purpose. A merge directory containing regular files with the xattr
+"trusted.overlay.whiteout", should be additionally marked by setting the xattr
+"trusted.overlay.opaque" to "x" on the merge directory itself.
+This is needed to avoid the overhead of checking the "trusted.overlay.whiteout"
+on all entries during readdir in the common case.
+
readdir
-------
@@ -172,12 +181,12 @@ directory is being read. This is unlikely to be noticed by many
programs.
seek offsets are assigned sequentially when the directories are read.
-Thus if
+Thus if:
- - read part of a directory
- - remember an offset, and close the directory
- - re-open the directory some time later
- - seek to the remembered offset
+ - read part of a directory
+ - remember an offset, and close the directory
+ - re-open the directory some time later
+ - seek to the remembered offset
there may be little correlation between the old and new locations in
the list of filenames, particularly if anything has changed in the
@@ -290,9 +299,9 @@ Permission checking in the overlay filesystem follows these principles:
2) task creating the overlay mount MUST NOT gain additional privileges
3) non-mounting task MAY gain additional privileges through the overlay,
- compared to direct access on underlying lower or upper filesystems
+ compared to direct access on underlying lower or upper filesystems
-This is achieved by performing two permission checks on each access
+This is achieved by performing two permission checks on each access:
a) check if current task is allowed access based on local DAC (owner,
group, mode and posix acl), as well as MAC checks
@@ -311,11 +320,11 @@ to create setups where the consistency rule (1) does not hold; normally,
however, the mounting task will have sufficient privileges to perform all
operations.
-Another way to demonstrate this model is drawing parallels between
+Another way to demonstrate this model is drawing parallels between::
mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,... /merged
-and
+and::
cp -a /lower /upper
mount --bind /upper /merged
@@ -328,7 +337,7 @@ Multiple lower layers
---------------------
Multiple lower layers can now be given using the colon (":") as a
-separator character between the directory names. For example:
+separator character between the directory names. For example::
mount -t overlay overlay -olowerdir=/lower1:/lower2:/lower3 /merged
@@ -340,14 +349,15 @@ rightmost one and going left. In the above example lower1 will be the
top, lower2 the middle and lower3 the bottom layer.
Note: directory names containing colons can be provided as lower layer by
-escaping the colons with a single backslash. For example:
+escaping the colons with a single backslash. For example::
mount -t overlay overlay -olowerdir=/a\:lower\:\:dir /merged
-Since kernel version v6.5, directory names containing colons can also
-be provided as lower layer using the fsconfig syscall from new mount api:
+Since kernel version v6.8, directory names containing colons can also
+be configured as lower layer using the "lowerdir+" mount options and the
+fsconfig syscall from new mount api. For example::
- fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir", "/a:lower::dir", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/a:lower::dir", 0);
In the latter case, colons in lower layer directory names will be escaped
as an octal characters (\072) when displayed in /proc/self/mountinfo.
@@ -355,7 +365,7 @@ as an octal characters (\072) when displayed in /proc/self/mountinfo.
Metadata only copy up
---------------------
-When metadata only copy up feature is enabled, overlayfs will only copy
+When the "metacopy" feature is enabled, overlayfs will only copy
up metadata (as opposed to whole file), when a metadata specific operation
like chown/chmod is performed. Full file will be copied up later when
file is opened for WRITE operation.
@@ -404,7 +414,7 @@ A normal lower layer is not allowed to be below a data-only layer, so single
colon separators are not allowed to the right of double colon ("::") separators.
-For example:
+For example::
mount -t overlay overlay -olowerdir=/l1:/l2:/l3::/do1::/do2 /merged
@@ -416,9 +426,19 @@ Only the data of the files in the "data-only" lower layers may be visible
when a "metacopy" file in one of the lower layers above it, has a "redirect"
to the absolute path of the "lower data" file in the "data-only" lower layer.
+Since kernel version v6.8, "data-only" lower layers can also be added using
+the "datadir+" mount options and the fsconfig syscall from new mount api.
+For example::
+
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/l1", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/l2", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/l3", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "datadir+", "/do1", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "datadir+", "/do2", 0);
+
fs-verity support
-----------------------
+-----------------
During metadata copy up of a lower file, if the source file has
fs-verity enabled and overlay verity support is enabled, then the
@@ -481,29 +501,53 @@ though it will not result in a crash or deadlock.
Mounting an overlay using an upper layer path, where the upper layer path
was previously used by another mounted overlay in combination with a
-different lower layer path, is allowed, unless the "inodes index" feature
-or "metadata only copy up" feature is enabled.
+different lower layer path, is allowed, unless the "index" or "metacopy"
+features are enabled.
-With the "inodes index" feature, on the first time mount, an NFS file
+With the "index" feature, on the first time mount, an NFS file
handle of the lower layer root directory, along with the UUID of the lower
filesystem, are encoded and stored in the "trusted.overlay.origin" extended
attribute on the upper layer root directory. On subsequent mount attempts,
the lower root directory file handle and lower filesystem UUID are compared
to the stored origin in upper root directory. On failure to verify the
lower root origin, mount will fail with ESTALE. An overlayfs mount with
-"inodes index" enabled will fail with EOPNOTSUPP if the lower filesystem
+"index" enabled will fail with EOPNOTSUPP if the lower filesystem
does not support NFS export, lower filesystem does not have a valid UUID or
if the upper filesystem does not support extended attributes.
-For "metadata only copy up" feature there is no verification mechanism at
+For the "metacopy" feature, there is no verification mechanism at
mount time. So if same upper is mounted with different set of lower, mount
probably will succeed but expect the unexpected later on. So don't do it.
It is quite a common practice to copy overlay layers to a different
directory tree on the same or different underlying filesystem, and even
-to a different machine. With the "inodes index" feature, trying to mount
+to a different machine. With the "index" feature, trying to mount
the copied layers will fail the verification of the lower root file handle.
+Nesting overlayfs mounts
+------------------------
+
+It is possible to use a lower directory that is stored on an overlayfs
+mount. For regular files this does not need any special care. However, files
+that have overlayfs attributes, such as whiteouts or "overlay.*" xattrs will be
+interpreted by the underlying overlayfs mount and stripped out. In order to
+allow the second overlayfs mount to see the attributes they must be escaped.
+
+Overlayfs specific xattrs are escaped by using a special prefix of
+"overlay.overlay.". So, a file with a "trusted.overlay.overlay.metacopy" xattr
+in the lower dir will be exposed as a regular file with a
+"trusted.overlay.metacopy" xattr in the overlayfs mount. This can be nested by
+repeating the prefix multiple time, as each instance only removes one prefix.
+
+A lower dir with a regular whiteout will always be handled by the overlayfs
+mount, so to support storing an effective whiteout file in an overlayfs mount an
+alternative form of whiteout is supported. This form is a regular, zero-size
+file with the "overlay.whiteout" xattr set, inside a directory with the
+"overlay.opaque" xattr set to "x" (see `whiteouts and opaque directories`_).
+These alternative whiteouts are never created by overlayfs, but can be used by
+userspace tools (like containers) that generate lower layers.
+These alternative whiteouts can be escaped using the standard xattr escape
+mechanism in order to properly nest to any depth.
Non-standard behavior
---------------------
@@ -513,20 +557,21 @@ filesystem.
This is the list of cases that overlayfs doesn't currently handle:
-a) POSIX mandates updating st_atime for reads. This is currently not
-done in the case when the file resides on a lower layer.
+ a) POSIX mandates updating st_atime for reads. This is currently not
+ done in the case when the file resides on a lower layer.
-b) If a file residing on a lower layer is opened for read-only and then
-memory mapped with MAP_SHARED, then subsequent changes to the file are not
-reflected in the memory mapping.
+ b) If a file residing on a lower layer is opened for read-only and then
+ memory mapped with MAP_SHARED, then subsequent changes to the file are not
+ reflected in the memory mapping.
-c) If a file residing on a lower layer is being executed, then opening that
-file for write or truncating the file will not be denied with ETXTBSY.
+ c) If a file residing on a lower layer is being executed, then opening that
+ file for write or truncating the file will not be denied with ETXTBSY.
The following options allow overlayfs to act more like a standards
compliant filesystem:
-1) "redirect_dir"
+redirect_dir
+````````````
Enabled with the mount option or module option: "redirect_dir=on" or with
the kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y.
@@ -534,7 +579,8 @@ the kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y.
If this feature is disabled, then rename(2) on a lower or merged directory
will fail with EXDEV ("Invalid cross-device link").
-2) "inode index"
+index
+`````
Enabled with the mount option or module option "index=on" or with the
kernel config option CONFIG_OVERLAY_FS_INDEX=y.
@@ -543,7 +589,8 @@ If this feature is disabled and a file with multiple hard links is copied
up, then this will "break" the link. Changes will not be propagated to
other names referring to the same inode.
-3) "xino"
+xino
+````
Enabled with the mount option "xino=auto" or "xino=on", with the module
option "xino_auto=on" or with the kernel config option
@@ -570,7 +617,7 @@ a crash or deadlock.
Offline changes, when the overlay is not mounted, are allowed to the
upper tree. Offline changes to the lower tree are only allowed if the
-"metadata only copy up", "inode index", "xino" and "redirect_dir" features
+"metacopy", "index", "xino" and "redirect_dir" features
have not been used. If the lower tree is modified and any of these
features has been used, the behavior of the overlay is undefined,
though it will not result in a crash or deadlock.
@@ -610,12 +657,13 @@ directory inode.
When encoding a file handle from an overlay filesystem object, the
following rules apply:
-1. For a non-upper object, encode a lower file handle from lower inode
-2. For an indexed object, encode a lower file handle from copy_up origin
-3. For a pure-upper object and for an existing non-indexed upper object,
- encode an upper file handle from upper inode
+ 1. For a non-upper object, encode a lower file handle from lower inode
+ 2. For an indexed object, encode a lower file handle from copy_up origin
+ 3. For a pure-upper object and for an existing non-indexed upper object,
+ encode an upper file handle from upper inode
The encoded overlay file handle includes:
+
- Header including path type information (e.g. lower/upper)
- UUID of the underlying filesystem
- Underlying filesystem encoding of underlying inode
@@ -625,15 +673,15 @@ are stored in extended attribute "trusted.overlay.origin".
When decoding an overlay file handle, the following steps are followed:
-1. Find underlying layer by UUID and path type information.
-2. Decode the underlying filesystem file handle to underlying dentry.
-3. For a lower file handle, lookup the handle in index directory by name.
-4. If a whiteout is found in index, return ESTALE. This represents an
- overlay object that was deleted after its file handle was encoded.
-5. For a non-directory, instantiate a disconnected overlay dentry from the
- decoded underlying dentry, the path type and index inode, if found.
-6. For a directory, use the connected underlying decoded dentry, path type
- and index, to lookup a connected overlay dentry.
+ 1. Find underlying layer by UUID and path type information.
+ 2. Decode the underlying filesystem file handle to underlying dentry.
+ 3. For a lower file handle, lookup the handle in index directory by name.
+ 4. If a whiteout is found in index, return ESTALE. This represents an
+ overlay object that was deleted after its file handle was encoded.
+ 5. For a non-directory, instantiate a disconnected overlay dentry from the
+ decoded underlying dentry, the path type and index inode, if found.
+ 6. For a directory, use the connected underlying decoded dentry, path type
+ and index, to lookup a connected overlay dentry.
Decoding a non-directory file handle may return a disconnected dentry.
copy_up of that disconnected dentry will create an upper index entry with
@@ -736,9 +784,9 @@ Testsuite
There's a testsuite originally developed by David Howells and currently
maintained by Amir Goldstein at:
- https://github.com/amir73il/unionmount-testsuite.git
+https://github.com/amir73il/unionmount-testsuite.git
-Run as root:
+Run as root::
# cd unionmount-testsuite
# ./run --ov --verify
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index d69f59700a23..f2b44c2400c6 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -858,7 +858,7 @@ be misspelled d_alloc_anon().
**mandatory**
-[should've been added in 2016] stale comment in finish_open() nonwithstanding,
+[should've been added in 2016] stale comment in finish_open() notwithstanding,
failure exits in ->atomic_open() instances should *NOT* fput() the file,
no matter what. Everything is handled by the caller.
@@ -989,7 +989,7 @@ This mechanism would only work for a single device so the block layer couldn't
find the owning superblock of any additional devices.
In the old mechanism reusing or creating a superblock for a racing mount(2) and
-umount(2) relied on the file_system_type as the holder. This was severly
+umount(2) relied on the file_system_type as the holder. This was severely
underdocumented however:
(1) Any concurrent mounter that managed to grab an active reference on an
@@ -1052,3 +1052,85 @@ kill_anon_super(), or kill_block_super() helpers.
Lock ordering has been changed so that s_umount ranks above open_mutex again.
All places where s_umount was taken under open_mutex have been fixed up.
+
+---
+
+**mandatory**
+
+export_operations ->encode_fh() no longer has a default implementation to
+encode FILEID_INO32_GEN* file handles.
+Filesystems that used the default implementation may use the generic helper
+generic_encode_ino32_fh() explicitly.
+
+---
+
+**mandatory**
+
+If ->rename() update of .. on cross-directory move needs an exclusion with
+directory modifications, do *not* lock the subdirectory in question in your
+->rename() - it's done by the caller now [that item should've been added in
+28eceeda130f "fs: Lock moved directories"].
+
+---
+
+**mandatory**
+
+On same-directory ->rename() the (tautological) update of .. is not protected
+by any locks; just don't do it if the old parent is the same as the new one.
+We really can't lock two subdirectories in same-directory rename - not without
+deadlocks.
+
+---
+
+**mandatory**
+
+lock_rename() and lock_rename_child() may fail in cross-directory case, if
+their arguments do not have a common ancestor. In that case ERR_PTR(-EXDEV)
+is returned, with no locks taken. In-tree users updated; out-of-tree ones
+would need to do so.
+
+---
+
+**mandatory**
+
+The list of children anchored in parent dentry got turned into hlist now.
+Field names got changed (->d_children/->d_sib instead of ->d_subdirs/->d_child
+for anchor/entries resp.), so any affected places will be immediately caught
+by compiler.
+
+---
+
+**mandatory**
+
+->d_delete() instances are now called for dentries with ->d_lock held
+and refcount equal to 0. They are not permitted to drop/regain ->d_lock.
+None of in-tree instances did anything of that sort. Make sure yours do not...
+
+---
+
+**mandatory**
+
+->d_prune() instances are now called without ->d_lock held on the parent.
+->d_lock on dentry itself is still held; if you need per-parent exclusions (none
+of the in-tree instances did), use your own spinlock.
+
+->d_iput() and ->d_release() are called with victim dentry still in the
+list of parent's children. It is still unhashed, marked killed, etc., just not
+removed from parent's ->d_children yet.
+
+Anyone iterating through the list of children needs to be aware of the
+half-killed dentries that might be seen there; taking ->d_lock on those will
+see them negative, unhashed and with negative refcount, which means that most
+of the in-kernel users would've done the right thing anyway without any adjustment.
+
+---
+
+**recommended**
+
+Block device freezing and thawing have been moved to holder operations.
+
+Before this change, get_active_super() would only be able to find the
+superblock of the main block device, i.e., the one stored in sb->s_bdev. Block
+device freezing now works for any block device owned by a given superblock, not
+just the main block device. The get_active_super() helper and bd_fsfreeze_sb
+pointer are gone.
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 49ef12df631b..c6a6b9df2104 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -528,9 +528,9 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
does not take into account swapped out page of underlying shmem objects.
"Locked" indicates whether the mapping is locked in memory or not.
-"THPeligible" indicates whether the mapping is eligible for allocating THP
-pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
-It just shows the current status.
+"THPeligible" indicates whether the mapping is eligible for allocating
+naturally aligned THP pages of any currently enabled size. 1 if true, 0
+otherwise.
"VmFlags" field deserves a separate description. This member represents the
kernel flags associated with the particular virtual memory area in two letter
@@ -1899,8 +1899,8 @@ For more information on mount propagation see:
These files provide a method to access a task's comm value. It also allows for
a task to set its own or one of its thread siblings comm value. The comm value
is limited in size compared to the cmdline value, so writing anything longer
-then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated
-comm value.
+then the kernel's TASK_COMM_LEN (currently 16 chars, including the NUL
+terminator) will result in a truncated comm value.
3.7 /proc/<pid>/task/<tid>/children - Information about task children
diff --git a/Documentation/filesystems/smb/ksmbd.rst b/Documentation/filesystems/smb/ksmbd.rst
index 7bed96d794fc..6b30e43a0d11 100644
--- a/Documentation/filesystems/smb/ksmbd.rst
+++ b/Documentation/filesystems/smb/ksmbd.rst
@@ -73,15 +73,14 @@ Auto Negotiation Supported.
Compound Request Supported.
Oplock Cache Mechanism Supported.
SMB2 leases(v1 lease) Supported.
-Directory leases(v2 lease) Planned for future.
+Directory leases(v2 lease) Supported.
Multi-credits Supported.
NTLM/NTLMv2 Supported.
HMAC-SHA256 Signing Supported.
Secure negotiate Supported.
Signing Update Supported.
Pre-authentication integrity Supported.
-SMB3 encryption(CCM, GCM) Supported. (CCM and GCM128 supported, GCM256 in
- progress)
+SMB3 encryption(CCM, GCM) Supported. (CCM/GCM128 and CCM/GCM256 supported)
SMB direct(RDMA) Supported.
SMB3 Multi-channel Partially Supported. Planned to implement
replay/retry mechanisms for future.
@@ -112,6 +111,10 @@ DCE/RPC support Partially Supported. a few calls(NetShareEnumAll,
for Witness protocol e.g.)
ksmbd/nfsd interoperability Planned for future. The features that ksmbd
support are Leases, Notify, ACLs and Share modes.
+SMB3.1.1 Compression Planned for future.
+SMB3.1.1 over QUIC Planned for future.
+Signing/Encryption over RDMA Planned for future.
+SMB3.1.1 GMAC signing support Planned for future.
============================== =================================================
diff --git a/Documentation/filesystems/squashfs.rst b/Documentation/filesystems/squashfs.rst
index df42106bae71..4af8d6207509 100644
--- a/Documentation/filesystems/squashfs.rst
+++ b/Documentation/filesystems/squashfs.rst
@@ -64,6 +64,66 @@ obtained from this site also.
The squashfs-tools development tree is now located on kernel.org
git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
+2.1 Mount options
+-----------------
+=================== =========================================================
+errors=%s Specify whether squashfs errors trigger a kernel panic
+ or not
+
+ ========== =============================================
+ continue errors don't trigger a panic (default)
+ panic trigger a panic when errors are encountered,
+ similar to several other filesystems (e.g.
+ btrfs, ext4, f2fs, GFS2, jfs, ntfs, ubifs)
+
+ This allows a kernel dump to be saved,
+ useful for analyzing and debugging the
+ corruption.
+ ========== =============================================
+threads=%s Select the decompression mode or the number of threads
+
+ If SQUASHFS_CHOICE_DECOMP_BY_MOUNT is set:
+
+ ========== =============================================
+ single use single-threaded decompression (default)
+
+ Only one block (data or metadata) can be
+ decompressed at any one time. This limits
+ CPU and memory usage to a minimum, but it
+ also gives poor performance on parallel I/O
+ workloads when using multiple CPU machines
+ due to waiting on decompressor availability.
+ multi use up to two parallel decompressors per core
+
+ If you have a parallel I/O workload and your
+ system has enough memory, using this option
+ may improve overall I/O performance. It
+ dynamically allocates decompressors on a
+ demand basis.
+ percpu use a maximum of one decompressor per core
+
+ It uses percpu variables to ensure
+ decompression is load-balanced across the
+ cores.
+ 1|2|3|... configure the number of threads used for
+ decompression
+
+ The upper limit is num_online_cpus() * 2.
+ ========== =============================================
+
+ If SQUASHFS_CHOICE_DECOMP_BY_MOUNT is **not** set and
+ SQUASHFS_DECOMP_MULTI, SQUASHFS_MOUNT_DECOMP_THREADS are
+ both set:
+
+ ========== =============================================
+ 2|3|... configure the number of threads used for
+ decompression
+
+ The upper limit is num_online_cpus() * 2.
+ ========== =============================================
+
+=================== =========================================================
+
3. Squashfs Filesystem Design
-----------------------------
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 99acc2e98673..6e903a903f8f 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -437,7 +437,7 @@ field. This is a pointer to a "struct inode_operations" which describes
the methods that can be performed on individual inodes.
-struct xattr_handlers
+struct xattr_handler
---------------------
On filesystems that support extended attributes (xattrs), the s_xattr
@@ -823,7 +823,7 @@ cache in your filesystem. The following members are defined:
bool (*is_partially_uptodate) (struct folio *, size_t from,
size_t count);
void (*is_dirty_writeback)(struct folio *, bool *, bool *);
- int (*error_remove_page) (struct mapping *mapping, struct page *page);
+ int (*error_remove_folio)(struct mapping *mapping, struct folio *);
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *);
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
@@ -1034,8 +1034,8 @@ cache in your filesystem. The following members are defined:
VM if a folio should be treated as dirty or writeback for the
purposes of stalling.
-``error_remove_page``
- normally set to generic_error_remove_page if truncation is ok
+``error_remove_folio``
+ normally set to generic_error_remove_folio if truncation is ok
for this address space. Used for memory failure handling.
Setting this implies you deal with pages going away under you,
unless you have them locked or reference counts increased.
@@ -1264,7 +1264,7 @@ defined:
char *(*d_dname)(struct dentry *, char *, int);
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(const struct path *, bool);
- struct dentry *(*d_real)(struct dentry *, const struct inode *);
+ struct dentry *(*d_real)(struct dentry *, enum d_real_type type);
};
``d_revalidate``
@@ -1419,16 +1419,14 @@ defined:
the dentry being transited from.
``d_real``
- overlay/union type filesystems implement this method to return
- one of the underlying dentries hidden by the overlay. It is
- used in two different modes:
+ overlay/union type filesystems implement this method to return one
+ of the underlying dentries of a regular file hidden by the overlay.
- Called from file_dentry() it returns the real dentry matching
- the inode argument. The real dentry may be from a lower layer
- already copied up, but still referenced from the file. This
- mode is selected with a non-NULL inode argument.
+ The 'type' argument takes the values D_REAL_DATA or D_REAL_METADATA
+ for returning the real underlying dentry that refers to the inode
+ hosting the file's data or metadata respectively.
- With NULL inode the topmost real underlying dentry is returned.
+ For non-regular files, the 'dentry' argument is returned.
Each dentry has a pointer to its parent dentry, as well as a hash list
of child dentries. Child dentries are basically like files in a
diff --git a/Documentation/filesystems/xfs/index.rst b/Documentation/filesystems/xfs/index.rst
new file mode 100644
index 000000000000..ab66c57a5d18
--- /dev/null
+++ b/Documentation/filesystems/xfs/index.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+XFS Filesystem Documentation
+============================
+
+.. toctree::
+ :maxdepth: 2
+ :numbered:
+
+ xfs-delayed-logging-design
+ xfs-maintainer-entry-profile
+ xfs-self-describing-metadata
+ xfs-online-fsck-design
diff --git a/Documentation/filesystems/xfs-delayed-logging-design.rst b/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst
index 6402ab8e370c..6402ab8e370c 100644
--- a/Documentation/filesystems/xfs-delayed-logging-design.rst
+++ b/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst
diff --git a/Documentation/filesystems/xfs-maintainer-entry-profile.rst b/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst
index 32b6ac4ca9d6..32b6ac4ca9d6 100644
--- a/Documentation/filesystems/xfs-maintainer-entry-profile.rst
+++ b/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst
diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
index a0678101a7d0..6333697ba3e8 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
@@ -962,7 +962,7 @@ disk, but these buffer verifiers cannot provide any consistency checking
between metadata structures.
For more information, please see the documentation for
-Documentation/filesystems/xfs-self-describing-metadata.rst
+Documentation/filesystems/xfs/xfs-self-describing-metadata.rst
Reverse Mapping
---------------
@@ -1915,19 +1915,13 @@ four of those five higher level data structures.
The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case
study.
-The most general storage interface supported by the xfile enables the reading
-and writing of arbitrary quantities of data at arbitrary offsets in the xfile.
-This capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions,
-which behave similarly to their userspace counterparts.
XFS is very record-based, which suggests that the ability to load and store
complete records is important.
-To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store``
-functions are provided to read and persist objects into an xfile.
-They are internally the same as pread and pwrite, except that they treat any
-error as an out of memory error.
-For online repair, squashing error conditions in this manner is an acceptable
-behavior because the only reaction is to abort the operation back to userspace.
-All five xfile usecases can be serviced by these four functions.
+To support these cases, a pair of ``xfile_load`` and ``xfile_store``
+functions are provided to read and persist objects into an xfile that treat any
+error as an out of memory error. For online repair, squashing error conditions
+in this manner is an acceptable behavior because the only reaction is to abort
+the operation back to userspace.
However, no discussion of file access idioms is complete without answering the
question, "But what about mmap?"
@@ -1939,15 +1933,14 @@ tmpfs can only push a pagecache folio to the swap cache if the folio is neither
pinned nor locked, which means the xfile must not pin too many folios.
Short term direct access to xfile contents is done by locking the pagecache
-folio and mapping it into kernel address space.
-Programmatic access (e.g. pread and pwrite) uses this mechanism.
-Folio locks are not supposed to be held for long periods of time, so long
-term direct access to xfile contents is done by bumping the folio refcount,
+folio and mapping it into kernel address space. Object load and store uses this
+mechanism. Folio locks are not supposed to be held for long periods of time, so
+long term direct access to xfile contents is done by bumping the folio refcount,
mapping it into kernel address space, and dropping the folio lock.
These long term users *must* be responsive to memory reclaim by hooking into
the shrinker infrastructure to know when to release folios.
-The ``xfile_get_page`` and ``xfile_put_page`` functions are provided to
+The ``xfile_get_folio`` and ``xfile_put_folio`` functions are provided to
retrieve the (locked) folio that backs part of an xfile and to release it.
The only code to use these folio lease functions are the xfarray
:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
@@ -2277,13 +2270,12 @@ follows:
pointing to the xfile.
3. Pass the buffer cache target, buffer ops, and other information to
- ``xfbtree_create`` to write an initial tree header and root block to the
- xfile.
+ ``xfbtree_init`` to initialize the passed in ``struct xfbtree`` and write an
+ initial root block to the xfile.
Each btree type should define a wrapper that passes necessary arguments to
the creation function.
For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
all the necessary details for callers.
- A ``struct xfbtree`` object will be returned.
4. Pass the xfbtree object to the btree cursor creation function for the
btree type.
diff --git a/Documentation/filesystems/xfs-self-describing-metadata.rst b/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst
index a10c4ae6955e..a10c4ae6955e 100644
--- a/Documentation/filesystems/xfs-self-describing-metadata.rst
+++ b/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst