From 07d241fd66ba99111d43a0a4c4abeeb972468d1d Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:47 +0100 Subject: docs: filesystems: convert 9p.txt to ReST - Add a SPDX header; - Add a document title; - Adjust section titles; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/96a060b7b5c0c3838ab1751addfe4d6d3bc37bd6.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/9p.rst | 185 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/9p.txt | 161 ------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 186 insertions(+), 161 deletions(-) create mode 100644 Documentation/filesystems/9p.rst delete mode 100644 Documentation/filesystems/9p.txt diff --git a/Documentation/filesystems/9p.rst b/Documentation/filesystems/9p.rst new file mode 100644 index 000000000000..f054d1c45e86 --- /dev/null +++ b/Documentation/filesystems/9p.rst @@ -0,0 +1,185 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================= +v9fs: Plan 9 Resource Sharing for Linux +======================================= + +About +===== + +v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol. + +This software was originally developed by Ron Minnich +and Maya Gokhale. Additional development by Greg Watson + and most recently Eric Van Hensbergen +, Latchesar Ionkov and Russ Cox +. + +The best detailed explanation of the Linux implementation and applications of +the 9p client is available in the form of a USENIX paper: + + http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html + +Other applications are described in the following papers: + + * XCPU & Clustering + http://xcpu.org/papers/xcpu-talk.pdf + * KVMFS: control file system for KVM + http://xcpu.org/papers/kvmfs.pdf + * CellFS: A New Programming Model for the Cell BE + http://xcpu.org/papers/cellfs-talk.pdf + * PROSE I/O: Using 9p to enable Application Partitions + http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf + * VirtFS: A Virtualization Aware File System pass-through + http://goo.gl/3WPDg + +Usage +===== + +For remote file server:: + + mount -t 9p 10.10.1.2 /mnt/9 + +For Plan 9 From User Space applications (http://swtch.com/plan9):: + + mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER + +For server running on QEMU host with virtio transport:: + + mount -t 9p -o trans=virtio /mnt/9 + +where mount_tag is the tag associated by the server to each of the exported +mount points. Each 9P export is seen by the client as a virtio device with an +associated "mount_tag" property. Available mount tags can be +seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio/mount_tag files. + +Options +======= + + ============= =============================================================== + trans=name select an alternative transport. Valid options are + currently: + + ======== ============================================ + unix specifying a named pipe mount point + tcp specifying a normal TCP/IP connection + fd used passed file descriptors for connection + (see rfdno and wfdno) + virtio connect to the next virtio channel available + (from QEMU with trans_virtio module) + rdma connect to a specified RDMA channel + ======== ============================================ + + uname=name user name to attempt mount as on the remote server. The + server may override or ignore this value. Certain user + names may require authentication. + + aname=name aname specifies the file tree to access when the server is + offering several exported file systems. + + cache=mode specifies a caching policy. By default, no caches are used. + + none + default no cache policy, metadata and data + alike are synchronous. + loose + no attempts are made at consistency, + intended for exclusive, read-only mounts + fscache + use FS-Cache for a persistent, read-only + cache backend. + mmap + minimal cache that is only used for read-write + mmap. Northing else is cached, like cache=none + + debug=n specifies debug level. The debug level is a bitmask. + + ===== ================================ + 0x01 display verbose error messages + 0x02 developer debug (DEBUG_CURRENT) + 0x04 display 9p trace + 0x08 display VFS trace + 0x10 display Marshalling debug + 0x20 display RPC debug + 0x40 display transport debug + 0x80 display allocation debug + 0x100 display protocol message debug + 0x200 display Fid debug + 0x400 display packet debug + 0x800 display fscache tracing debug + ===== ================================ + + rfdno=n the file descriptor for reading with trans=fd + + wfdno=n the file descriptor for writing with trans=fd + + msize=n the number of bytes to use for 9p packet payload + + port=n port to connect to on the remote server + + noextend force legacy mode (no 9p2000.u or 9p2000.L semantics) + + version=name Select 9P protocol version. Valid options are: + + ======== ============================== + 9p2000 Legacy mode (same as noextend) + 9p2000.u Use 9P2000.u protocol + 9p2000.L Use 9P2000.L protocol + ======== ============================== + + dfltuid attempt to mount as a particular uid + + dfltgid attempt to mount with a particular gid + + afid security channel - used by Plan 9 authentication protocols + + nodevmap do not map special files - represent them as normal files. + This can be used to share devices/named pipes/sockets between + hosts. This functionality will be expanded in later versions. + + access there are four access modes. + user + if a user tries to access a file on v9fs + filesystem for the first time, v9fs sends an + attach command (Tattach) for that user. + This is the default mode. + + allows only user with uid= to access + the files on the mounted filesystem + any + v9fs does single attach and performs all + operations as one user + clien + ACL based access check on the 9p client + side for access validation + + cachetag cache tag to use the specified persistent cache. + cache tags for existing cache sessions can be listed at + /sys/fs/9p/caches. (applies only to cache=fscache) + ============= =============================================================== + +Resources +========= + +Protocol specifications are maintained on github: +http://ericvh.github.com/9p-rfc/ + +9p client and server implementations are listed on +http://9p.cat-v.org/implementations + +A 9p2000.L server is being developed by LLNL and can be found +at http://code.google.com/p/diod/ + +There are user and developer mailing lists available through the v9fs project +on sourceforge (http://sourceforge.net/projects/v9fs). + +News and other information is maintained on a Wiki. +(http://sf.net/apps/mediawiki/v9fs/index.php). + +Bug reports are best issued via the mailing list. + +For more information on the Plan 9 Operating System check out +http://plan9.bell-labs.com/plan9 + +For information on Plan 9 from User Space (Plan 9 applications and libraries +ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9 diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt deleted file mode 100644 index fec7144e817c..000000000000 --- a/Documentation/filesystems/9p.txt +++ /dev/null @@ -1,161 +0,0 @@ - v9fs: Plan 9 Resource Sharing for Linux - ======================================= - -ABOUT -===== - -v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol. - -This software was originally developed by Ron Minnich -and Maya Gokhale. Additional development by Greg Watson - and most recently Eric Van Hensbergen -, Latchesar Ionkov and Russ Cox -. - -The best detailed explanation of the Linux implementation and applications of -the 9p client is available in the form of a USENIX paper: - http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html - -Other applications are described in the following papers: - * XCPU & Clustering - http://xcpu.org/papers/xcpu-talk.pdf - * KVMFS: control file system for KVM - http://xcpu.org/papers/kvmfs.pdf - * CellFS: A New Programming Model for the Cell BE - http://xcpu.org/papers/cellfs-talk.pdf - * PROSE I/O: Using 9p to enable Application Partitions - http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf - * VirtFS: A Virtualization Aware File System pass-through - http://goo.gl/3WPDg - -USAGE -===== - -For remote file server: - - mount -t 9p 10.10.1.2 /mnt/9 - -For Plan 9 From User Space applications (http://swtch.com/plan9) - - mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER - -For server running on QEMU host with virtio transport: - - mount -t 9p -o trans=virtio /mnt/9 - -where mount_tag is the tag associated by the server to each of the exported -mount points. Each 9P export is seen by the client as a virtio device with an -associated "mount_tag" property. Available mount tags can be -seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio/mount_tag files. - -OPTIONS -======= - - trans=name select an alternative transport. Valid options are - currently: - unix - specifying a named pipe mount point - tcp - specifying a normal TCP/IP connection - fd - used passed file descriptors for connection - (see rfdno and wfdno) - virtio - connect to the next virtio channel available - (from QEMU with trans_virtio module) - rdma - connect to a specified RDMA channel - - uname=name user name to attempt mount as on the remote server. The - server may override or ignore this value. Certain user - names may require authentication. - - aname=name aname specifies the file tree to access when the server is - offering several exported file systems. - - cache=mode specifies a caching policy. By default, no caches are used. - none = default no cache policy, metadata and data - alike are synchronous. - loose = no attempts are made at consistency, - intended for exclusive, read-only mounts - fscache = use FS-Cache for a persistent, read-only - cache backend. - mmap = minimal cache that is only used for read-write - mmap. Northing else is cached, like cache=none - - debug=n specifies debug level. The debug level is a bitmask. - 0x01 = display verbose error messages - 0x02 = developer debug (DEBUG_CURRENT) - 0x04 = display 9p trace - 0x08 = display VFS trace - 0x10 = display Marshalling debug - 0x20 = display RPC debug - 0x40 = display transport debug - 0x80 = display allocation debug - 0x100 = display protocol message debug - 0x200 = display Fid debug - 0x400 = display packet debug - 0x800 = display fscache tracing debug - - rfdno=n the file descriptor for reading with trans=fd - - wfdno=n the file descriptor for writing with trans=fd - - msize=n the number of bytes to use for 9p packet payload - - port=n port to connect to on the remote server - - noextend force legacy mode (no 9p2000.u or 9p2000.L semantics) - - version=name Select 9P protocol version. Valid options are: - 9p2000 - Legacy mode (same as noextend) - 9p2000.u - Use 9P2000.u protocol - 9p2000.L - Use 9P2000.L protocol - - dfltuid attempt to mount as a particular uid - - dfltgid attempt to mount with a particular gid - - afid security channel - used by Plan 9 authentication protocols - - nodevmap do not map special files - represent them as normal files. - This can be used to share devices/named pipes/sockets between - hosts. This functionality will be expanded in later versions. - - access there are four access modes. - user = if a user tries to access a file on v9fs - filesystem for the first time, v9fs sends an - attach command (Tattach) for that user. - This is the default mode. - = allows only user with uid= to access - the files on the mounted filesystem - any = v9fs does single attach and performs all - operations as one user - client = ACL based access check on the 9p client - side for access validation - - cachetag cache tag to use the specified persistent cache. - cache tags for existing cache sessions can be listed at - /sys/fs/9p/caches. (applies only to cache=fscache) - -RESOURCES -========= - -Protocol specifications are maintained on github: -http://ericvh.github.com/9p-rfc/ - -9p client and server implementations are listed on -http://9p.cat-v.org/implementations - -A 9p2000.L server is being developed by LLNL and can be found -at http://code.google.com/p/diod/ - -There are user and developer mailing lists available through the v9fs project -on sourceforge (http://sourceforge.net/projects/v9fs). - -News and other information is maintained on a Wiki. -(http://sf.net/apps/mediawiki/v9fs/index.php). - -Bug reports are best issued via the mailing list. - -For more information on the Plan 9 Operating System check out -http://plan9.bell-labs.com/plan9 - -For information on Plan 9 from User Space (Plan 9 applications and libraries -ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9 - diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 45d791905e91..a9330c3f8c2e 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -46,6 +46,7 @@ Documentation for filesystem implementations. .. toctree:: :maxdepth: 2 + 9p autofs fuse overlayfs -- cgit From 348739003d4f7e777ef935a44a91e7494f8ab786 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:48 +0100 Subject: docs: filesystems: convert adfs.txt to ReST - Add a SPDX header; - Add a document title; - Adjust section titles; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/15ee92f03ec917e5d26bd7b863565dec88c843f6.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/adfs.rst | 108 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/adfs.txt | 99 --------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 109 insertions(+), 99 deletions(-) create mode 100644 Documentation/filesystems/adfs.rst delete mode 100644 Documentation/filesystems/adfs.txt diff --git a/Documentation/filesystems/adfs.rst b/Documentation/filesystems/adfs.rst new file mode 100644 index 000000000000..5b22cae38e5e --- /dev/null +++ b/Documentation/filesystems/adfs.rst @@ -0,0 +1,108 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================== +Acorn Disc Filing System - ADFS +=============================== + +Filesystems supported by ADFS +----------------------------- + +The ADFS module supports the following Filecore formats which have: + +- new maps +- new directories or big directories + +In terms of the named formats, this means we support: + +- E and E+, with or without boot block +- F and F+ + +We fully support reading files from these filesystems, and writing to +existing files within their existing allocation. Essentially, we do +not support changing any of the filesystem metadata. + +This is intended to support loopback mounted Linux native filesystems +on a RISC OS Filecore filesystem, but will allow the data within files +to be changed. + +If write support (ADFS_FS_RW) is configured, we allow rudimentary +directory updates, specifically updating the access mode and timestamp. + +Mount options for ADFS +---------------------- + + ============ ====================================================== + uid=nnn All files in the partition will be owned by + user id nnn. Default 0 (root). + gid=nnn All files in the partition will be in group + nnn. Default 0 (root). + ownmask=nnn The permission mask for ADFS 'owner' permissions + will be nnn. Default 0700. + othmask=nnn The permission mask for ADFS 'other' permissions + will be nnn. Default 0077. + ftsuffix=n When ftsuffix=0, no file type suffix will be applied. + When ftsuffix=1, a hexadecimal suffix corresponding to + the RISC OS file type will be added. Default 0. + ============ ====================================================== + +Mapping of ADFS permissions to Linux permissions +------------------------------------------------ + + ADFS permissions consist of the following: + + - Owner read + - Owner write + - Other read + - Other write + + (In older versions, an 'execute' permission did exist, but this + does not hold the same meaning as the Linux 'execute' permission + and is now obsolete). + + The mapping is performed as follows:: + + Owner read -> -r--r--r-- + Owner write -> --w--w---w + Owner read and filetype UnixExec -> ---x--x--x + These are then masked by ownmask, eg 700 -> -rwx------ + Possible owner mode permissions -> -rwx------ + + Other read -> -r--r--r-- + Other write -> --w--w--w- + Other read and filetype UnixExec -> ---x--x--x + These are then masked by othmask, eg 077 -> ----rwxrwx + Possible other mode permissions -> ----rwxrwx + + Hence, with the default masks, if a file is owner read/write, and + not a UnixExec filetype, then the permissions will be:: + + -rw------- + + However, if the masks were ownmask=0770,othmask=0007, then this would + be modified to:: + + -rw-rw---- + + There is no restriction on what you can do with these masks. You may + wish that either read bits give read access to the file for all, but + keep the default write protection (ownmask=0755,othmask=0577):: + + -rw-r--r-- + + You can therefore tailor the permission translation to whatever you + desire the permissions should be under Linux. + +RISC OS file type suffix +------------------------ + + RISC OS file types are stored in bits 19..8 of the file load address. + + To enable non-RISC OS systems to be used to store files without losing + file type information, a file naming convention was devised (initially + for use with NFS) such that a hexadecimal suffix of the form ,xyz + denoted the file type: e.g. BasicFile,ffb is a BASIC (0xffb) file. This + naming convention is now also used by RISC OS emulators such as RPCEmu. + + Mounting an ADFS disc with option ftsuffix=1 will cause appropriate file + type suffixes to be appended to file names read from a directory. If the + ftsuffix option is zero or omitted, no file type suffixes will be added. diff --git a/Documentation/filesystems/adfs.txt b/Documentation/filesystems/adfs.txt deleted file mode 100644 index 0baa8e8c1fc1..000000000000 --- a/Documentation/filesystems/adfs.txt +++ /dev/null @@ -1,99 +0,0 @@ -Filesystems supported by ADFS ------------------------------ - -The ADFS module supports the following Filecore formats which have: - -- new maps -- new directories or big directories - -In terms of the named formats, this means we support: - -- E and E+, with or without boot block -- F and F+ - -We fully support reading files from these filesystems, and writing to -existing files within their existing allocation. Essentially, we do -not support changing any of the filesystem metadata. - -This is intended to support loopback mounted Linux native filesystems -on a RISC OS Filecore filesystem, but will allow the data within files -to be changed. - -If write support (ADFS_FS_RW) is configured, we allow rudimentary -directory updates, specifically updating the access mode and timestamp. - -Mount options for ADFS ----------------------- - - uid=nnn All files in the partition will be owned by - user id nnn. Default 0 (root). - gid=nnn All files in the partition will be in group - nnn. Default 0 (root). - ownmask=nnn The permission mask for ADFS 'owner' permissions - will be nnn. Default 0700. - othmask=nnn The permission mask for ADFS 'other' permissions - will be nnn. Default 0077. - ftsuffix=n When ftsuffix=0, no file type suffix will be applied. - When ftsuffix=1, a hexadecimal suffix corresponding to - the RISC OS file type will be added. Default 0. - -Mapping of ADFS permissions to Linux permissions ------------------------------------------------- - - ADFS permissions consist of the following: - - Owner read - Owner write - Other read - Other write - - (In older versions, an 'execute' permission did exist, but this - does not hold the same meaning as the Linux 'execute' permission - and is now obsolete). - - The mapping is performed as follows: - - Owner read -> -r--r--r-- - Owner write -> --w--w---w - Owner read and filetype UnixExec -> ---x--x--x - These are then masked by ownmask, eg 700 -> -rwx------ - Possible owner mode permissions -> -rwx------ - - Other read -> -r--r--r-- - Other write -> --w--w--w- - Other read and filetype UnixExec -> ---x--x--x - These are then masked by othmask, eg 077 -> ----rwxrwx - Possible other mode permissions -> ----rwxrwx - - Hence, with the default masks, if a file is owner read/write, and - not a UnixExec filetype, then the permissions will be: - - -rw------- - - However, if the masks were ownmask=0770,othmask=0007, then this would - be modified to: - -rw-rw---- - - There is no restriction on what you can do with these masks. You may - wish that either read bits give read access to the file for all, but - keep the default write protection (ownmask=0755,othmask=0577): - - -rw-r--r-- - - You can therefore tailor the permission translation to whatever you - desire the permissions should be under Linux. - -RISC OS file type suffix ------------------------- - - RISC OS file types are stored in bits 19..8 of the file load address. - - To enable non-RISC OS systems to be used to store files without losing - file type information, a file naming convention was devised (initially - for use with NFS) such that a hexadecimal suffix of the form ,xyz - denoted the file type: e.g. BasicFile,ffb is a BASIC (0xffb) file. This - naming convention is now also used by RISC OS emulators such as RPCEmu. - - Mounting an ADFS disc with option ftsuffix=1 will cause appropriate file - type suffixes to be appended to file names read from a directory. If the - ftsuffix option is zero or omitted, no file type suffixes will be added. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index a9330c3f8c2e..14dc89c94822 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -47,6 +47,7 @@ Documentation for filesystem implementations. :maxdepth: 2 9p + adfs autofs fuse overlayfs -- cgit From 7627216830d808572fff8225964e9209249ba196 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:49 +0100 Subject: docs: filesystems: convert affs.txt to ReST - Add a SPDX header; - Adjust document title; - Add table markups; - Mark literal blocks as such; - Some whitespace fixes and new line breaks; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: David Sterba Link: https://lore.kernel.org/r/b44c56befe0e28cbc0eb1b3e281ad7d99737ff16.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/affs.rst | 246 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/affs.txt | 222 -------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 247 insertions(+), 222 deletions(-) create mode 100644 Documentation/filesystems/affs.rst delete mode 100644 Documentation/filesystems/affs.txt diff --git a/Documentation/filesystems/affs.rst b/Documentation/filesystems/affs.rst new file mode 100644 index 000000000000..7f1a40dce6d3 --- /dev/null +++ b/Documentation/filesystems/affs.rst @@ -0,0 +1,246 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================= +Overview of Amiga Filesystems +============================= + +Not all varieties of the Amiga filesystems are supported for reading and +writing. The Amiga currently knows six different filesystems: + +============== =============================================================== +DOS\0 The old or original filesystem, not really suited for + hard disks and normally not used on them, either. + Supported read/write. + +DOS\1 The original Fast File System. Supported read/write. + +DOS\2 The old "international" filesystem. International means that + a bug has been fixed so that accented ("international") letters + in file names are case-insensitive, as they ought to be. + Supported read/write. + +DOS\3 The "international" Fast File System. Supported read/write. + +DOS\4 The original filesystem with directory cache. The directory + cache speeds up directory accesses on floppies considerably, + but slows down file creation/deletion. Doesn't make much + sense on hard disks. Supported read only. + +DOS\5 The Fast File System with directory cache. Supported read only. +============== =============================================================== + +All of the above filesystems allow block sizes from 512 to 32K bytes. +Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks +speed up almost everything at the expense of wasted disk space. The speed +gain above 4K seems not really worth the price, so you don't lose too +much here, either. + +The muFS (multi user File System) equivalents of the above file systems +are supported, too. + +Mount options for the AFFS +========================== + +protect + If this option is set, the protection bits cannot be altered. + +setuid[=uid] + This sets the owner of all files and directories in the file + system to uid or the uid of the current user, respectively. + +setgid[=gid] + Same as above, but for gid. + +mode=mode + Sets the mode flags to the given (octal) value, regardless + of the original permissions. Directories will get an x + permission if the corresponding r bit is set. + This is useful since most of the plain AmigaOS files + will map to 600. + +nofilenametruncate + The file system will return an error when filename exceeds + standard maximum filename length (30 characters). + +reserved=num + Sets the number of reserved blocks at the start of the + partition to num. You should never need this option. + Default is 2. + +root=block + Sets the block number of the root block. This should never + be necessary. + +bs=blksize + Sets the blocksize to blksize. Valid block sizes are 512, + 1024, 2048 and 4096. Like the root option, this should + never be necessary, as the affs can figure it out itself. + +quiet + The file system will not return an error for disallowed + mode changes. + +verbose + The volume name, file system type and block size will + be written to the syslog when the filesystem is mounted. + +mufs + The filesystem is really a muFS, also it doesn't + identify itself as one. This option is necessary if + the filesystem wasn't formatted as muFS, but is used + as one. + +prefix=path + Path will be prefixed to every absolute path name of + symbolic links on an AFFS partition. Default = "/". + (See below.) + +volume=name + When symbolic links with an absolute path are created + on an AFFS partition, name will be prepended as the + volume name. Default = "" (empty string). + (See below.) + +Handling of the Users/Groups and protection flags +================================================= + +Amiga -> Linux: + +The Amiga protection flags RWEDRWEDHSPARWED are handled as follows: + + - R maps to r for user, group and others. On directories, R implies x. + + - If both W and D are allowed, w will be set. + + - E maps to x. + + - H and P are always retained and ignored under Linux. + + - A is always reset when a file is written to. + +User id and group id will be used unless set[gu]id are given as mount +options. Since most of the Amiga file systems are single user systems +they will be owned by root. The root directory (the mount point) of the +Amiga filesystem will be owned by the user who actually mounts the +filesystem (the root directory doesn't have uid/gid fields). + +Linux -> Amiga: + +The Linux rwxrwxrwx file mode is handled as follows: + + - r permission will set R for user, group and others. + + - w permission will set W and D for user, group and others. + + - x permission of the user will set E for plain files. + + - All other flags (suid, sgid, ...) are ignored and will + not be retained. + +Newly created files and directories will get the user and group ID +of the current user and a mode according to the umask. + +Symbolic links +============== + +Although the Amiga and Linux file systems resemble each other, there +are some, not always subtle, differences. One of them becomes apparent +with symbolic links. While Linux has a file system with exactly one +root directory, the Amiga has a separate root directory for each +file system (for example, partition, floppy disk, ...). With the Amiga, +these entities are called "volumes". They have symbolic names which +can be used to access them. Thus, symbolic links can point to a +different volume. AFFS turns the volume name into a directory name +and prepends the prefix path (see prefix option) to it. + +Example: +You mount all your Amiga partitions under /amiga/ (where + is the name of the volume), and you give the option +"prefix=/amiga/" when mounting all your AFFS partitions. (They +might be "User", "WB" and "Graphics", the mount points /amiga/User, +/amiga/WB and /amiga/Graphics). A symbolic link referring to +"User:sc/include/dos/dos.h" will be followed to +"/amiga/User/sc/include/dos/dos.h". + +Examples +======== + +Command line:: + + mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose + mount /dev/sda3 /Amiga -t affs + +/etc/fstab entry:: + + /dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0 + +IMPORTANT NOTE +============== + +If you boot Windows 95 (don't know about 3.x, 98 and NT) while you +have an Amiga harddisk connected to your PC, it will overwrite +the bytes 0x00dc..0x00df of block 0 with garbage, thus invalidating +the Rigid Disk Block. Sheer luck has it that this is an unused +area of the RDB, so only the checksum doesn't match anymore. +Linux will ignore this garbage and recognize the RDB anyway, but +before you connect that drive to your Amiga again, you must +restore or repair your RDB. So please do make a backup copy of it +before booting Windows! + +If the damage is already done, the following should fix the RDB +(where is the device name). + +DO AT YOUR OWN RISK:: + + dd if=/dev/ of=rdb.tmp count=1 + cp rdb.tmp rdb.fixed + dd if=/dev/zero of=rdb.fixed bs=1 seek=220 count=4 + dd if=rdb.fixed of=/dev/ + +Bugs, Restrictions, Caveats +=========================== + +Quite a few things may not work as advertised. Not everything is +tested, though several hundred MB have been read and written using +this fs. For a most up-to-date list of bugs please consult +fs/affs/Changes. + +By default, filenames are truncated to 30 characters without warning. +'nofilenametruncate' mount option can change that behavior. + +Case is ignored by the affs in filename matching, but Linux shells +do care about the case. Example (with /wb being an affs mounted fs):: + + rm /wb/WRONGCASE + +will remove /mnt/wrongcase, but:: + + rm /wb/WR* + +will not since the names are matched by the shell. + +The block allocation is designed for hard disk partitions. If more +than 1 process writes to a (small) diskette, the blocks are allocated +in an ugly way (but the real AFFS doesn't do much better). This +is also true when space gets tight. + +You cannot execute programs on an OFS (Old File System), since the +program files cannot be memory mapped due to the 488 byte blocks. +For the same reason you cannot mount an image on such a filesystem +via the loopback device. + +The bitmap valid flag in the root block may not be accurate when the +system crashes while an affs partition is mounted. There's currently +no way to fix a garbled filesystem without an Amiga (disk validator) +or manually (who would do this?). Maybe later. + +If you mount affs partitions on system startup, you may want to tell +fsck that the fs should not be checked (place a '0' in the sixth field +of /etc/fstab). + +It's not possible to read floppy disks with a normal PC or workstation +due to an incompatibility with the Amiga floppy controller. + +If you are interested in an Amiga Emulator for Linux, look at + +http://web.archive.org/web/%2E/http://www.freiburg.linux.de/~uae/ diff --git a/Documentation/filesystems/affs.txt b/Documentation/filesystems/affs.txt deleted file mode 100644 index 71b63c2b9841..000000000000 --- a/Documentation/filesystems/affs.txt +++ /dev/null @@ -1,222 +0,0 @@ -Overview of Amiga Filesystems -============================= - -Not all varieties of the Amiga filesystems are supported for reading and -writing. The Amiga currently knows six different filesystems: - -DOS\0 The old or original filesystem, not really suited for - hard disks and normally not used on them, either. - Supported read/write. - -DOS\1 The original Fast File System. Supported read/write. - -DOS\2 The old "international" filesystem. International means that - a bug has been fixed so that accented ("international") letters - in file names are case-insensitive, as they ought to be. - Supported read/write. - -DOS\3 The "international" Fast File System. Supported read/write. - -DOS\4 The original filesystem with directory cache. The directory - cache speeds up directory accesses on floppies considerably, - but slows down file creation/deletion. Doesn't make much - sense on hard disks. Supported read only. - -DOS\5 The Fast File System with directory cache. Supported read only. - -All of the above filesystems allow block sizes from 512 to 32K bytes. -Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks -speed up almost everything at the expense of wasted disk space. The speed -gain above 4K seems not really worth the price, so you don't lose too -much here, either. - -The muFS (multi user File System) equivalents of the above file systems -are supported, too. - -Mount options for the AFFS -========================== - -protect If this option is set, the protection bits cannot be altered. - -setuid[=uid] This sets the owner of all files and directories in the file - system to uid or the uid of the current user, respectively. - -setgid[=gid] Same as above, but for gid. - -mode=mode Sets the mode flags to the given (octal) value, regardless - of the original permissions. Directories will get an x - permission if the corresponding r bit is set. - This is useful since most of the plain AmigaOS files - will map to 600. - -nofilenametruncate - The file system will return an error when filename exceeds - standard maximum filename length (30 characters). - -reserved=num Sets the number of reserved blocks at the start of the - partition to num. You should never need this option. - Default is 2. - -root=block Sets the block number of the root block. This should never - be necessary. - -bs=blksize Sets the blocksize to blksize. Valid block sizes are 512, - 1024, 2048 and 4096. Like the root option, this should - never be necessary, as the affs can figure it out itself. - -quiet The file system will not return an error for disallowed - mode changes. - -verbose The volume name, file system type and block size will - be written to the syslog when the filesystem is mounted. - -mufs The filesystem is really a muFS, also it doesn't - identify itself as one. This option is necessary if - the filesystem wasn't formatted as muFS, but is used - as one. - -prefix=path Path will be prefixed to every absolute path name of - symbolic links on an AFFS partition. Default = "/". - (See below.) - -volume=name When symbolic links with an absolute path are created - on an AFFS partition, name will be prepended as the - volume name. Default = "" (empty string). - (See below.) - -Handling of the Users/Groups and protection flags -================================================= - -Amiga -> Linux: - -The Amiga protection flags RWEDRWEDHSPARWED are handled as follows: - - - R maps to r for user, group and others. On directories, R implies x. - - - If both W and D are allowed, w will be set. - - - E maps to x. - - - H and P are always retained and ignored under Linux. - - - A is always reset when a file is written to. - -User id and group id will be used unless set[gu]id are given as mount -options. Since most of the Amiga file systems are single user systems -they will be owned by root. The root directory (the mount point) of the -Amiga filesystem will be owned by the user who actually mounts the -filesystem (the root directory doesn't have uid/gid fields). - -Linux -> Amiga: - -The Linux rwxrwxrwx file mode is handled as follows: - - - r permission will set R for user, group and others. - - - w permission will set W and D for user, group and others. - - - x permission of the user will set E for plain files. - - - All other flags (suid, sgid, ...) are ignored and will - not be retained. - -Newly created files and directories will get the user and group ID -of the current user and a mode according to the umask. - -Symbolic links -============== - -Although the Amiga and Linux file systems resemble each other, there -are some, not always subtle, differences. One of them becomes apparent -with symbolic links. While Linux has a file system with exactly one -root directory, the Amiga has a separate root directory for each -file system (for example, partition, floppy disk, ...). With the Amiga, -these entities are called "volumes". They have symbolic names which -can be used to access them. Thus, symbolic links can point to a -different volume. AFFS turns the volume name into a directory name -and prepends the prefix path (see prefix option) to it. - -Example: -You mount all your Amiga partitions under /amiga/ (where - is the name of the volume), and you give the option -"prefix=/amiga/" when mounting all your AFFS partitions. (They -might be "User", "WB" and "Graphics", the mount points /amiga/User, -/amiga/WB and /amiga/Graphics). A symbolic link referring to -"User:sc/include/dos/dos.h" will be followed to -"/amiga/User/sc/include/dos/dos.h". - -Examples -======== - -Command line: - mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose - mount /dev/sda3 /Amiga -t affs - -/etc/fstab entry: - /dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0 - -IMPORTANT NOTE -============== - -If you boot Windows 95 (don't know about 3.x, 98 and NT) while you -have an Amiga harddisk connected to your PC, it will overwrite -the bytes 0x00dc..0x00df of block 0 with garbage, thus invalidating -the Rigid Disk Block. Sheer luck has it that this is an unused -area of the RDB, so only the checksum doesn't match anymore. -Linux will ignore this garbage and recognize the RDB anyway, but -before you connect that drive to your Amiga again, you must -restore or repair your RDB. So please do make a backup copy of it -before booting Windows! - -If the damage is already done, the following should fix the RDB -(where is the device name). -DO AT YOUR OWN RISK: - - dd if=/dev/ of=rdb.tmp count=1 - cp rdb.tmp rdb.fixed - dd if=/dev/zero of=rdb.fixed bs=1 seek=220 count=4 - dd if=rdb.fixed of=/dev/ - -Bugs, Restrictions, Caveats -=========================== - -Quite a few things may not work as advertised. Not everything is -tested, though several hundred MB have been read and written using -this fs. For a most up-to-date list of bugs please consult -fs/affs/Changes. - -By default, filenames are truncated to 30 characters without warning. -'nofilenametruncate' mount option can change that behavior. - -Case is ignored by the affs in filename matching, but Linux shells -do care about the case. Example (with /wb being an affs mounted fs): - rm /wb/WRONGCASE -will remove /mnt/wrongcase, but - rm /wb/WR* -will not since the names are matched by the shell. - -The block allocation is designed for hard disk partitions. If more -than 1 process writes to a (small) diskette, the blocks are allocated -in an ugly way (but the real AFFS doesn't do much better). This -is also true when space gets tight. - -You cannot execute programs on an OFS (Old File System), since the -program files cannot be memory mapped due to the 488 byte blocks. -For the same reason you cannot mount an image on such a filesystem -via the loopback device. - -The bitmap valid flag in the root block may not be accurate when the -system crashes while an affs partition is mounted. There's currently -no way to fix a garbled filesystem without an Amiga (disk validator) -or manually (who would do this?). Maybe later. - -If you mount affs partitions on system startup, you may want to tell -fsck that the fs should not be checked (place a '0' in the sixth field -of /etc/fstab). - -It's not possible to read floppy disks with a normal PC or workstation -due to an incompatibility with the Amiga floppy controller. - -If you are interested in an Amiga Emulator for Linux, look at - -http://web.archive.org/web/*/http://www.freiburg.linux.de/~uae/ diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 14dc89c94822..273d802ad5fb 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -48,6 +48,7 @@ Documentation for filesystem implementations. 9p adfs + affs autofs fuse overlayfs -- cgit From ca6e9049a0934fe72ffea6990c889205aff0a2cf Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:50 +0100 Subject: docs: filesystems: convert afs.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Comment out text-only ToC; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/d77f5afdb5da0f8b0ec3dbe720aef23f1ce73bb5.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/afs.rst | 251 +++++++++++++++++++++++++++++++++++ Documentation/filesystems/afs.txt | 258 ------------------------------------ Documentation/filesystems/index.rst | 1 + 3 files changed, 252 insertions(+), 258 deletions(-) create mode 100644 Documentation/filesystems/afs.rst delete mode 100644 Documentation/filesystems/afs.txt diff --git a/Documentation/filesystems/afs.rst b/Documentation/filesystems/afs.rst new file mode 100644 index 000000000000..c4ec39a5966e --- /dev/null +++ b/Documentation/filesystems/afs.rst @@ -0,0 +1,251 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +kAFS: AFS FILESYSTEM +==================== + +.. Contents: + + - Overview. + - Usage. + - Mountpoints. + - Dynamic root. + - Proc filesystem. + - The cell database. + - Security. + - The @sys substitution. + + +Overview +======== + +This filesystem provides a fairly simple secure AFS filesystem driver. It is +under development and does not yet provide the full feature set. The features +it does support include: + + (*) Security (currently only AFS kaserver and KerberosIV tickets). + + (*) File reading and writing. + + (*) Automounting. + + (*) Local caching (via fscache). + +It does not yet support the following AFS features: + + (*) pioctl() system call. + + +Compilation +=========== + +The filesystem should be enabled by turning on the kernel configuration +options:: + + CONFIG_AF_RXRPC - The RxRPC protocol transport + CONFIG_RXKAD - The RxRPC Kerberos security handler + CONFIG_AFS - The AFS filesystem + +Additionally, the following can be turned on to aid debugging:: + + CONFIG_AF_RXRPC_DEBUG - Permit AF_RXRPC debugging to be enabled + CONFIG_AFS_DEBUG - Permit AFS debugging to be enabled + +They permit the debugging messages to be turned on dynamically by manipulating +the masks in the following files:: + + /sys/module/af_rxrpc/parameters/debug + /sys/module/kafs/parameters/debug + + +Usage +===== + +When inserting the driver modules the root cell must be specified along with a +list of volume location server IP addresses:: + + modprobe rxrpc + modprobe kafs rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 + +The first module is the AF_RXRPC network protocol driver. This provides the +RxRPC remote operation protocol and may also be accessed from userspace. See: + + Documentation/networking/rxrpc.txt + +The second module is the kerberos RxRPC security driver, and the third module +is the actual filesystem driver for the AFS filesystem. + +Once the module has been loaded, more modules can be added by the following +procedure:: + + echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells + +Where the parameters to the "add" command are the name of a cell and a list of +volume location servers within that cell, with the latter separated by colons. + +Filesystems can be mounted anywhere by commands similar to the following:: + + mount -t afs "%cambridge.redhat.com:root.afs." /afs + mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge + mount -t afs "#root.afs." /afs + mount -t afs "#root.cell." /afs/cambridge + +Where the initial character is either a hash or a percent symbol depending on +whether you definitely want a R/W volume (percent) or whether you'd prefer a +R/O volume, but are willing to use a R/W volume instead (hash). + +The name of the volume can be suffixes with ".backup" or ".readonly" to +specify connection to only volumes of those types. + +The name of the cell is optional, and if not given during a mount, then the +named volume will be looked up in the cell specified during modprobe. + +Additional cells can be added through /proc (see later section). + + +Mountpoints +=========== + +AFS has a concept of mountpoints. In AFS terms, these are specially formatted +symbolic links (of the same form as the "device name" passed to mount). kAFS +presents these to the user as directories that have a follow-link capability +(ie: symbolic link semantics). If anyone attempts to access them, they will +automatically cause the target volume to be mounted (if possible) on that site. + +Automatically mounted filesystems will be automatically unmounted approximately +twenty minutes after they were last used. Alternatively they can be unmounted +directly with the umount() system call. + +Manually unmounting an AFS volume will cause any idle submounts upon it to be +culled first. If all are culled, then the requested volume will also be +unmounted, otherwise error EBUSY will be returned. + +This can be used by the administrator to attempt to unmount the whole AFS tree +mounted on /afs in one go by doing:: + + umount /afs + + +Dynamic Root +============ + +A mount option is available to create a serverless mount that is only usable +for dynamic lookup. Creating such a mount can be done by, for example:: + + mount -t afs none /afs -o dyn + +This creates a mount that just has an empty directory at the root. Attempting +to look up a name in this directory will cause a mountpoint to be created that +looks up a cell of the same name, for example:: + + ls /afs/grand.central.org/ + + +Proc Filesystem +=============== + +The AFS modules creates a "/proc/fs/afs/" directory and populates it: + + (*) A "cells" file that lists cells currently known to the afs module and + their usage counts:: + + [root@andromeda ~]# cat /proc/fs/afs/cells + USE NAME + 3 cambridge.redhat.com + + (*) A directory per cell that contains files that list volume location + servers, volumes, and active servers known within that cell:: + + [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/servers + USE ADDR STATE + 4 172.16.18.91 0 + [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/vlservers + ADDRESS + 172.16.18.91 + [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/volumes + USE STT VLID[0] VLID[1] VLID[2] NAME + 1 Val 20000000 20000001 20000002 root.afs + + +The Cell Database +================= + +The filesystem maintains an internal database of all the cells it knows and the +IP addresses of the volume location servers for those cells. The cell to which +the system belongs is added to the database when modprobe is performed by the +"rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on +the kernel command line. + +Further cells can be added by commands similar to the following:: + + echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells + echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells + +No other cell database operations are available at this time. + + +Security +======== + +Secure operations are initiated by acquiring a key using the klog program. A +very primitive klog program is available at: + + http://people.redhat.com/~dhowells/rxrpc/klog.c + +This should be compiled by:: + + make klog LDLIBS="-lcrypto -lcrypt -lkrb4 -lkeyutils" + +And then run as:: + + ./klog + +Assuming it's successful, this adds a key of type RxRPC, named for the service +and cell, eg: "afs@". This can be viewed with the keyctl program or +by cat'ing /proc/keys:: + + [root@andromeda ~]# keyctl show + Session Keyring + -3 --alswrv 0 0 keyring: _ses.3268 + 2 --alswrv 0 0 \_ keyring: _uid.0 + 111416553 --als--v 0 0 \_ rxrpc: afs@CAMBRIDGE.REDHAT.COM + +Currently the username, realm, password and proposed ticket lifetime are +compiled in to the program. + +It is not required to acquire a key before using AFS facilities, but if one is +not acquired then all operations will be governed by the anonymous user parts +of the ACLs. + +If a key is acquired, then all AFS operations, including mounts and automounts, +made by a possessor of that key will be secured with that key. + +If a file is opened with a particular key and then the file descriptor is +passed to a process that doesn't have that key (perhaps over an AF_UNIX +socket), then the operations on the file will be made with key that was used to +open the file. + + +The @sys Substitution +===================== + +The list of up to 16 @sys substitutions for the current network namespace can +be configured by writing a list to /proc/fs/afs/sysname:: + + [root@andromeda ~]# echo foo amd64_linux_26 >/proc/fs/afs/sysname + +or cleared entirely by writing an empty list:: + + [root@andromeda ~]# echo >/proc/fs/afs/sysname + +The current list for current network namespace can be retrieved by:: + + [root@andromeda ~]# cat /proc/fs/afs/sysname + foo + amd64_linux_26 + +When @sys is being substituted for, each element of the list is tried in the +order given. + +By default, the list will contain one item that conforms to the pattern +"_linux_26", amd64 being the name for x86_64. diff --git a/Documentation/filesystems/afs.txt b/Documentation/filesystems/afs.txt deleted file mode 100644 index 8c6ea7b41048..000000000000 --- a/Documentation/filesystems/afs.txt +++ /dev/null @@ -1,258 +0,0 @@ - ==================== - kAFS: AFS FILESYSTEM - ==================== - -Contents: - - - Overview. - - Usage. - - Mountpoints. - - Dynamic root. - - Proc filesystem. - - The cell database. - - Security. - - The @sys substitution. - - -======== -OVERVIEW -======== - -This filesystem provides a fairly simple secure AFS filesystem driver. It is -under development and does not yet provide the full feature set. The features -it does support include: - - (*) Security (currently only AFS kaserver and KerberosIV tickets). - - (*) File reading and writing. - - (*) Automounting. - - (*) Local caching (via fscache). - -It does not yet support the following AFS features: - - (*) pioctl() system call. - - -=========== -COMPILATION -=========== - -The filesystem should be enabled by turning on the kernel configuration -options: - - CONFIG_AF_RXRPC - The RxRPC protocol transport - CONFIG_RXKAD - The RxRPC Kerberos security handler - CONFIG_AFS - The AFS filesystem - -Additionally, the following can be turned on to aid debugging: - - CONFIG_AF_RXRPC_DEBUG - Permit AF_RXRPC debugging to be enabled - CONFIG_AFS_DEBUG - Permit AFS debugging to be enabled - -They permit the debugging messages to be turned on dynamically by manipulating -the masks in the following files: - - /sys/module/af_rxrpc/parameters/debug - /sys/module/kafs/parameters/debug - - -===== -USAGE -===== - -When inserting the driver modules the root cell must be specified along with a -list of volume location server IP addresses: - - modprobe rxrpc - modprobe kafs rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 - -The first module is the AF_RXRPC network protocol driver. This provides the -RxRPC remote operation protocol and may also be accessed from userspace. See: - - Documentation/networking/rxrpc.txt - -The second module is the kerberos RxRPC security driver, and the third module -is the actual filesystem driver for the AFS filesystem. - -Once the module has been loaded, more modules can be added by the following -procedure: - - echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells - -Where the parameters to the "add" command are the name of a cell and a list of -volume location servers within that cell, with the latter separated by colons. - -Filesystems can be mounted anywhere by commands similar to the following: - - mount -t afs "%cambridge.redhat.com:root.afs." /afs - mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge - mount -t afs "#root.afs." /afs - mount -t afs "#root.cell." /afs/cambridge - -Where the initial character is either a hash or a percent symbol depending on -whether you definitely want a R/W volume (percent) or whether you'd prefer a -R/O volume, but are willing to use a R/W volume instead (hash). - -The name of the volume can be suffixes with ".backup" or ".readonly" to -specify connection to only volumes of those types. - -The name of the cell is optional, and if not given during a mount, then the -named volume will be looked up in the cell specified during modprobe. - -Additional cells can be added through /proc (see later section). - - -=========== -MOUNTPOINTS -=========== - -AFS has a concept of mountpoints. In AFS terms, these are specially formatted -symbolic links (of the same form as the "device name" passed to mount). kAFS -presents these to the user as directories that have a follow-link capability -(ie: symbolic link semantics). If anyone attempts to access them, they will -automatically cause the target volume to be mounted (if possible) on that site. - -Automatically mounted filesystems will be automatically unmounted approximately -twenty minutes after they were last used. Alternatively they can be unmounted -directly with the umount() system call. - -Manually unmounting an AFS volume will cause any idle submounts upon it to be -culled first. If all are culled, then the requested volume will also be -unmounted, otherwise error EBUSY will be returned. - -This can be used by the administrator to attempt to unmount the whole AFS tree -mounted on /afs in one go by doing: - - umount /afs - - -============ -DYNAMIC ROOT -============ - -A mount option is available to create a serverless mount that is only usable -for dynamic lookup. Creating such a mount can be done by, for example: - - mount -t afs none /afs -o dyn - -This creates a mount that just has an empty directory at the root. Attempting -to look up a name in this directory will cause a mountpoint to be created that -looks up a cell of the same name, for example: - - ls /afs/grand.central.org/ - - -=============== -PROC FILESYSTEM -=============== - -The AFS modules creates a "/proc/fs/afs/" directory and populates it: - - (*) A "cells" file that lists cells currently known to the afs module and - their usage counts: - - [root@andromeda ~]# cat /proc/fs/afs/cells - USE NAME - 3 cambridge.redhat.com - - (*) A directory per cell that contains files that list volume location - servers, volumes, and active servers known within that cell. - - [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/servers - USE ADDR STATE - 4 172.16.18.91 0 - [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/vlservers - ADDRESS - 172.16.18.91 - [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/volumes - USE STT VLID[0] VLID[1] VLID[2] NAME - 1 Val 20000000 20000001 20000002 root.afs - - -================= -THE CELL DATABASE -================= - -The filesystem maintains an internal database of all the cells it knows and the -IP addresses of the volume location servers for those cells. The cell to which -the system belongs is added to the database when modprobe is performed by the -"rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on -the kernel command line. - -Further cells can be added by commands similar to the following: - - echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells - echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells - -No other cell database operations are available at this time. - - -======== -SECURITY -======== - -Secure operations are initiated by acquiring a key using the klog program. A -very primitive klog program is available at: - - http://people.redhat.com/~dhowells/rxrpc/klog.c - -This should be compiled by: - - make klog LDLIBS="-lcrypto -lcrypt -lkrb4 -lkeyutils" - -And then run as: - - ./klog - -Assuming it's successful, this adds a key of type RxRPC, named for the service -and cell, eg: "afs@". This can be viewed with the keyctl program or -by cat'ing /proc/keys: - - [root@andromeda ~]# keyctl show - Session Keyring - -3 --alswrv 0 0 keyring: _ses.3268 - 2 --alswrv 0 0 \_ keyring: _uid.0 - 111416553 --als--v 0 0 \_ rxrpc: afs@CAMBRIDGE.REDHAT.COM - -Currently the username, realm, password and proposed ticket lifetime are -compiled in to the program. - -It is not required to acquire a key before using AFS facilities, but if one is -not acquired then all operations will be governed by the anonymous user parts -of the ACLs. - -If a key is acquired, then all AFS operations, including mounts and automounts, -made by a possessor of that key will be secured with that key. - -If a file is opened with a particular key and then the file descriptor is -passed to a process that doesn't have that key (perhaps over an AF_UNIX -socket), then the operations on the file will be made with key that was used to -open the file. - - -===================== -THE @SYS SUBSTITUTION -===================== - -The list of up to 16 @sys substitutions for the current network namespace can -be configured by writing a list to /proc/fs/afs/sysname: - - [root@andromeda ~]# echo foo amd64_linux_26 >/proc/fs/afs/sysname - -or cleared entirely by writing an empty list: - - [root@andromeda ~]# echo >/proc/fs/afs/sysname - -The current list for current network namespace can be retrieved by: - - [root@andromeda ~]# cat /proc/fs/afs/sysname - foo - amd64_linux_26 - -When @sys is being substituted for, each element of the list is tried in the -order given. - -By default, the list will contain one item that conforms to the pattern -"_linux_26", amd64 being the name for x86_64. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 273d802ad5fb..0598bc52abdc 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -49,6 +49,7 @@ Documentation for filesystem implementations. 9p adfs affs + afs autofs fuse overlayfs -- cgit From c64d3dc69f38a08d082813f1c0425d7a108ef950 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:51 +0100 Subject: docs: filesystems: convert autofs-mount-control.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/8cae057ae244d0f5b58d3c209bcdae5ed82bc52c.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/autofs-mount-control.rst | 410 +++++++++++++++++++++ Documentation/filesystems/autofs-mount-control.txt | 408 -------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 411 insertions(+), 408 deletions(-) create mode 100644 Documentation/filesystems/autofs-mount-control.rst delete mode 100644 Documentation/filesystems/autofs-mount-control.txt diff --git a/Documentation/filesystems/autofs-mount-control.rst b/Documentation/filesystems/autofs-mount-control.rst new file mode 100644 index 000000000000..2903aed92316 --- /dev/null +++ b/Documentation/filesystems/autofs-mount-control.rst @@ -0,0 +1,410 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================================================================== +Miscellaneous Device control operations for the autofs kernel module +==================================================================== + +The problem +=========== + +There is a problem with active restarts in autofs (that is to say +restarting autofs when there are busy mounts). + +During normal operation autofs uses a file descriptor opened on the +directory that is being managed in order to be able to issue control +operations. Using a file descriptor gives ioctl operations access to +autofs specific information stored in the super block. The operations +are things such as setting an autofs mount catatonic, setting the +expire timeout and requesting expire checks. As is explained below, +certain types of autofs triggered mounts can end up covering an autofs +mount itself which prevents us being able to use open(2) to obtain a +file descriptor for these operations if we don't already have one open. + +Currently autofs uses "umount -l" (lazy umount) to clear active mounts +at restart. While using lazy umount works for most cases, anything that +needs to walk back up the mount tree to construct a path, such as +getcwd(2) and the proc file system /proc//cwd, no longer works +because the point from which the path is constructed has been detached +from the mount tree. + +The actual problem with autofs is that it can't reconnect to existing +mounts. Immediately one thinks of just adding the ability to remount +autofs file systems would solve it, but alas, that can't work. This is +because autofs direct mounts and the implementation of "on demand mount +and expire" of nested mount trees have the file system mounted directly +on top of the mount trigger directory dentry. + +For example, there are two types of automount maps, direct (in the kernel +module source you will see a third type called an offset, which is just +a direct mount in disguise) and indirect. + +Here is a master map with direct and indirect map entries:: + + /- /etc/auto.direct + /test /etc/auto.indirect + +and the corresponding map files:: + + /etc/auto.direct: + + /automount/dparse/g6 budgie:/autofs/export1 + /automount/dparse/g1 shark:/autofs/export1 + and so on. + +/etc/auto.indirect:: + + g1 shark:/autofs/export1 + g6 budgie:/autofs/export1 + and so on. + +For the above indirect map an autofs file system is mounted on /test and +mounts are triggered for each sub-directory key by the inode lookup +operation. So we see a mount of shark:/autofs/export1 on /test/g1, for +example. + +The way that direct mounts are handled is by making an autofs mount on +each full path, such as /automount/dparse/g1, and using it as a mount +trigger. So when we walk on the path we mount shark:/autofs/export1 "on +top of this mount point". Since these are always directories we can +use the follow_link inode operation to trigger the mount. + +But, each entry in direct and indirect maps can have offsets (making +them multi-mount map entries). + +For example, an indirect mount map entry could also be:: + + g1 \ + / shark:/autofs/export5/testing/test \ + /s1 shark:/autofs/export/testing/test/s1 \ + /s2 shark:/autofs/export5/testing/test/s2 \ + /s1/ss1 shark:/autofs/export1 \ + /s2/ss2 shark:/autofs/export2 + +and a similarly a direct mount map entry could also be:: + + /automount/dparse/g1 \ + / shark:/autofs/export5/testing/test \ + /s1 shark:/autofs/export/testing/test/s1 \ + /s2 shark:/autofs/export5/testing/test/s2 \ + /s1/ss1 shark:/autofs/export2 \ + /s2/ss2 shark:/autofs/export2 + +One of the issues with version 4 of autofs was that, when mounting an +entry with a large number of offsets, possibly with nesting, we needed +to mount and umount all of the offsets as a single unit. Not really a +problem, except for people with a large number of offsets in map entries. +This mechanism is used for the well known "hosts" map and we have seen +cases (in 2.4) where the available number of mounts are exhausted or +where the number of privileged ports available is exhausted. + +In version 5 we mount only as we go down the tree of offsets and +similarly for expiring them which resolves the above problem. There is +somewhat more detail to the implementation but it isn't needed for the +sake of the problem explanation. The one important detail is that these +offsets are implemented using the same mechanism as the direct mounts +above and so the mount points can be covered by a mount. + +The current autofs implementation uses an ioctl file descriptor opened +on the mount point for control operations. The references held by the +descriptor are accounted for in checks made to determine if a mount is +in use and is also used to access autofs file system information held +in the mount super block. So the use of a file handle needs to be +retained. + + +The Solution +============ + +To be able to restart autofs leaving existing direct, indirect and +offset mounts in place we need to be able to obtain a file handle +for these potentially covered autofs mount points. Rather than just +implement an isolated operation it was decided to re-implement the +existing ioctl interface and add new operations to provide this +functionality. + +In addition, to be able to reconstruct a mount tree that has busy mounts, +the uid and gid of the last user that triggered the mount needs to be +available because these can be used as macro substitution variables in +autofs maps. They are recorded at mount request time and an operation +has been added to retrieve them. + +Since we're re-implementing the control interface, a couple of other +problems with the existing interface have been addressed. First, when +a mount or expire operation completes a status is returned to the +kernel by either a "send ready" or a "send fail" operation. The +"send fail" operation of the ioctl interface could only ever send +ENOENT so the re-implementation allows user space to send an actual +status. Another expensive operation in user space, for those using +very large maps, is discovering if a mount is present. Usually this +involves scanning /proc/mounts and since it needs to be done quite +often it can introduce significant overhead when there are many entries +in the mount table. An operation to lookup the mount status of a mount +point dentry (covered or not) has also been added. + +Current kernel development policy recommends avoiding the use of the +ioctl mechanism in favor of systems such as Netlink. An implementation +using this system was attempted to evaluate its suitability and it was +found to be inadequate, in this case. The Generic Netlink system was +used for this as raw Netlink would lead to a significant increase in +complexity. There's no question that the Generic Netlink system is an +elegant solution for common case ioctl functions but it's not a complete +replacement probably because its primary purpose in life is to be a +message bus implementation rather than specifically an ioctl replacement. +While it would be possible to work around this there is one concern +that lead to the decision to not use it. This is that the autofs +expire in the daemon has become far to complex because umount +candidates are enumerated, almost for no other reason than to "count" +the number of times to call the expire ioctl. This involves scanning +the mount table which has proved to be a big overhead for users with +large maps. The best way to improve this is try and get back to the +way the expire was done long ago. That is, when an expire request is +issued for a mount (file handle) we should continually call back to +the daemon until we can't umount any more mounts, then return the +appropriate status to the daemon. At the moment we just expire one +mount at a time. A Generic Netlink implementation would exclude this +possibility for future development due to the requirements of the +message bus architecture. + + +autofs Miscellaneous Device mount control interface +==================================================== + +The control interface is opening a device node, typically /dev/autofs. + +All the ioctls use a common structure to pass the needed parameter +information and return operation results:: + + struct autofs_dev_ioctl { + __u32 ver_major; + __u32 ver_minor; + __u32 size; /* total size of data passed in + * including this struct */ + __s32 ioctlfd; /* automount command fd */ + + /* Command parameters */ + union { + struct args_protover protover; + struct args_protosubver protosubver; + struct args_openmount openmount; + struct args_ready ready; + struct args_fail fail; + struct args_setpipefd setpipefd; + struct args_timeout timeout; + struct args_requester requester; + struct args_expire expire; + struct args_askumount askumount; + struct args_ismountpoint ismountpoint; + }; + + char path[0]; + }; + +The ioctlfd field is a mount point file descriptor of an autofs mount +point. It is returned by the open call and is used by all calls except +the check for whether a given path is a mount point, where it may +optionally be used to check a specific mount corresponding to a given +mount point file descriptor, and when requesting the uid and gid of the +last successful mount on a directory within the autofs file system. + +The union is used to communicate parameters and results of calls made +as described below. + +The path field is used to pass a path where it is needed and the size field +is used account for the increased structure length when translating the +structure sent from user space. + +This structure can be initialized before setting specific fields by using +the void function call init_autofs_dev_ioctl(``struct autofs_dev_ioctl *``). + +All of the ioctls perform a copy of this structure from user space to +kernel space and return -EINVAL if the size parameter is smaller than +the structure size itself, -ENOMEM if the kernel memory allocation fails +or -EFAULT if the copy itself fails. Other checks include a version check +of the compiled in user space version against the module version and a +mismatch results in a -EINVAL return. If the size field is greater than +the structure size then a path is assumed to be present and is checked to +ensure it begins with a "/" and is NULL terminated, otherwise -EINVAL is +returned. Following these checks, for all ioctl commands except +AUTOFS_DEV_IOCTL_VERSION_CMD, AUTOFS_DEV_IOCTL_OPENMOUNT_CMD and +AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD the ioctlfd is validated and if it is +not a valid descriptor or doesn't correspond to an autofs mount point +an error of -EBADF, -ENOTTY or -EINVAL (not an autofs descriptor) is +returned. + + +The ioctls +========== + +An example of an implementation which uses this interface can be seen +in autofs version 5.0.4 and later in file lib/dev-ioctl-lib.c of the +distribution tar available for download from kernel.org in directory +/pub/linux/daemons/autofs/v5. + +The device node ioctl operations implemented by this interface are: + + +AUTOFS_DEV_IOCTL_VERSION +------------------------ + +Get the major and minor version of the autofs device ioctl kernel module +implementation. It requires an initialized struct autofs_dev_ioctl as an +input parameter and sets the version information in the passed in structure. +It returns 0 on success or the error -EINVAL if a version mismatch is +detected. + + +AUTOFS_DEV_IOCTL_PROTOVER_CMD and AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD +------------------------------------------------------------------ + +Get the major and minor version of the autofs protocol version understood +by loaded module. This call requires an initialized struct autofs_dev_ioctl +with the ioctlfd field set to a valid autofs mount point descriptor +and sets the requested version number in version field of struct args_protover +or sub_version field of struct args_protosubver. These commands return +0 on success or one of the negative error codes if validation fails. + + +AUTOFS_DEV_IOCTL_OPENMOUNT and AUTOFS_DEV_IOCTL_CLOSEMOUNT +---------------------------------------------------------- + +Obtain and release a file descriptor for an autofs managed mount point +path. The open call requires an initialized struct autofs_dev_ioctl with +the path field set and the size field adjusted appropriately as well +as the devid field of struct args_openmount set to the device number of +the autofs mount. The device number can be obtained from the mount options +shown in /proc/mounts. The close call requires an initialized struct +autofs_dev_ioct with the ioctlfd field set to the descriptor obtained +from the open call. The release of the file descriptor can also be done +with close(2) so any open descriptors will also be closed at process exit. +The close call is included in the implemented operations largely for +completeness and to provide for a consistent user space implementation. + + +AUTOFS_DEV_IOCTL_READY_CMD and AUTOFS_DEV_IOCTL_FAIL_CMD +-------------------------------------------------------- + +Return mount and expire result status from user space to the kernel. +Both of these calls require an initialized struct autofs_dev_ioctl +with the ioctlfd field set to the descriptor obtained from the open +call and the token field of struct args_ready or struct args_fail set +to the wait queue token number, received by user space in the foregoing +mount or expire request. The status field of struct args_fail is set to +the errno of the operation. It is set to 0 on success. + + +AUTOFS_DEV_IOCTL_SETPIPEFD_CMD +------------------------------ + +Set the pipe file descriptor used for kernel communication to the daemon. +Normally this is set at mount time using an option but when reconnecting +to a existing mount we need to use this to tell the autofs mount about +the new kernel pipe descriptor. In order to protect mounts against +incorrectly setting the pipe descriptor we also require that the autofs +mount be catatonic (see next call). + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call and +the pipefd field of struct args_setpipefd set to descriptor of the pipe. +On success the call also sets the process group id used to identify the +controlling process (eg. the owning automount(8) daemon) to the process +group of the caller. + + +AUTOFS_DEV_IOCTL_CATATONIC_CMD +------------------------------ + +Make the autofs mount point catatonic. The autofs mount will no longer +issue mount requests, the kernel communication pipe descriptor is released +and any remaining waits in the queue released. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call. + + +AUTOFS_DEV_IOCTL_TIMEOUT_CMD +---------------------------- + +Set the expire timeout for mounts within an autofs mount point. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call. + + +AUTOFS_DEV_IOCTL_REQUESTER_CMD +------------------------------ + +Return the uid and gid of the last process to successfully trigger a the +mount on the given path dentry. + +The call requires an initialized struct autofs_dev_ioctl with the path +field set to the mount point in question and the size field adjusted +appropriately. Upon return the uid field of struct args_requester contains +the uid and gid field the gid. + +When reconstructing an autofs mount tree with active mounts we need to +re-connect to mounts that may have used the original process uid and +gid (or string variations of them) for mount lookups within the map entry. +This call provides the ability to obtain this uid and gid so they may be +used by user space for the mount map lookups. + + +AUTOFS_DEV_IOCTL_EXPIRE_CMD +--------------------------- + +Issue an expire request to the kernel for an autofs mount. Typically +this ioctl is called until no further expire candidates are found. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call. In +addition an immediate expire that's independent of the mount timeout, +and a forced expire that's independent of whether the mount is busy, +can be requested by setting the how field of struct args_expire to +AUTOFS_EXP_IMMEDIATE or AUTOFS_EXP_FORCED, respectively . If no +expire candidates can be found the ioctl returns -1 with errno set to +EAGAIN. + +This call causes the kernel module to check the mount corresponding +to the given ioctlfd for mounts that can be expired, issues an expire +request back to the daemon and waits for completion. + +AUTOFS_DEV_IOCTL_ASKUMOUNT_CMD +------------------------------ + +Checks if an autofs mount point is in use. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call and +it returns the result in the may_umount field of struct args_askumount, +1 for busy and 0 otherwise. + + +AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD +--------------------------------- + +Check if the given path is a mountpoint. + +The call requires an initialized struct autofs_dev_ioctl. There are two +possible variations. Both use the path field set to the path of the mount +point to check and the size field adjusted appropriately. One uses the +ioctlfd field to identify a specific mount point to check while the other +variation uses the path and optionally in.type field of struct args_ismountpoint +set to an autofs mount type. The call returns 1 if this is a mount point +and sets out.devid field to the device number of the mount and out.magic +field to the relevant super block magic number (described below) or 0 if +it isn't a mountpoint. In both cases the the device number (as returned +by new_encode_dev()) is returned in out.devid field. + +If supplied with a file descriptor we're looking for a specific mount, +not necessarily at the top of the mounted stack. In this case the path +the descriptor corresponds to is considered a mountpoint if it is itself +a mountpoint or contains a mount, such as a multi-mount without a root +mount. In this case we return 1 if the descriptor corresponds to a mount +point and and also returns the super magic of the covering mount if there +is one or 0 if it isn't a mountpoint. + +If a path is supplied (and the ioctlfd field is set to -1) then the path +is looked up and is checked to see if it is the root of a mount. If a +type is also given we are looking for a particular autofs mount and if +a match isn't found a fail is returned. If the the located path is the +root of a mount 1 is returned along with the super magic of the mount +or 0 otherwise. diff --git a/Documentation/filesystems/autofs-mount-control.txt b/Documentation/filesystems/autofs-mount-control.txt deleted file mode 100644 index acc02fc57993..000000000000 --- a/Documentation/filesystems/autofs-mount-control.txt +++ /dev/null @@ -1,408 +0,0 @@ - -Miscellaneous Device control operations for the autofs kernel module -==================================================================== - -The problem -=========== - -There is a problem with active restarts in autofs (that is to say -restarting autofs when there are busy mounts). - -During normal operation autofs uses a file descriptor opened on the -directory that is being managed in order to be able to issue control -operations. Using a file descriptor gives ioctl operations access to -autofs specific information stored in the super block. The operations -are things such as setting an autofs mount catatonic, setting the -expire timeout and requesting expire checks. As is explained below, -certain types of autofs triggered mounts can end up covering an autofs -mount itself which prevents us being able to use open(2) to obtain a -file descriptor for these operations if we don't already have one open. - -Currently autofs uses "umount -l" (lazy umount) to clear active mounts -at restart. While using lazy umount works for most cases, anything that -needs to walk back up the mount tree to construct a path, such as -getcwd(2) and the proc file system /proc//cwd, no longer works -because the point from which the path is constructed has been detached -from the mount tree. - -The actual problem with autofs is that it can't reconnect to existing -mounts. Immediately one thinks of just adding the ability to remount -autofs file systems would solve it, but alas, that can't work. This is -because autofs direct mounts and the implementation of "on demand mount -and expire" of nested mount trees have the file system mounted directly -on top of the mount trigger directory dentry. - -For example, there are two types of automount maps, direct (in the kernel -module source you will see a third type called an offset, which is just -a direct mount in disguise) and indirect. - -Here is a master map with direct and indirect map entries: - -/- /etc/auto.direct -/test /etc/auto.indirect - -and the corresponding map files: - -/etc/auto.direct: - -/automount/dparse/g6 budgie:/autofs/export1 -/automount/dparse/g1 shark:/autofs/export1 -and so on. - -/etc/auto.indirect: - -g1 shark:/autofs/export1 -g6 budgie:/autofs/export1 -and so on. - -For the above indirect map an autofs file system is mounted on /test and -mounts are triggered for each sub-directory key by the inode lookup -operation. So we see a mount of shark:/autofs/export1 on /test/g1, for -example. - -The way that direct mounts are handled is by making an autofs mount on -each full path, such as /automount/dparse/g1, and using it as a mount -trigger. So when we walk on the path we mount shark:/autofs/export1 "on -top of this mount point". Since these are always directories we can -use the follow_link inode operation to trigger the mount. - -But, each entry in direct and indirect maps can have offsets (making -them multi-mount map entries). - -For example, an indirect mount map entry could also be: - -g1 \ - / shark:/autofs/export5/testing/test \ - /s1 shark:/autofs/export/testing/test/s1 \ - /s2 shark:/autofs/export5/testing/test/s2 \ - /s1/ss1 shark:/autofs/export1 \ - /s2/ss2 shark:/autofs/export2 - -and a similarly a direct mount map entry could also be: - -/automount/dparse/g1 \ - / shark:/autofs/export5/testing/test \ - /s1 shark:/autofs/export/testing/test/s1 \ - /s2 shark:/autofs/export5/testing/test/s2 \ - /s1/ss1 shark:/autofs/export2 \ - /s2/ss2 shark:/autofs/export2 - -One of the issues with version 4 of autofs was that, when mounting an -entry with a large number of offsets, possibly with nesting, we needed -to mount and umount all of the offsets as a single unit. Not really a -problem, except for people with a large number of offsets in map entries. -This mechanism is used for the well known "hosts" map and we have seen -cases (in 2.4) where the available number of mounts are exhausted or -where the number of privileged ports available is exhausted. - -In version 5 we mount only as we go down the tree of offsets and -similarly for expiring them which resolves the above problem. There is -somewhat more detail to the implementation but it isn't needed for the -sake of the problem explanation. The one important detail is that these -offsets are implemented using the same mechanism as the direct mounts -above and so the mount points can be covered by a mount. - -The current autofs implementation uses an ioctl file descriptor opened -on the mount point for control operations. The references held by the -descriptor are accounted for in checks made to determine if a mount is -in use and is also used to access autofs file system information held -in the mount super block. So the use of a file handle needs to be -retained. - - -The Solution -============ - -To be able to restart autofs leaving existing direct, indirect and -offset mounts in place we need to be able to obtain a file handle -for these potentially covered autofs mount points. Rather than just -implement an isolated operation it was decided to re-implement the -existing ioctl interface and add new operations to provide this -functionality. - -In addition, to be able to reconstruct a mount tree that has busy mounts, -the uid and gid of the last user that triggered the mount needs to be -available because these can be used as macro substitution variables in -autofs maps. They are recorded at mount request time and an operation -has been added to retrieve them. - -Since we're re-implementing the control interface, a couple of other -problems with the existing interface have been addressed. First, when -a mount or expire operation completes a status is returned to the -kernel by either a "send ready" or a "send fail" operation. The -"send fail" operation of the ioctl interface could only ever send -ENOENT so the re-implementation allows user space to send an actual -status. Another expensive operation in user space, for those using -very large maps, is discovering if a mount is present. Usually this -involves scanning /proc/mounts and since it needs to be done quite -often it can introduce significant overhead when there are many entries -in the mount table. An operation to lookup the mount status of a mount -point dentry (covered or not) has also been added. - -Current kernel development policy recommends avoiding the use of the -ioctl mechanism in favor of systems such as Netlink. An implementation -using this system was attempted to evaluate its suitability and it was -found to be inadequate, in this case. The Generic Netlink system was -used for this as raw Netlink would lead to a significant increase in -complexity. There's no question that the Generic Netlink system is an -elegant solution for common case ioctl functions but it's not a complete -replacement probably because its primary purpose in life is to be a -message bus implementation rather than specifically an ioctl replacement. -While it would be possible to work around this there is one concern -that lead to the decision to not use it. This is that the autofs -expire in the daemon has become far to complex because umount -candidates are enumerated, almost for no other reason than to "count" -the number of times to call the expire ioctl. This involves scanning -the mount table which has proved to be a big overhead for users with -large maps. The best way to improve this is try and get back to the -way the expire was done long ago. That is, when an expire request is -issued for a mount (file handle) we should continually call back to -the daemon until we can't umount any more mounts, then return the -appropriate status to the daemon. At the moment we just expire one -mount at a time. A Generic Netlink implementation would exclude this -possibility for future development due to the requirements of the -message bus architecture. - - -autofs Miscellaneous Device mount control interface -==================================================== - -The control interface is opening a device node, typically /dev/autofs. - -All the ioctls use a common structure to pass the needed parameter -information and return operation results: - -struct autofs_dev_ioctl { - __u32 ver_major; - __u32 ver_minor; - __u32 size; /* total size of data passed in - * including this struct */ - __s32 ioctlfd; /* automount command fd */ - - /* Command parameters */ - union { - struct args_protover protover; - struct args_protosubver protosubver; - struct args_openmount openmount; - struct args_ready ready; - struct args_fail fail; - struct args_setpipefd setpipefd; - struct args_timeout timeout; - struct args_requester requester; - struct args_expire expire; - struct args_askumount askumount; - struct args_ismountpoint ismountpoint; - }; - - char path[0]; -}; - -The ioctlfd field is a mount point file descriptor of an autofs mount -point. It is returned by the open call and is used by all calls except -the check for whether a given path is a mount point, where it may -optionally be used to check a specific mount corresponding to a given -mount point file descriptor, and when requesting the uid and gid of the -last successful mount on a directory within the autofs file system. - -The union is used to communicate parameters and results of calls made -as described below. - -The path field is used to pass a path where it is needed and the size field -is used account for the increased structure length when translating the -structure sent from user space. - -This structure can be initialized before setting specific fields by using -the void function call init_autofs_dev_ioctl(struct autofs_dev_ioctl *). - -All of the ioctls perform a copy of this structure from user space to -kernel space and return -EINVAL if the size parameter is smaller than -the structure size itself, -ENOMEM if the kernel memory allocation fails -or -EFAULT if the copy itself fails. Other checks include a version check -of the compiled in user space version against the module version and a -mismatch results in a -EINVAL return. If the size field is greater than -the structure size then a path is assumed to be present and is checked to -ensure it begins with a "/" and is NULL terminated, otherwise -EINVAL is -returned. Following these checks, for all ioctl commands except -AUTOFS_DEV_IOCTL_VERSION_CMD, AUTOFS_DEV_IOCTL_OPENMOUNT_CMD and -AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD the ioctlfd is validated and if it is -not a valid descriptor or doesn't correspond to an autofs mount point -an error of -EBADF, -ENOTTY or -EINVAL (not an autofs descriptor) is -returned. - - -The ioctls -========== - -An example of an implementation which uses this interface can be seen -in autofs version 5.0.4 and later in file lib/dev-ioctl-lib.c of the -distribution tar available for download from kernel.org in directory -/pub/linux/daemons/autofs/v5. - -The device node ioctl operations implemented by this interface are: - - -AUTOFS_DEV_IOCTL_VERSION ------------------------- - -Get the major and minor version of the autofs device ioctl kernel module -implementation. It requires an initialized struct autofs_dev_ioctl as an -input parameter and sets the version information in the passed in structure. -It returns 0 on success or the error -EINVAL if a version mismatch is -detected. - - -AUTOFS_DEV_IOCTL_PROTOVER_CMD and AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD ------------------------------------------------------------------- - -Get the major and minor version of the autofs protocol version understood -by loaded module. This call requires an initialized struct autofs_dev_ioctl -with the ioctlfd field set to a valid autofs mount point descriptor -and sets the requested version number in version field of struct args_protover -or sub_version field of struct args_protosubver. These commands return -0 on success or one of the negative error codes if validation fails. - - -AUTOFS_DEV_IOCTL_OPENMOUNT and AUTOFS_DEV_IOCTL_CLOSEMOUNT ----------------------------------------------------------- - -Obtain and release a file descriptor for an autofs managed mount point -path. The open call requires an initialized struct autofs_dev_ioctl with -the path field set and the size field adjusted appropriately as well -as the devid field of struct args_openmount set to the device number of -the autofs mount. The device number can be obtained from the mount options -shown in /proc/mounts. The close call requires an initialized struct -autofs_dev_ioct with the ioctlfd field set to the descriptor obtained -from the open call. The release of the file descriptor can also be done -with close(2) so any open descriptors will also be closed at process exit. -The close call is included in the implemented operations largely for -completeness and to provide for a consistent user space implementation. - - -AUTOFS_DEV_IOCTL_READY_CMD and AUTOFS_DEV_IOCTL_FAIL_CMD --------------------------------------------------------- - -Return mount and expire result status from user space to the kernel. -Both of these calls require an initialized struct autofs_dev_ioctl -with the ioctlfd field set to the descriptor obtained from the open -call and the token field of struct args_ready or struct args_fail set -to the wait queue token number, received by user space in the foregoing -mount or expire request. The status field of struct args_fail is set to -the errno of the operation. It is set to 0 on success. - - -AUTOFS_DEV_IOCTL_SETPIPEFD_CMD ------------------------------- - -Set the pipe file descriptor used for kernel communication to the daemon. -Normally this is set at mount time using an option but when reconnecting -to a existing mount we need to use this to tell the autofs mount about -the new kernel pipe descriptor. In order to protect mounts against -incorrectly setting the pipe descriptor we also require that the autofs -mount be catatonic (see next call). - -The call requires an initialized struct autofs_dev_ioctl with the -ioctlfd field set to the descriptor obtained from the open call and -the pipefd field of struct args_setpipefd set to descriptor of the pipe. -On success the call also sets the process group id used to identify the -controlling process (eg. the owning automount(8) daemon) to the process -group of the caller. - - -AUTOFS_DEV_IOCTL_CATATONIC_CMD ------------------------------- - -Make the autofs mount point catatonic. The autofs mount will no longer -issue mount requests, the kernel communication pipe descriptor is released -and any remaining waits in the queue released. - -The call requires an initialized struct autofs_dev_ioctl with the -ioctlfd field set to the descriptor obtained from the open call. - - -AUTOFS_DEV_IOCTL_TIMEOUT_CMD ----------------------------- - -Set the expire timeout for mounts within an autofs mount point. - -The call requires an initialized struct autofs_dev_ioctl with the -ioctlfd field set to the descriptor obtained from the open call. - - -AUTOFS_DEV_IOCTL_REQUESTER_CMD ------------------------------- - -Return the uid and gid of the last process to successfully trigger a the -mount on the given path dentry. - -The call requires an initialized struct autofs_dev_ioctl with the path -field set to the mount point in question and the size field adjusted -appropriately. Upon return the uid field of struct args_requester contains -the uid and gid field the gid. - -When reconstructing an autofs mount tree with active mounts we need to -re-connect to mounts that may have used the original process uid and -gid (or string variations of them) for mount lookups within the map entry. -This call provides the ability to obtain this uid and gid so they may be -used by user space for the mount map lookups. - - -AUTOFS_DEV_IOCTL_EXPIRE_CMD ---------------------------- - -Issue an expire request to the kernel for an autofs mount. Typically -this ioctl is called until no further expire candidates are found. - -The call requires an initialized struct autofs_dev_ioctl with the -ioctlfd field set to the descriptor obtained from the open call. In -addition an immediate expire that's independent of the mount timeout, -and a forced expire that's independent of whether the mount is busy, -can be requested by setting the how field of struct args_expire to -AUTOFS_EXP_IMMEDIATE or AUTOFS_EXP_FORCED, respectively . If no -expire candidates can be found the ioctl returns -1 with errno set to -EAGAIN. - -This call causes the kernel module to check the mount corresponding -to the given ioctlfd for mounts that can be expired, issues an expire -request back to the daemon and waits for completion. - -AUTOFS_DEV_IOCTL_ASKUMOUNT_CMD ------------------------------- - -Checks if an autofs mount point is in use. - -The call requires an initialized struct autofs_dev_ioctl with the -ioctlfd field set to the descriptor obtained from the open call and -it returns the result in the may_umount field of struct args_askumount, -1 for busy and 0 otherwise. - - -AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD ---------------------------------- - -Check if the given path is a mountpoint. - -The call requires an initialized struct autofs_dev_ioctl. There are two -possible variations. Both use the path field set to the path of the mount -point to check and the size field adjusted appropriately. One uses the -ioctlfd field to identify a specific mount point to check while the other -variation uses the path and optionally in.type field of struct args_ismountpoint -set to an autofs mount type. The call returns 1 if this is a mount point -and sets out.devid field to the device number of the mount and out.magic -field to the relevant super block magic number (described below) or 0 if -it isn't a mountpoint. In both cases the the device number (as returned -by new_encode_dev()) is returned in out.devid field. - -If supplied with a file descriptor we're looking for a specific mount, -not necessarily at the top of the mounted stack. In this case the path -the descriptor corresponds to is considered a mountpoint if it is itself -a mountpoint or contains a mount, such as a multi-mount without a root -mount. In this case we return 1 if the descriptor corresponds to a mount -point and and also returns the super magic of the covering mount if there -is one or 0 if it isn't a mountpoint. - -If a path is supplied (and the ioctlfd field is set to -1) then the path -is looked up and is checked to see if it is the root of a mount. If a -type is also given we are looking for a particular autofs mount and if -a match isn't found a fail is returned. If the the located path is the -root of a mount 1 is returned along with the super magic of the mount -or 0 otherwise. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 0598bc52abdc..c9480138d47e 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -51,6 +51,7 @@ Documentation for filesystem implementations. affs afs autofs + autofs-mount-control fuse overlayfs virtiofs -- cgit From c54ad9a4e8faf080e6b395cc4b8298dfc5170255 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:52 +0100 Subject: docs: filesystems: convert befs.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/3e29ea6df6cd569021cfa953ccb8ed7dfc146f3d.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/befs.rst | 128 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/befs.txt | 117 -------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 129 insertions(+), 117 deletions(-) create mode 100644 Documentation/filesystems/befs.rst delete mode 100644 Documentation/filesystems/befs.txt diff --git a/Documentation/filesystems/befs.rst b/Documentation/filesystems/befs.rst new file mode 100644 index 000000000000..79f9740d76ff --- /dev/null +++ b/Documentation/filesystems/befs.rst @@ -0,0 +1,128 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================= +BeOS filesystem for Linux +========================= + +Document last updated: Dec 6, 2001 + +Warning +======= +Make sure you understand that this is alpha software. This means that the +implementation is neither complete nor well-tested. + +I DISCLAIM ALL RESPONSIBILITY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE! + +License +======= +This software is covered by the GNU General Public License. +See the file COPYING for the complete text of the license. +Or the GNU website: + +Author +====== +The largest part of the code written by Will Dyson +He has been working on the code since Aug 13, 2001. See the changelog for +details. + +Original Author: Makoto Kato + +His original code can still be found at: + + +Does anyone know of a more current email address for Makoto? He doesn't +respond to the address given above... + +This filesystem doesn't have a maintainer. + +What is this Driver? +==================== +This module implements the native filesystem of BeOS http://www.beincorporated.com/ +for the linux 2.4.1 and later kernels. Currently it is a read-only +implementation. + +Which is it, BFS or BEFS? +========================= +Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS". +But Unixware Boot Filesystem is called bfs, too. And they are already in +the kernel. Because of this naming conflict, on Linux the BeOS +filesystem is called befs. + +How to Install +============== +step 1. Install the BeFS patch into the source code tree of linux. + +Apply the patchfile to your kernel source tree. +Assuming that your kernel source is in /foo/bar/linux and the patchfile +is called patch-befs-xxx, you would do the following: + + cd /foo/bar/linux + patch -p1 < /path/to/patch-befs-xxx + +if the patching step fails (i.e. there are rejected hunks), you can try to +figure it out yourself (it shouldn't be hard), or mail the maintainer +(Will Dyson ) for help. + +step 2. Configuration & make kernel + +The linux kernel has many compile-time options. Most of them are beyond the +scope of this document. I suggest the Kernel-HOWTO document as a good general +reference on this topic. http://www.linuxdocs.org/HOWTOs/Kernel-HOWTO-4.html + +However, to use the BeFS module, you must enable it at configure time:: + + cd /foo/bar/linux + make menuconfig (or xconfig) + +The BeFS module is not a standard part of the linux kernel, so you must first +enable support for experimental code under the "Code maturity level" menu. + +Then, under the "Filesystems" menu will be an option called "BeFS +filesystem (experimental)", or something like that. Enable that option +(it is fine to make it a module). + +Save your kernel configuration and then build your kernel. + +step 3. Install + +See the kernel howto for +instructions on this critical step. + +Using BFS +========= +To use the BeOS filesystem, use filesystem type 'befs'. + +ex:: + + mount -t befs /dev/fd0 /beos + +Mount Options +============= + +============= =========================================================== +uid=nnn All files in the partition will be owned by user id nnn. +gid=nnn All files in the partition will be in group nnn. +iocharset=xxx Use xxx as the name of the NLS translation table. +debug The driver will output debugging information to the syslog. +============= =========================================================== + +How to Get Lastest Version +========================== + +The latest version is currently available at: + + +Any Known Bugs? +=============== +As of Jan 20, 2002: + + None + +Special Thanks +============== +Dominic Giampalo ... Writing "Practical file system design with Be filesystem" + +Hiroyuki Yamada ... Testing LinuxPPC. + + + diff --git a/Documentation/filesystems/befs.txt b/Documentation/filesystems/befs.txt deleted file mode 100644 index da45e6c842b8..000000000000 --- a/Documentation/filesystems/befs.txt +++ /dev/null @@ -1,117 +0,0 @@ -BeOS filesystem for Linux - -Document last updated: Dec 6, 2001 - -WARNING -======= -Make sure you understand that this is alpha software. This means that the -implementation is neither complete nor well-tested. - -I DISCLAIM ALL RESPONSIBILITY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE! - -LICENSE -===== -This software is covered by the GNU General Public License. -See the file COPYING for the complete text of the license. -Or the GNU website: - -AUTHOR -===== -The largest part of the code written by Will Dyson -He has been working on the code since Aug 13, 2001. See the changelog for -details. - -Original Author: Makoto Kato -His original code can still be found at: - -Does anyone know of a more current email address for Makoto? He doesn't -respond to the address given above... - -This filesystem doesn't have a maintainer. - -WHAT IS THIS DRIVER? -================== -This module implements the native filesystem of BeOS http://www.beincorporated.com/ -for the linux 2.4.1 and later kernels. Currently it is a read-only -implementation. - -Which is it, BFS or BEFS? -================ -Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS". -But Unixware Boot Filesystem is called bfs, too. And they are already in -the kernel. Because of this naming conflict, on Linux the BeOS -filesystem is called befs. - -HOW TO INSTALL -============== -step 1. Install the BeFS patch into the source code tree of linux. - -Apply the patchfile to your kernel source tree. -Assuming that your kernel source is in /foo/bar/linux and the patchfile -is called patch-befs-xxx, you would do the following: - - cd /foo/bar/linux - patch -p1 < /path/to/patch-befs-xxx - -if the patching step fails (i.e. there are rejected hunks), you can try to -figure it out yourself (it shouldn't be hard), or mail the maintainer -(Will Dyson ) for help. - -step 2. Configuration & make kernel - -The linux kernel has many compile-time options. Most of them are beyond the -scope of this document. I suggest the Kernel-HOWTO document as a good general -reference on this topic. http://www.linuxdocs.org/HOWTOs/Kernel-HOWTO-4.html - -However, to use the BeFS module, you must enable it at configure time. - - cd /foo/bar/linux - make menuconfig (or xconfig) - -The BeFS module is not a standard part of the linux kernel, so you must first -enable support for experimental code under the "Code maturity level" menu. - -Then, under the "Filesystems" menu will be an option called "BeFS -filesystem (experimental)", or something like that. Enable that option -(it is fine to make it a module). - -Save your kernel configuration and then build your kernel. - -step 3. Install - -See the kernel howto for -instructions on this critical step. - -USING BFS -========= -To use the BeOS filesystem, use filesystem type 'befs'. - -ex) - mount -t befs /dev/fd0 /beos - -MOUNT OPTIONS -============= -uid=nnn All files in the partition will be owned by user id nnn. -gid=nnn All files in the partition will be in group nnn. -iocharset=xxx Use xxx as the name of the NLS translation table. -debug The driver will output debugging information to the syslog. - -HOW TO GET LASTEST VERSION -========================== - -The latest version is currently available at: - - -ANY KNOWN BUGS? -=========== -As of Jan 20, 2002: - - None - -SPECIAL THANKS -============== -Dominic Giampalo ... Writing "Practical file system design with Be filesystem" -Hiroyuki Yamada ... Testing LinuxPPC. - - - diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index c9480138d47e..98de437f5500 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -52,6 +52,7 @@ Documentation for filesystem implementations. afs autofs autofs-mount-control + befs fuse overlayfs virtiofs -- cgit From ee68f34d7e7e553ffb74f09df0f3764fbfcf5d4b Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:53 +0100 Subject: docs: filesystems: convert bfs.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/93991bcc05e419368ee1e585c81057fb2c7c8d2b.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/bfs.rst | 60 +++++++++++++++++++++++++++++++++++++ Documentation/filesystems/bfs.txt | 57 ----------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 61 insertions(+), 57 deletions(-) create mode 100644 Documentation/filesystems/bfs.rst delete mode 100644 Documentation/filesystems/bfs.txt diff --git a/Documentation/filesystems/bfs.rst b/Documentation/filesystems/bfs.rst new file mode 100644 index 000000000000..ce14b9018807 --- /dev/null +++ b/Documentation/filesystems/bfs.rst @@ -0,0 +1,60 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================== +BFS Filesystem for Linux +======================== + +The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which +usually contains the kernel image and a few other files required for the +boot process. + +In order to access /stand partition under Linux you obviously need to +know the partition number and the kernel must support UnixWare disk slices +(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not +depend on having UnixWare disklabel support because one can also mount +BFS filesystem via loopback:: + + # losetup /dev/loop0 stand.img + # mount -t bfs /dev/loop0 /mnt/stand + +where stand.img is a file containing the image of BFS filesystem. +When you have finished using it and umounted you need to also deallocate +/dev/loop0 device by:: + + # losetup -d /dev/loop0 + +You can simplify mounting by just typing:: + + # mount -t bfs -o loop stand.img /mnt/stand + +this will allocate the first available loopback device (and load loop.o +kernel module if necessary) automatically. If the loopback driver is not +loaded automatically, make sure that you have compiled the module and +that modprobe is functioning. Beware that umount will not deallocate +/dev/loopN device if /etc/mtab file on your system is a symbolic link to +/proc/mounts. You will need to do it manually using "-d" switch of +losetup(8). Read losetup(8) manpage for more info. + +To create the BFS image under UnixWare you need to find out first which +slice contains it. The command prtvtoc(1M) is your friend:: + + # prtvtoc /dev/rdsk/c0b0t0d0s0 + +(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you +look for the slice with tag "STAND", which is usually slice 10. With this +information you can use dd(1) to create the BFS image:: + + # umount /stand + # dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512 + +Just in case, you can verify that you have done the right thing by checking +the magic number:: + + # od -Ad -tx4 stand.img | more + +The first 4 bytes should be 0x1badface. + +If you have any patches, questions or suggestions regarding this BFS +implementation please contact the author: + +Tigran Aivazian diff --git a/Documentation/filesystems/bfs.txt b/Documentation/filesystems/bfs.txt deleted file mode 100644 index 843ce91a2e40..000000000000 --- a/Documentation/filesystems/bfs.txt +++ /dev/null @@ -1,57 +0,0 @@ -BFS FILESYSTEM FOR LINUX -======================== - -The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which -usually contains the kernel image and a few other files required for the -boot process. - -In order to access /stand partition under Linux you obviously need to -know the partition number and the kernel must support UnixWare disk slices -(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not -depend on having UnixWare disklabel support because one can also mount -BFS filesystem via loopback: - -# losetup /dev/loop0 stand.img -# mount -t bfs /dev/loop0 /mnt/stand - -where stand.img is a file containing the image of BFS filesystem. -When you have finished using it and umounted you need to also deallocate -/dev/loop0 device by: - -# losetup -d /dev/loop0 - -You can simplify mounting by just typing: - -# mount -t bfs -o loop stand.img /mnt/stand - -this will allocate the first available loopback device (and load loop.o -kernel module if necessary) automatically. If the loopback driver is not -loaded automatically, make sure that you have compiled the module and -that modprobe is functioning. Beware that umount will not deallocate -/dev/loopN device if /etc/mtab file on your system is a symbolic link to -/proc/mounts. You will need to do it manually using "-d" switch of -losetup(8). Read losetup(8) manpage for more info. - -To create the BFS image under UnixWare you need to find out first which -slice contains it. The command prtvtoc(1M) is your friend: - -# prtvtoc /dev/rdsk/c0b0t0d0s0 - -(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you -look for the slice with tag "STAND", which is usually slice 10. With this -information you can use dd(1) to create the BFS image: - -# umount /stand -# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512 - -Just in case, you can verify that you have done the right thing by checking -the magic number: - -# od -Ad -tx4 stand.img | more - -The first 4 bytes should be 0x1badface. - -If you have any patches, questions or suggestions regarding this BFS -implementation please contact the author: - -Tigran Aivazian diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 98de437f5500..f74e6b273d9f 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -53,6 +53,7 @@ Documentation for filesystem implementations. autofs autofs-mount-control befs + bfs fuse overlayfs virtiofs -- cgit From 5d43e1bc2dfccbb07ea662fa4536544f1b6efd43 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:54 +0100 Subject: docs: filesystems: convert btrfs.txt to ReST Just trivial changes: - Add a SPDX header; - Add it to filesystems/index.rst. While here, adjust document title, just to make it use the same style of the other docs. Signed-off-by: Mauro Carvalho Chehab Acked-by: David Sterba Link: https://lore.kernel.org/r/1ef76da4ac24a9a6f6187723554733c702ea19ae.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/btrfs.rst | 34 ++++++++++++++++++++++++++++++++++ Documentation/filesystems/btrfs.txt | 31 ------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 35 insertions(+), 31 deletions(-) create mode 100644 Documentation/filesystems/btrfs.rst delete mode 100644 Documentation/filesystems/btrfs.txt diff --git a/Documentation/filesystems/btrfs.rst b/Documentation/filesystems/btrfs.rst new file mode 100644 index 000000000000..d0904f602819 --- /dev/null +++ b/Documentation/filesystems/btrfs.rst @@ -0,0 +1,34 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===== +BTRFS +===== + +Btrfs is a copy on write filesystem for Linux aimed at implementing advanced +features while focusing on fault tolerance, repair and easy administration. +Jointly developed by several companies, licensed under the GPL and open for +contribution from anyone. + +The main Btrfs features include: + + * Extent based file storage (2^64 max file size) + * Space efficient packing of small files + * Space efficient indexed directories + * Dynamic inode allocation + * Writable snapshots + * Subvolumes (separate internal filesystem roots) + * Object level mirroring and striping + * Checksums on data and metadata (multiple algorithms available) + * Compression + * Integrated multiple device support, with several raid algorithms + * Offline filesystem check + * Efficient incremental backup and FS mirroring + * Online filesystem defragmentation + +For more information please refer to the wiki + + https://btrfs.wiki.kernel.org + +that maintains information about administration tasks, frequently asked +questions, use cases, mount options, comprehensible changelogs, features, +manual pages, source code repositories, contacts etc. diff --git a/Documentation/filesystems/btrfs.txt b/Documentation/filesystems/btrfs.txt deleted file mode 100644 index f9dad22d95ce..000000000000 --- a/Documentation/filesystems/btrfs.txt +++ /dev/null @@ -1,31 +0,0 @@ -BTRFS -===== - -Btrfs is a copy on write filesystem for Linux aimed at implementing advanced -features while focusing on fault tolerance, repair and easy administration. -Jointly developed by several companies, licensed under the GPL and open for -contribution from anyone. - -The main Btrfs features include: - - * Extent based file storage (2^64 max file size) - * Space efficient packing of small files - * Space efficient indexed directories - * Dynamic inode allocation - * Writable snapshots - * Subvolumes (separate internal filesystem roots) - * Object level mirroring and striping - * Checksums on data and metadata (multiple algorithms available) - * Compression - * Integrated multiple device support, with several raid algorithms - * Offline filesystem check - * Efficient incremental backup and FS mirroring - * Online filesystem defragmentation - -For more information please refer to the wiki - - https://btrfs.wiki.kernel.org - -that maintains information about administration tasks, frequently asked -questions, use cases, mount options, comprehensible changelogs, features, -manual pages, source code repositories, contacts etc. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index f74e6b273d9f..dae862cf167e 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -54,6 +54,7 @@ Documentation for filesystem implementations. autofs-mount-control befs bfs + btrfs fuse overlayfs virtiofs -- cgit From 471379a174aa444b326d1b74e9f96a8b4b766b79 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:55 +0100 Subject: docs: filesystems: convert ceph.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Jeff Layton Link: https://lore.kernel.org/r/df2f142b5ca5842e030d8209482dfd62dcbe020f.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/ceph.rst | 190 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/ceph.txt | 186 ----------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 191 insertions(+), 186 deletions(-) create mode 100644 Documentation/filesystems/ceph.rst delete mode 100644 Documentation/filesystems/ceph.txt diff --git a/Documentation/filesystems/ceph.rst b/Documentation/filesystems/ceph.rst new file mode 100644 index 000000000000..b46a7218248f --- /dev/null +++ b/Documentation/filesystems/ceph.rst @@ -0,0 +1,190 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================ +Ceph Distributed File System +============================ + +Ceph is a distributed network file system designed to provide good +performance, reliability, and scalability. + +Basic features include: + + * POSIX semantics + * Seamless scaling from 1 to many thousands of nodes + * High availability and reliability. No single point of failure. + * N-way replication of data across storage nodes + * Fast recovery from node failures + * Automatic rebalancing of data on node addition/removal + * Easy deployment: most FS components are userspace daemons + +Also, + + * Flexible snapshots (on any directory) + * Recursive accounting (nested files, directories, bytes) + +In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely +on symmetric access by all clients to shared block devices, Ceph +separates data and metadata management into independent server +clusters, similar to Lustre. Unlike Lustre, however, metadata and +storage nodes run entirely as user space daemons. File data is striped +across storage nodes in large chunks to distribute workload and +facilitate high throughputs. When storage nodes fail, data is +re-replicated in a distributed fashion by the storage nodes themselves +(with some minimal coordination from a cluster monitor), making the +system extremely efficient and scalable. + +Metadata servers effectively form a large, consistent, distributed +in-memory cache above the file namespace that is extremely scalable, +dynamically redistributes metadata in response to workload changes, +and can tolerate arbitrary (well, non-Byzantine) node failures. The +metadata server takes a somewhat unconventional approach to metadata +storage to significantly improve performance for common workloads. In +particular, inodes with only a single link are embedded in +directories, allowing entire directories of dentries and inodes to be +loaded into its cache with a single I/O operation. The contents of +extremely large directories can be fragmented and managed by +independent metadata servers, allowing scalable concurrent access. + +The system offers automatic data rebalancing/migration when scaling +from a small cluster of just a few nodes to many hundreds, without +requiring an administrator carve the data set into static volumes or +go through the tedious process of migrating data between servers. +When the file system approaches full, new nodes can be easily added +and things will "just work." + +Ceph includes flexible snapshot mechanism that allows a user to create +a snapshot on any subdirectory (and its nested contents) in the +system. Snapshot creation and deletion are as simple as 'mkdir +.snap/foo' and 'rmdir .snap/foo'. + +Ceph also provides some recursive accounting on directories for nested +files and bytes. That is, a 'getfattr -d foo' on any directory in the +system will reveal the total number of nested regular files and +subdirectories, and a summation of all nested file sizes. This makes +the identification of large disk space consumers relatively quick, as +no 'du' or similar recursive scan of the file system is required. + +Finally, Ceph also allows quotas to be set on any directory in the system. +The quota can restrict the number of bytes or the number of files stored +beneath that point in the directory hierarchy. Quotas can be set using +extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg:: + + setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir + getfattr -n ceph.quota.max_bytes /some/dir + +A limitation of the current quotas implementation is that it relies on the +cooperation of the client mounting the file system to stop writers when a +limit is reached. A modified or adversarial client cannot be prevented +from writing as much data as it needs. + +Mount Syntax +============ + +The basic mount syntax is:: + + # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt + +You only need to specify a single monitor, as the client will get the +full list when it connects. (However, if the monitor you specify +happens to be down, the mount won't succeed.) The port can be left +off if the monitor is using the default. So if the monitor is at +1.2.3.4:: + + # mount -t ceph 1.2.3.4:/ /mnt/ceph + +is sufficient. If /sbin/mount.ceph is installed, a hostname can be +used instead of an IP address. + + + +Mount Options +============= + + ip=A.B.C.D[:N] + Specify the IP and/or port the client should bind to locally. + There is normally not much reason to do this. If the IP is not + specified, the client's IP address is determined by looking at the + address its connection to the monitor originates from. + + wsize=X + Specify the maximum write size in bytes. Default: 16 MB. + + rsize=X + Specify the maximum read size in bytes. Default: 16 MB. + + rasize=X + Specify the maximum readahead size in bytes. Default: 8 MB. + + mount_timeout=X + Specify the timeout value for mount (in seconds), in the case + of a non-responsive Ceph file system. The default is 30 + seconds. + + caps_max=X + Specify the maximum number of caps to hold. Unused caps are released + when number of caps exceeds the limit. The default is 0 (no limit) + + rbytes + When stat() is called on a directory, set st_size to 'rbytes', + the summation of file sizes over all files nested beneath that + directory. This is the default. + + norbytes + When stat() is called on a directory, set st_size to the + number of entries in that directory. + + nocrc + Disable CRC32C calculation for data writes. If set, the storage node + must rely on TCP's error correction to detect data corruption + in the data payload. + + dcache + Use the dcache contents to perform negative lookups and + readdir when the client has the entire directory contents in + its cache. (This does not change correctness; the client uses + cached metadata only when a lease or capability ensures it is + valid.) + + nodcache + Do not use the dcache as above. This avoids a significant amount of + complex code, sacrificing performance without affecting correctness, + and is useful for tracking down bugs. + + noasyncreaddir + Do not use the dcache as above for readdir. + + noquotadf + Report overall filesystem usage in statfs instead of using the root + directory quota. + + nocopyfrom + Don't use the RADOS 'copy-from' operation to perform remote object + copies. Currently, it's only used in copy_file_range, which will revert + to the default VFS implementation if this option is used. + + recover_session= + Set auto reconnect mode in the case where the client is blacklisted. The + available modes are "no" and "clean". The default is "no". + + * no: never attempt to reconnect when client detects that it has been + blacklisted. Operations will generally fail after being blacklisted. + + * clean: client reconnects to the ceph cluster automatically when it + detects that it has been blacklisted. During reconnect, client drops + dirty data/metadata, invalidates page caches and writable file handles. + After reconnect, file locks become stale because the MDS loses track + of them. If an inode contains any stale file locks, read/write on the + inode is not allowed until applications release all stale file locks. + +More Information +================ + +For more information on Ceph, see the home page at + https://ceph.com/ + +The Linux kernel client source tree is available at + - https://github.com/ceph/ceph-client.git + - git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git + +and the source for the full system is at + https://github.com/ceph/ceph.git diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt deleted file mode 100644 index b19b6a03f91c..000000000000 --- a/Documentation/filesystems/ceph.txt +++ /dev/null @@ -1,186 +0,0 @@ -Ceph Distributed File System -============================ - -Ceph is a distributed network file system designed to provide good -performance, reliability, and scalability. - -Basic features include: - - * POSIX semantics - * Seamless scaling from 1 to many thousands of nodes - * High availability and reliability. No single point of failure. - * N-way replication of data across storage nodes - * Fast recovery from node failures - * Automatic rebalancing of data on node addition/removal - * Easy deployment: most FS components are userspace daemons - -Also, - * Flexible snapshots (on any directory) - * Recursive accounting (nested files, directories, bytes) - -In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely -on symmetric access by all clients to shared block devices, Ceph -separates data and metadata management into independent server -clusters, similar to Lustre. Unlike Lustre, however, metadata and -storage nodes run entirely as user space daemons. File data is striped -across storage nodes in large chunks to distribute workload and -facilitate high throughputs. When storage nodes fail, data is -re-replicated in a distributed fashion by the storage nodes themselves -(with some minimal coordination from a cluster monitor), making the -system extremely efficient and scalable. - -Metadata servers effectively form a large, consistent, distributed -in-memory cache above the file namespace that is extremely scalable, -dynamically redistributes metadata in response to workload changes, -and can tolerate arbitrary (well, non-Byzantine) node failures. The -metadata server takes a somewhat unconventional approach to metadata -storage to significantly improve performance for common workloads. In -particular, inodes with only a single link are embedded in -directories, allowing entire directories of dentries and inodes to be -loaded into its cache with a single I/O operation. The contents of -extremely large directories can be fragmented and managed by -independent metadata servers, allowing scalable concurrent access. - -The system offers automatic data rebalancing/migration when scaling -from a small cluster of just a few nodes to many hundreds, without -requiring an administrator carve the data set into static volumes or -go through the tedious process of migrating data between servers. -When the file system approaches full, new nodes can be easily added -and things will "just work." - -Ceph includes flexible snapshot mechanism that allows a user to create -a snapshot on any subdirectory (and its nested contents) in the -system. Snapshot creation and deletion are as simple as 'mkdir -.snap/foo' and 'rmdir .snap/foo'. - -Ceph also provides some recursive accounting on directories for nested -files and bytes. That is, a 'getfattr -d foo' on any directory in the -system will reveal the total number of nested regular files and -subdirectories, and a summation of all nested file sizes. This makes -the identification of large disk space consumers relatively quick, as -no 'du' or similar recursive scan of the file system is required. - -Finally, Ceph also allows quotas to be set on any directory in the system. -The quota can restrict the number of bytes or the number of files stored -beneath that point in the directory hierarchy. Quotas can be set using -extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg: - - setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir - getfattr -n ceph.quota.max_bytes /some/dir - -A limitation of the current quotas implementation is that it relies on the -cooperation of the client mounting the file system to stop writers when a -limit is reached. A modified or adversarial client cannot be prevented -from writing as much data as it needs. - -Mount Syntax -============ - -The basic mount syntax is: - - # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt - -You only need to specify a single monitor, as the client will get the -full list when it connects. (However, if the monitor you specify -happens to be down, the mount won't succeed.) The port can be left -off if the monitor is using the default. So if the monitor is at -1.2.3.4, - - # mount -t ceph 1.2.3.4:/ /mnt/ceph - -is sufficient. If /sbin/mount.ceph is installed, a hostname can be -used instead of an IP address. - - - -Mount Options -============= - - ip=A.B.C.D[:N] - Specify the IP and/or port the client should bind to locally. - There is normally not much reason to do this. If the IP is not - specified, the client's IP address is determined by looking at the - address its connection to the monitor originates from. - - wsize=X - Specify the maximum write size in bytes. Default: 16 MB. - - rsize=X - Specify the maximum read size in bytes. Default: 16 MB. - - rasize=X - Specify the maximum readahead size in bytes. Default: 8 MB. - - mount_timeout=X - Specify the timeout value for mount (in seconds), in the case - of a non-responsive Ceph file system. The default is 30 - seconds. - - caps_max=X - Specify the maximum number of caps to hold. Unused caps are released - when number of caps exceeds the limit. The default is 0 (no limit) - - rbytes - When stat() is called on a directory, set st_size to 'rbytes', - the summation of file sizes over all files nested beneath that - directory. This is the default. - - norbytes - When stat() is called on a directory, set st_size to the - number of entries in that directory. - - nocrc - Disable CRC32C calculation for data writes. If set, the storage node - must rely on TCP's error correction to detect data corruption - in the data payload. - - dcache - Use the dcache contents to perform negative lookups and - readdir when the client has the entire directory contents in - its cache. (This does not change correctness; the client uses - cached metadata only when a lease or capability ensures it is - valid.) - - nodcache - Do not use the dcache as above. This avoids a significant amount of - complex code, sacrificing performance without affecting correctness, - and is useful for tracking down bugs. - - noasyncreaddir - Do not use the dcache as above for readdir. - - noquotadf - Report overall filesystem usage in statfs instead of using the root - directory quota. - - nocopyfrom - Don't use the RADOS 'copy-from' operation to perform remote object - copies. Currently, it's only used in copy_file_range, which will revert - to the default VFS implementation if this option is used. - - recover_session= - Set auto reconnect mode in the case where the client is blacklisted. The - available modes are "no" and "clean". The default is "no". - - * no: never attempt to reconnect when client detects that it has been - blacklisted. Operations will generally fail after being blacklisted. - - * clean: client reconnects to the ceph cluster automatically when it - detects that it has been blacklisted. During reconnect, client drops - dirty data/metadata, invalidates page caches and writable file handles. - After reconnect, file locks become stale because the MDS loses track - of them. If an inode contains any stale file locks, read/write on the - inode is not allowed until applications release all stale file locks. - -More Information -================ - -For more information on Ceph, see the home page at - https://ceph.com/ - -The Linux kernel client source tree is available at - https://github.com/ceph/ceph-client.git - git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git - -and the source for the full system is at - https://github.com/ceph/ceph.git diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index dae862cf167e..ddd8f7b2bb25 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -55,6 +55,7 @@ Documentation for filesystem implementations. befs bfs btrfs + ceph fuse overlayfs virtiofs -- cgit From f1fa0e6028d395c5f0d1a0929a795b8dc0d43295 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:56 +0100 Subject: docs: filesystems: convert cramfs.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Nicolas Pitre Link: https://lore.kernel.org/r/e87b267e71f99974b7bb3fc0a4a08454ff58165e.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/cramfs.rst | 123 +++++++++++++++++++++++++++++++++++ Documentation/filesystems/cramfs.txt | 118 --------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 124 insertions(+), 118 deletions(-) create mode 100644 Documentation/filesystems/cramfs.rst delete mode 100644 Documentation/filesystems/cramfs.txt diff --git a/Documentation/filesystems/cramfs.rst b/Documentation/filesystems/cramfs.rst new file mode 100644 index 000000000000..afbdbde98bd2 --- /dev/null +++ b/Documentation/filesystems/cramfs.rst @@ -0,0 +1,123 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================================== +Cramfs - cram a filesystem onto a small ROM +=========================================== + +cramfs is designed to be simple and small, and to compress things well. + +It uses the zlib routines to compress a file one page at a time, and +allows random page access. The meta-data is not compressed, but is +expressed in a very terse representation to make it use much less +diskspace than traditional filesystems. + +You can't write to a cramfs filesystem (making it compressible and +compact also makes it _very_ hard to update on-the-fly), so you have to +create the disk image with the "mkcramfs" utility. + + +Usage Notes +----------- + +File sizes are limited to less than 16MB. + +Maximum filesystem size is a little over 256MB. (The last file on the +filesystem is allowed to extend past 256MB.) + +Only the low 8 bits of gid are stored. The current version of +mkcramfs simply truncates to 8 bits, which is a potential security +issue. + +Hard links are supported, but hard linked files +will still have a link count of 1 in the cramfs image. + +Cramfs directories have no ``.`` or ``..`` entries. Directories (like +every other file on cramfs) always have a link count of 1. (There's +no need to use -noleaf in ``find``, btw.) + +No timestamps are stored in a cramfs, so these default to the epoch +(1970 GMT). Recently-accessed files may have updated timestamps, but +the update lasts only as long as the inode is cached in memory, after +which the timestamp reverts to 1970, i.e. moves backwards in time. + +Currently, cramfs must be written and read with architectures of the +same endianness, and can be read only by kernels with PAGE_SIZE +== 4096. At least the latter of these is a bug, but it hasn't been +decided what the best fix is. For the moment if you have larger pages +you can just change the #define in mkcramfs.c, so long as you don't +mind the filesystem becoming unreadable to future kernels. + + +Memory Mapped cramfs image +-------------------------- + +The CRAMFS_MTD Kconfig option adds support for loading data directly from +a physical linear memory range (usually non volatile memory like Flash) +instead of going through the block device layer. This saves some memory +since no intermediate buffering is necessary to hold the data before +decompressing. + +And when data blocks are kept uncompressed and properly aligned, they will +automatically be mapped directly into user space whenever possible providing +eXecute-In-Place (XIP) from ROM of read-only segments. Data segments mapped +read-write (hence they have to be copied to RAM) may still be compressed in +the cramfs image in the same file along with non compressed read-only +segments. Both MMU and no-MMU systems are supported. This is particularly +handy for tiny embedded systems with very tight memory constraints. + +The location of the cramfs image in memory is system dependent. You must +know the proper physical address where the cramfs image is located and +configure an MTD device for it. Also, that MTD device must be supported +by a map driver that implements the "point" method. Examples of such +MTD drivers are cfi_cmdset_0001 (Intel/Sharp CFI flash) or physmap +(Flash device in physical memory map). MTD partitions based on such devices +are fine too. Then that device should be specified with the "mtd:" prefix +as the mount device argument. For example, to mount the MTD device named +"fs_partition" on the /mnt directory:: + + $ mount -t cramfs mtd:fs_partition /mnt + +To boot a kernel with this as root filesystem, suffice to specify +something like "root=mtd:fs_partition" on the kernel command line. + + +Tools +----- + +A version of mkcramfs that can take advantage of the latest capabilities +described above can be found here: + +https://github.com/npitre/cramfs-tools + + +For /usr/share/magic +-------------------- + +===== ======================= ======================= +0 ulelong 0x28cd3d45 Linux cramfs offset 0 +>4 ulelong x size %d +>8 ulelong x flags 0x%x +>12 ulelong x future 0x%x +>16 string >\0 signature "%.16s" +>32 ulelong x fsid.crc 0x%x +>36 ulelong x fsid.edition %d +>40 ulelong x fsid.blocks %d +>44 ulelong x fsid.files %d +>48 string >\0 name "%.16s" +512 ulelong 0x28cd3d45 Linux cramfs offset 512 +>516 ulelong x size %d +>520 ulelong x flags 0x%x +>524 ulelong x future 0x%x +>528 string >\0 signature "%.16s" +>544 ulelong x fsid.crc 0x%x +>548 ulelong x fsid.edition %d +>552 ulelong x fsid.blocks %d +>556 ulelong x fsid.files %d +>560 string >\0 name "%.16s" +===== ======================= ======================= + + +Hacker Notes +------------ + +See fs/cramfs/README for filesystem layout and implementation notes. diff --git a/Documentation/filesystems/cramfs.txt b/Documentation/filesystems/cramfs.txt deleted file mode 100644 index 8e19a53d648b..000000000000 --- a/Documentation/filesystems/cramfs.txt +++ /dev/null @@ -1,118 +0,0 @@ - - Cramfs - cram a filesystem onto a small ROM - -cramfs is designed to be simple and small, and to compress things well. - -It uses the zlib routines to compress a file one page at a time, and -allows random page access. The meta-data is not compressed, but is -expressed in a very terse representation to make it use much less -diskspace than traditional filesystems. - -You can't write to a cramfs filesystem (making it compressible and -compact also makes it _very_ hard to update on-the-fly), so you have to -create the disk image with the "mkcramfs" utility. - - -Usage Notes ------------ - -File sizes are limited to less than 16MB. - -Maximum filesystem size is a little over 256MB. (The last file on the -filesystem is allowed to extend past 256MB.) - -Only the low 8 bits of gid are stored. The current version of -mkcramfs simply truncates to 8 bits, which is a potential security -issue. - -Hard links are supported, but hard linked files -will still have a link count of 1 in the cramfs image. - -Cramfs directories have no `.' or `..' entries. Directories (like -every other file on cramfs) always have a link count of 1. (There's -no need to use -noleaf in `find', btw.) - -No timestamps are stored in a cramfs, so these default to the epoch -(1970 GMT). Recently-accessed files may have updated timestamps, but -the update lasts only as long as the inode is cached in memory, after -which the timestamp reverts to 1970, i.e. moves backwards in time. - -Currently, cramfs must be written and read with architectures of the -same endianness, and can be read only by kernels with PAGE_SIZE -== 4096. At least the latter of these is a bug, but it hasn't been -decided what the best fix is. For the moment if you have larger pages -you can just change the #define in mkcramfs.c, so long as you don't -mind the filesystem becoming unreadable to future kernels. - - -Memory Mapped cramfs image --------------------------- - -The CRAMFS_MTD Kconfig option adds support for loading data directly from -a physical linear memory range (usually non volatile memory like Flash) -instead of going through the block device layer. This saves some memory -since no intermediate buffering is necessary to hold the data before -decompressing. - -And when data blocks are kept uncompressed and properly aligned, they will -automatically be mapped directly into user space whenever possible providing -eXecute-In-Place (XIP) from ROM of read-only segments. Data segments mapped -read-write (hence they have to be copied to RAM) may still be compressed in -the cramfs image in the same file along with non compressed read-only -segments. Both MMU and no-MMU systems are supported. This is particularly -handy for tiny embedded systems with very tight memory constraints. - -The location of the cramfs image in memory is system dependent. You must -know the proper physical address where the cramfs image is located and -configure an MTD device for it. Also, that MTD device must be supported -by a map driver that implements the "point" method. Examples of such -MTD drivers are cfi_cmdset_0001 (Intel/Sharp CFI flash) or physmap -(Flash device in physical memory map). MTD partitions based on such devices -are fine too. Then that device should be specified with the "mtd:" prefix -as the mount device argument. For example, to mount the MTD device named -"fs_partition" on the /mnt directory: - -$ mount -t cramfs mtd:fs_partition /mnt - -To boot a kernel with this as root filesystem, suffice to specify -something like "root=mtd:fs_partition" on the kernel command line. - - -Tools ------ - -A version of mkcramfs that can take advantage of the latest capabilities -described above can be found here: - -https://github.com/npitre/cramfs-tools - - -For /usr/share/magic --------------------- - -0 ulelong 0x28cd3d45 Linux cramfs offset 0 ->4 ulelong x size %d ->8 ulelong x flags 0x%x ->12 ulelong x future 0x%x ->16 string >\0 signature "%.16s" ->32 ulelong x fsid.crc 0x%x ->36 ulelong x fsid.edition %d ->40 ulelong x fsid.blocks %d ->44 ulelong x fsid.files %d ->48 string >\0 name "%.16s" -512 ulelong 0x28cd3d45 Linux cramfs offset 512 ->516 ulelong x size %d ->520 ulelong x flags 0x%x ->524 ulelong x future 0x%x ->528 string >\0 signature "%.16s" ->544 ulelong x fsid.crc 0x%x ->548 ulelong x fsid.edition %d ->552 ulelong x fsid.blocks %d ->556 ulelong x fsid.files %d ->560 string >\0 name "%.16s" - - -Hacker Notes ------------- - -See fs/cramfs/README for filesystem layout and implementation notes. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index ddd8f7b2bb25..8fe848ea04af 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -56,6 +56,7 @@ Documentation for filesystem implementations. bfs btrfs ceph + cramfs fuse overlayfs virtiofs -- cgit From 57443789849cd79e66488301a01f01c6340942ce Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:57 +0100 Subject: docs: filesystems: convert debugfs.txt to ReST - Add a SPDX header; - Use copyright symbol; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Use footnoote markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/42db8f9db17a5d8b619130815ae63d1615951d50.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/debugfs.rst | 247 ++++++++++++++++++++++++++++++++++ Documentation/filesystems/debugfs.txt | 241 --------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 248 insertions(+), 241 deletions(-) create mode 100644 Documentation/filesystems/debugfs.rst delete mode 100644 Documentation/filesystems/debugfs.txt diff --git a/Documentation/filesystems/debugfs.rst b/Documentation/filesystems/debugfs.rst new file mode 100644 index 000000000000..c89d2d335dfb --- /dev/null +++ b/Documentation/filesystems/debugfs.rst @@ -0,0 +1,247 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +======= +DebugFS +======= + +Copyright |copy| 2009 Jonathan Corbet + +Debugfs exists as a simple way for kernel developers to make information +available to user space. Unlike /proc, which is only meant for information +about a process, or sysfs, which has strict one-value-per-file rules, +debugfs has no rules at all. Developers can put any information they want +there. The debugfs filesystem is also intended to not serve as a stable +ABI to user space; in theory, there are no stability constraints placed on +files exported there. The real world is not always so simple, though [1]_; +even debugfs interfaces are best designed with the idea that they will need +to be maintained forever. + +Debugfs is typically mounted with a command like:: + + mount -t debugfs none /sys/kernel/debug + +(Or an equivalent /etc/fstab line). +The debugfs root directory is accessible only to the root user by +default. To change access to the tree the "uid", "gid" and "mode" mount +options can be used. + +Note that the debugfs API is exported GPL-only to modules. + +Code using debugfs should include . Then, the first order +of business will be to create at least one directory to hold a set of +debugfs files:: + + struct dentry *debugfs_create_dir(const char *name, struct dentry *parent); + +This call, if successful, will make a directory called name underneath the +indicated parent directory. If parent is NULL, the directory will be +created in the debugfs root. On success, the return value is a struct +dentry pointer which can be used to create files in the directory (and to +clean it up at the end). An ERR_PTR(-ERROR) return value indicates that +something went wrong. If ERR_PTR(-ENODEV) is returned, that is an +indication that the kernel has been built without debugfs support and none +of the functions described below will work. + +The most general way to create a file within a debugfs directory is with:: + + struct dentry *debugfs_create_file(const char *name, umode_t mode, + struct dentry *parent, void *data, + const struct file_operations *fops); + +Here, name is the name of the file to create, mode describes the access +permissions the file should have, parent indicates the directory which +should hold the file, data will be stored in the i_private field of the +resulting inode structure, and fops is a set of file operations which +implement the file's behavior. At a minimum, the read() and/or write() +operations should be provided; others can be included as needed. Again, +the return value will be a dentry pointer to the created file, +ERR_PTR(-ERROR) on error, or ERR_PTR(-ENODEV) if debugfs support is +missing. + +Create a file with an initial size, the following function can be used +instead:: + + struct dentry *debugfs_create_file_size(const char *name, umode_t mode, + struct dentry *parent, void *data, + const struct file_operations *fops, + loff_t file_size); + +file_size is the initial file size. The other parameters are the same +as the function debugfs_create_file. + +In a number of cases, the creation of a set of file operations is not +actually necessary; the debugfs code provides a number of helper functions +for simple situations. Files containing a single integer value can be +created with any of:: + + void debugfs_create_u8(const char *name, umode_t mode, + struct dentry *parent, u8 *value); + void debugfs_create_u16(const char *name, umode_t mode, + struct dentry *parent, u16 *value); + struct dentry *debugfs_create_u32(const char *name, umode_t mode, + struct dentry *parent, u32 *value); + void debugfs_create_u64(const char *name, umode_t mode, + struct dentry *parent, u64 *value); + +These files support both reading and writing the given value; if a specific +file should not be written to, simply set the mode bits accordingly. The +values in these files are in decimal; if hexadecimal is more appropriate, +the following functions can be used instead:: + + void debugfs_create_x8(const char *name, umode_t mode, + struct dentry *parent, u8 *value); + void debugfs_create_x16(const char *name, umode_t mode, + struct dentry *parent, u16 *value); + void debugfs_create_x32(const char *name, umode_t mode, + struct dentry *parent, u32 *value); + void debugfs_create_x64(const char *name, umode_t mode, + struct dentry *parent, u64 *value); + +These functions are useful as long as the developer knows the size of the +value to be exported. Some types can have different widths on different +architectures, though, complicating the situation somewhat. There are +functions meant to help out in such special cases:: + + void debugfs_create_size_t(const char *name, umode_t mode, + struct dentry *parent, size_t *value); + +As might be expected, this function will create a debugfs file to represent +a variable of type size_t. + +Similarly, there are helpers for variables of type unsigned long, in decimal +and hexadecimal:: + + struct dentry *debugfs_create_ulong(const char *name, umode_t mode, + struct dentry *parent, + unsigned long *value); + void debugfs_create_xul(const char *name, umode_t mode, + struct dentry *parent, unsigned long *value); + +Boolean values can be placed in debugfs with:: + + struct dentry *debugfs_create_bool(const char *name, umode_t mode, + struct dentry *parent, bool *value); + +A read on the resulting file will yield either Y (for non-zero values) or +N, followed by a newline. If written to, it will accept either upper- or +lower-case values, or 1 or 0. Any other input will be silently ignored. + +Also, atomic_t values can be placed in debugfs with:: + + void debugfs_create_atomic_t(const char *name, umode_t mode, + struct dentry *parent, atomic_t *value) + +A read of this file will get atomic_t values, and a write of this file +will set atomic_t values. + +Another option is exporting a block of arbitrary binary data, with +this structure and function:: + + struct debugfs_blob_wrapper { + void *data; + unsigned long size; + }; + + struct dentry *debugfs_create_blob(const char *name, umode_t mode, + struct dentry *parent, + struct debugfs_blob_wrapper *blob); + +A read of this file will return the data pointed to by the +debugfs_blob_wrapper structure. Some drivers use "blobs" as a simple way +to return several lines of (static) formatted text output. This function +can be used to export binary information, but there does not appear to be +any code which does so in the mainline. Note that all files created with +debugfs_create_blob() are read-only. + +If you want to dump a block of registers (something that happens quite +often during development, even if little such code reaches mainline. +Debugfs offers two functions: one to make a registers-only file, and +another to insert a register block in the middle of another sequential +file:: + + struct debugfs_reg32 { + char *name; + unsigned long offset; + }; + + struct debugfs_regset32 { + struct debugfs_reg32 *regs; + int nregs; + void __iomem *base; + }; + + struct dentry *debugfs_create_regset32(const char *name, umode_t mode, + struct dentry *parent, + struct debugfs_regset32 *regset); + + void debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs, + int nregs, void __iomem *base, char *prefix); + +The "base" argument may be 0, but you may want to build the reg32 array +using __stringify, and a number of register names (macros) are actually +byte offsets over a base for the register block. + +If you want to dump an u32 array in debugfs, you can create file with:: + + void debugfs_create_u32_array(const char *name, umode_t mode, + struct dentry *parent, + u32 *array, u32 elements); + +The "array" argument provides data, and the "elements" argument is +the number of elements in the array. Note: Once array is created its +size can not be changed. + +There is a helper function to create device related seq_file:: + + struct dentry *debugfs_create_devm_seqfile(struct device *dev, + const char *name, + struct dentry *parent, + int (*read_fn)(struct seq_file *s, + void *data)); + +The "dev" argument is the device related to this debugfs file, and +the "read_fn" is a function pointer which to be called to print the +seq_file content. + +There are a couple of other directory-oriented helper functions:: + + struct dentry *debugfs_rename(struct dentry *old_dir, + struct dentry *old_dentry, + struct dentry *new_dir, + const char *new_name); + + struct dentry *debugfs_create_symlink(const char *name, + struct dentry *parent, + const char *target); + +A call to debugfs_rename() will give a new name to an existing debugfs +file, possibly in a different directory. The new_name must not exist prior +to the call; the return value is old_dentry with updated information. +Symbolic links can be created with debugfs_create_symlink(). + +There is one important thing that all debugfs users must take into account: +there is no automatic cleanup of any directories created in debugfs. If a +module is unloaded without explicitly removing debugfs entries, the result +will be a lot of stale pointers and no end of highly antisocial behavior. +So all debugfs users - at least those which can be built as modules - must +be prepared to remove all files and directories they create there. A file +can be removed with:: + + void debugfs_remove(struct dentry *dentry); + +The dentry value can be NULL or an error value, in which case nothing will +be removed. + +Once upon a time, debugfs users were required to remember the dentry +pointer for every debugfs file they created so that all files could be +cleaned up. We live in more civilized times now, though, and debugfs users +can call:: + + void debugfs_remove_recursive(struct dentry *dentry); + +If this function is passed a pointer for the dentry corresponding to the +top-level directory, the entire hierarchy below that directory will be +removed. + +.. [1] http://lwn.net/Articles/309298/ diff --git a/Documentation/filesystems/debugfs.txt b/Documentation/filesystems/debugfs.txt deleted file mode 100644 index dc497b96fa4f..000000000000 --- a/Documentation/filesystems/debugfs.txt +++ /dev/null @@ -1,241 +0,0 @@ -Copyright 2009 Jonathan Corbet - -Debugfs exists as a simple way for kernel developers to make information -available to user space. Unlike /proc, which is only meant for information -about a process, or sysfs, which has strict one-value-per-file rules, -debugfs has no rules at all. Developers can put any information they want -there. The debugfs filesystem is also intended to not serve as a stable -ABI to user space; in theory, there are no stability constraints placed on -files exported there. The real world is not always so simple, though [1]; -even debugfs interfaces are best designed with the idea that they will need -to be maintained forever. - -Debugfs is typically mounted with a command like: - - mount -t debugfs none /sys/kernel/debug - -(Or an equivalent /etc/fstab line). -The debugfs root directory is accessible only to the root user by -default. To change access to the tree the "uid", "gid" and "mode" mount -options can be used. - -Note that the debugfs API is exported GPL-only to modules. - -Code using debugfs should include . Then, the first order -of business will be to create at least one directory to hold a set of -debugfs files: - - struct dentry *debugfs_create_dir(const char *name, struct dentry *parent); - -This call, if successful, will make a directory called name underneath the -indicated parent directory. If parent is NULL, the directory will be -created in the debugfs root. On success, the return value is a struct -dentry pointer which can be used to create files in the directory (and to -clean it up at the end). An ERR_PTR(-ERROR) return value indicates that -something went wrong. If ERR_PTR(-ENODEV) is returned, that is an -indication that the kernel has been built without debugfs support and none -of the functions described below will work. - -The most general way to create a file within a debugfs directory is with: - - struct dentry *debugfs_create_file(const char *name, umode_t mode, - struct dentry *parent, void *data, - const struct file_operations *fops); - -Here, name is the name of the file to create, mode describes the access -permissions the file should have, parent indicates the directory which -should hold the file, data will be stored in the i_private field of the -resulting inode structure, and fops is a set of file operations which -implement the file's behavior. At a minimum, the read() and/or write() -operations should be provided; others can be included as needed. Again, -the return value will be a dentry pointer to the created file, -ERR_PTR(-ERROR) on error, or ERR_PTR(-ENODEV) if debugfs support is -missing. - -Create a file with an initial size, the following function can be used -instead: - - struct dentry *debugfs_create_file_size(const char *name, umode_t mode, - struct dentry *parent, void *data, - const struct file_operations *fops, - loff_t file_size); - -file_size is the initial file size. The other parameters are the same -as the function debugfs_create_file. - -In a number of cases, the creation of a set of file operations is not -actually necessary; the debugfs code provides a number of helper functions -for simple situations. Files containing a single integer value can be -created with any of: - - void debugfs_create_u8(const char *name, umode_t mode, - struct dentry *parent, u8 *value); - void debugfs_create_u16(const char *name, umode_t mode, - struct dentry *parent, u16 *value); - struct dentry *debugfs_create_u32(const char *name, umode_t mode, - struct dentry *parent, u32 *value); - void debugfs_create_u64(const char *name, umode_t mode, - struct dentry *parent, u64 *value); - -These files support both reading and writing the given value; if a specific -file should not be written to, simply set the mode bits accordingly. The -values in these files are in decimal; if hexadecimal is more appropriate, -the following functions can be used instead: - - void debugfs_create_x8(const char *name, umode_t mode, - struct dentry *parent, u8 *value); - void debugfs_create_x16(const char *name, umode_t mode, - struct dentry *parent, u16 *value); - void debugfs_create_x32(const char *name, umode_t mode, - struct dentry *parent, u32 *value); - void debugfs_create_x64(const char *name, umode_t mode, - struct dentry *parent, u64 *value); - -These functions are useful as long as the developer knows the size of the -value to be exported. Some types can have different widths on different -architectures, though, complicating the situation somewhat. There are -functions meant to help out in such special cases: - - void debugfs_create_size_t(const char *name, umode_t mode, - struct dentry *parent, size_t *value); - -As might be expected, this function will create a debugfs file to represent -a variable of type size_t. - -Similarly, there are helpers for variables of type unsigned long, in decimal -and hexadecimal: - - struct dentry *debugfs_create_ulong(const char *name, umode_t mode, - struct dentry *parent, - unsigned long *value); - void debugfs_create_xul(const char *name, umode_t mode, - struct dentry *parent, unsigned long *value); - -Boolean values can be placed in debugfs with: - - struct dentry *debugfs_create_bool(const char *name, umode_t mode, - struct dentry *parent, bool *value); - -A read on the resulting file will yield either Y (for non-zero values) or -N, followed by a newline. If written to, it will accept either upper- or -lower-case values, or 1 or 0. Any other input will be silently ignored. - -Also, atomic_t values can be placed in debugfs with: - - void debugfs_create_atomic_t(const char *name, umode_t mode, - struct dentry *parent, atomic_t *value) - -A read of this file will get atomic_t values, and a write of this file -will set atomic_t values. - -Another option is exporting a block of arbitrary binary data, with -this structure and function: - - struct debugfs_blob_wrapper { - void *data; - unsigned long size; - }; - - struct dentry *debugfs_create_blob(const char *name, umode_t mode, - struct dentry *parent, - struct debugfs_blob_wrapper *blob); - -A read of this file will return the data pointed to by the -debugfs_blob_wrapper structure. Some drivers use "blobs" as a simple way -to return several lines of (static) formatted text output. This function -can be used to export binary information, but there does not appear to be -any code which does so in the mainline. Note that all files created with -debugfs_create_blob() are read-only. - -If you want to dump a block of registers (something that happens quite -often during development, even if little such code reaches mainline. -Debugfs offers two functions: one to make a registers-only file, and -another to insert a register block in the middle of another sequential -file. - - struct debugfs_reg32 { - char *name; - unsigned long offset; - }; - - struct debugfs_regset32 { - struct debugfs_reg32 *regs; - int nregs; - void __iomem *base; - }; - - struct dentry *debugfs_create_regset32(const char *name, umode_t mode, - struct dentry *parent, - struct debugfs_regset32 *regset); - - void debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs, - int nregs, void __iomem *base, char *prefix); - -The "base" argument may be 0, but you may want to build the reg32 array -using __stringify, and a number of register names (macros) are actually -byte offsets over a base for the register block. - -If you want to dump an u32 array in debugfs, you can create file with: - - void debugfs_create_u32_array(const char *name, umode_t mode, - struct dentry *parent, - u32 *array, u32 elements); - -The "array" argument provides data, and the "elements" argument is -the number of elements in the array. Note: Once array is created its -size can not be changed. - -There is a helper function to create device related seq_file: - - struct dentry *debugfs_create_devm_seqfile(struct device *dev, - const char *name, - struct dentry *parent, - int (*read_fn)(struct seq_file *s, - void *data)); - -The "dev" argument is the device related to this debugfs file, and -the "read_fn" is a function pointer which to be called to print the -seq_file content. - -There are a couple of other directory-oriented helper functions: - - struct dentry *debugfs_rename(struct dentry *old_dir, - struct dentry *old_dentry, - struct dentry *new_dir, - const char *new_name); - - struct dentry *debugfs_create_symlink(const char *name, - struct dentry *parent, - const char *target); - -A call to debugfs_rename() will give a new name to an existing debugfs -file, possibly in a different directory. The new_name must not exist prior -to the call; the return value is old_dentry with updated information. -Symbolic links can be created with debugfs_create_symlink(). - -There is one important thing that all debugfs users must take into account: -there is no automatic cleanup of any directories created in debugfs. If a -module is unloaded without explicitly removing debugfs entries, the result -will be a lot of stale pointers and no end of highly antisocial behavior. -So all debugfs users - at least those which can be built as modules - must -be prepared to remove all files and directories they create there. A file -can be removed with: - - void debugfs_remove(struct dentry *dentry); - -The dentry value can be NULL or an error value, in which case nothing will -be removed. - -Once upon a time, debugfs users were required to remember the dentry -pointer for every debugfs file they created so that all files could be -cleaned up. We live in more civilized times now, though, and debugfs users -can call: - - void debugfs_remove_recursive(struct dentry *dentry); - -If this function is passed a pointer for the dentry corresponding to the -top-level directory, the entire hierarchy below that directory will be -removed. - -Notes: - [1] http://lwn.net/Articles/309298/ diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 8fe848ea04af..ab3b656bbe60 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -57,6 +57,7 @@ Documentation for filesystem implementations. btrfs ceph cramfs + debugfs fuse overlayfs virtiofs -- cgit From 14a19fa5cf759ea18bc7d692cd8fe326af3c4d0a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:58 +0100 Subject: docs: filesystems: convert dlmfs.txt to ReST - Add a SPDX header; - Use copyright symbol; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/efc9e59925723e17d1a4741b11049616c221463e.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/dlmfs.rst | 140 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/dlmfs.txt | 130 --------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 141 insertions(+), 130 deletions(-) create mode 100644 Documentation/filesystems/dlmfs.rst delete mode 100644 Documentation/filesystems/dlmfs.txt diff --git a/Documentation/filesystems/dlmfs.rst b/Documentation/filesystems/dlmfs.rst new file mode 100644 index 000000000000..68daaa7facf9 --- /dev/null +++ b/Documentation/filesystems/dlmfs.rst @@ -0,0 +1,140 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +===== +DLMFS +===== + +A minimal DLM userspace interface implemented via a virtual file +system. + +dlmfs is built with OCFS2 as it requires most of its infrastructure. + +:Project web page: http://ocfs2.wiki.kernel.org +:Tools web page: https://github.com/markfasheh/ocfs2-tools +:OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ + +All code copyright 2005 Oracle except when otherwise noted. + +Credits +======= + +Some code taken from ramfs which is Copyright |copy| 2000 Linus Torvalds +and Transmeta Corp. + +Mark Fasheh + +Caveats +======= +- Right now it only works with the OCFS2 DLM, though support for other + DLM implementations should not be a major issue. + +Mount options +============= +None + +Usage +===== + +If you're just interested in OCFS2, then please see ocfs2.txt. The +rest of this document will be geared towards those who want to use +dlmfs for easy to setup and easy to use clustered locking in +userspace. + +Setup +===== + +dlmfs requires that the OCFS2 cluster infrastructure be in +place. Please download ocfs2-tools from the above url and configure a +cluster. + +You'll want to start heartbeating on a volume which all the nodes in +your lockspace can access. The easiest way to do this is via +ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires +that an OCFS2 file system be in place so that it can automatically +find its heartbeat area, though it will eventually support heartbeat +against raw disks. + +Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed +with ocfs2-tools. + +Once you're heartbeating, DLM lock 'domains' can be easily created / +destroyed and locks within them accessed. + +Locking +======= + +Users may access dlmfs via standard file system calls, or they can use +'libo2dlm' (distributed with ocfs2-tools) which abstracts the file +system calls and presents a more traditional locking api. + +dlmfs handles lock caching automatically for the user, so a lock +request for an already acquired lock will not generate another DLM +call. Userspace programs are assumed to handle their own local +locking. + +Two levels of locks are supported - Shared Read, and Exclusive. +Also supported is a Trylock operation. + +For information on the libo2dlm interface, please see o2dlm.h, +distributed with ocfs2-tools. + +Lock value blocks can be read and written to a resource via read(2) +and write(2) against the fd obtained via your open(2) call. The +maximum currently supported LVB length is 64 bytes (though that is an +OCFS2 DLM limitation). Through this mechanism, users of dlmfs can share +small amounts of data amongst their nodes. + +mkdir(2) signals dlmfs to join a domain (which will have the same name +as the resulting directory) + +rmdir(2) signals dlmfs to leave the domain + +Locks for a given domain are represented by regular inodes inside the +domain directory. Locking against them is done via the open(2) system +call. + +The open(2) call will not return until your lock has been granted or +an error has occurred, unless it has been instructed to do a trylock +operation. If the lock succeeds, you'll get an fd. + +open(2) with O_CREAT to ensure the resource inode is created - dlmfs does +not automatically create inodes for existing lock resources. + +============ =========================== +Open Flag Lock Request Type +============ =========================== +O_RDONLY Shared Read +O_RDWR Exclusive +============ =========================== + + +============ =========================== +Open Flag Resulting Locking Behavior +============ =========================== +O_NONBLOCK Trylock operation +============ =========================== + +You must provide exactly one of O_RDONLY or O_RDWR. + +If O_NONBLOCK is also provided and the trylock operation was valid but +could not lock the resource then open(2) will return ETXTBUSY. + +close(2) drops the lock associated with your fd. + +Modes passed to mkdir(2) or open(2) are adhered to locally. Chown is +supported locally as well. This means you can use them to restrict +access to the resources via dlmfs on your local node only. + +The resource LVB may be read from the fd in either Shared Read or +Exclusive modes via the read(2) system call. It can be written via +write(2) only when open in Exclusive mode. + +Once written, an LVB will be visible to other nodes who obtain Read +Only or higher level locks on the resource. + +See Also +======== +http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf + +For more information on the VMS distributed locking API. diff --git a/Documentation/filesystems/dlmfs.txt b/Documentation/filesystems/dlmfs.txt deleted file mode 100644 index fcf4d509d118..000000000000 --- a/Documentation/filesystems/dlmfs.txt +++ /dev/null @@ -1,130 +0,0 @@ -dlmfs -================== -A minimal DLM userspace interface implemented via a virtual file -system. - -dlmfs is built with OCFS2 as it requires most of its infrastructure. - -Project web page: http://ocfs2.wiki.kernel.org -Tools web page: https://github.com/markfasheh/ocfs2-tools -OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ - -All code copyright 2005 Oracle except when otherwise noted. - -CREDITS -======= - -Some code taken from ramfs which is Copyright (C) 2000 Linus Torvalds -and Transmeta Corp. - -Mark Fasheh - -Caveats -======= -- Right now it only works with the OCFS2 DLM, though support for other - DLM implementations should not be a major issue. - -Mount options -============= -None - -Usage -===== - -If you're just interested in OCFS2, then please see ocfs2.txt. The -rest of this document will be geared towards those who want to use -dlmfs for easy to setup and easy to use clustered locking in -userspace. - -Setup -===== - -dlmfs requires that the OCFS2 cluster infrastructure be in -place. Please download ocfs2-tools from the above url and configure a -cluster. - -You'll want to start heartbeating on a volume which all the nodes in -your lockspace can access. The easiest way to do this is via -ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires -that an OCFS2 file system be in place so that it can automatically -find its heartbeat area, though it will eventually support heartbeat -against raw disks. - -Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed -with ocfs2-tools. - -Once you're heartbeating, DLM lock 'domains' can be easily created / -destroyed and locks within them accessed. - -Locking -======= - -Users may access dlmfs via standard file system calls, or they can use -'libo2dlm' (distributed with ocfs2-tools) which abstracts the file -system calls and presents a more traditional locking api. - -dlmfs handles lock caching automatically for the user, so a lock -request for an already acquired lock will not generate another DLM -call. Userspace programs are assumed to handle their own local -locking. - -Two levels of locks are supported - Shared Read, and Exclusive. -Also supported is a Trylock operation. - -For information on the libo2dlm interface, please see o2dlm.h, -distributed with ocfs2-tools. - -Lock value blocks can be read and written to a resource via read(2) -and write(2) against the fd obtained via your open(2) call. The -maximum currently supported LVB length is 64 bytes (though that is an -OCFS2 DLM limitation). Through this mechanism, users of dlmfs can share -small amounts of data amongst their nodes. - -mkdir(2) signals dlmfs to join a domain (which will have the same name -as the resulting directory) - -rmdir(2) signals dlmfs to leave the domain - -Locks for a given domain are represented by regular inodes inside the -domain directory. Locking against them is done via the open(2) system -call. - -The open(2) call will not return until your lock has been granted or -an error has occurred, unless it has been instructed to do a trylock -operation. If the lock succeeds, you'll get an fd. - -open(2) with O_CREAT to ensure the resource inode is created - dlmfs does -not automatically create inodes for existing lock resources. - -Open Flag Lock Request Type ---------- ----------------- -O_RDONLY Shared Read -O_RDWR Exclusive - -Open Flag Resulting Locking Behavior ---------- -------------------------- -O_NONBLOCK Trylock operation - -You must provide exactly one of O_RDONLY or O_RDWR. - -If O_NONBLOCK is also provided and the trylock operation was valid but -could not lock the resource then open(2) will return ETXTBUSY. - -close(2) drops the lock associated with your fd. - -Modes passed to mkdir(2) or open(2) are adhered to locally. Chown is -supported locally as well. This means you can use them to restrict -access to the resources via dlmfs on your local node only. - -The resource LVB may be read from the fd in either Shared Read or -Exclusive modes via the read(2) system call. It can be written via -write(2) only when open in Exclusive mode. - -Once written, an LVB will be visible to other nodes who obtain Read -Only or higher level locks on the resource. - -See Also -======== -http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf - -For more information on the VMS distributed locking API. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index ab3b656bbe60..c6885c7ef781 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -58,6 +58,7 @@ Documentation for filesystem implementations. ceph cramfs debugfs + dlmfs fuse overlayfs virtiofs -- cgit From b02a17cb8ae23479c9bf306e96d2dd71422de63f Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:59 +0100 Subject: docs: filesystems: convert ecryptfs.txt to ReST - Add a SPDX header; - Add a document title; - use :field: markup; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Tyler Hicks Link: https://lore.kernel.org/r/6e13841ebd00c8d988027115c75c58821bb41a0c.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/ecryptfs.rst | 87 ++++++++++++++++++++++++++++++++++ Documentation/filesystems/ecryptfs.txt | 77 ------------------------------ Documentation/filesystems/index.rst | 1 + 3 files changed, 88 insertions(+), 77 deletions(-) create mode 100644 Documentation/filesystems/ecryptfs.rst delete mode 100644 Documentation/filesystems/ecryptfs.txt diff --git a/Documentation/filesystems/ecryptfs.rst b/Documentation/filesystems/ecryptfs.rst new file mode 100644 index 000000000000..7236172300ef --- /dev/null +++ b/Documentation/filesystems/ecryptfs.rst @@ -0,0 +1,87 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================================== +eCryptfs: A stacked cryptographic filesystem for Linux +====================================================== + +eCryptfs is free software. Please see the file COPYING for details. +For documentation, please see the files in the doc/ subdirectory. For +building and installation instructions please see the INSTALL file. + +:Maintainer: Phillip Hellewell +:Lead developer: Michael A. Halcrow +:Developers: Michael C. Thompson + Kent Yoder +:Web Site: http://ecryptfs.sf.net + +This software is currently undergoing development. Make sure to +maintain a backup copy of any data you write into eCryptfs. + +eCryptfs requires the userspace tools downloadable from the +SourceForge site: + +http://sourceforge.net/projects/ecryptfs/ + +Userspace requirements include: + +- David Howells' userspace keyring headers and libraries (version + 1.0 or higher), obtainable from + http://people.redhat.com/~dhowells/keyutils/ +- Libgcrypt + + +Notes +===== + +In the beta/experimental releases of eCryptfs, when you upgrade +eCryptfs, you should copy the files to an unencrypted location and +then copy the files back into the new eCryptfs mount to migrate the +files. + + +Mount-wide Passphrase +===================== + +Create a new directory into which eCryptfs will write its encrypted +files (i.e., /root/crypt). Then, create the mount point directory +(i.e., /mnt/crypt). Now it's time to mount eCryptfs:: + + mount -t ecryptfs /root/crypt /mnt/crypt + +You should be prompted for a passphrase and a salt (the salt may be +blank). + +Try writing a new file:: + + echo "Hello, World" > /mnt/crypt/hello.txt + +The operation will complete. Notice that there is a new file in +/root/crypt that is at least 12288 bytes in size (depending on your +host page size). This is the encrypted underlying file for what you +just wrote. To test reading, from start to finish, you need to clear +the user session keyring: + +keyctl clear @u + +Then umount /mnt/crypt and mount again per the instructions given +above. + +:: + + cat /mnt/crypt/hello.txt + + +Notes +===== + +eCryptfs version 0.1 should only be mounted on (1) empty directories +or (2) directories containing files only created by eCryptfs. If you +mount a directory that has pre-existing files not created by eCryptfs, +then behavior is undefined. Do not run eCryptfs in higher verbosity +levels unless you are doing so for the sole purpose of debugging or +development, since secret values will be written out to the system log +in that case. + + +Mike Halcrow +mhalcrow@us.ibm.com diff --git a/Documentation/filesystems/ecryptfs.txt b/Documentation/filesystems/ecryptfs.txt deleted file mode 100644 index 01d8a08351ac..000000000000 --- a/Documentation/filesystems/ecryptfs.txt +++ /dev/null @@ -1,77 +0,0 @@ -eCryptfs: A stacked cryptographic filesystem for Linux - -eCryptfs is free software. Please see the file COPYING for details. -For documentation, please see the files in the doc/ subdirectory. For -building and installation instructions please see the INSTALL file. - -Maintainer: Phillip Hellewell -Lead developer: Michael A. Halcrow -Developers: Michael C. Thompson - Kent Yoder -Web Site: http://ecryptfs.sf.net - -This software is currently undergoing development. Make sure to -maintain a backup copy of any data you write into eCryptfs. - -eCryptfs requires the userspace tools downloadable from the -SourceForge site: - -http://sourceforge.net/projects/ecryptfs/ - -Userspace requirements include: - - David Howells' userspace keyring headers and libraries (version - 1.0 or higher), obtainable from - http://people.redhat.com/~dhowells/keyutils/ - - Libgcrypt - - -NOTES - -In the beta/experimental releases of eCryptfs, when you upgrade -eCryptfs, you should copy the files to an unencrypted location and -then copy the files back into the new eCryptfs mount to migrate the -files. - - -MOUNT-WIDE PASSPHRASE - -Create a new directory into which eCryptfs will write its encrypted -files (i.e., /root/crypt). Then, create the mount point directory -(i.e., /mnt/crypt). Now it's time to mount eCryptfs: - -mount -t ecryptfs /root/crypt /mnt/crypt - -You should be prompted for a passphrase and a salt (the salt may be -blank). - -Try writing a new file: - -echo "Hello, World" > /mnt/crypt/hello.txt - -The operation will complete. Notice that there is a new file in -/root/crypt that is at least 12288 bytes in size (depending on your -host page size). This is the encrypted underlying file for what you -just wrote. To test reading, from start to finish, you need to clear -the user session keyring: - -keyctl clear @u - -Then umount /mnt/crypt and mount again per the instructions given -above. - -cat /mnt/crypt/hello.txt - - -NOTES - -eCryptfs version 0.1 should only be mounted on (1) empty directories -or (2) directories containing files only created by eCryptfs. If you -mount a directory that has pre-existing files not created by eCryptfs, -then behavior is undefined. Do not run eCryptfs in higher verbosity -levels unless you are doing so for the sole purpose of debugging or -development, since secret values will be written out to the system log -in that case. - - -Mike Halcrow -mhalcrow@us.ibm.com diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index c6885c7ef781..d6d69f1c9287 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -59,6 +59,7 @@ Documentation for filesystem implementations. cramfs debugfs dlmfs + ecryptfs fuse overlayfs virtiofs -- cgit From 06dedb45b79c6550b878244879f33b6e614126bd Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:00 +0100 Subject: docs: filesystems: convert efivarfs.txt to ReST Trivial changes: - Add a SPDX header; - Adjust document title; - Mark a literal block as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/215691d747055c4ccb038ec7d78d8d1fe87fe2c0.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/efivarfs.rst | 26 ++++++++++++++++++++++++++ Documentation/filesystems/efivarfs.txt | 23 ----------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 27 insertions(+), 23 deletions(-) create mode 100644 Documentation/filesystems/efivarfs.rst delete mode 100644 Documentation/filesystems/efivarfs.txt diff --git a/Documentation/filesystems/efivarfs.rst b/Documentation/filesystems/efivarfs.rst new file mode 100644 index 000000000000..90ac65683e7e --- /dev/null +++ b/Documentation/filesystems/efivarfs.rst @@ -0,0 +1,26 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================= +efivarfs - a (U)EFI variable filesystem +======================================= + +The efivarfs filesystem was created to address the shortcomings of +using entries in sysfs to maintain EFI variables. The old sysfs EFI +variables code only supported variables of up to 1024 bytes. This +limitation existed in version 0.99 of the EFI specification, but was +removed before any full releases. Since variables can now be larger +than a single page, sysfs isn't the best interface for this. + +Variables can be created, deleted and modified with the efivarfs +filesystem. + +efivarfs is typically mounted like this:: + + mount -t efivarfs none /sys/firmware/efi/efivars + +Due to the presence of numerous firmware bugs where removing non-standard +UEFI variables causes the system firmware to fail to POST, efivarfs +files that are not well-known standardized variables are created +as immutable files. This doesn't prevent removal - "chattr -i" will work - +but it does prevent this kind of failure from being accomplished +accidentally. diff --git a/Documentation/filesystems/efivarfs.txt b/Documentation/filesystems/efivarfs.txt deleted file mode 100644 index 686a64bba775..000000000000 --- a/Documentation/filesystems/efivarfs.txt +++ /dev/null @@ -1,23 +0,0 @@ - -efivarfs - a (U)EFI variable filesystem - -The efivarfs filesystem was created to address the shortcomings of -using entries in sysfs to maintain EFI variables. The old sysfs EFI -variables code only supported variables of up to 1024 bytes. This -limitation existed in version 0.99 of the EFI specification, but was -removed before any full releases. Since variables can now be larger -than a single page, sysfs isn't the best interface for this. - -Variables can be created, deleted and modified with the efivarfs -filesystem. - -efivarfs is typically mounted like this, - - mount -t efivarfs none /sys/firmware/efi/efivars - -Due to the presence of numerous firmware bugs where removing non-standard -UEFI variables causes the system firmware to fail to POST, efivarfs -files that are not well-known standardized variables are created -as immutable files. This doesn't prevent removal - "chattr -i" will work - -but it does prevent this kind of failure from being accomplished -accidentally. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index d6d69f1c9287..4230f49d2732 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -60,6 +60,7 @@ Documentation for filesystem implementations. debugfs dlmfs ecryptfs + efivarfs fuse overlayfs virtiofs -- cgit From e66d8631ddb3306bd9f463324c2d9a5d9dc559f7 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:01 +0100 Subject: docs: filesystems: convert erofs.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add lists markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/402d1d2f7252b8a683f7a9c6867bc5428da64026.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/erofs.rst | 240 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/erofs.txt | 211 ------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 241 insertions(+), 211 deletions(-) create mode 100644 Documentation/filesystems/erofs.rst delete mode 100644 Documentation/filesystems/erofs.txt diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst new file mode 100644 index 000000000000..bf145171c2bf --- /dev/null +++ b/Documentation/filesystems/erofs.rst @@ -0,0 +1,240 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================== +Enhanced Read-Only File System - EROFS +====================================== + +Overview +======== + +EROFS file-system stands for Enhanced Read-Only File System. Different +from other read-only file systems, it aims to be designed for flexibility, +scalability, but be kept simple and high performance. + +It is designed as a better filesystem solution for the following scenarios: + + - read-only storage media or + + - part of a fully trusted read-only solution, which means it needs to be + immutable and bit-for-bit identical to the official golden image for + their releases due to security and other considerations and + + - hope to save some extra storage space with guaranteed end-to-end performance + by using reduced metadata and transparent file compression, especially + for those embedded devices with limited memory (ex, smartphone); + +Here is the main features of EROFS: + + - Little endian on-disk design; + + - Currently 4KB block size (nobh) and therefore maximum 16TB address space; + + - Metadata & data could be mixed by design; + + - 2 inode versions for different requirements: + + ===================== ============ ===================================== + compact (v1) extended (v2) + ===================== ============ ===================================== + Inode metadata size 32 bytes 64 bytes + Max file size 4 GB 16 EB (also limited by max. vol size) + Max uids/gids 65536 4294967296 + File change time no yes (64 + 32-bit timestamp) + Max hardlinks 65536 4294967296 + Metadata reserved 4 bytes 14 bytes + ===================== ============ ===================================== + + - Support extended attributes (xattrs) as an option; + + - Support xattr inline and tail-end data inline for all files; + + - Support POSIX.1e ACLs by using xattrs; + + - Support transparent file compression as an option: + LZ4 algorithm with 4 KB fixed-sized output compression for high performance. + +The following git tree provides the file system user-space tools under +development (ex, formatting tool mkfs.erofs): + +- git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git + +Bugs and patches are welcome, please kindly help us and send to the following +linux-erofs mailing list: + +- linux-erofs mailing list + +Mount options +============= + +=================== ========================================================= +(no)user_xattr Setup Extended User Attributes. Note: xattr is enabled + by default if CONFIG_EROFS_FS_XATTR is selected. +(no)acl Setup POSIX Access Control List. Note: acl is enabled + by default if CONFIG_EROFS_FS_POSIX_ACL is selected. +cache_strategy=%s Select a strategy for cached decompression from now on: + + ========== ============================================= + disabled In-place I/O decompression only; + readahead Cache the last incomplete compressed physical + cluster for further reading. It still does + in-place I/O decompression for the rest + compressed physical clusters; + readaround Cache the both ends of incomplete compressed + physical clusters for further reading. + It still does in-place I/O decompression + for the rest compressed physical clusters. + ========== ============================================= +=================== ========================================================= + +On-disk details +=============== + +Summary +------- +Different from other read-only file systems, an EROFS volume is designed +to be as simple as possible:: + + |-> aligned with the block size + ____________________________________________________________ + | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data | + |_|__|_|_____|__________|_____|______|__________|_____|______| + 0 +1K + +All data areas should be aligned with the block size, but metadata areas +may not. All metadatas can be now observed in two different spaces (views): + + 1. Inode metadata space + + Each valid inode should be aligned with an inode slot, which is a fixed + value (32 bytes) and designed to be kept in line with compact inode size. + + Each inode can be directly found with the following formula: + inode offset = meta_blkaddr * block_size + 32 * nid + + :: + + |-> aligned with 8B + |-> followed closely + + meta_blkaddr blocks |-> another slot + _____________________________________________________________________ + | ... | inode | xattrs | extents | data inline | ... | inode ... + |________|_______|(optional)|(optional)|__(optional)_|_____|__________ + |-> aligned with the inode slot size + . . + . . + . . + . . + . . + . . + .____________________________________________________|-> aligned with 4B + | xattr_ibody_header | shared xattrs | inline xattrs | + |____________________|_______________|_______________| + |-> 12 bytes <-|->x * 4 bytes<-| . + . . . + . . . + . . . + ._______________________________.______________________. + | id | id | id | id | ... | id | ent | ... | ent| ... | + |____|____|____|____|______|____|_____|_____|____|_____| + |-> aligned with 4B + |-> aligned with 4B + + Inode could be 32 or 64 bytes, which can be distinguished from a common + field which all inode versions have -- i_format:: + + __________________ __________________ + | i_format | | i_format | + |__________________| |__________________| + | ... | | ... | + | | | | + |__________________| 32 bytes | | + | | + |__________________| 64 bytes + + Xattrs, extents, data inline are followed by the corresponding inode with + proper alignment, and they could be optional for different data mappings. + _currently_ total 4 valid data mappings are supported: + + == ==================================================================== + 0 flat file data without data inline (no extent); + 1 fixed-sized output data compression (with non-compacted indexes); + 2 flat file data with tail packing data inline (no extent); + 3 fixed-sized output data compression (with compacted indexes, v5.3+). + == ==================================================================== + + The size of the optional xattrs is indicated by i_xattr_count in inode + header. Large xattrs or xattrs shared by many different files can be + stored in shared xattrs metadata rather than inlined right after inode. + + 2. Shared xattrs metadata space + + Shared xattrs space is similar to the above inode space, started with + a specific block indicated by xattr_blkaddr, organized one by one with + proper align. + + Each share xattr can also be directly found by the following formula: + xattr offset = xattr_blkaddr * block_size + 4 * xattr_id + + :: + + |-> aligned by 4 bytes + + xattr_blkaddr blocks |-> aligned with 4 bytes + _________________________________________________________________________ + | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ... + |________|_____________|_____________|_____|______________|_______________ + +Directories +----------- +All directories are now organized in a compact on-disk format. Note that +each directory block is divided into index and name areas in order to support +random file lookup, and all directory entries are _strictly_ recorded in +alphabetical order in order to support improved prefix binary search +algorithm (could refer to the related source code). + +:: + + ___________________________ + / | + / ______________|________________ + / / | nameoff1 | nameoffN-1 + ____________.______________._______________v________________v__________ + | dirent | dirent | ... | dirent | filename | filename | ... | filename | + |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____| + \ ^ + \ | * could have + \ | trailing '\0' + \________________________| nameoff0 + + Directory block + +Note that apart from the offset of the first filename, nameoff0 also indicates +the total number of directory entries in this block since it is no need to +introduce another on-disk field at all. + +Compression +----------- +Currently, EROFS supports 4KB fixed-sized output transparent file compression, +as illustrated below:: + + |---- Variant-Length Extent ----|-------- VLE --------|----- VLE ----- + clusterofs clusterofs clusterofs + | | | logical data + _________v_______________________________v_____________________v_______________ + ... | . | | . | | . | ... + ____|____.________|_____________|________.____|_____________|__.__________|____ + |-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-| + size size size size size + . . . . + . . . . + . . . . + _______._____________._____________._____________._____________________ + ... | | | | ... physical data + _______|_____________|_____________|_____________|_____________________ + |-> cluster <-|-> cluster <-|-> cluster <-| + size size size + +Currently each on-disk physical cluster can contain 4KB (un)compressed data +at most. For each logical cluster, there is a corresponding on-disk index to +describe its cluster type, physical cluster address, etc. + +See "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details. diff --git a/Documentation/filesystems/erofs.txt b/Documentation/filesystems/erofs.txt deleted file mode 100644 index db6d39c3ae71..000000000000 --- a/Documentation/filesystems/erofs.txt +++ /dev/null @@ -1,211 +0,0 @@ -Overview -======== - -EROFS file-system stands for Enhanced Read-Only File System. Different -from other read-only file systems, it aims to be designed for flexibility, -scalability, but be kept simple and high performance. - -It is designed as a better filesystem solution for the following scenarios: - - read-only storage media or - - - part of a fully trusted read-only solution, which means it needs to be - immutable and bit-for-bit identical to the official golden image for - their releases due to security and other considerations and - - - hope to save some extra storage space with guaranteed end-to-end performance - by using reduced metadata and transparent file compression, especially - for those embedded devices with limited memory (ex, smartphone); - -Here is the main features of EROFS: - - Little endian on-disk design; - - - Currently 4KB block size (nobh) and therefore maximum 16TB address space; - - - Metadata & data could be mixed by design; - - - 2 inode versions for different requirements: - compact (v1) extended (v2) - Inode metadata size: 32 bytes 64 bytes - Max file size: 4 GB 16 EB (also limited by max. vol size) - Max uids/gids: 65536 4294967296 - File change time: no yes (64 + 32-bit timestamp) - Max hardlinks: 65536 4294967296 - Metadata reserved: 4 bytes 14 bytes - - - Support extended attributes (xattrs) as an option; - - - Support xattr inline and tail-end data inline for all files; - - - Support POSIX.1e ACLs by using xattrs; - - - Support transparent file compression as an option: - LZ4 algorithm with 4 KB fixed-sized output compression for high performance. - -The following git tree provides the file system user-space tools under -development (ex, formatting tool mkfs.erofs): ->> git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git - -Bugs and patches are welcome, please kindly help us and send to the following -linux-erofs mailing list: ->> linux-erofs mailing list - -Mount options -============= - -(no)user_xattr Setup Extended User Attributes. Note: xattr is enabled - by default if CONFIG_EROFS_FS_XATTR is selected. -(no)acl Setup POSIX Access Control List. Note: acl is enabled - by default if CONFIG_EROFS_FS_POSIX_ACL is selected. -cache_strategy=%s Select a strategy for cached decompression from now on: - disabled: In-place I/O decompression only; - readahead: Cache the last incomplete compressed physical - cluster for further reading. It still does - in-place I/O decompression for the rest - compressed physical clusters; - readaround: Cache the both ends of incomplete compressed - physical clusters for further reading. - It still does in-place I/O decompression - for the rest compressed physical clusters. - -On-disk details -=============== - -Summary -------- -Different from other read-only file systems, an EROFS volume is designed -to be as simple as possible: - - |-> aligned with the block size - ____________________________________________________________ - | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data | - |_|__|_|_____|__________|_____|______|__________|_____|______| - 0 +1K - -All data areas should be aligned with the block size, but metadata areas -may not. All metadatas can be now observed in two different spaces (views): - 1. Inode metadata space - Each valid inode should be aligned with an inode slot, which is a fixed - value (32 bytes) and designed to be kept in line with compact inode size. - - Each inode can be directly found with the following formula: - inode offset = meta_blkaddr * block_size + 32 * nid - - |-> aligned with 8B - |-> followed closely - + meta_blkaddr blocks |-> another slot - _____________________________________________________________________ - | ... | inode | xattrs | extents | data inline | ... | inode ... - |________|_______|(optional)|(optional)|__(optional)_|_____|__________ - |-> aligned with the inode slot size - . . - . . - . . - . . - . . - . . - .____________________________________________________|-> aligned with 4B - | xattr_ibody_header | shared xattrs | inline xattrs | - |____________________|_______________|_______________| - |-> 12 bytes <-|->x * 4 bytes<-| . - . . . - . . . - . . . - ._______________________________.______________________. - | id | id | id | id | ... | id | ent | ... | ent| ... | - |____|____|____|____|______|____|_____|_____|____|_____| - |-> aligned with 4B - |-> aligned with 4B - - Inode could be 32 or 64 bytes, which can be distinguished from a common - field which all inode versions have -- i_format: - - __________________ __________________ - | i_format | | i_format | - |__________________| |__________________| - | ... | | ... | - | | | | - |__________________| 32 bytes | | - | | - |__________________| 64 bytes - - Xattrs, extents, data inline are followed by the corresponding inode with - proper alignment, and they could be optional for different data mappings. - _currently_ total 4 valid data mappings are supported: - - 0 flat file data without data inline (no extent); - 1 fixed-sized output data compression (with non-compacted indexes); - 2 flat file data with tail packing data inline (no extent); - 3 fixed-sized output data compression (with compacted indexes, v5.3+). - - The size of the optional xattrs is indicated by i_xattr_count in inode - header. Large xattrs or xattrs shared by many different files can be - stored in shared xattrs metadata rather than inlined right after inode. - - 2. Shared xattrs metadata space - Shared xattrs space is similar to the above inode space, started with - a specific block indicated by xattr_blkaddr, organized one by one with - proper align. - - Each share xattr can also be directly found by the following formula: - xattr offset = xattr_blkaddr * block_size + 4 * xattr_id - - |-> aligned by 4 bytes - + xattr_blkaddr blocks |-> aligned with 4 bytes - _________________________________________________________________________ - | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ... - |________|_____________|_____________|_____|______________|_______________ - -Directories ------------ -All directories are now organized in a compact on-disk format. Note that -each directory block is divided into index and name areas in order to support -random file lookup, and all directory entries are _strictly_ recorded in -alphabetical order in order to support improved prefix binary search -algorithm (could refer to the related source code). - - ___________________________ - / | - / ______________|________________ - / / | nameoff1 | nameoffN-1 - ____________.______________._______________v________________v__________ -| dirent | dirent | ... | dirent | filename | filename | ... | filename | -|___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____| - \ ^ - \ | * could have - \ | trailing '\0' - \________________________| nameoff0 - - Directory block - -Note that apart from the offset of the first filename, nameoff0 also indicates -the total number of directory entries in this block since it is no need to -introduce another on-disk field at all. - -Compression ------------ -Currently, EROFS supports 4KB fixed-sized output transparent file compression, -as illustrated below: - - |---- Variant-Length Extent ----|-------- VLE --------|----- VLE ----- - clusterofs clusterofs clusterofs - | | | logical data -_________v_______________________________v_____________________v_______________ -... | . | | . | | . | ... -____|____.________|_____________|________.____|_____________|__.__________|____ - |-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-| - size size size size size - . . . . - . . . . - . . . . - _______._____________._____________._____________._____________________ - ... | | | | ... physical data - _______|_____________|_____________|_____________|_____________________ - |-> cluster <-|-> cluster <-|-> cluster <-| - size size size - -Currently each on-disk physical cluster can contain 4KB (un)compressed data -at most. For each logical cluster, there is a corresponding on-disk index to -describe its cluster type, physical cluster address, etc. - -See "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details. - diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 4230f49d2732..03a493b27920 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -61,6 +61,7 @@ Documentation for filesystem implementations. dlmfs ecryptfs efivarfs + erofs fuse overlayfs virtiofs -- cgit From 6e29ad2ea34f63f2b959807370672af569861378 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:02 +0100 Subject: docs: filesystems: convert ext2.txt to ReST - Add a SPDX header; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Use footnoote markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Jan Kara Link: https://lore.kernel.org/r/fde6721f0303259d830391e351dbde48f67f3ec7.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/ext2.rst | 399 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/ext2.txt | 388 ----------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 400 insertions(+), 388 deletions(-) create mode 100644 Documentation/filesystems/ext2.rst delete mode 100644 Documentation/filesystems/ext2.txt diff --git a/Documentation/filesystems/ext2.rst b/Documentation/filesystems/ext2.rst new file mode 100644 index 000000000000..d83dbbb162e2 --- /dev/null +++ b/Documentation/filesystems/ext2.rst @@ -0,0 +1,399 @@ +.. SPDX-License-Identifier: GPL-2.0 + + +The Second Extended Filesystem +============================== + +ext2 was originally released in January 1993. Written by R\'emy Card, +Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the +Extended Filesystem. It is currently still (April 2001) the predominant +filesystem in use by Linux. There are also implementations available +for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS. + +Options +======= + +Most defaults are determined by the filesystem superblock, and can be +set using tune2fs(8). Kernel-determined defaults are indicated by (*). + +==================== === ================================================ +bsddf (*) Makes ``df`` act like BSD. +minixdf Makes ``df`` act like Minix. + +check=none, nocheck (*) Don't do extra checking of bitmaps on mount + (check=normal and check=strict options removed) + +dax Use direct access (no page cache). See + Documentation/filesystems/dax.txt. + +debug Extra debugging information is sent to the + kernel syslog. Useful for developers. + +errors=continue Keep going on a filesystem error. +errors=remount-ro Remount the filesystem read-only on an error. +errors=panic Panic and halt the machine if an error occurs. + +grpid, bsdgroups Give objects the same group ID as their parent. +nogrpid, sysvgroups New objects have the group ID of their creator. + +nouid32 Use 16-bit UIDs and GIDs. + +oldalloc Enable the old block allocator. Orlov should + have better performance, we'd like to get some + feedback if it's the contrary for you. +orlov (*) Use the Orlov block allocator. + (See http://lwn.net/Articles/14633/ and + http://lwn.net/Articles/14446/.) + +resuid=n The user ID which may use the reserved blocks. +resgid=n The group ID which may use the reserved blocks. + +sb=n Use alternate superblock at this location. + +user_xattr Enable "user." POSIX Extended Attributes + (requires CONFIG_EXT2_FS_XATTR). +nouser_xattr Don't support "user." extended attributes. + +acl Enable POSIX Access Control Lists support + (requires CONFIG_EXT2_FS_POSIX_ACL). +noacl Don't support POSIX ACLs. + +nobh Do not attach buffer_heads to file pagecache. + +quota, usrquota Enable user disk quota support + (requires CONFIG_QUOTA). + +grpquota Enable group disk quota support + (requires CONFIG_QUOTA). +==================== === ================================================ + +noquota option ls silently ignored by ext2. + + +Specification +============= + +ext2 shares many properties with traditional Unix filesystems. It has +the concepts of blocks, inodes and directories. It has space in the +specification for Access Control Lists (ACLs), fragments, undeletion and +compression though these are not yet implemented (some are available as +separate patches). There is also a versioning mechanism to allow new +features (such as journalling) to be added in a maximally compatible +manner. + +Blocks +------ + +The space in the device or file is split up into blocks. These are +a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems), +which is decided when the filesystem is created. Smaller blocks mean +less wasted space per file, but require slightly more accounting overhead, +and also impose other limits on the size of files and the filesystem. + +Block Groups +------------ + +Blocks are clustered into block groups in order to reduce fragmentation +and minimise the amount of head seeking when reading a large amount +of consecutive data. Information about each block group is kept in a +descriptor table stored in the block(s) immediately after the superblock. +Two blocks near the start of each group are reserved for the block usage +bitmap and the inode usage bitmap which show which blocks and inodes +are in use. Since each bitmap is limited to a single block, this means +that the maximum size of a block group is 8 times the size of a block. + +The block(s) following the bitmaps in each block group are designated +as the inode table for that block group and the remainder are the data +blocks. The block allocation algorithm attempts to allocate data blocks +in the same block group as the inode which contains them. + +The Superblock +-------------- + +The superblock contains all the information about the configuration of +the filing system. The primary copy of the superblock is stored at an +offset of 1024 bytes from the start of the device, and it is essential +to mounting the filesystem. Since it is so important, backup copies of +the superblock are stored in block groups throughout the filesystem. +The first version of ext2 (revision 0) stores a copy at the start of +every block group, along with backups of the group descriptor block(s). +Because this can consume a considerable amount of space for large +filesystems, later revisions can optionally reduce the number of backup +copies by only putting backups in specific groups (this is the sparse +superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7. + +The information in the superblock contains fields such as the total +number of inodes and blocks in the filesystem and how many are free, +how many inodes and blocks are in each block group, when the filesystem +was mounted (and if it was cleanly unmounted), when it was modified, +what version of the filesystem it is (see the Revisions section below) +and which OS created it. + +If the filesystem is revision 1 or higher, then there are extra fields, +such as a volume name, a unique identification number, the inode size, +and space for optional filesystem features to store configuration info. + +All fields in the superblock (as in all other ext2 structures) are stored +on the disc in little endian format, so a filesystem is portable between +machines without having to know what machine it was created on. + +Inodes +------ + +The inode (index node) is a fundamental concept in the ext2 filesystem. +Each object in the filesystem is represented by an inode. The inode +structure contains pointers to the filesystem blocks which contain the +data held in the object and all of the metadata about an object except +its name. The metadata about an object includes the permissions, owner, +group, flags, size, number of blocks used, access time, change time, +modification time, deletion time, number of links, fragments, version +(for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs). + +There are some reserved fields which are currently unused in the inode +structure and several which are overloaded. One field is reserved for the +directory ACL if the inode is a directory and alternately for the top 32 +bits of the file size if the inode is a regular file (allowing file sizes +larger than 2GB). The translator field is unused under Linux, but is used +by the HURD to reference the inode of a program which will be used to +interpret this object. Most of the remaining reserved fields have been +used up for both Linux and the HURD for larger owner and group fields, +The HURD also has a larger mode field so it uses another of the remaining +fields to store the extra more bits. + +There are pointers to the first 12 blocks which contain the file's data +in the inode. There is a pointer to an indirect block (which contains +pointers to the next set of blocks), a pointer to a doubly-indirect +block (which contains pointers to indirect blocks) and a pointer to a +trebly-indirect block (which contains pointers to doubly-indirect blocks). + +The flags field contains some ext2-specific flags which aren't catered +for by the standard chmod flags. These flags can be listed with lsattr +and changed with the chattr command, and allow specific filesystem +behaviour on a per-file basis. There are flags for secure deletion, +undeletable, compression, synchronous updates, immutability, append-only, +dumpable, no-atime, indexed directories, and data-journaling. Not all +of these are supported yet. + +Directories +----------- + +A directory is a filesystem object and has an inode just like a file. +It is a specially formatted file containing records which associate +each name with an inode number. Later revisions of the filesystem also +encode the type of the object (file, directory, symlink, device, fifo, +socket) to avoid the need to check the inode itself for this information +(support for taking advantage of this feature does not yet exist in +Glibc 2.2). + +The inode allocation code tries to assign inodes which are in the same +block group as the directory in which they are first created. + +The current implementation of ext2 uses a singly-linked list to store +the filenames in the directory; a pending enhancement uses hashing of the +filenames to allow lookup without the need to scan the entire directory. + +The current implementation never removes empty directory blocks once they +have been allocated to hold more files. + +Special files +------------- + +Symbolic links are also filesystem objects with inodes. They deserve +special mention because the data for them is stored within the inode +itself if the symlink is less than 60 bytes long. It uses the fields +which would normally be used to store the pointers to data blocks. +This is a worthwhile optimisation as it we avoid allocating a full +block for the symlink, and most symlinks are less than 60 characters long. + +Character and block special devices never have data blocks assigned to +them. Instead, their device number is stored in the inode, again reusing +the fields which would be used to point to the data blocks. + +Reserved Space +-------------- + +In ext2, there is a mechanism for reserving a certain number of blocks +for a particular user (normally the super-user). This is intended to +allow for the system to continue functioning even if non-privileged users +fill up all the space available to them (this is independent of filesystem +quotas). It also keeps the filesystem from filling up entirely which +helps combat fragmentation. + +Filesystem check +---------------- + +At boot time, most systems run a consistency check (e2fsck) on their +filesystems. The superblock of the ext2 filesystem contains several +fields which indicate whether fsck should actually run (since checking +the filesystem at boot can take a long time if it is large). fsck will +run if the filesystem was not cleanly unmounted, if the maximum mount +count has been exceeded or if the maximum time between checks has been +exceeded. + +Feature Compatibility +--------------------- + +The compatibility feature mechanism used in ext2 is sophisticated. +It safely allows features to be added to the filesystem, without +unnecessarily sacrificing compatibility with older versions of the +filesystem code. The feature compatibility mechanism is not supported by +the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in +revision 1. There are three 32-bit fields, one for compatible features +(COMPAT), one for read-only compatible (RO_COMPAT) features and one for +incompatible (INCOMPAT) features. + +These feature flags have specific meanings for the kernel as follows: + +A COMPAT flag indicates that a feature is present in the filesystem, +but the on-disk format is 100% compatible with older on-disk formats, so +a kernel which didn't know anything about this feature could read/write +the filesystem without any chance of corrupting the filesystem (or even +making it inconsistent). This is essentially just a flag which says +"this filesystem has a (hidden) feature" that the kernel or e2fsck may +want to be aware of (more on e2fsck and feature flags later). The ext3 +HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply +a regular file with data blocks in it so the kernel does not need to +take any special notice of it if it doesn't understand ext3 journaling. + +An RO_COMPAT flag indicates that the on-disk format is 100% compatible +with older on-disk formats for reading (i.e. the feature does not change +the visible on-disk format). However, an old kernel writing to such a +filesystem would/could corrupt the filesystem, so this is prevented. The +most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because +sparse groups allow file data blocks where superblock/group descriptor +backups used to live, and ext2_free_blocks() refuses to free these blocks, +which would leading to inconsistent bitmaps. An old kernel would also +get an error if it tried to free a series of blocks which crossed a group +boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem. + +An INCOMPAT flag indicates the on-disk format has changed in some +way that makes it unreadable by older kernels, or would otherwise +cause a problem if an old kernel tried to mount it. FILETYPE is an +INCOMPAT flag because older kernels would think a filename was longer +than 256 characters, which would lead to corrupt directory listings. +The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel +doesn't understand compression, you would just get garbage back from +read() instead of it automatically decompressing your data. The ext3 +RECOVER flag is needed to prevent a kernel which does not understand the +ext3 journal from mounting the filesystem without replaying the journal. + +For e2fsck, it needs to be more strict with the handling of these +flags than the kernel. If it doesn't understand ANY of the COMPAT, +RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem, +because it has no way of verifying whether a given feature is valid +or not. Allowing e2fsck to succeed on a filesystem with an unknown +feature is a false sense of security for the user. Refusing to check +a filesystem with unknown features is a good incentive for the user to +update to the latest e2fsck. This also means that anyone adding feature +flags to ext2 also needs to update e2fsck to verify these features. + +Metadata +-------- + +It is frequently claimed that the ext2 implementation of writing +asynchronous metadata is faster than the ffs synchronous metadata +scheme but less reliable. Both methods are equally resolvable by their +respective fsck programs. + +If you're exceptionally paranoid, there are 3 ways of making metadata +writes synchronous on ext2: + +- per-file if you have the program source: use the O_SYNC flag to open() +- per-file if you don't have the source: use "chattr +S" on the file +- per-filesystem: add the "sync" option to mount (or in /etc/fstab) + +the first and last are not ext2 specific but do force the metadata to +be written synchronously. See also Journaling below. + +Limitations +----------- + +There are various limits imposed by the on-disk layout of ext2. Other +limits are imposed by the current implementation of the kernel code. +Many of the limits are determined at the time the filesystem is first +created, and depend upon the block size chosen. The ratio of inodes to +data blocks is fixed at filesystem creation time, so the only way to +increase the number of inodes is to increase the size of the filesystem. +No tools currently exist which can change the ratio of inodes to blocks. + +Most of these limits could be overcome with slight changes in the on-disk +format and using a compatibility flag to signal the format change (at +the expense of some compatibility). + +===================== ======= ======= ======= ======== +Filesystem block size 1kB 2kB 4kB 8kB +===================== ======= ======= ======= ======== +File size limit 16GB 256GB 2048GB 2048GB +Filesystem size limit 2047GB 8192GB 16384GB 32768GB +===================== ======= ======= ======= ======== + +There is a 2.4 kernel limit of 2048GB for a single block device, so no +filesystem larger than that can be created at this time. There is also +an upper limit on the block size imposed by the page size of the kernel, +so 8kB blocks are only allowed on Alpha systems (and other architectures +which support larger pages). + +There is an upper limit of 32000 subdirectories in a single directory. + +There is a "soft" upper limit of about 10-15k files in a single directory +with the current linear linked-list directory implementation. This limit +stems from performance problems when creating and deleting (and also +finding) files in such large directories. Using a hashed directory index +(under development) allows 100k-1M+ files in a single directory without +performance problems (although RAM size becomes an issue at this point). + +The (meaningless) absolute upper limit of files in a single directory +(imposed by the file size, the realistic limit is obviously much less) +is over 130 trillion files. It would be higher except there are not +enough 4-character names to make up unique directory entries, so they +have to be 8 character filenames, even then we are fairly close to +running out of unique filenames. + +Journaling +---------- + +A journaling extension to the ext2 code has been developed by Stephen +Tweedie. It avoids the risks of metadata corruption and the need to +wait for e2fsck to complete after a crash, without requiring a change +to the on-disk ext2 layout. In a nutshell, the journal is a regular +file which stores whole metadata (and optionally data) blocks that have +been modified, prior to writing them into the filesystem. This means +it is possible to add a journal to an existing ext2 filesystem without +the need for data conversion. + +When changes to the filesystem (e.g. a file is renamed) they are stored in +a transaction in the journal and can either be complete or incomplete at +the time of a crash. If a transaction is complete at the time of a crash +(or in the normal case where the system does not crash), then any blocks +in that transaction are guaranteed to represent a valid filesystem state, +and are copied into the filesystem. If a transaction is incomplete at +the time of the crash, then there is no guarantee of consistency for +the blocks in that transaction so they are discarded (which means any +filesystem changes they represent are also lost). +Check Documentation/filesystems/ext4/ if you want to read more about +ext4 and journaling. + +References +========== + +======================= =============================================== +The kernel source file:/usr/src/linux/fs/ext2/ +e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/ +Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html +Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/ +Filesystem Resizing http://ext2resize.sourceforge.net/ +Compression [1]_ http://e2compr.sourceforge.net/ +======================= =============================================== + +Implementations for: + +======================= =========================================================== +Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs +Windows 95 [1]_ http://www.yipton.net/content.html#FSDEXT2 +DOS client [1]_ ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ +OS/2 [2]_ ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ +RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/ +======================= =========================================================== + +.. [1] no longer actively developed/supported (as of Apr 2001) +.. [2] no longer actively developed/supported (as of Mar 2009) diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt deleted file mode 100644 index 94c2cf0292f5..000000000000 --- a/Documentation/filesystems/ext2.txt +++ /dev/null @@ -1,388 +0,0 @@ - -The Second Extended Filesystem -============================== - -ext2 was originally released in January 1993. Written by R\'emy Card, -Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the -Extended Filesystem. It is currently still (April 2001) the predominant -filesystem in use by Linux. There are also implementations available -for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS. - -Options -======= - -Most defaults are determined by the filesystem superblock, and can be -set using tune2fs(8). Kernel-determined defaults are indicated by (*). - -bsddf (*) Makes `df' act like BSD. -minixdf Makes `df' act like Minix. - -check=none, nocheck (*) Don't do extra checking of bitmaps on mount - (check=normal and check=strict options removed) - -dax Use direct access (no page cache). See - Documentation/filesystems/dax.txt. - -debug Extra debugging information is sent to the - kernel syslog. Useful for developers. - -errors=continue Keep going on a filesystem error. -errors=remount-ro Remount the filesystem read-only on an error. -errors=panic Panic and halt the machine if an error occurs. - -grpid, bsdgroups Give objects the same group ID as their parent. -nogrpid, sysvgroups New objects have the group ID of their creator. - -nouid32 Use 16-bit UIDs and GIDs. - -oldalloc Enable the old block allocator. Orlov should - have better performance, we'd like to get some - feedback if it's the contrary for you. -orlov (*) Use the Orlov block allocator. - (See http://lwn.net/Articles/14633/ and - http://lwn.net/Articles/14446/.) - -resuid=n The user ID which may use the reserved blocks. -resgid=n The group ID which may use the reserved blocks. - -sb=n Use alternate superblock at this location. - -user_xattr Enable "user." POSIX Extended Attributes - (requires CONFIG_EXT2_FS_XATTR). -nouser_xattr Don't support "user." extended attributes. - -acl Enable POSIX Access Control Lists support - (requires CONFIG_EXT2_FS_POSIX_ACL). -noacl Don't support POSIX ACLs. - -nobh Do not attach buffer_heads to file pagecache. - -quota, usrquota Enable user disk quota support - (requires CONFIG_QUOTA). - -grpquota Enable group disk quota support - (requires CONFIG_QUOTA). - -noquota option ls silently ignored by ext2. - - -Specification -============= - -ext2 shares many properties with traditional Unix filesystems. It has -the concepts of blocks, inodes and directories. It has space in the -specification for Access Control Lists (ACLs), fragments, undeletion and -compression though these are not yet implemented (some are available as -separate patches). There is also a versioning mechanism to allow new -features (such as journalling) to be added in a maximally compatible -manner. - -Blocks ------- - -The space in the device or file is split up into blocks. These are -a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems), -which is decided when the filesystem is created. Smaller blocks mean -less wasted space per file, but require slightly more accounting overhead, -and also impose other limits on the size of files and the filesystem. - -Block Groups ------------- - -Blocks are clustered into block groups in order to reduce fragmentation -and minimise the amount of head seeking when reading a large amount -of consecutive data. Information about each block group is kept in a -descriptor table stored in the block(s) immediately after the superblock. -Two blocks near the start of each group are reserved for the block usage -bitmap and the inode usage bitmap which show which blocks and inodes -are in use. Since each bitmap is limited to a single block, this means -that the maximum size of a block group is 8 times the size of a block. - -The block(s) following the bitmaps in each block group are designated -as the inode table for that block group and the remainder are the data -blocks. The block allocation algorithm attempts to allocate data blocks -in the same block group as the inode which contains them. - -The Superblock --------------- - -The superblock contains all the information about the configuration of -the filing system. The primary copy of the superblock is stored at an -offset of 1024 bytes from the start of the device, and it is essential -to mounting the filesystem. Since it is so important, backup copies of -the superblock are stored in block groups throughout the filesystem. -The first version of ext2 (revision 0) stores a copy at the start of -every block group, along with backups of the group descriptor block(s). -Because this can consume a considerable amount of space for large -filesystems, later revisions can optionally reduce the number of backup -copies by only putting backups in specific groups (this is the sparse -superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7. - -The information in the superblock contains fields such as the total -number of inodes and blocks in the filesystem and how many are free, -how many inodes and blocks are in each block group, when the filesystem -was mounted (and if it was cleanly unmounted), when it was modified, -what version of the filesystem it is (see the Revisions section below) -and which OS created it. - -If the filesystem is revision 1 or higher, then there are extra fields, -such as a volume name, a unique identification number, the inode size, -and space for optional filesystem features to store configuration info. - -All fields in the superblock (as in all other ext2 structures) are stored -on the disc in little endian format, so a filesystem is portable between -machines without having to know what machine it was created on. - -Inodes ------- - -The inode (index node) is a fundamental concept in the ext2 filesystem. -Each object in the filesystem is represented by an inode. The inode -structure contains pointers to the filesystem blocks which contain the -data held in the object and all of the metadata about an object except -its name. The metadata about an object includes the permissions, owner, -group, flags, size, number of blocks used, access time, change time, -modification time, deletion time, number of links, fragments, version -(for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs). - -There are some reserved fields which are currently unused in the inode -structure and several which are overloaded. One field is reserved for the -directory ACL if the inode is a directory and alternately for the top 32 -bits of the file size if the inode is a regular file (allowing file sizes -larger than 2GB). The translator field is unused under Linux, but is used -by the HURD to reference the inode of a program which will be used to -interpret this object. Most of the remaining reserved fields have been -used up for both Linux and the HURD for larger owner and group fields, -The HURD also has a larger mode field so it uses another of the remaining -fields to store the extra more bits. - -There are pointers to the first 12 blocks which contain the file's data -in the inode. There is a pointer to an indirect block (which contains -pointers to the next set of blocks), a pointer to a doubly-indirect -block (which contains pointers to indirect blocks) and a pointer to a -trebly-indirect block (which contains pointers to doubly-indirect blocks). - -The flags field contains some ext2-specific flags which aren't catered -for by the standard chmod flags. These flags can be listed with lsattr -and changed with the chattr command, and allow specific filesystem -behaviour on a per-file basis. There are flags for secure deletion, -undeletable, compression, synchronous updates, immutability, append-only, -dumpable, no-atime, indexed directories, and data-journaling. Not all -of these are supported yet. - -Directories ------------ - -A directory is a filesystem object and has an inode just like a file. -It is a specially formatted file containing records which associate -each name with an inode number. Later revisions of the filesystem also -encode the type of the object (file, directory, symlink, device, fifo, -socket) to avoid the need to check the inode itself for this information -(support for taking advantage of this feature does not yet exist in -Glibc 2.2). - -The inode allocation code tries to assign inodes which are in the same -block group as the directory in which they are first created. - -The current implementation of ext2 uses a singly-linked list to store -the filenames in the directory; a pending enhancement uses hashing of the -filenames to allow lookup without the need to scan the entire directory. - -The current implementation never removes empty directory blocks once they -have been allocated to hold more files. - -Special files -------------- - -Symbolic links are also filesystem objects with inodes. They deserve -special mention because the data for them is stored within the inode -itself if the symlink is less than 60 bytes long. It uses the fields -which would normally be used to store the pointers to data blocks. -This is a worthwhile optimisation as it we avoid allocating a full -block for the symlink, and most symlinks are less than 60 characters long. - -Character and block special devices never have data blocks assigned to -them. Instead, their device number is stored in the inode, again reusing -the fields which would be used to point to the data blocks. - -Reserved Space --------------- - -In ext2, there is a mechanism for reserving a certain number of blocks -for a particular user (normally the super-user). This is intended to -allow for the system to continue functioning even if non-privileged users -fill up all the space available to them (this is independent of filesystem -quotas). It also keeps the filesystem from filling up entirely which -helps combat fragmentation. - -Filesystem check ----------------- - -At boot time, most systems run a consistency check (e2fsck) on their -filesystems. The superblock of the ext2 filesystem contains several -fields which indicate whether fsck should actually run (since checking -the filesystem at boot can take a long time if it is large). fsck will -run if the filesystem was not cleanly unmounted, if the maximum mount -count has been exceeded or if the maximum time between checks has been -exceeded. - -Feature Compatibility ---------------------- - -The compatibility feature mechanism used in ext2 is sophisticated. -It safely allows features to be added to the filesystem, without -unnecessarily sacrificing compatibility with older versions of the -filesystem code. The feature compatibility mechanism is not supported by -the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in -revision 1. There are three 32-bit fields, one for compatible features -(COMPAT), one for read-only compatible (RO_COMPAT) features and one for -incompatible (INCOMPAT) features. - -These feature flags have specific meanings for the kernel as follows: - -A COMPAT flag indicates that a feature is present in the filesystem, -but the on-disk format is 100% compatible with older on-disk formats, so -a kernel which didn't know anything about this feature could read/write -the filesystem without any chance of corrupting the filesystem (or even -making it inconsistent). This is essentially just a flag which says -"this filesystem has a (hidden) feature" that the kernel or e2fsck may -want to be aware of (more on e2fsck and feature flags later). The ext3 -HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply -a regular file with data blocks in it so the kernel does not need to -take any special notice of it if it doesn't understand ext3 journaling. - -An RO_COMPAT flag indicates that the on-disk format is 100% compatible -with older on-disk formats for reading (i.e. the feature does not change -the visible on-disk format). However, an old kernel writing to such a -filesystem would/could corrupt the filesystem, so this is prevented. The -most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because -sparse groups allow file data blocks where superblock/group descriptor -backups used to live, and ext2_free_blocks() refuses to free these blocks, -which would leading to inconsistent bitmaps. An old kernel would also -get an error if it tried to free a series of blocks which crossed a group -boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem. - -An INCOMPAT flag indicates the on-disk format has changed in some -way that makes it unreadable by older kernels, or would otherwise -cause a problem if an old kernel tried to mount it. FILETYPE is an -INCOMPAT flag because older kernels would think a filename was longer -than 256 characters, which would lead to corrupt directory listings. -The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel -doesn't understand compression, you would just get garbage back from -read() instead of it automatically decompressing your data. The ext3 -RECOVER flag is needed to prevent a kernel which does not understand the -ext3 journal from mounting the filesystem without replaying the journal. - -For e2fsck, it needs to be more strict with the handling of these -flags than the kernel. If it doesn't understand ANY of the COMPAT, -RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem, -because it has no way of verifying whether a given feature is valid -or not. Allowing e2fsck to succeed on a filesystem with an unknown -feature is a false sense of security for the user. Refusing to check -a filesystem with unknown features is a good incentive for the user to -update to the latest e2fsck. This also means that anyone adding feature -flags to ext2 also needs to update e2fsck to verify these features. - -Metadata --------- - -It is frequently claimed that the ext2 implementation of writing -asynchronous metadata is faster than the ffs synchronous metadata -scheme but less reliable. Both methods are equally resolvable by their -respective fsck programs. - -If you're exceptionally paranoid, there are 3 ways of making metadata -writes synchronous on ext2: - -per-file if you have the program source: use the O_SYNC flag to open() -per-file if you don't have the source: use "chattr +S" on the file -per-filesystem: add the "sync" option to mount (or in /etc/fstab) - -the first and last are not ext2 specific but do force the metadata to -be written synchronously. See also Journaling below. - -Limitations ------------ - -There are various limits imposed by the on-disk layout of ext2. Other -limits are imposed by the current implementation of the kernel code. -Many of the limits are determined at the time the filesystem is first -created, and depend upon the block size chosen. The ratio of inodes to -data blocks is fixed at filesystem creation time, so the only way to -increase the number of inodes is to increase the size of the filesystem. -No tools currently exist which can change the ratio of inodes to blocks. - -Most of these limits could be overcome with slight changes in the on-disk -format and using a compatibility flag to signal the format change (at -the expense of some compatibility). - -Filesystem block size: 1kB 2kB 4kB 8kB - -File size limit: 16GB 256GB 2048GB 2048GB -Filesystem size limit: 2047GB 8192GB 16384GB 32768GB - -There is a 2.4 kernel limit of 2048GB for a single block device, so no -filesystem larger than that can be created at this time. There is also -an upper limit on the block size imposed by the page size of the kernel, -so 8kB blocks are only allowed on Alpha systems (and other architectures -which support larger pages). - -There is an upper limit of 32000 subdirectories in a single directory. - -There is a "soft" upper limit of about 10-15k files in a single directory -with the current linear linked-list directory implementation. This limit -stems from performance problems when creating and deleting (and also -finding) files in such large directories. Using a hashed directory index -(under development) allows 100k-1M+ files in a single directory without -performance problems (although RAM size becomes an issue at this point). - -The (meaningless) absolute upper limit of files in a single directory -(imposed by the file size, the realistic limit is obviously much less) -is over 130 trillion files. It would be higher except there are not -enough 4-character names to make up unique directory entries, so they -have to be 8 character filenames, even then we are fairly close to -running out of unique filenames. - -Journaling ----------- - -A journaling extension to the ext2 code has been developed by Stephen -Tweedie. It avoids the risks of metadata corruption and the need to -wait for e2fsck to complete after a crash, without requiring a change -to the on-disk ext2 layout. In a nutshell, the journal is a regular -file which stores whole metadata (and optionally data) blocks that have -been modified, prior to writing them into the filesystem. This means -it is possible to add a journal to an existing ext2 filesystem without -the need for data conversion. - -When changes to the filesystem (e.g. a file is renamed) they are stored in -a transaction in the journal and can either be complete or incomplete at -the time of a crash. If a transaction is complete at the time of a crash -(or in the normal case where the system does not crash), then any blocks -in that transaction are guaranteed to represent a valid filesystem state, -and are copied into the filesystem. If a transaction is incomplete at -the time of the crash, then there is no guarantee of consistency for -the blocks in that transaction so they are discarded (which means any -filesystem changes they represent are also lost). -Check Documentation/filesystems/ext4/ if you want to read more about -ext4 and journaling. - -References -========== - -The kernel source file:/usr/src/linux/fs/ext2/ -e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/ -Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html -Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/ -Filesystem Resizing http://ext2resize.sourceforge.net/ -Compression (*) http://e2compr.sourceforge.net/ - -Implementations for: -Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs -Windows 95 (*) http://www.yipton.net/content.html#FSDEXT2 -DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ -OS/2 (+) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ -RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/ - -(*) no longer actively developed/supported (as of Apr 2001) -(+) no longer actively developed/supported (as of Mar 2009) diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 03a493b27920..102b3b65486a 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -62,6 +62,7 @@ Documentation for filesystem implementations. ecryptfs efivarfs erofs + ext2 fuse overlayfs virtiofs -- cgit From 7dc62406320c4103bbdeeeecd0a7ef03e3e58009 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:03 +0100 Subject: docs: filesystems: convert ext3.txt to ReST Nothing really required here. Just renaming would be enough. Yet, while here, lets add a SPDX header and adjust document title to met the same standard we're using on most docs. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/26960235e3e7c972bd543f5dd59f1ef4f3a877c6.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/ext3.rst | 14 ++++++++++++++ Documentation/filesystems/ext3.txt | 12 ------------ Documentation/filesystems/index.rst | 1 + 3 files changed, 15 insertions(+), 12 deletions(-) create mode 100644 Documentation/filesystems/ext3.rst delete mode 100644 Documentation/filesystems/ext3.txt diff --git a/Documentation/filesystems/ext3.rst b/Documentation/filesystems/ext3.rst new file mode 100644 index 000000000000..c06cec3a8fdc --- /dev/null +++ b/Documentation/filesystems/ext3.rst @@ -0,0 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +Ext3 Filesystem +=============== + +Ext3 was originally released in September 1999. Written by Stephen Tweedie +for the 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger, +Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie. + +Ext3 is the ext2 filesystem enhanced with journalling capabilities. The +filesystem is a subset of ext4 filesystem so use ext4 driver for accessing +ext3 filesystems. + diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt deleted file mode 100644 index 58758fbef9e0..000000000000 --- a/Documentation/filesystems/ext3.txt +++ /dev/null @@ -1,12 +0,0 @@ - -Ext3 Filesystem -=============== - -Ext3 was originally released in September 1999. Written by Stephen Tweedie -for the 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger, -Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie. - -Ext3 is the ext2 filesystem enhanced with journalling capabilities. The -filesystem is a subset of ext4 filesystem so use ext4 driver for accessing -ext3 filesystems. - diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 102b3b65486a..aa2c3d1de3de 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -63,6 +63,7 @@ Documentation for filesystem implementations. efivarfs erofs ext2 + ext3 fuse overlayfs virtiofs -- cgit From 89272ca1102e000f7dbca724b7b106e688199a5d Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:04 +0100 Subject: docs: filesystems: convert f2fs.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/8dd156320b0c015dec6d3f848d03ea057042a15b.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/f2fs.rst | 762 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/f2fs.txt | 730 ---------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 763 insertions(+), 730 deletions(-) create mode 100644 Documentation/filesystems/f2fs.rst delete mode 100644 Documentation/filesystems/f2fs.txt diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst new file mode 100644 index 000000000000..d681203728d7 --- /dev/null +++ b/Documentation/filesystems/f2fs.rst @@ -0,0 +1,762 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================== +WHAT IS Flash-Friendly File System (F2FS)? +========================================== + +NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have +been equipped on a variety systems ranging from mobile to server systems. Since +they are known to have different characteristics from the conventional rotating +disks, a file system, an upper layer to the storage device, should adapt to the +changes from the sketch in the design level. + +F2FS is a file system exploiting NAND flash memory-based storage devices, which +is based on Log-structured File System (LFS). The design has been focused on +addressing the fundamental issues in LFS, which are snowball effect of wandering +tree and high cleaning overhead. + +Since a NAND flash memory-based storage device shows different characteristic +according to its internal geometry or flash memory management scheme, namely FTL, +F2FS and its tools support various parameters not only for configuring on-disk +layout, but also for selecting allocation and cleaning algorithms. + +The following git tree provides the file system formatting tool (mkfs.f2fs), +a consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs). + +- git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git + +For reporting bugs and sending patches, please use the following mailing list: + +- linux-f2fs-devel@lists.sourceforge.net + +Background and Design issues +============================ + +Log-structured File System (LFS) +-------------------------------- +"A log-structured file system writes all modifications to disk sequentially in +a log-like structure, thereby speeding up both file writing and crash recovery. +The log is the only structure on disk; it contains indexing information so that +files can be read back from the log efficiently. In order to maintain large free +areas on disk for fast writing, we divide the log into segments and use a +segment cleaner to compress the live information from heavily fragmented +segments." from Rosenblum, M. and Ousterhout, J. K., 1992, "The design and +implementation of a log-structured file system", ACM Trans. Computer Systems +10, 1, 26–52. + +Wandering Tree Problem +---------------------- +In LFS, when a file data is updated and written to the end of log, its direct +pointer block is updated due to the changed location. Then the indirect pointer +block is also updated due to the direct pointer block update. In this manner, +the upper index structures such as inode, inode map, and checkpoint block are +also updated recursively. This problem is called as wandering tree problem [1], +and in order to enhance the performance, it should eliminate or relax the update +propagation as much as possible. + +[1] Bityutskiy, A. 2005. JFFS3 design issues. http://www.linux-mtd.infradead.org/ + +Cleaning Overhead +----------------- +Since LFS is based on out-of-place writes, it produces so many obsolete blocks +scattered across the whole storage. In order to serve new empty log space, it +needs to reclaim these obsolete blocks seamlessly to users. This job is called +as a cleaning process. + +The process consists of three operations as follows. + +1. A victim segment is selected through referencing segment usage table. +2. It loads parent index structures of all the data in the victim identified by + segment summary blocks. +3. It checks the cross-reference between the data and its parent index structure. +4. It moves valid data selectively. + +This cleaning job may cause unexpected long delays, so the most important goal +is to hide the latencies to users. And also definitely, it should reduce the +amount of valid data to be moved, and move them quickly as well. + +Key Features +============ + +Flash Awareness +--------------- +- Enlarge the random write area for better performance, but provide the high + spatial locality +- Align FS data structures to the operational units in FTL as best efforts + +Wandering Tree Problem +---------------------- +- Use a term, “node”, that represents inodes as well as various pointer blocks +- Introduce Node Address Table (NAT) containing the locations of all the “node” + blocks; this will cut off the update propagation. + +Cleaning Overhead +----------------- +- Support a background cleaning process +- Support greedy and cost-benefit algorithms for victim selection policies +- Support multi-head logs for static/dynamic hot and cold data separation +- Introduce adaptive logging for efficient block allocation + +Mount Options +============= + + +====================== ============================================================ +background_gc=%s Turn on/off cleaning operations, namely garbage + collection, triggered in background when I/O subsystem is + idle. If background_gc=on, it will turn on the garbage + collection and if background_gc=off, garbage collection + will be turned off. If background_gc=sync, it will turn + on synchronous garbage collection running in background. + Default value for this option is on. So garbage + collection is on by default. +disable_roll_forward Disable the roll-forward recovery routine +norecovery Disable the roll-forward recovery routine, mounted read- + only (i.e., -o ro,disable_roll_forward) +discard/nodiscard Enable/disable real-time discard in f2fs, if discard is + enabled, f2fs will issue discard/TRIM commands when a + segment is cleaned. +no_heap Disable heap-style segment allocation which finds free + segments for data from the beginning of main area, while + for node from the end of main area. +nouser_xattr Disable Extended User Attributes. Note: xattr is enabled + by default if CONFIG_F2FS_FS_XATTR is selected. +noacl Disable POSIX Access Control List. Note: acl is enabled + by default if CONFIG_F2FS_FS_POSIX_ACL is selected. +active_logs=%u Support configuring the number of active logs. In the + current design, f2fs supports only 2, 4, and 6 logs. + Default number is 6. +disable_ext_identify Disable the extension list configured by mkfs, so f2fs + does not aware of cold files such as media files. +inline_xattr Enable the inline xattrs feature. +noinline_xattr Disable the inline xattrs feature. +inline_xattr_size=%u Support configuring inline xattr size, it depends on + flexible inline xattr feature. +inline_data Enable the inline data feature: New created small(<~3.4k) + files can be written into inode block. +inline_dentry Enable the inline dir feature: data in new created + directory entries can be written into inode block. The + space of inode block which is used to store inline + dentries is limited to ~3.4k. +noinline_dentry Disable the inline dentry feature. +flush_merge Merge concurrent cache_flush commands as much as possible + to eliminate redundant command issues. If the underlying + device handles the cache_flush command relatively slowly, + recommend to enable this option. +nobarrier This option can be used if underlying storage guarantees + its cached data should be written to the novolatile area. + If this option is set, no cache_flush commands are issued + but f2fs still guarantees the write ordering of all the + data writes. +fastboot This option is used when a system wants to reduce mount + time as much as possible, even though normal performance + can be sacrificed. +extent_cache Enable an extent cache based on rb-tree, it can cache + as many as extent which map between contiguous logical + address and physical address per inode, resulting in + increasing the cache hit ratio. Set by default. +noextent_cache Disable an extent cache based on rb-tree explicitly, see + the above extent_cache mount option. +noinline_data Disable the inline data feature, inline data feature is + enabled by default. +data_flush Enable data flushing before checkpoint in order to + persist data of regular and symlink. +reserve_root=%d Support configuring reserved space which is used for + allocation from a privileged user with specified uid or + gid, unit: 4KB, the default limit is 0.2% of user blocks. +resuid=%d The user ID which may use the reserved blocks. +resgid=%d The group ID which may use the reserved blocks. +fault_injection=%d Enable fault injection in all supported types with + specified injection rate. +fault_type=%d Support configuring fault injection type, should be + enabled with fault_injection option, fault type value + is shown below, it supports single or combined type. + + =================== =========== + Type_Name Type_Value + =================== =========== + FAULT_KMALLOC 0x000000001 + FAULT_KVMALLOC 0x000000002 + FAULT_PAGE_ALLOC 0x000000004 + FAULT_PAGE_GET 0x000000008 + FAULT_ALLOC_BIO 0x000000010 + FAULT_ALLOC_NID 0x000000020 + FAULT_ORPHAN 0x000000040 + FAULT_BLOCK 0x000000080 + FAULT_DIR_DEPTH 0x000000100 + FAULT_EVICT_INODE 0x000000200 + FAULT_TRUNCATE 0x000000400 + FAULT_READ_IO 0x000000800 + FAULT_CHECKPOINT 0x000001000 + FAULT_DISCARD 0x000002000 + FAULT_WRITE_IO 0x000004000 + =================== =========== +mode=%s Control block allocation mode which supports "adaptive" + and "lfs". In "lfs" mode, there should be no random + writes towards main area. +io_bits=%u Set the bit size of write IO requests. It should be set + with "mode=lfs". +usrquota Enable plain user disk quota accounting. +grpquota Enable plain group disk quota accounting. +prjquota Enable plain project quota accounting. +usrjquota= Appoint specified file and type during mount, so that quota +grpjquota= information can be properly updated during recovery flow, +prjjquota= : must be in root directory; +jqfmt= : [vfsold,vfsv0,vfsv1]. +offusrjquota Turn off user journelled quota. +offgrpjquota Turn off group journelled quota. +offprjjquota Turn off project journelled quota. +quota Enable plain user disk quota accounting. +noquota Disable all plain disk quota option. +whint_mode=%s Control which write hints are passed down to block + layer. This supports "off", "user-based", and + "fs-based". In "off" mode (default), f2fs does not pass + down hints. In "user-based" mode, f2fs tries to pass + down hints given by users. And in "fs-based" mode, f2fs + passes down hints with its policy. +alloc_mode=%s Adjust block allocation policy, which supports "reuse" + and "default". +fsync_mode=%s Control the policy of fsync. Currently supports "posix", + "strict", and "nobarrier". In "posix" mode, which is + default, fsync will follow POSIX semantics and does a + light operation to improve the filesystem performance. + In "strict" mode, fsync will be heavy and behaves in line + with xfs, ext4 and btrfs, where xfstest generic/342 will + pass, but the performance will regress. "nobarrier" is + based on "posix", but doesn't issue flush command for + non-atomic files likewise "nobarrier" mount option. +test_dummy_encryption Enable dummy encryption, which provides a fake fscrypt + context. The fake fscrypt context is used by xfstests. +checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable" + to reenable checkpointing. Is enabled by default. While + disabled, any unmounting or unexpected shutdowns will cause + the filesystem contents to appear as they did when the + filesystem was mounted with that option. + While mounting with checkpoint=disabled, the filesystem must + run garbage collection to ensure that all available space can + be used. If this takes too much time, the mount may return + EAGAIN. You may optionally add a value to indicate how much + of the disk you would be willing to temporarily give up to + avoid additional garbage collection. This can be given as a + number of blocks, or as a percent. For instance, mounting + with checkpoint=disable:100% would always succeed, but it may + hide up to all remaining free space. The actual space that + would be unusable can be viewed at /sys/fs/f2fs//unusable + This space is reclaimed once checkpoint=enable. +compress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo" + and "lz4" algorithm. +compress_log_size=%u Support configuring compress cluster size, the size will + be 4KB * (1 << %u), 16KB is minimum size, also it's + default size. +compress_extension=%s Support adding specified extension, so that f2fs can enable + compression on those corresponding files, e.g. if all files + with '.ext' has high compression rate, we can set the '.ext' + on compression extension list and enable compression on + these file by default rather than to enable it via ioctl. + For other files, we can still enable compression via ioctl. +====================== ============================================================ + +Debugfs Entries +=============== + +/sys/kernel/debug/f2fs/ contains information about all the partitions mounted as +f2fs. Each file shows the whole f2fs information. + +/sys/kernel/debug/f2fs/status includes: + + - major file system information managed by f2fs currently + - average SIT information about whole segments + - current memory footprint consumed by f2fs. + +Sysfs Entries +============= + +Information about mounted f2fs file systems can be found in +/sys/fs/f2fs. Each mounted filesystem will have a directory in +/sys/fs/f2fs based on its device name (i.e., /sys/fs/f2fs/sda). +The files in each per-device directory are shown in table below. + +Files in /sys/fs/f2fs/ +(see also Documentation/ABI/testing/sysfs-fs-f2fs) + +Usage +===== + +1. Download userland tools and compile them. + +2. Skip, if f2fs was compiled statically inside kernel. + Otherwise, insert the f2fs.ko module:: + + # insmod f2fs.ko + +3. Create a directory trying to mount:: + + # mkdir /mnt/f2fs + +4. Format the block device, and then mount as f2fs:: + + # mkfs.f2fs -l label /dev/block_device + # mount -t f2fs /dev/block_device /mnt/f2fs + +mkfs.f2fs +--------- +The mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem, +which builds a basic on-disk layout. + +The options consist of: + +=============== =========================================================== +``-l [label]`` Give a volume label, up to 512 unicode name. +``-a [0 or 1]`` Split start location of each area for heap-based allocation. + + 1 is set by default, which performs this. +``-o [int]`` Set overprovision ratio in percent over volume size. + + 5 is set by default. +``-s [int]`` Set the number of segments per section. + + 1 is set by default. +``-z [int]`` Set the number of sections per zone. + + 1 is set by default. +``-e [str]`` Set basic extension list. e.g. "mp3,gif,mov" +``-t [0 or 1]`` Disable discard command or not. + + 1 is set by default, which conducts discard. +=============== =========================================================== + +fsck.f2fs +--------- +The fsck.f2fs is a tool to check the consistency of an f2fs-formatted +partition, which examines whether the filesystem metadata and user-made data +are cross-referenced correctly or not. +Note that, initial version of the tool does not fix any inconsistency. + +The options consist of:: + + -d debug level [default:0] + +dump.f2fs +--------- +The dump.f2fs shows the information of specific inode and dumps SSA and SIT to +file. Each file is dump_ssa and dump_sit. + +The dump.f2fs is used to debug on-disk data structures of the f2fs filesystem. +It shows on-disk inode information recognized by a given inode number, and is +able to dump all the SSA and SIT entries into predefined files, ./dump_ssa and +./dump_sit respectively. + +The options consist of:: + + -d debug level [default:0] + -i inode no (hex) + -s [SIT dump segno from #1~#2 (decimal), for all 0~-1] + -a [SSA dump segno from #1~#2 (decimal), for all 0~-1] + +Examples:: + + # dump.f2fs -i [ino] /dev/sdx + # dump.f2fs -s 0~-1 /dev/sdx (SIT dump) + # dump.f2fs -a 0~-1 /dev/sdx (SSA dump) + +Design +====== + +On-disk Layout +-------------- + +F2FS divides the whole volume into a number of segments, each of which is fixed +to 2MB in size. A section is composed of consecutive segments, and a zone +consists of a set of sections. By default, section and zone sizes are set to one +segment size identically, but users can easily modify the sizes by mkfs. + +F2FS splits the entire volume into six areas, and all the areas except superblock +consists of multiple segments as described below:: + + align with the zone size <-| + |-> align with the segment size + _________________________________________________________________________ + | | | Segment | Node | Segment | | + | Superblock | Checkpoint | Info. | Address | Summary | Main | + | (SB) | (CP) | Table (SIT) | Table (NAT) | Area (SSA) | | + |____________|_____2______|______N______|______N______|______N_____|__N___| + . . + . . + . . + ._________________________________________. + |_Segment_|_..._|_Segment_|_..._|_Segment_| + . . + ._________._________ + |_section_|__...__|_ + . . + .________. + |__zone__| + +- Superblock (SB) + It is located at the beginning of the partition, and there exist two copies + to avoid file system crash. It contains basic partition information and some + default parameters of f2fs. + +- Checkpoint (CP) + It contains file system information, bitmaps for valid NAT/SIT sets, orphan + inode lists, and summary entries of current active segments. + +- Segment Information Table (SIT) + It contains segment information such as valid block count and bitmap for the + validity of all the blocks. + +- Node Address Table (NAT) + It is composed of a block address table for all the node blocks stored in + Main area. + +- Segment Summary Area (SSA) + It contains summary entries which contains the owner information of all the + data and node blocks stored in Main area. + +- Main Area + It contains file and directory data including their indices. + +In order to avoid misalignment between file system and flash-based storage, F2FS +aligns the start block address of CP with the segment size. Also, it aligns the +start block address of Main area with the zone size by reserving some segments +in SSA area. + +Reference the following survey for additional technical details. +https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey + +File System Metadata Structure +------------------------------ + +F2FS adopts the checkpointing scheme to maintain file system consistency. At +mount time, F2FS first tries to find the last valid checkpoint data by scanning +CP area. In order to reduce the scanning time, F2FS uses only two copies of CP. +One of them always indicates the last valid data, which is called as shadow copy +mechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism. + +For file system consistency, each CP points to which NAT and SIT copies are +valid, as shown as below:: + + +--------+----------+---------+ + | CP | SIT | NAT | + +--------+----------+---------+ + . . . . + . . . . + . . . . + +-------+-------+--------+--------+--------+--------+ + | CP #0 | CP #1 | SIT #0 | SIT #1 | NAT #0 | NAT #1 | + +-------+-------+--------+--------+--------+--------+ + | ^ ^ + | | | + `----------------------------------------' + +Index Structure +--------------- + +The key data structure to manage the data locations is a "node". Similar to +traditional file structures, F2FS has three types of node: inode, direct node, +indirect node. F2FS assigns 4KB to an inode block which contains 923 data block +indices, two direct node pointers, two indirect node pointers, and one double +indirect node pointer as described below. One direct node block contains 1018 +data blocks, and one indirect node block contains also 1018 node blocks. Thus, +one inode block (i.e., a file) covers:: + + 4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB. + + Inode block (4KB) + |- data (923) + |- direct node (2) + | `- data (1018) + |- indirect node (2) + | `- direct node (1018) + | `- data (1018) + `- double indirect node (1) + `- indirect node (1018) + `- direct node (1018) + `- data (1018) + +Note that, all the node blocks are mapped by NAT which means the location of +each node is translated by the NAT table. In the consideration of the wandering +tree problem, F2FS is able to cut off the propagation of node updates caused by +leaf data writes. + +Directory Structure +------------------- + +A directory entry occupies 11 bytes, which consists of the following attributes. + +- hash hash value of the file name +- ino inode number +- len the length of file name +- type file type such as directory, symlink, etc + +A dentry block consists of 214 dentry slots and file names. Therein a bitmap is +used to represent whether each dentry is valid or not. A dentry block occupies +4KB with the following composition. + +:: + + Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) + + dentries(11 * 214 bytes) + file name (8 * 214 bytes) + + [Bucket] + +--------------------------------+ + |dentry block 1 | dentry block 2 | + +--------------------------------+ + . . + . . + . [Dentry Block Structure: 4KB] . + +--------+----------+----------+------------+ + | bitmap | reserved | dentries | file names | + +--------+----------+----------+------------+ + [Dentry Block: 4KB] . . + . . + . . + +------+------+-----+------+ + | hash | ino | len | type | + +------+------+-----+------+ + [Dentry Structure: 11 bytes] + +F2FS implements multi-level hash tables for directory structure. Each level has +a hash table with dedicated number of hash buckets as shown below. Note that +"A(2B)" means a bucket includes 2 data blocks. + +:: + + ---------------------- + A : bucket + B : block + N : MAX_DIR_HASH_DEPTH + ---------------------- + + level #0 | A(2B) + | + level #1 | A(2B) - A(2B) + | + level #2 | A(2B) - A(2B) - A(2B) - A(2B) + . | . . . . + level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B) + . | . . . . + level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B) + +The number of blocks and buckets are determined by:: + + ,- 2, if n < MAX_DIR_HASH_DEPTH / 2, + # of blocks in level #n = | + `- 4, Otherwise + + ,- 2^(n + dir_level), + | if n + dir_level < MAX_DIR_HASH_DEPTH / 2, + # of buckets in level #n = | + `- 2^((MAX_DIR_HASH_DEPTH / 2) - 1), + Otherwise + +When F2FS finds a file name in a directory, at first a hash value of the file +name is calculated. Then, F2FS scans the hash table in level #0 to find the +dentry consisting of the file name and its inode number. If not found, F2FS +scans the next hash table in level #1. In this way, F2FS scans hash tables in +each levels incrementally from 1 to N. In each levels F2FS needs to scan only +one bucket determined by the following equation, which shows O(log(# of files)) +complexity:: + + bucket number to scan in level #n = (hash value) % (# of buckets in level #n) + +In the case of file creation, F2FS finds empty consecutive slots that cover the +file name. F2FS searches the empty slots in the hash tables of whole levels from +1 to N in the same way as the lookup operation. + +The following figure shows an example of two cases holding children:: + + --------------> Dir <-------------- + | | + child child + + child - child [hole] - child + + child - child - child [hole] - [hole] - child + + Case 1: Case 2: + Number of children = 6, Number of children = 3, + File size = 7 File size = 7 + +Default Block Allocation +------------------------ + +At runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node +and Hot/Warm/Cold data. + +- Hot node contains direct node blocks of directories. +- Warm node contains direct node blocks except hot node blocks. +- Cold node contains indirect node blocks +- Hot data contains dentry blocks +- Warm data contains data blocks except hot and cold data blocks +- Cold data contains multimedia data or migrated data blocks + +LFS has two schemes for free space management: threaded log and copy-and-compac- +tion. The copy-and-compaction scheme which is known as cleaning, is well-suited +for devices showing very good sequential write performance, since free segments +are served all the time for writing new data. However, it suffers from cleaning +overhead under high utilization. Contrarily, the threaded log scheme suffers +from random writes, but no cleaning process is needed. F2FS adopts a hybrid +scheme where the copy-and-compaction scheme is adopted by default, but the +policy is dynamically changed to the threaded log scheme according to the file +system status. + +In order to align F2FS with underlying flash-based storage, F2FS allocates a +segment in a unit of section. F2FS expects that the section size would be the +same as the unit size of garbage collection in FTL. Furthermore, with respect +to the mapping granularity in FTL, F2FS allocates each section of the active +logs from different zones as much as possible, since FTL can write the data in +the active logs into one allocation unit according to its mapping granularity. + +Cleaning process +---------------- + +F2FS does cleaning both on demand and in the background. On-demand cleaning is +triggered when there are not enough free segments to serve VFS calls. Background +cleaner is operated by a kernel thread, and triggers the cleaning job when the +system is idle. + +F2FS supports two victim selection policies: greedy and cost-benefit algorithms. +In the greedy algorithm, F2FS selects a victim segment having the smallest number +of valid blocks. In the cost-benefit algorithm, F2FS selects a victim segment +according to the segment age and the number of valid blocks in order to address +log block thrashing problem in the greedy algorithm. F2FS adopts the greedy +algorithm for on-demand cleaner, while background cleaner adopts cost-benefit +algorithm. + +In order to identify whether the data in the victim segment are valid or not, +F2FS manages a bitmap. Each bit represents the validity of a block, and the +bitmap is composed of a bit stream covering whole blocks in main area. + +Write-hint Policy +----------------- + +1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET. + +2) whint_mode=user-based. F2FS tries to pass down hints given by +users. + +===================== ======================== =================== +User F2FS Block +===================== ======================== =================== + META WRITE_LIFE_NOT_SET + HOT_NODE " + WARM_NODE " + COLD_NODE " +ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME +extension list " " + +-- buffered io +WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME +WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT +WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET +WRITE_LIFE_NONE " " +WRITE_LIFE_MEDIUM " " +WRITE_LIFE_LONG " " + +-- direct io +WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME +WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT +WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET +WRITE_LIFE_NONE " WRITE_LIFE_NONE +WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM +WRITE_LIFE_LONG " WRITE_LIFE_LONG +===================== ======================== =================== + +3) whint_mode=fs-based. F2FS passes down hints with its policy. + +===================== ======================== =================== +User F2FS Block +===================== ======================== =================== + META WRITE_LIFE_MEDIUM; + HOT_NODE WRITE_LIFE_NOT_SET + WARM_NODE " + COLD_NODE WRITE_LIFE_NONE +ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME +extension list " " + +-- buffered io +WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME +WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT +WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_LONG +WRITE_LIFE_NONE " " +WRITE_LIFE_MEDIUM " " +WRITE_LIFE_LONG " " + +-- direct io +WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME +WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT +WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET +WRITE_LIFE_NONE " WRITE_LIFE_NONE +WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM +WRITE_LIFE_LONG " WRITE_LIFE_LONG +===================== ======================== =================== + +Fallocate(2) Policy +------------------- + +The default policy follows the below posix rule. + +Allocating disk space + The default operation (i.e., mode is zero) of fallocate() allocates + the disk space within the range specified by offset and len. The + file size (as reported by stat(2)) will be changed if offset+len is + greater than the file size. Any subregion within the range specified + by offset and len that did not contain data before the call will be + initialized to zero. This default behavior closely resembles the + behavior of the posix_fallocate(3) library function, and is intended + as a method of optimally implementing that function. + +However, once F2FS receives ioctl(fd, F2FS_IOC_SET_PIN_FILE) in prior to +fallocate(fd, DEFAULT_MODE), it allocates on-disk blocks addressess having +zero or random data, which is useful to the below scenario where: + + 1. create(fd) + 2. ioctl(fd, F2FS_IOC_SET_PIN_FILE) + 3. fallocate(fd, 0, 0, size) + 4. address = fibmap(fd, offset) + 5. open(blkdev) + 6. write(blkdev, address) + +Compression implementation +-------------------------- + +- New term named cluster is defined as basic unit of compression, file can + be divided into multiple clusters logically. One cluster includes 4 << n + (n >= 0) logical pages, compression size is also cluster size, each of + cluster can be compressed or not. + +- In cluster metadata layout, one special block address is used to indicate + cluster is compressed one or normal one, for compressed cluster, following + metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs + stores data including compress header and compressed data. + +- In order to eliminate write amplification during overwrite, F2FS only + support compression on write-once file, data can be compressed only when + all logical blocks in file are valid and cluster compress ratio is lower + than specified threshold. + +- To enable compression on regular inode, there are three ways: + + * chattr +c file + * chattr +c dir; touch dir/file + * mount w/ -o compress_extension=ext; touch file.ext + +Compress metadata layout:: + + [Dnode Structure] + +-----------------------------------------------+ + | cluster 1 | cluster 2 | ......... | cluster N | + +-----------------------------------------------+ + . . . . + . . . . + . Compressed Cluster . . Normal Cluster . + +----------+---------+---------+---------+ +---------+---------+---------+---------+ + |compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 | + +----------+---------+---------+---------+ +---------+---------+---------+---------+ + . . + . . + . . + +-------------+-------------+----------+----------------------------+ + | data length | data chksum | reserved | compressed data | + +-------------+-------------+----------+----------------------------+ diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt deleted file mode 100644 index 4eb3e2ddd00e..000000000000 --- a/Documentation/filesystems/f2fs.txt +++ /dev/null @@ -1,730 +0,0 @@ -================================================================================ -WHAT IS Flash-Friendly File System (F2FS)? -================================================================================ - -NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have -been equipped on a variety systems ranging from mobile to server systems. Since -they are known to have different characteristics from the conventional rotating -disks, a file system, an upper layer to the storage device, should adapt to the -changes from the sketch in the design level. - -F2FS is a file system exploiting NAND flash memory-based storage devices, which -is based on Log-structured File System (LFS). The design has been focused on -addressing the fundamental issues in LFS, which are snowball effect of wandering -tree and high cleaning overhead. - -Since a NAND flash memory-based storage device shows different characteristic -according to its internal geometry or flash memory management scheme, namely FTL, -F2FS and its tools support various parameters not only for configuring on-disk -layout, but also for selecting allocation and cleaning algorithms. - -The following git tree provides the file system formatting tool (mkfs.f2fs), -a consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs). ->> git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git - -For reporting bugs and sending patches, please use the following mailing list: ->> linux-f2fs-devel@lists.sourceforge.net - -================================================================================ -BACKGROUND AND DESIGN ISSUES -================================================================================ - -Log-structured File System (LFS) --------------------------------- -"A log-structured file system writes all modifications to disk sequentially in -a log-like structure, thereby speeding up both file writing and crash recovery. -The log is the only structure on disk; it contains indexing information so that -files can be read back from the log efficiently. In order to maintain large free -areas on disk for fast writing, we divide the log into segments and use a -segment cleaner to compress the live information from heavily fragmented -segments." from Rosenblum, M. and Ousterhout, J. K., 1992, "The design and -implementation of a log-structured file system", ACM Trans. Computer Systems -10, 1, 26–52. - -Wandering Tree Problem ----------------------- -In LFS, when a file data is updated and written to the end of log, its direct -pointer block is updated due to the changed location. Then the indirect pointer -block is also updated due to the direct pointer block update. In this manner, -the upper index structures such as inode, inode map, and checkpoint block are -also updated recursively. This problem is called as wandering tree problem [1], -and in order to enhance the performance, it should eliminate or relax the update -propagation as much as possible. - -[1] Bityutskiy, A. 2005. JFFS3 design issues. http://www.linux-mtd.infradead.org/ - -Cleaning Overhead ------------------ -Since LFS is based on out-of-place writes, it produces so many obsolete blocks -scattered across the whole storage. In order to serve new empty log space, it -needs to reclaim these obsolete blocks seamlessly to users. This job is called -as a cleaning process. - -The process consists of three operations as follows. -1. A victim segment is selected through referencing segment usage table. -2. It loads parent index structures of all the data in the victim identified by - segment summary blocks. -3. It checks the cross-reference between the data and its parent index structure. -4. It moves valid data selectively. - -This cleaning job may cause unexpected long delays, so the most important goal -is to hide the latencies to users. And also definitely, it should reduce the -amount of valid data to be moved, and move them quickly as well. - -================================================================================ -KEY FEATURES -================================================================================ - -Flash Awareness ---------------- -- Enlarge the random write area for better performance, but provide the high - spatial locality -- Align FS data structures to the operational units in FTL as best efforts - -Wandering Tree Problem ----------------------- -- Use a term, “node”, that represents inodes as well as various pointer blocks -- Introduce Node Address Table (NAT) containing the locations of all the “node” - blocks; this will cut off the update propagation. - -Cleaning Overhead ------------------ -- Support a background cleaning process -- Support greedy and cost-benefit algorithms for victim selection policies -- Support multi-head logs for static/dynamic hot and cold data separation -- Introduce adaptive logging for efficient block allocation - -================================================================================ -MOUNT OPTIONS -================================================================================ - -background_gc=%s Turn on/off cleaning operations, namely garbage - collection, triggered in background when I/O subsystem is - idle. If background_gc=on, it will turn on the garbage - collection and if background_gc=off, garbage collection - will be turned off. If background_gc=sync, it will turn - on synchronous garbage collection running in background. - Default value for this option is on. So garbage - collection is on by default. -disable_roll_forward Disable the roll-forward recovery routine -norecovery Disable the roll-forward recovery routine, mounted read- - only (i.e., -o ro,disable_roll_forward) -discard/nodiscard Enable/disable real-time discard in f2fs, if discard is - enabled, f2fs will issue discard/TRIM commands when a - segment is cleaned. -no_heap Disable heap-style segment allocation which finds free - segments for data from the beginning of main area, while - for node from the end of main area. -nouser_xattr Disable Extended User Attributes. Note: xattr is enabled - by default if CONFIG_F2FS_FS_XATTR is selected. -noacl Disable POSIX Access Control List. Note: acl is enabled - by default if CONFIG_F2FS_FS_POSIX_ACL is selected. -active_logs=%u Support configuring the number of active logs. In the - current design, f2fs supports only 2, 4, and 6 logs. - Default number is 6. -disable_ext_identify Disable the extension list configured by mkfs, so f2fs - does not aware of cold files such as media files. -inline_xattr Enable the inline xattrs feature. -noinline_xattr Disable the inline xattrs feature. -inline_xattr_size=%u Support configuring inline xattr size, it depends on - flexible inline xattr feature. -inline_data Enable the inline data feature: New created small(<~3.4k) - files can be written into inode block. -inline_dentry Enable the inline dir feature: data in new created - directory entries can be written into inode block. The - space of inode block which is used to store inline - dentries is limited to ~3.4k. -noinline_dentry Disable the inline dentry feature. -flush_merge Merge concurrent cache_flush commands as much as possible - to eliminate redundant command issues. If the underlying - device handles the cache_flush command relatively slowly, - recommend to enable this option. -nobarrier This option can be used if underlying storage guarantees - its cached data should be written to the novolatile area. - If this option is set, no cache_flush commands are issued - but f2fs still guarantees the write ordering of all the - data writes. -fastboot This option is used when a system wants to reduce mount - time as much as possible, even though normal performance - can be sacrificed. -extent_cache Enable an extent cache based on rb-tree, it can cache - as many as extent which map between contiguous logical - address and physical address per inode, resulting in - increasing the cache hit ratio. Set by default. -noextent_cache Disable an extent cache based on rb-tree explicitly, see - the above extent_cache mount option. -noinline_data Disable the inline data feature, inline data feature is - enabled by default. -data_flush Enable data flushing before checkpoint in order to - persist data of regular and symlink. -reserve_root=%d Support configuring reserved space which is used for - allocation from a privileged user with specified uid or - gid, unit: 4KB, the default limit is 0.2% of user blocks. -resuid=%d The user ID which may use the reserved blocks. -resgid=%d The group ID which may use the reserved blocks. -fault_injection=%d Enable fault injection in all supported types with - specified injection rate. -fault_type=%d Support configuring fault injection type, should be - enabled with fault_injection option, fault type value - is shown below, it supports single or combined type. - Type_Name Type_Value - FAULT_KMALLOC 0x000000001 - FAULT_KVMALLOC 0x000000002 - FAULT_PAGE_ALLOC 0x000000004 - FAULT_PAGE_GET 0x000000008 - FAULT_ALLOC_BIO 0x000000010 - FAULT_ALLOC_NID 0x000000020 - FAULT_ORPHAN 0x000000040 - FAULT_BLOCK 0x000000080 - FAULT_DIR_DEPTH 0x000000100 - FAULT_EVICT_INODE 0x000000200 - FAULT_TRUNCATE 0x000000400 - FAULT_READ_IO 0x000000800 - FAULT_CHECKPOINT 0x000001000 - FAULT_DISCARD 0x000002000 - FAULT_WRITE_IO 0x000004000 -mode=%s Control block allocation mode which supports "adaptive" - and "lfs". In "lfs" mode, there should be no random - writes towards main area. -io_bits=%u Set the bit size of write IO requests. It should be set - with "mode=lfs". -usrquota Enable plain user disk quota accounting. -grpquota Enable plain group disk quota accounting. -prjquota Enable plain project quota accounting. -usrjquota= Appoint specified file and type during mount, so that quota -grpjquota= information can be properly updated during recovery flow, -prjjquota= : must be in root directory; -jqfmt= : [vfsold,vfsv0,vfsv1]. -offusrjquota Turn off user journelled quota. -offgrpjquota Turn off group journelled quota. -offprjjquota Turn off project journelled quota. -quota Enable plain user disk quota accounting. -noquota Disable all plain disk quota option. -whint_mode=%s Control which write hints are passed down to block - layer. This supports "off", "user-based", and - "fs-based". In "off" mode (default), f2fs does not pass - down hints. In "user-based" mode, f2fs tries to pass - down hints given by users. And in "fs-based" mode, f2fs - passes down hints with its policy. -alloc_mode=%s Adjust block allocation policy, which supports "reuse" - and "default". -fsync_mode=%s Control the policy of fsync. Currently supports "posix", - "strict", and "nobarrier". In "posix" mode, which is - default, fsync will follow POSIX semantics and does a - light operation to improve the filesystem performance. - In "strict" mode, fsync will be heavy and behaves in line - with xfs, ext4 and btrfs, where xfstest generic/342 will - pass, but the performance will regress. "nobarrier" is - based on "posix", but doesn't issue flush command for - non-atomic files likewise "nobarrier" mount option. -test_dummy_encryption Enable dummy encryption, which provides a fake fscrypt - context. The fake fscrypt context is used by xfstests. -checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable" - to reenable checkpointing. Is enabled by default. While - disabled, any unmounting or unexpected shutdowns will cause - the filesystem contents to appear as they did when the - filesystem was mounted with that option. - While mounting with checkpoint=disabled, the filesystem must - run garbage collection to ensure that all available space can - be used. If this takes too much time, the mount may return - EAGAIN. You may optionally add a value to indicate how much - of the disk you would be willing to temporarily give up to - avoid additional garbage collection. This can be given as a - number of blocks, or as a percent. For instance, mounting - with checkpoint=disable:100% would always succeed, but it may - hide up to all remaining free space. The actual space that - would be unusable can be viewed at /sys/fs/f2fs//unusable - This space is reclaimed once checkpoint=enable. -compress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo" - and "lz4" algorithm. -compress_log_size=%u Support configuring compress cluster size, the size will - be 4KB * (1 << %u), 16KB is minimum size, also it's - default size. -compress_extension=%s Support adding specified extension, so that f2fs can enable - compression on those corresponding files, e.g. if all files - with '.ext' has high compression rate, we can set the '.ext' - on compression extension list and enable compression on - these file by default rather than to enable it via ioctl. - For other files, we can still enable compression via ioctl. - -================================================================================ -DEBUGFS ENTRIES -================================================================================ - -/sys/kernel/debug/f2fs/ contains information about all the partitions mounted as -f2fs. Each file shows the whole f2fs information. - -/sys/kernel/debug/f2fs/status includes: - - major file system information managed by f2fs currently - - average SIT information about whole segments - - current memory footprint consumed by f2fs. - -================================================================================ -SYSFS ENTRIES -================================================================================ - -Information about mounted f2fs file systems can be found in -/sys/fs/f2fs. Each mounted filesystem will have a directory in -/sys/fs/f2fs based on its device name (i.e., /sys/fs/f2fs/sda). -The files in each per-device directory are shown in table below. - -Files in /sys/fs/f2fs/ -(see also Documentation/ABI/testing/sysfs-fs-f2fs) - -================================================================================ -USAGE -================================================================================ - -1. Download userland tools and compile them. - -2. Skip, if f2fs was compiled statically inside kernel. - Otherwise, insert the f2fs.ko module. - # insmod f2fs.ko - -3. Create a directory trying to mount - # mkdir /mnt/f2fs - -4. Format the block device, and then mount as f2fs - # mkfs.f2fs -l label /dev/block_device - # mount -t f2fs /dev/block_device /mnt/f2fs - -mkfs.f2fs ---------- -The mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem, -which builds a basic on-disk layout. - -The options consist of: --l [label] : Give a volume label, up to 512 unicode name. --a [0 or 1] : Split start location of each area for heap-based allocation. - 1 is set by default, which performs this. --o [int] : Set overprovision ratio in percent over volume size. - 5 is set by default. --s [int] : Set the number of segments per section. - 1 is set by default. --z [int] : Set the number of sections per zone. - 1 is set by default. --e [str] : Set basic extension list. e.g. "mp3,gif,mov" --t [0 or 1] : Disable discard command or not. - 1 is set by default, which conducts discard. - -fsck.f2fs ---------- -The fsck.f2fs is a tool to check the consistency of an f2fs-formatted -partition, which examines whether the filesystem metadata and user-made data -are cross-referenced correctly or not. -Note that, initial version of the tool does not fix any inconsistency. - -The options consist of: - -d debug level [default:0] - -dump.f2fs ---------- -The dump.f2fs shows the information of specific inode and dumps SSA and SIT to -file. Each file is dump_ssa and dump_sit. - -The dump.f2fs is used to debug on-disk data structures of the f2fs filesystem. -It shows on-disk inode information recognized by a given inode number, and is -able to dump all the SSA and SIT entries into predefined files, ./dump_ssa and -./dump_sit respectively. - -The options consist of: - -d debug level [default:0] - -i inode no (hex) - -s [SIT dump segno from #1~#2 (decimal), for all 0~-1] - -a [SSA dump segno from #1~#2 (decimal), for all 0~-1] - -Examples: -# dump.f2fs -i [ino] /dev/sdx -# dump.f2fs -s 0~-1 /dev/sdx (SIT dump) -# dump.f2fs -a 0~-1 /dev/sdx (SSA dump) - -================================================================================ -DESIGN -================================================================================ - -On-disk Layout --------------- - -F2FS divides the whole volume into a number of segments, each of which is fixed -to 2MB in size. A section is composed of consecutive segments, and a zone -consists of a set of sections. By default, section and zone sizes are set to one -segment size identically, but users can easily modify the sizes by mkfs. - -F2FS splits the entire volume into six areas, and all the areas except superblock -consists of multiple segments as described below. - - align with the zone size <-| - |-> align with the segment size - _________________________________________________________________________ - | | | Segment | Node | Segment | | - | Superblock | Checkpoint | Info. | Address | Summary | Main | - | (SB) | (CP) | Table (SIT) | Table (NAT) | Area (SSA) | | - |____________|_____2______|______N______|______N______|______N_____|__N___| - . . - . . - . . - ._________________________________________. - |_Segment_|_..._|_Segment_|_..._|_Segment_| - . . - ._________._________ - |_section_|__...__|_ - . . - .________. - |__zone__| - -- Superblock (SB) - : It is located at the beginning of the partition, and there exist two copies - to avoid file system crash. It contains basic partition information and some - default parameters of f2fs. - -- Checkpoint (CP) - : It contains file system information, bitmaps for valid NAT/SIT sets, orphan - inode lists, and summary entries of current active segments. - -- Segment Information Table (SIT) - : It contains segment information such as valid block count and bitmap for the - validity of all the blocks. - -- Node Address Table (NAT) - : It is composed of a block address table for all the node blocks stored in - Main area. - -- Segment Summary Area (SSA) - : It contains summary entries which contains the owner information of all the - data and node blocks stored in Main area. - -- Main Area - : It contains file and directory data including their indices. - -In order to avoid misalignment between file system and flash-based storage, F2FS -aligns the start block address of CP with the segment size. Also, it aligns the -start block address of Main area with the zone size by reserving some segments -in SSA area. - -Reference the following survey for additional technical details. -https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey - -File System Metadata Structure ------------------------------- - -F2FS adopts the checkpointing scheme to maintain file system consistency. At -mount time, F2FS first tries to find the last valid checkpoint data by scanning -CP area. In order to reduce the scanning time, F2FS uses only two copies of CP. -One of them always indicates the last valid data, which is called as shadow copy -mechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism. - -For file system consistency, each CP points to which NAT and SIT copies are -valid, as shown as below. - - +--------+----------+---------+ - | CP | SIT | NAT | - +--------+----------+---------+ - . . . . - . . . . - . . . . - +-------+-------+--------+--------+--------+--------+ - | CP #0 | CP #1 | SIT #0 | SIT #1 | NAT #0 | NAT #1 | - +-------+-------+--------+--------+--------+--------+ - | ^ ^ - | | | - `----------------------------------------' - -Index Structure ---------------- - -The key data structure to manage the data locations is a "node". Similar to -traditional file structures, F2FS has three types of node: inode, direct node, -indirect node. F2FS assigns 4KB to an inode block which contains 923 data block -indices, two direct node pointers, two indirect node pointers, and one double -indirect node pointer as described below. One direct node block contains 1018 -data blocks, and one indirect node block contains also 1018 node blocks. Thus, -one inode block (i.e., a file) covers: - - 4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB. - - Inode block (4KB) - |- data (923) - |- direct node (2) - | `- data (1018) - |- indirect node (2) - | `- direct node (1018) - | `- data (1018) - `- double indirect node (1) - `- indirect node (1018) - `- direct node (1018) - `- data (1018) - -Note that, all the node blocks are mapped by NAT which means the location of -each node is translated by the NAT table. In the consideration of the wandering -tree problem, F2FS is able to cut off the propagation of node updates caused by -leaf data writes. - -Directory Structure -------------------- - -A directory entry occupies 11 bytes, which consists of the following attributes. - -- hash hash value of the file name -- ino inode number -- len the length of file name -- type file type such as directory, symlink, etc - -A dentry block consists of 214 dentry slots and file names. Therein a bitmap is -used to represent whether each dentry is valid or not. A dentry block occupies -4KB with the following composition. - - Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) + - dentries(11 * 214 bytes) + file name (8 * 214 bytes) - - [Bucket] - +--------------------------------+ - |dentry block 1 | dentry block 2 | - +--------------------------------+ - . . - . . - . [Dentry Block Structure: 4KB] . - +--------+----------+----------+------------+ - | bitmap | reserved | dentries | file names | - +--------+----------+----------+------------+ - [Dentry Block: 4KB] . . - . . - . . - +------+------+-----+------+ - | hash | ino | len | type | - +------+------+-----+------+ - [Dentry Structure: 11 bytes] - -F2FS implements multi-level hash tables for directory structure. Each level has -a hash table with dedicated number of hash buckets as shown below. Note that -"A(2B)" means a bucket includes 2 data blocks. - ----------------------- -A : bucket -B : block -N : MAX_DIR_HASH_DEPTH ----------------------- - -level #0 | A(2B) - | -level #1 | A(2B) - A(2B) - | -level #2 | A(2B) - A(2B) - A(2B) - A(2B) - . | . . . . -level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B) - . | . . . . -level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B) - -The number of blocks and buckets are determined by, - - ,- 2, if n < MAX_DIR_HASH_DEPTH / 2, - # of blocks in level #n = | - `- 4, Otherwise - - ,- 2^(n + dir_level), - | if n + dir_level < MAX_DIR_HASH_DEPTH / 2, - # of buckets in level #n = | - `- 2^((MAX_DIR_HASH_DEPTH / 2) - 1), - Otherwise - -When F2FS finds a file name in a directory, at first a hash value of the file -name is calculated. Then, F2FS scans the hash table in level #0 to find the -dentry consisting of the file name and its inode number. If not found, F2FS -scans the next hash table in level #1. In this way, F2FS scans hash tables in -each levels incrementally from 1 to N. In each levels F2FS needs to scan only -one bucket determined by the following equation, which shows O(log(# of files)) -complexity. - - bucket number to scan in level #n = (hash value) % (# of buckets in level #n) - -In the case of file creation, F2FS finds empty consecutive slots that cover the -file name. F2FS searches the empty slots in the hash tables of whole levels from -1 to N in the same way as the lookup operation. - -The following figure shows an example of two cases holding children. - --------------> Dir <-------------- - | | - child child - - child - child [hole] - child - - child - child - child [hole] - [hole] - child - - Case 1: Case 2: - Number of children = 6, Number of children = 3, - File size = 7 File size = 7 - -Default Block Allocation ------------------------- - -At runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node -and Hot/Warm/Cold data. - -- Hot node contains direct node blocks of directories. -- Warm node contains direct node blocks except hot node blocks. -- Cold node contains indirect node blocks -- Hot data contains dentry blocks -- Warm data contains data blocks except hot and cold data blocks -- Cold data contains multimedia data or migrated data blocks - -LFS has two schemes for free space management: threaded log and copy-and-compac- -tion. The copy-and-compaction scheme which is known as cleaning, is well-suited -for devices showing very good sequential write performance, since free segments -are served all the time for writing new data. However, it suffers from cleaning -overhead under high utilization. Contrarily, the threaded log scheme suffers -from random writes, but no cleaning process is needed. F2FS adopts a hybrid -scheme where the copy-and-compaction scheme is adopted by default, but the -policy is dynamically changed to the threaded log scheme according to the file -system status. - -In order to align F2FS with underlying flash-based storage, F2FS allocates a -segment in a unit of section. F2FS expects that the section size would be the -same as the unit size of garbage collection in FTL. Furthermore, with respect -to the mapping granularity in FTL, F2FS allocates each section of the active -logs from different zones as much as possible, since FTL can write the data in -the active logs into one allocation unit according to its mapping granularity. - -Cleaning process ----------------- - -F2FS does cleaning both on demand and in the background. On-demand cleaning is -triggered when there are not enough free segments to serve VFS calls. Background -cleaner is operated by a kernel thread, and triggers the cleaning job when the -system is idle. - -F2FS supports two victim selection policies: greedy and cost-benefit algorithms. -In the greedy algorithm, F2FS selects a victim segment having the smallest number -of valid blocks. In the cost-benefit algorithm, F2FS selects a victim segment -according to the segment age and the number of valid blocks in order to address -log block thrashing problem in the greedy algorithm. F2FS adopts the greedy -algorithm for on-demand cleaner, while background cleaner adopts cost-benefit -algorithm. - -In order to identify whether the data in the victim segment are valid or not, -F2FS manages a bitmap. Each bit represents the validity of a block, and the -bitmap is composed of a bit stream covering whole blocks in main area. - -Write-hint Policy ------------------ - -1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET. - -2) whint_mode=user-based. F2FS tries to pass down hints given by -users. - -User F2FS Block ----- ---- ----- - META WRITE_LIFE_NOT_SET - HOT_NODE " - WARM_NODE " - COLD_NODE " -*ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME -*extension list " " - --- buffered io -WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME -WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT -WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET -WRITE_LIFE_NONE " " -WRITE_LIFE_MEDIUM " " -WRITE_LIFE_LONG " " - --- direct io -WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME -WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT -WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET -WRITE_LIFE_NONE " WRITE_LIFE_NONE -WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM -WRITE_LIFE_LONG " WRITE_LIFE_LONG - -3) whint_mode=fs-based. F2FS passes down hints with its policy. - -User F2FS Block ----- ---- ----- - META WRITE_LIFE_MEDIUM; - HOT_NODE WRITE_LIFE_NOT_SET - WARM_NODE " - COLD_NODE WRITE_LIFE_NONE -ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME -extension list " " - --- buffered io -WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME -WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT -WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_LONG -WRITE_LIFE_NONE " " -WRITE_LIFE_MEDIUM " " -WRITE_LIFE_LONG " " - --- direct io -WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME -WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT -WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET -WRITE_LIFE_NONE " WRITE_LIFE_NONE -WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM -WRITE_LIFE_LONG " WRITE_LIFE_LONG - -Fallocate(2) Policy -------------------- - -The default policy follows the below posix rule. - -Allocating disk space - The default operation (i.e., mode is zero) of fallocate() allocates - the disk space within the range specified by offset and len. The - file size (as reported by stat(2)) will be changed if offset+len is - greater than the file size. Any subregion within the range specified - by offset and len that did not contain data before the call will be - initialized to zero. This default behavior closely resembles the - behavior of the posix_fallocate(3) library function, and is intended - as a method of optimally implementing that function. - -However, once F2FS receives ioctl(fd, F2FS_IOC_SET_PIN_FILE) in prior to -fallocate(fd, DEFAULT_MODE), it allocates on-disk blocks addressess having -zero or random data, which is useful to the below scenario where: - 1. create(fd) - 2. ioctl(fd, F2FS_IOC_SET_PIN_FILE) - 3. fallocate(fd, 0, 0, size) - 4. address = fibmap(fd, offset) - 5. open(blkdev) - 6. write(blkdev, address) - -Compression implementation --------------------------- - -- New term named cluster is defined as basic unit of compression, file can -be divided into multiple clusters logically. One cluster includes 4 << n -(n >= 0) logical pages, compression size is also cluster size, each of -cluster can be compressed or not. - -- In cluster metadata layout, one special block address is used to indicate -cluster is compressed one or normal one, for compressed cluster, following -metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs -stores data including compress header and compressed data. - -- In order to eliminate write amplification during overwrite, F2FS only -support compression on write-once file, data can be compressed only when -all logical blocks in file are valid and cluster compress ratio is lower -than specified threshold. - -- To enable compression on regular inode, there are three ways: -* chattr +c file -* chattr +c dir; touch dir/file -* mount w/ -o compress_extension=ext; touch file.ext - -Compress metadata layout: - [Dnode Structure] - +-----------------------------------------------+ - | cluster 1 | cluster 2 | ......... | cluster N | - +-----------------------------------------------+ - . . . . - . . . . - . Compressed Cluster . . Normal Cluster . -+----------+---------+---------+---------+ +---------+---------+---------+---------+ -|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 | -+----------+---------+---------+---------+ +---------+---------+---------+---------+ - . . - . . - . . - +-------------+-------------+----------+----------------------------+ - | data length | data chksum | reserved | compressed data | - +-------------+-------------+----------+----------------------------+ diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index aa2c3d1de3de..f69d20406be0 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -64,6 +64,7 @@ Documentation for filesystem implementations. erofs ext2 ext3 + f2fs fuse overlayfs virtiofs -- cgit From 720c2fc1ec7cb36bfc5326603522bc3955534773 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:05 +0100 Subject: docs: filesystems: convert gfs2.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Bob Peterson Link: https://lore.kernel.org/r/6d7a296de025bcfed7a229da7f8cc1678944f304.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/gfs2.rst | 53 +++++++++++++++++++++++++++++++++++++ Documentation/filesystems/gfs2.txt | 45 ------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 54 insertions(+), 45 deletions(-) create mode 100644 Documentation/filesystems/gfs2.rst delete mode 100644 Documentation/filesystems/gfs2.txt diff --git a/Documentation/filesystems/gfs2.rst b/Documentation/filesystems/gfs2.rst new file mode 100644 index 000000000000..8d1ab589ce18 --- /dev/null +++ b/Documentation/filesystems/gfs2.rst @@ -0,0 +1,53 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +Global File System +================== + +https://fedorahosted.org/cluster/wiki/HomePage + +GFS is a cluster file system. It allows a cluster of computers to +simultaneously use a block device that is shared between them (with FC, +iSCSI, NBD, etc). GFS reads and writes to the block device like a local +file system, but also uses a lock module to allow the computers coordinate +their I/O so file system consistency is maintained. One of the nifty +features of GFS is perfect consistency -- changes made to the file system +on one machine show up immediately on all other machines in the cluster. + +GFS uses interchangeable inter-node locking mechanisms, the currently +supported mechanisms are: + + lock_nolock + - allows gfs to be used as a local file system + + lock_dlm + - uses a distributed lock manager (dlm) for inter-node locking. + The dlm is found at linux/fs/dlm/ + +Lock_dlm depends on user space cluster management systems found +at the URL above. + +To use gfs as a local file system, no external clustering systems are +needed, simply:: + + $ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device + $ mount -t gfs2 /dev/block_device /dir + +If you are using Fedora, you need to install the gfs2-utils package +and, for lock_dlm, you will also need to install the cman package +and write a cluster.conf as per the documentation. For F17 and above +cman has been replaced by the dlm package. + +GFS2 is not on-disk compatible with previous versions of GFS, but it +is pretty close. + +The following man pages can be found at the URL above: + + ============ ============================================= + fsck.gfs2 to repair a filesystem + gfs2_grow to expand a filesystem online + gfs2_jadd to add journals to a filesystem online + tunegfs2 to manipulate, examine and tune a filesystem + gfs2_convert to convert a gfs filesystem to gfs2 in-place + mkfs.gfs2 to make a filesystem + ============ ============================================= diff --git a/Documentation/filesystems/gfs2.txt b/Documentation/filesystems/gfs2.txt deleted file mode 100644 index cc4f2306609e..000000000000 --- a/Documentation/filesystems/gfs2.txt +++ /dev/null @@ -1,45 +0,0 @@ -Global File System ------------------- - -https://fedorahosted.org/cluster/wiki/HomePage - -GFS is a cluster file system. It allows a cluster of computers to -simultaneously use a block device that is shared between them (with FC, -iSCSI, NBD, etc). GFS reads and writes to the block device like a local -file system, but also uses a lock module to allow the computers coordinate -their I/O so file system consistency is maintained. One of the nifty -features of GFS is perfect consistency -- changes made to the file system -on one machine show up immediately on all other machines in the cluster. - -GFS uses interchangeable inter-node locking mechanisms, the currently -supported mechanisms are: - - lock_nolock -- allows gfs to be used as a local file system - - lock_dlm -- uses a distributed lock manager (dlm) for inter-node locking - The dlm is found at linux/fs/dlm/ - -Lock_dlm depends on user space cluster management systems found -at the URL above. - -To use gfs as a local file system, no external clustering systems are -needed, simply: - - $ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device - $ mount -t gfs2 /dev/block_device /dir - -If you are using Fedora, you need to install the gfs2-utils package -and, for lock_dlm, you will also need to install the cman package -and write a cluster.conf as per the documentation. For F17 and above -cman has been replaced by the dlm package. - -GFS2 is not on-disk compatible with previous versions of GFS, but it -is pretty close. - -The following man pages can be found at the URL above: - fsck.gfs2 to repair a filesystem - gfs2_grow to expand a filesystem online - gfs2_jadd to add journals to a filesystem online - tunegfs2 to manipulate, examine and tune a filesystem - gfs2_convert to convert a gfs filesystem to gfs2 in-place - mkfs.gfs2 to make a filesystem diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index f69d20406be0..f24befe78326 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -65,6 +65,7 @@ Documentation for filesystem implementations. ext2 ext3 f2fs + gfs2 fuse overlayfs virtiofs -- cgit From 5b7ac27a6e2c54cc09f479b616f1076afeae3c1b Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:06 +0100 Subject: docs: filesystems: convert gfs2-uevents.txt to ReST This document is almost in ReST format: all it needs is to have the titles adjusted and add a SPDX header. In other words: - Add a SPDX header; - Add a document title; - Adjust section titles; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Bob Peterson Link: https://lore.kernel.org/r/1d1c46b7e86bd0a18d9abbea0de0bc2be84e5e2b.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/gfs2-uevents.rst | 112 +++++++++++++++++++++++++++++ Documentation/filesystems/gfs2-uevents.txt | 100 -------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 113 insertions(+), 100 deletions(-) create mode 100644 Documentation/filesystems/gfs2-uevents.rst delete mode 100644 Documentation/filesystems/gfs2-uevents.txt diff --git a/Documentation/filesystems/gfs2-uevents.rst b/Documentation/filesystems/gfs2-uevents.rst new file mode 100644 index 000000000000..f162a2c76c69 --- /dev/null +++ b/Documentation/filesystems/gfs2-uevents.rst @@ -0,0 +1,112 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +uevents and GFS2 +================ + +During the lifetime of a GFS2 mount, a number of uevents are generated. +This document explains what the events are and what they are used +for (by gfs_controld in gfs2-utils). + +A list of GFS2 uevents +====================== + +1. ADD +------ + +The ADD event occurs at mount time. It will always be the first +uevent generated by the newly created filesystem. If the mount +is successful, an ONLINE uevent will follow. If it is not successful +then a REMOVE uevent will follow. + +The ADD uevent has two environment variables: SPECTATOR=[0|1] +and RDONLY=[0|1] that specify the spectator status (a read-only mount +with no journal assigned), and read-only (with journal assigned) status +of the filesystem respectively. + +2. ONLINE +--------- + +The ONLINE uevent is generated after a successful mount or remount. It +has the same environment variables as the ADD uevent. The ONLINE +uevent, along with the two environment variables for spectator and +RDONLY are a relatively recent addition (2.6.32-rc+) and will not +be generated by older kernels. + +3. CHANGE +--------- + +The CHANGE uevent is used in two places. One is when reporting the +successful mount of the filesystem by the first node (FIRSTMOUNT=Done). +This is used as a signal by gfs_controld that it is then ok for other +nodes in the cluster to mount the filesystem. + +The other CHANGE uevent is used to inform of the completion +of journal recovery for one of the filesystems journals. It has +two environment variables, JID= which specifies the journal id which +has just been recovered, and RECOVERY=[Done|Failed] to indicate the +success (or otherwise) of the operation. These uevents are generated +for every journal recovered, whether it is during the initial mount +process or as the result of gfs_controld requesting a specific journal +recovery via the /sys/fs/gfs2//lock_module/recovery file. + +Because the CHANGE uevent was used (in early versions of gfs_controld) +without checking the environment variables to discover the state, we +cannot add any more functions to it without running the risk of +someone using an older version of the user tools and breaking their +cluster. For this reason the ONLINE uevent was used when adding a new +uevent for a successful mount or remount. + +4. OFFLINE +---------- + +The OFFLINE uevent is only generated due to filesystem errors and is used +as part of the "withdraw" mechanism. Currently this doesn't give any +information about what the error is, which is something that needs to +be fixed. + +5. REMOVE +--------- + +The REMOVE uevent is generated at the end of an unsuccessful mount +or at the end of a umount of the filesystem. All REMOVE uevents will +have been preceded by at least an ADD uevent for the same filesystem, +and unlike the other uevents is generated automatically by the kernel's +kobject subsystem. + + +Information common to all GFS2 uevents (uevent environment variables) +===================================================================== + +1. LOCKTABLE= +-------------- + +The LOCKTABLE is a string, as supplied on the mount command +line (locktable=) or via fstab. It is used as a filesystem label +as well as providing the information for a lock_dlm mount to be +able to join the cluster. + +2. LOCKPROTO= +------------- + +The LOCKPROTO is a string, and its value depends on what is set +on the mount command line, or via fstab. It will be either +lock_nolock or lock_dlm. In the future other lock managers +may be supported. + +3. JOURNALID= +------------- + +If a journal is in use by the filesystem (journals are not +assigned for spectator mounts) then this will give the +numeric journal id in all GFS2 uevents. + +4. UUID= +-------- + +With recent versions of gfs2-utils, mkfs.gfs2 writes a UUID +into the filesystem superblock. If it exists, this will +be included in every uevent relating to the filesystem. + + + diff --git a/Documentation/filesystems/gfs2-uevents.txt b/Documentation/filesystems/gfs2-uevents.txt deleted file mode 100644 index 19a19ebebc34..000000000000 --- a/Documentation/filesystems/gfs2-uevents.txt +++ /dev/null @@ -1,100 +0,0 @@ - uevents and GFS2 - ================== - -During the lifetime of a GFS2 mount, a number of uevents are generated. -This document explains what the events are and what they are used -for (by gfs_controld in gfs2-utils). - -A list of GFS2 uevents ------------------------ - -1. ADD - -The ADD event occurs at mount time. It will always be the first -uevent generated by the newly created filesystem. If the mount -is successful, an ONLINE uevent will follow. If it is not successful -then a REMOVE uevent will follow. - -The ADD uevent has two environment variables: SPECTATOR=[0|1] -and RDONLY=[0|1] that specify the spectator status (a read-only mount -with no journal assigned), and read-only (with journal assigned) status -of the filesystem respectively. - -2. ONLINE - -The ONLINE uevent is generated after a successful mount or remount. It -has the same environment variables as the ADD uevent. The ONLINE -uevent, along with the two environment variables for spectator and -RDONLY are a relatively recent addition (2.6.32-rc+) and will not -be generated by older kernels. - -3. CHANGE - -The CHANGE uevent is used in two places. One is when reporting the -successful mount of the filesystem by the first node (FIRSTMOUNT=Done). -This is used as a signal by gfs_controld that it is then ok for other -nodes in the cluster to mount the filesystem. - -The other CHANGE uevent is used to inform of the completion -of journal recovery for one of the filesystems journals. It has -two environment variables, JID= which specifies the journal id which -has just been recovered, and RECOVERY=[Done|Failed] to indicate the -success (or otherwise) of the operation. These uevents are generated -for every journal recovered, whether it is during the initial mount -process or as the result of gfs_controld requesting a specific journal -recovery via the /sys/fs/gfs2//lock_module/recovery file. - -Because the CHANGE uevent was used (in early versions of gfs_controld) -without checking the environment variables to discover the state, we -cannot add any more functions to it without running the risk of -someone using an older version of the user tools and breaking their -cluster. For this reason the ONLINE uevent was used when adding a new -uevent for a successful mount or remount. - -4. OFFLINE - -The OFFLINE uevent is only generated due to filesystem errors and is used -as part of the "withdraw" mechanism. Currently this doesn't give any -information about what the error is, which is something that needs to -be fixed. - -5. REMOVE - -The REMOVE uevent is generated at the end of an unsuccessful mount -or at the end of a umount of the filesystem. All REMOVE uevents will -have been preceded by at least an ADD uevent for the same filesystem, -and unlike the other uevents is generated automatically by the kernel's -kobject subsystem. - - -Information common to all GFS2 uevents (uevent environment variables) ----------------------------------------------------------------------- - -1. LOCKTABLE= - -The LOCKTABLE is a string, as supplied on the mount command -line (locktable=) or via fstab. It is used as a filesystem label -as well as providing the information for a lock_dlm mount to be -able to join the cluster. - -2. LOCKPROTO= - -The LOCKPROTO is a string, and its value depends on what is set -on the mount command line, or via fstab. It will be either -lock_nolock or lock_dlm. In the future other lock managers -may be supported. - -3. JOURNALID= - -If a journal is in use by the filesystem (journals are not -assigned for spectator mounts) then this will give the -numeric journal id in all GFS2 uevents. - -4. UUID= - -With recent versions of gfs2-utils, mkfs.gfs2 writes a UUID -into the filesystem superblock. If it exists, this will -be included in every uevent relating to the filesystem. - - - diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index f24befe78326..c16e517e37c5 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -66,6 +66,7 @@ Documentation for filesystem implementations. ext3 f2fs gfs2 + gfs2-uevents fuse overlayfs virtiofs -- cgit From cdded7db3625c98e66316911947bd3a1941992e2 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:07 +0100 Subject: docs: filesystems: convert hfsplus.txt to ReST Just trivial changes: - Add a SPDX header; - Add it to filesystems/index.rst. While here, adjust document title, just to make it use the same style of the other docs. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/4298409da951fbee000201a6c8d9c85e961b2b79.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/hfsplus.rst | 61 +++++++++++++++++++++++++++++++++++ Documentation/filesystems/hfsplus.txt | 59 --------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 62 insertions(+), 59 deletions(-) create mode 100644 Documentation/filesystems/hfsplus.rst delete mode 100644 Documentation/filesystems/hfsplus.txt diff --git a/Documentation/filesystems/hfsplus.rst b/Documentation/filesystems/hfsplus.rst new file mode 100644 index 000000000000..f02f4f5fc020 --- /dev/null +++ b/Documentation/filesystems/hfsplus.rst @@ -0,0 +1,61 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================== +Macintosh HFSPlus Filesystem for Linux +====================================== + +HFSPlus is a filesystem first introduced in MacOS 8.1. +HFSPlus has several extensions to HFS, including 32-bit allocation +blocks, 255-character unicode filenames, and file sizes of 2^63 bytes. + + +Mount options +============= + +When mounting an HFSPlus filesystem, the following options are accepted: + + creator=cccc, type=cccc + Specifies the creator/type values as shown by the MacOS finder + used for creating new files. Default values: '????'. + + uid=n, gid=n + Specifies the user/group that owns all files on the filesystem + that have uninitialized permissions structures. + Default: user/group id of the mounting process. + + umask=n + Specifies the umask (in octal) used for files and directories + that have uninitialized permissions structures. + Default: umask of the mounting process. + + session=n + Select the CDROM session to mount as HFSPlus filesystem. Defaults to + leaving that decision to the CDROM driver. This option will fail + with anything but a CDROM as underlying devices. + + part=n + Select partition number n from the devices. This option only makes + sense for CDROMs because they can't be partitioned under Linux. + For disk devices the generic partition parsing code does this + for us. Defaults to not parsing the partition table at all. + + decompose + Decompose file name characters. + + nodecompose + Do not decompose file name characters. + + force + Used to force write access to volumes that are marked as journalled + or locked. Use at your own risk. + + nls=cccc + Encoding to use when presenting file names. + + +References +========== + +kernel source: + +Apple Technote 1150 https://developer.apple.com/legacy/library/technotes/tn/tn1150.html diff --git a/Documentation/filesystems/hfsplus.txt b/Documentation/filesystems/hfsplus.txt deleted file mode 100644 index 59f7569fc9ed..000000000000 --- a/Documentation/filesystems/hfsplus.txt +++ /dev/null @@ -1,59 +0,0 @@ - -Macintosh HFSPlus Filesystem for Linux -====================================== - -HFSPlus is a filesystem first introduced in MacOS 8.1. -HFSPlus has several extensions to HFS, including 32-bit allocation -blocks, 255-character unicode filenames, and file sizes of 2^63 bytes. - - -Mount options -============= - -When mounting an HFSPlus filesystem, the following options are accepted: - - creator=cccc, type=cccc - Specifies the creator/type values as shown by the MacOS finder - used for creating new files. Default values: '????'. - - uid=n, gid=n - Specifies the user/group that owns all files on the filesystem - that have uninitialized permissions structures. - Default: user/group id of the mounting process. - - umask=n - Specifies the umask (in octal) used for files and directories - that have uninitialized permissions structures. - Default: umask of the mounting process. - - session=n - Select the CDROM session to mount as HFSPlus filesystem. Defaults to - leaving that decision to the CDROM driver. This option will fail - with anything but a CDROM as underlying devices. - - part=n - Select partition number n from the devices. This option only makes - sense for CDROMs because they can't be partitioned under Linux. - For disk devices the generic partition parsing code does this - for us. Defaults to not parsing the partition table at all. - - decompose - Decompose file name characters. - - nodecompose - Do not decompose file name characters. - - force - Used to force write access to volumes that are marked as journalled - or locked. Use at your own risk. - - nls=cccc - Encoding to use when presenting file names. - - -References -========== - -kernel source: - -Apple Technote 1150 https://developer.apple.com/legacy/library/technotes/tn/tn1150.html diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index c16e517e37c5..c351bc8a8c85 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -67,6 +67,7 @@ Documentation for filesystem implementations. f2fs gfs2 gfs2-uevents + hfsplus fuse overlayfs virtiofs -- cgit From 5040a0acc8f2300ef35a1d9cc1c50a25235e061d Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:08 +0100 Subject: docs: filesystems: convert hfs.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Use notes markups; - Add lists markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/8a625d6652d88809730020048d26c3b9333ddbdf.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/hfs.rst | 87 +++++++++++++++++++++++++++++++++++++ Documentation/filesystems/hfs.txt | 82 ---------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 88 insertions(+), 82 deletions(-) create mode 100644 Documentation/filesystems/hfs.rst delete mode 100644 Documentation/filesystems/hfs.txt diff --git a/Documentation/filesystems/hfs.rst b/Documentation/filesystems/hfs.rst new file mode 100644 index 000000000000..ab17a005e9b1 --- /dev/null +++ b/Documentation/filesystems/hfs.rst @@ -0,0 +1,87 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== +Macintosh HFS Filesystem for Linux +================================== + + +.. Note:: This filesystem doesn't have a maintainer. + + +HFS stands for ``Hierarchical File System`` and is the filesystem used +by the Mac Plus and all later Macintosh models. Earlier Macintosh +models used MFS (``Macintosh File System``), which is not supported, +MacOS 8.1 and newer support a filesystem called HFS+ that's similar to +HFS but is extended in various areas. Use the hfsplus filesystem driver +to access such filesystems from Linux. + + +Mount options +============= + +When mounting an HFS filesystem, the following options are accepted: + + creator=cccc, type=cccc + Specifies the creator/type values as shown by the MacOS finder + used for creating new files. Default values: '????'. + + uid=n, gid=n + Specifies the user/group that owns all files on the filesystems. + Default: user/group id of the mounting process. + + dir_umask=n, file_umask=n, umask=n + Specifies the umask used for all files , all directories or all + files and directories. Defaults to the umask of the mounting process. + + session=n + Select the CDROM session to mount as HFS filesystem. Defaults to + leaving that decision to the CDROM driver. This option will fail + with anything but a CDROM as underlying devices. + + part=n + Select partition number n from the devices. Does only makes + sense for CDROMS because they can't be partitioned under Linux. + For disk devices the generic partition parsing code does this + for us. Defaults to not parsing the partition table at all. + + quiet + Ignore invalid mount options instead of complaining. + + +Writing to HFS Filesystems +========================== + +HFS is not a UNIX filesystem, thus it does not have the usual features you'd +expect: + + * You can't modify the set-uid, set-gid, sticky or executable bits or the uid + and gid of files. + * You can't create hard- or symlinks, device files, sockets or FIFOs. + +HFS does on the other have the concepts of multiple forks per file. These +non-standard forks are represented as hidden additional files in the normal +filesystems namespace which is kind of a cludge and makes the semantics for +the a little strange: + + * You can't create, delete or rename resource forks of files or the + Finder's metadata. + * They are however created (with default values), deleted and renamed + along with the corresponding data fork or directory. + * Copying files to a different filesystem will loose those attributes + that are essential for MacOS to work. + + +Creating HFS filesystems +======================== + +The hfsutils package from Robert Leslie contains a program called +hformat that can be used to create HFS filesystem. See + for details. + + +Credits +======= + +The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU). +Roman Zippel (roman@ardistech.com) rewrote large parts of the code and brought +in btree routines derived from Brad Boyer's hfsplus driver. diff --git a/Documentation/filesystems/hfs.txt b/Documentation/filesystems/hfs.txt deleted file mode 100644 index d096df6db07a..000000000000 --- a/Documentation/filesystems/hfs.txt +++ /dev/null @@ -1,82 +0,0 @@ -Note: This filesystem doesn't have a maintainer. - -Macintosh HFS Filesystem for Linux -================================== - -HFS stands for ``Hierarchical File System'' and is the filesystem used -by the Mac Plus and all later Macintosh models. Earlier Macintosh -models used MFS (``Macintosh File System''), which is not supported, -MacOS 8.1 and newer support a filesystem called HFS+ that's similar to -HFS but is extended in various areas. Use the hfsplus filesystem driver -to access such filesystems from Linux. - - -Mount options -============= - -When mounting an HFS filesystem, the following options are accepted: - - creator=cccc, type=cccc - Specifies the creator/type values as shown by the MacOS finder - used for creating new files. Default values: '????'. - - uid=n, gid=n - Specifies the user/group that owns all files on the filesystems. - Default: user/group id of the mounting process. - - dir_umask=n, file_umask=n, umask=n - Specifies the umask used for all files , all directories or all - files and directories. Defaults to the umask of the mounting process. - - session=n - Select the CDROM session to mount as HFS filesystem. Defaults to - leaving that decision to the CDROM driver. This option will fail - with anything but a CDROM as underlying devices. - - part=n - Select partition number n from the devices. Does only makes - sense for CDROMS because they can't be partitioned under Linux. - For disk devices the generic partition parsing code does this - for us. Defaults to not parsing the partition table at all. - - quiet - Ignore invalid mount options instead of complaining. - - -Writing to HFS Filesystems -========================== - -HFS is not a UNIX filesystem, thus it does not have the usual features you'd -expect: - - o You can't modify the set-uid, set-gid, sticky or executable bits or the uid - and gid of files. - o You can't create hard- or symlinks, device files, sockets or FIFOs. - -HFS does on the other have the concepts of multiple forks per file. These -non-standard forks are represented as hidden additional files in the normal -filesystems namespace which is kind of a cludge and makes the semantics for -the a little strange: - - o You can't create, delete or rename resource forks of files or the - Finder's metadata. - o They are however created (with default values), deleted and renamed - along with the corresponding data fork or directory. - o Copying files to a different filesystem will loose those attributes - that are essential for MacOS to work. - - -Creating HFS filesystems -=================================== - -The hfsutils package from Robert Leslie contains a program called -hformat that can be used to create HFS filesystem. See - for details. - - -Credits -======= - -The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU). -Roman Zippel (roman@ardistech.com) rewrote large parts of the code and brought -in btree routines derived from Brad Boyer's hfsplus driver. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index c351bc8a8c85..f776411340cb 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -67,6 +67,7 @@ Documentation for filesystem implementations. f2fs gfs2 gfs2-uevents + hfs hfsplus fuse overlayfs -- cgit From a1ef4bcd1664a9c1ae5191598b769ab37b93aa57 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:09 +0100 Subject: docs: filesystems: convert hpfs.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/581019c3120938118aa55ba28902b62083c3f37a.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/hpfs.rst | 353 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/hpfs.txt | 296 ------------------------------ Documentation/filesystems/index.rst | 1 + 3 files changed, 354 insertions(+), 296 deletions(-) create mode 100644 Documentation/filesystems/hpfs.rst delete mode 100644 Documentation/filesystems/hpfs.txt diff --git a/Documentation/filesystems/hpfs.rst b/Documentation/filesystems/hpfs.rst new file mode 100644 index 000000000000..0db152278572 --- /dev/null +++ b/Documentation/filesystems/hpfs.rst @@ -0,0 +1,353 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Read/Write HPFS 2.09 +==================== + +1998-2004, Mikulas Patocka + +:email: mikulas@artax.karlin.mff.cuni.cz +:homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi + +Credits +======= +Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file + is taken from it + +Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993) + +Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion + +Mount options + +uid=xxx,gid=xxx,umask=xxx (default uid=gid=0 umask=default_system_umask) + Set owner/group/mode for files that do not have it specified in extended + attributes. Mode is inverted umask - for example umask 027 gives owner + all permission, group read permission and anybody else no access. Note + that for files mode is anded with 0666. If you want files to have 'x' + rights, you must use extended attributes. +case=lower,asis (default asis) + File name lowercasing in readdir. +conv=binary,text,auto (default binary) + CR/LF -> LF conversion, if auto, decision is made according to extension + - there is a list of text extensions (I thing it's better to not convert + text file than to damage binary file). If you want to change that list, + change it in the source. Original readonly HPFS contained some strange + heuristic algorithm that I removed. I thing it's danger to let the + computer decide whether file is text or binary. For example, DJGPP + binaries contain small text message at the beginning and they could be + misidentified and damaged under some circumstances. +check=none,normal,strict (default normal) + Check level. Selecting none will cause only little speedup and big + danger. I tried to write it so that it won't crash if check=normal on + corrupted filesystems. check=strict means many superfluous checks - + used for debugging (for example it checks if file is allocated in + bitmaps when accessing it). +errors=continue,remount-ro,panic (default remount-ro) + Behaviour when filesystem errors found. +chkdsk=no,errors,always (default errors) + When to mark filesystem dirty so that OS/2 checks it. +eas=no,ro,rw (default rw) + What to do with extended attributes. 'no' - ignore them and use always + values specified in uid/gid/mode options. 'ro' - read extended + attributes but do not create them. 'rw' - create extended attributes + when you use chmod/chown/chgrp/mknod/ln -s on the filesystem. +timeshift=(-)nnn (default 0) + Shifts the time by nnn seconds. For example, if you see under linux + one hour more, than under os/2, use timeshift=-3600. + + +File names +========== + +As in OS/2, filenames are case insensitive. However, shell thinks that names +are case sensitive, so for example when you create a file FOO, you can use +'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you +also won't be able to compile linux kernel (and maybe other things) on HPFS +because kernel creates different files with names like bootsect.S and +bootsect.s. When searching for file thats name has characters >= 128, codepages +are used - see below. +OS/2 ignores dots and spaces at the end of file name, so this driver does as +well. If you create 'a. ...', the file 'a' will be created, but you can still +access it under names 'a.', 'a..', 'a . . . ' etc. + + +Extended attributes +=================== + +On HPFS partitions, OS/2 can associate to each file a special information called +extended attributes. Extended attributes are pairs of (key,value) where key is +an ascii string identifying that attribute and value is any string of bytes of +variable length. OS/2 stores window and icon positions and file types there. So +why not use it for unix-specific info like file owner or access rights? This +driver can do it. If you chown/chgrp/chmod on a hpfs partition, extended +attributes with keys "UID", "GID" or "MODE" and 2-byte values are created. Only +that extended attributes those value differs from defaults specified in mount +options are created. Once created, the extended attributes are never deleted, +they're just changed. It means that when your default uid=0 and you type +something like 'chown luser file; chown root file' the file will contain +extended attribute UID=0. And when you umount the fs and mount it again with +uid=luser_uid, the file will be still owned by root! If you chmod file to 444, +extended attribute "MODE" will not be set, this special case is done by setting +read-only flag. When you mknod a block or char device, besides "MODE", the +special 4-byte extended attribute "DEV" will be created containing the device +number. Currently this driver cannot resize extended attributes - it means +that if somebody (I don't know who?) has set "UID", "GID", "MODE" or "DEV" +attributes with different sizes, they won't be rewritten and changing these +values doesn't work. + + +Symlinks +======== + +You can do symlinks on HPFS partition, symlinks are achieved by setting extended +attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and +chgrp symlinks but I don't know what is it good for. chmoding symlink results +in chmoding file where symlink points. These symlinks are just for Linux use and +incompatible with OS/2. OS/2 PmShell symlinks are not supported because they are +stored in very crazy way. They tried to do it so that link changes when file is +moved ... sometimes it works. But the link is partly stored in directory +extended attributes and partly in OS2SYS.INI. I don't want (and don't know how) +to analyze or change OS2SYS.INI. + + +Codepages +========= + +HPFS can contain several uppercasing tables for several codepages and each +file has a pointer to codepage its name is in. However OS/2 was created in +America where people don't care much about codepages and so multiple codepages +support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk. +Once I booted English OS/2 working in cp 850 and I created a file on my 852 +partition. It marked file name codepage as 850 - good. But when I again booted +Czech OS/2, the file was completely inaccessible under any name. It seems that +OS/2 uppercases the search pattern with its system code page (852) and file +name it's comparing to with its code page (850). These could never match. Is it +really what IBM developers wanted? But problems continued. When I created in +Czech OS/2 another file in that directory, that file was inaccessible too. OS/2 +probably uses different uppercasing method when searching where to place a file +(note, that files in HPFS directory must be sorted) and when searching for +a file. Finally when I opened this directory in PmShell, PmShell crashed (the +funny thing was that, when rebooted, PmShell tried to reopen this directory +again :-). chkdsk happily ignores these errors and only low-level disk +modification saved me. Never mix different language versions of OS/2 on one +system although HPFS was designed to allow that. +OK, I could implement complex codepage support to this driver but I think it +would cause more problems than benefit with such buggy implementation in OS/2. +So this driver simply uses first codepage it finds for uppercasing and +lowercasing no matter what's file codepage index. Usually all file names are in +this codepage - if you don't try to do what I described above :-) + + +Known bugs +========== + +HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client +should work. If you have OS/2 server, use only read-only mode. I don't know how +to handle some HPFS386 structures like access control list or extended perm +list, I don't know how to delete them when file is deleted and how to not +overwrite them with extended attributes. Send me some info on these structures +and I'll make it. However, this driver should detect presence of HPFS386 +structures, remount read-only and not destroy them (I hope). + +When there's not enough space for extended attributes, they will be truncated +and no error is returned. + +OS/2 can't access files if the path is longer than about 256 chars but this +driver allows you to do it. chkdsk ignores such errors. + +Sometimes you won't be able to delete some files on a very full filesystem +(returning error ENOSPC). That's because file in non-leaf node in directory tree +(one directory, if it's large, has dirents in tree on HPFS) must be replaced +with another node when deleted. And that new file might have larger name than +the old one so the new name doesn't fit in directory node (dnode). And that +would result in directory tree splitting, that takes disk space. Workaround is +to delete other files that are leaf (probability that the file is non-leaf is +about 1/50) or to truncate file first to make some space. +You encounter this problem only if you have many directories so that +preallocated directory band is full i.e.:: + + number_of_directories / size_of_filesystem_in_mb > 4. + +You can't delete open directories. + +You can't rename over directories (what is it good for?). + +Renaming files so that only case changes doesn't work. This driver supports it +but vfs doesn't. Something like 'mv file FILE' won't work. + +All atimes and directory mtimes are not updated. That's because of performance +reasons. If you extremely wish to update them, let me know, I'll write it (but +it will be slow). + +When the system is out of memory and swap, it may slightly corrupt filesystem +(lost files, unbalanced directories). (I guess all filesystem may do it). + +When compiled, you get warning: function declaration isn't a prototype. Does +anybody know what does it mean? + + +What does "unbalanced tree" message mean? +========================================= + +Old versions of this driver created sometimes unbalanced dnode trees. OS/2 +chkdsk doesn't scream if the tree is unbalanced (and sometimes creates +unbalanced trees too :-) but both HPFS and HPFS386 contain bug that it rarely +crashes when the tree is not balanced. This driver handles unbalanced trees +correctly and writes warning if it finds them. If you see this message, this is +probably because of directories created with old version of this driver. +Workaround is to move all files from that directory to another and then back +again. Do it in Linux, not OS/2! If you see this message in directory that is +whole created by this driver, it is BUG - let me know about it. + + +Bugs in OS/2 +============ + +When you have two (or more) lost directories pointing each to other, chkdsk +locks up when repairing filesystem. + +Sometimes (I think it's random) when you create a file with one-char name under +OS/2, OS/2 marks it as 'long'. chkdsk then removes this flag saying "Minor fs +error corrected". + +File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and +marks them as short (and writes "minor fs error corrected"). This bug is not in +HPFS386. + +Codepage bugs described above +============================= + +If you don't install fixpacks, there are many, many more... + + +History +======= + +====== ========================================================================= +0.90 First public release +0.91 Fixed bug that caused shooting to memory when write_inode was called on + open inode (rarely happened) +0.92 Fixed a little memory leak in freeing directory inodes +0.93 Fixed bug that locked up the machine when there were too many filenames + with first 15 characters same + Fixed write_file to zero file when writing behind file end +0.94 Fixed a little memory leak when trying to delete busy file or directory +0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files +1.90 First version for 2.1.1xx kernels +1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk + Fixed a race-condition when write_inode is called while deleting file + Fixed a bug that could possibly happen (with very low probability) when + using 0xff in filenames. + + Rewritten locking to avoid race-conditions + + Mount option 'eas' now works + + Fsync no longer returns error + + Files beginning with '.' are marked hidden + + Remount support added + + Alloc is not so slow when filesystem becomes full + + Atimes are no more updated because it slows down operation + + Code cleanup (removed all commented debug prints) +1.92 Corrected a bug when sync was called just before closing file +1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it + works with previous versions + + Fixed a possible problem with disks > 64G (but I don't have one, so I can't + test it) + + Fixed a file overflow at 2G + + Added new option 'timeshift' + + Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in + read-only mode + + Fixed a bug that slowed down alloc and prevented allocating 100% space + (this bug was not destructive) +1.94 Added workaround for one bug in Linux + + Fixed one buffer leak + + Fixed some incompatibilities with large extended attributes (but it's still + not 100% ok, I have no info on it and OS/2 doesn't want to create them) + + Rewritten allocation + + Fixed a bug with i_blocks (du sometimes didn't display correct values) + + Directories have no longer archive attribute set (some programs don't like + it) + + Fixed a bug that it set badly one flag in large anode tree (it was not + destructive) +1.95 Fixed one buffer leak, that could happen on corrupted filesystem + + Fixed one bug in allocation in 1.94 +1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported + error sometimes when opening directories in PMSHELL) + + Fixed a possible bitmap race + + Fixed possible problem on large disks + + You can now delete open files + + Fixed a nondestructive race in rename +1.97 Support for HPFS v3 (on large partitions) + + ZFixed a bug that it didn't allow creation of files > 128M + (it should be 2G) +1.97.1 Changed names of global symbols + + Fixed a bug when chmoding or chowning root directory +1.98 Fixed a deadlock when using old_readdir + Better directory handling; workaround for "unbalanced tree" bug in OS/2 +1.99 Corrected a possible problem when there's not enough space while deleting + file + + Now it tries to truncate the file if there's not enough space when + deleting + + Removed a lot of redundant code +2.00 Fixed a bug in rename (it was there since 1.96) + Better anti-fragmentation strategy +2.01 Fixed problem with directory listing over NFS + + Directory lseek now checks for proper parameters + + Fixed race-condition in buffer code - it is in all filesystems in Linux; + when reading device (cat /dev/hda) while creating files on it, files + could be damaged +2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond + end of partition +2.03 Char, block devices and pipes are correctly created + + Fixed non-crashing race in unlink (Alexander Viro) + + Now it works with Japanese version of OS/2 +2.04 Fixed error when ftruncate used to extend file +2.05 Fixed crash when got mount parameters without = + + Fixed crash when allocation of anode failed due to full disk + + Fixed some crashes when block io or inode allocation failed +2.06 Fixed some crash on corrupted disk structures + + Better allocation strategy + + Reschedule points added so that it doesn't lock CPU long time + + It should work in read-only mode on Warp Server +2.07 More fixes for Warp Server. Now it really works +2.08 Creating new files is not so slow on large disks + + An attempt to sync deleted file does not generate filesystem error +2.09 Fixed error on extremely fragmented files +====== ========================================================================= diff --git a/Documentation/filesystems/hpfs.txt b/Documentation/filesystems/hpfs.txt deleted file mode 100644 index 74630bd504fb..000000000000 --- a/Documentation/filesystems/hpfs.txt +++ /dev/null @@ -1,296 +0,0 @@ -Read/Write HPFS 2.09 -1998-2004, Mikulas Patocka - -email: mikulas@artax.karlin.mff.cuni.cz -homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi - -CREDITS: -Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file - is taken from it -Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993) -Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion - -Mount options - -uid=xxx,gid=xxx,umask=xxx (default uid=gid=0 umask=default_system_umask) - Set owner/group/mode for files that do not have it specified in extended - attributes. Mode is inverted umask - for example umask 027 gives owner - all permission, group read permission and anybody else no access. Note - that for files mode is anded with 0666. If you want files to have 'x' - rights, you must use extended attributes. -case=lower,asis (default asis) - File name lowercasing in readdir. -conv=binary,text,auto (default binary) - CR/LF -> LF conversion, if auto, decision is made according to extension - - there is a list of text extensions (I thing it's better to not convert - text file than to damage binary file). If you want to change that list, - change it in the source. Original readonly HPFS contained some strange - heuristic algorithm that I removed. I thing it's danger to let the - computer decide whether file is text or binary. For example, DJGPP - binaries contain small text message at the beginning and they could be - misidentified and damaged under some circumstances. -check=none,normal,strict (default normal) - Check level. Selecting none will cause only little speedup and big - danger. I tried to write it so that it won't crash if check=normal on - corrupted filesystems. check=strict means many superfluous checks - - used for debugging (for example it checks if file is allocated in - bitmaps when accessing it). -errors=continue,remount-ro,panic (default remount-ro) - Behaviour when filesystem errors found. -chkdsk=no,errors,always (default errors) - When to mark filesystem dirty so that OS/2 checks it. -eas=no,ro,rw (default rw) - What to do with extended attributes. 'no' - ignore them and use always - values specified in uid/gid/mode options. 'ro' - read extended - attributes but do not create them. 'rw' - create extended attributes - when you use chmod/chown/chgrp/mknod/ln -s on the filesystem. -timeshift=(-)nnn (default 0) - Shifts the time by nnn seconds. For example, if you see under linux - one hour more, than under os/2, use timeshift=-3600. - - -File names - -As in OS/2, filenames are case insensitive. However, shell thinks that names -are case sensitive, so for example when you create a file FOO, you can use -'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you -also won't be able to compile linux kernel (and maybe other things) on HPFS -because kernel creates different files with names like bootsect.S and -bootsect.s. When searching for file thats name has characters >= 128, codepages -are used - see below. -OS/2 ignores dots and spaces at the end of file name, so this driver does as -well. If you create 'a. ...', the file 'a' will be created, but you can still -access it under names 'a.', 'a..', 'a . . . ' etc. - - -Extended attributes - -On HPFS partitions, OS/2 can associate to each file a special information called -extended attributes. Extended attributes are pairs of (key,value) where key is -an ascii string identifying that attribute and value is any string of bytes of -variable length. OS/2 stores window and icon positions and file types there. So -why not use it for unix-specific info like file owner or access rights? This -driver can do it. If you chown/chgrp/chmod on a hpfs partition, extended -attributes with keys "UID", "GID" or "MODE" and 2-byte values are created. Only -that extended attributes those value differs from defaults specified in mount -options are created. Once created, the extended attributes are never deleted, -they're just changed. It means that when your default uid=0 and you type -something like 'chown luser file; chown root file' the file will contain -extended attribute UID=0. And when you umount the fs and mount it again with -uid=luser_uid, the file will be still owned by root! If you chmod file to 444, -extended attribute "MODE" will not be set, this special case is done by setting -read-only flag. When you mknod a block or char device, besides "MODE", the -special 4-byte extended attribute "DEV" will be created containing the device -number. Currently this driver cannot resize extended attributes - it means -that if somebody (I don't know who?) has set "UID", "GID", "MODE" or "DEV" -attributes with different sizes, they won't be rewritten and changing these -values doesn't work. - - -Symlinks - -You can do symlinks on HPFS partition, symlinks are achieved by setting extended -attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and -chgrp symlinks but I don't know what is it good for. chmoding symlink results -in chmoding file where symlink points. These symlinks are just for Linux use and -incompatible with OS/2. OS/2 PmShell symlinks are not supported because they are -stored in very crazy way. They tried to do it so that link changes when file is -moved ... sometimes it works. But the link is partly stored in directory -extended attributes and partly in OS2SYS.INI. I don't want (and don't know how) -to analyze or change OS2SYS.INI. - - -Codepages - -HPFS can contain several uppercasing tables for several codepages and each -file has a pointer to codepage its name is in. However OS/2 was created in -America where people don't care much about codepages and so multiple codepages -support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk. -Once I booted English OS/2 working in cp 850 and I created a file on my 852 -partition. It marked file name codepage as 850 - good. But when I again booted -Czech OS/2, the file was completely inaccessible under any name. It seems that -OS/2 uppercases the search pattern with its system code page (852) and file -name it's comparing to with its code page (850). These could never match. Is it -really what IBM developers wanted? But problems continued. When I created in -Czech OS/2 another file in that directory, that file was inaccessible too. OS/2 -probably uses different uppercasing method when searching where to place a file -(note, that files in HPFS directory must be sorted) and when searching for -a file. Finally when I opened this directory in PmShell, PmShell crashed (the -funny thing was that, when rebooted, PmShell tried to reopen this directory -again :-). chkdsk happily ignores these errors and only low-level disk -modification saved me. Never mix different language versions of OS/2 on one -system although HPFS was designed to allow that. -OK, I could implement complex codepage support to this driver but I think it -would cause more problems than benefit with such buggy implementation in OS/2. -So this driver simply uses first codepage it finds for uppercasing and -lowercasing no matter what's file codepage index. Usually all file names are in -this codepage - if you don't try to do what I described above :-) - - -Known bugs - -HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client -should work. If you have OS/2 server, use only read-only mode. I don't know how -to handle some HPFS386 structures like access control list or extended perm -list, I don't know how to delete them when file is deleted and how to not -overwrite them with extended attributes. Send me some info on these structures -and I'll make it. However, this driver should detect presence of HPFS386 -structures, remount read-only and not destroy them (I hope). - -When there's not enough space for extended attributes, they will be truncated -and no error is returned. - -OS/2 can't access files if the path is longer than about 256 chars but this -driver allows you to do it. chkdsk ignores such errors. - -Sometimes you won't be able to delete some files on a very full filesystem -(returning error ENOSPC). That's because file in non-leaf node in directory tree -(one directory, if it's large, has dirents in tree on HPFS) must be replaced -with another node when deleted. And that new file might have larger name than -the old one so the new name doesn't fit in directory node (dnode). And that -would result in directory tree splitting, that takes disk space. Workaround is -to delete other files that are leaf (probability that the file is non-leaf is -about 1/50) or to truncate file first to make some space. -You encounter this problem only if you have many directories so that -preallocated directory band is full i.e. - number_of_directories / size_of_filesystem_in_mb > 4. - -You can't delete open directories. - -You can't rename over directories (what is it good for?). - -Renaming files so that only case changes doesn't work. This driver supports it -but vfs doesn't. Something like 'mv file FILE' won't work. - -All atimes and directory mtimes are not updated. That's because of performance -reasons. If you extremely wish to update them, let me know, I'll write it (but -it will be slow). - -When the system is out of memory and swap, it may slightly corrupt filesystem -(lost files, unbalanced directories). (I guess all filesystem may do it). - -When compiled, you get warning: function declaration isn't a prototype. Does -anybody know what does it mean? - - -What does "unbalanced tree" message mean? - -Old versions of this driver created sometimes unbalanced dnode trees. OS/2 -chkdsk doesn't scream if the tree is unbalanced (and sometimes creates -unbalanced trees too :-) but both HPFS and HPFS386 contain bug that it rarely -crashes when the tree is not balanced. This driver handles unbalanced trees -correctly and writes warning if it finds them. If you see this message, this is -probably because of directories created with old version of this driver. -Workaround is to move all files from that directory to another and then back -again. Do it in Linux, not OS/2! If you see this message in directory that is -whole created by this driver, it is BUG - let me know about it. - - -Bugs in OS/2 - -When you have two (or more) lost directories pointing each to other, chkdsk -locks up when repairing filesystem. - -Sometimes (I think it's random) when you create a file with one-char name under -OS/2, OS/2 marks it as 'long'. chkdsk then removes this flag saying "Minor fs -error corrected". - -File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and -marks them as short (and writes "minor fs error corrected"). This bug is not in -HPFS386. - -Codepage bugs described above. - -If you don't install fixpacks, there are many, many more... - - -History - -0.90 First public release -0.91 Fixed bug that caused shooting to memory when write_inode was called on - open inode (rarely happened) -0.92 Fixed a little memory leak in freeing directory inodes -0.93 Fixed bug that locked up the machine when there were too many filenames - with first 15 characters same - Fixed write_file to zero file when writing behind file end -0.94 Fixed a little memory leak when trying to delete busy file or directory -0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files -1.90 First version for 2.1.1xx kernels -1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk - Fixed a race-condition when write_inode is called while deleting file - Fixed a bug that could possibly happen (with very low probability) when - using 0xff in filenames - Rewritten locking to avoid race-conditions - Mount option 'eas' now works - Fsync no longer returns error - Files beginning with '.' are marked hidden - Remount support added - Alloc is not so slow when filesystem becomes full - Atimes are no more updated because it slows down operation - Code cleanup (removed all commented debug prints) -1.92 Corrected a bug when sync was called just before closing file -1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it - works with previous versions - Fixed a possible problem with disks > 64G (but I don't have one, so I can't - test it) - Fixed a file overflow at 2G - Added new option 'timeshift' - Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in - read-only mode - Fixed a bug that slowed down alloc and prevented allocating 100% space - (this bug was not destructive) -1.94 Added workaround for one bug in Linux - Fixed one buffer leak - Fixed some incompatibilities with large extended attributes (but it's still - not 100% ok, I have no info on it and OS/2 doesn't want to create them) - Rewritten allocation - Fixed a bug with i_blocks (du sometimes didn't display correct values) - Directories have no longer archive attribute set (some programs don't like - it) - Fixed a bug that it set badly one flag in large anode tree (it was not - destructive) -1.95 Fixed one buffer leak, that could happen on corrupted filesystem - Fixed one bug in allocation in 1.94 -1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported - error sometimes when opening directories in PMSHELL) - Fixed a possible bitmap race - Fixed possible problem on large disks - You can now delete open files - Fixed a nondestructive race in rename -1.97 Support for HPFS v3 (on large partitions) - Fixed a bug that it didn't allow creation of files > 128M (it should be 2G) -1.97.1 Changed names of global symbols - Fixed a bug when chmoding or chowning root directory -1.98 Fixed a deadlock when using old_readdir - Better directory handling; workaround for "unbalanced tree" bug in OS/2 -1.99 Corrected a possible problem when there's not enough space while deleting - file - Now it tries to truncate the file if there's not enough space when deleting - Removed a lot of redundant code -2.00 Fixed a bug in rename (it was there since 1.96) - Better anti-fragmentation strategy -2.01 Fixed problem with directory listing over NFS - Directory lseek now checks for proper parameters - Fixed race-condition in buffer code - it is in all filesystems in Linux; - when reading device (cat /dev/hda) while creating files on it, files - could be damaged -2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond - end of partition -2.03 Char, block devices and pipes are correctly created - Fixed non-crashing race in unlink (Alexander Viro) - Now it works with Japanese version of OS/2 -2.04 Fixed error when ftruncate used to extend file -2.05 Fixed crash when got mount parameters without = - Fixed crash when allocation of anode failed due to full disk - Fixed some crashes when block io or inode allocation failed -2.06 Fixed some crash on corrupted disk structures - Better allocation strategy - Reschedule points added so that it doesn't lock CPU long time - It should work in read-only mode on Warp Server -2.07 More fixes for Warp Server. Now it really works -2.08 Creating new files is not so slow on large disks - An attempt to sync deleted file does not generate filesystem error -2.09 Fixed error on extremely fragmented files - - - vim: set textwidth=80: diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index f776411340cb..3fbe2fa0b5c5 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -69,6 +69,7 @@ Documentation for filesystem implementations. gfs2-uevents hfs hfsplus + hpfs fuse overlayfs virtiofs -- cgit From de389cf08d4708d0a03516e5ce0e193f49f0b358 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:10 +0100 Subject: docs: filesystems: convert inotify.txt to ReST - Add a SPDX header; - Add a document title; - Adjust document title; - Fix list markups; - Some whitespace fixes and new line breaks; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Jan Kara Link: https://lore.kernel.org/r/8f846843ecf1914988feb4d001e3a53d27dc1a65.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/inotify.rst | 90 +++++++++++++++++++++++++++++++++++ Documentation/filesystems/inotify.txt | 79 ------------------------------ 3 files changed, 91 insertions(+), 79 deletions(-) create mode 100644 Documentation/filesystems/inotify.rst delete mode 100644 Documentation/filesystems/inotify.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 3fbe2fa0b5c5..5a737722652c 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -70,6 +70,7 @@ Documentation for filesystem implementations. hfs hfsplus hpfs + inotify fuse overlayfs virtiofs diff --git a/Documentation/filesystems/inotify.rst b/Documentation/filesystems/inotify.rst new file mode 100644 index 000000000000..7f7ef8af0e1e --- /dev/null +++ b/Documentation/filesystems/inotify.rst @@ -0,0 +1,90 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================================================== +Inotify - A Powerful yet Simple File Change Notification System +=============================================================== + + + +Document started 15 Mar 2005 by Robert Love + +Document updated 4 Jan 2015 by Zhang Zhen + + - Deleted obsoleted interface, just refer to manpages for user interface. + +(i) Rationale + +Q: + What is the design decision behind not tying the watch to the open fd of + the watched object? + +A: + Watches are associated with an open inotify device, not an open file. + This solves the primary problem with dnotify: keeping the file open pins + the file and thus, worse, pins the mount. Dnotify is therefore infeasible + for use on a desktop system with removable media as the media cannot be + unmounted. Watching a file should not require that it be open. + +Q: + What is the design decision behind using an-fd-per-instance as opposed to + an fd-per-watch? + +A: + An fd-per-watch quickly consumes more file descriptors than are allowed, + more fd's than are feasible to manage, and more fd's than are optimally + select()-able. Yes, root can bump the per-process fd limit and yes, users + can use epoll, but requiring both is a silly and extraneous requirement. + A watch consumes less memory than an open file, separating the number + spaces is thus sensible. The current design is what user-space developers + want: Users initialize inotify, once, and add n watches, requiring but one + fd and no twiddling with fd limits. Initializing an inotify instance two + thousand times is silly. If we can implement user-space's preferences + cleanly--and we can, the idr layer makes stuff like this trivial--then we + should. + + There are other good arguments. With a single fd, there is a single + item to block on, which is mapped to a single queue of events. The single + fd returns all watch events and also any potential out-of-band data. If + every fd was a separate watch, + + - There would be no way to get event ordering. Events on file foo and + file bar would pop poll() on both fd's, but there would be no way to tell + which happened first. A single queue trivially gives you ordering. Such + ordering is crucial to existing applications such as Beagle. Imagine + "mv a b ; mv b a" events without ordering. + + - We'd have to maintain n fd's and n internal queues with state, + versus just one. It is a lot messier in the kernel. A single, linear + queue is the data structure that makes sense. + + - User-space developers prefer the current API. The Beagle guys, for + example, love it. Trust me, I asked. It is not a surprise: Who'd want + to manage and block on 1000 fd's via select? + + - No way to get out of band data. + + - 1024 is still too low. ;-) + + When you talk about designing a file change notification system that + scales to 1000s of directories, juggling 1000s of fd's just does not seem + the right interface. It is too heavy. + + Additionally, it _is_ possible to more than one instance and + juggle more than one queue and thus more than one associated fd. There + need not be a one-fd-per-process mapping; it is one-fd-per-queue and a + process can easily want more than one queue. + +Q: + Why the system call approach? + +A: + The poor user-space interface is the second biggest problem with dnotify. + Signals are a terrible, terrible interface for file notification. Or for + anything, for that matter. The ideal solution, from all perspectives, is a + file descriptor-based one that allows basic file I/O and poll/select. + Obtaining the fd and managing the watches could have been done either via a + device file or a family of new system calls. We decided to implement a + family of system calls because that is the preferred approach for new kernel + interfaces. The only real difference was whether we wanted to use open(2) + and ioctl(2) or a couple of new system calls. System calls beat ioctls. + diff --git a/Documentation/filesystems/inotify.txt b/Documentation/filesystems/inotify.txt deleted file mode 100644 index 51f61db787fb..000000000000 --- a/Documentation/filesystems/inotify.txt +++ /dev/null @@ -1,79 +0,0 @@ - inotify - a powerful yet simple file change notification system - - - -Document started 15 Mar 2005 by Robert Love -Document updated 4 Jan 2015 by Zhang Zhen - --Deleted obsoleted interface, just refer to manpages for user interface. - -(i) Rationale - -Q: What is the design decision behind not tying the watch to the open fd of - the watched object? - -A: Watches are associated with an open inotify device, not an open file. - This solves the primary problem with dnotify: keeping the file open pins - the file and thus, worse, pins the mount. Dnotify is therefore infeasible - for use on a desktop system with removable media as the media cannot be - unmounted. Watching a file should not require that it be open. - -Q: What is the design decision behind using an-fd-per-instance as opposed to - an fd-per-watch? - -A: An fd-per-watch quickly consumes more file descriptors than are allowed, - more fd's than are feasible to manage, and more fd's than are optimally - select()-able. Yes, root can bump the per-process fd limit and yes, users - can use epoll, but requiring both is a silly and extraneous requirement. - A watch consumes less memory than an open file, separating the number - spaces is thus sensible. The current design is what user-space developers - want: Users initialize inotify, once, and add n watches, requiring but one - fd and no twiddling with fd limits. Initializing an inotify instance two - thousand times is silly. If we can implement user-space's preferences - cleanly--and we can, the idr layer makes stuff like this trivial--then we - should. - - There are other good arguments. With a single fd, there is a single - item to block on, which is mapped to a single queue of events. The single - fd returns all watch events and also any potential out-of-band data. If - every fd was a separate watch, - - - There would be no way to get event ordering. Events on file foo and - file bar would pop poll() on both fd's, but there would be no way to tell - which happened first. A single queue trivially gives you ordering. Such - ordering is crucial to existing applications such as Beagle. Imagine - "mv a b ; mv b a" events without ordering. - - - We'd have to maintain n fd's and n internal queues with state, - versus just one. It is a lot messier in the kernel. A single, linear - queue is the data structure that makes sense. - - - User-space developers prefer the current API. The Beagle guys, for - example, love it. Trust me, I asked. It is not a surprise: Who'd want - to manage and block on 1000 fd's via select? - - - No way to get out of band data. - - - 1024 is still too low. ;-) - - When you talk about designing a file change notification system that - scales to 1000s of directories, juggling 1000s of fd's just does not seem - the right interface. It is too heavy. - - Additionally, it _is_ possible to more than one instance and - juggle more than one queue and thus more than one associated fd. There - need not be a one-fd-per-process mapping; it is one-fd-per-queue and a - process can easily want more than one queue. - -Q: Why the system call approach? - -A: The poor user-space interface is the second biggest problem with dnotify. - Signals are a terrible, terrible interface for file notification. Or for - anything, for that matter. The ideal solution, from all perspectives, is a - file descriptor-based one that allows basic file I/O and poll/select. - Obtaining the fd and managing the watches could have been done either via a - device file or a family of new system calls. We decided to implement a - family of system calls because that is the preferred approach for new kernel - interfaces. The only real difference was whether we wanted to use open(2) - and ioctl(2) or a couple of new system calls. System calls beat ioctls. - -- cgit From 76f216855b6bd1027e236b29cd7fece7336c37eb Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:11 +0100 Subject: docs: filesystems: convert isofs.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Add table markups; - Add lists markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/ec16dc09d0c23bb0c1af3d3f33a96896083a1d36.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/isofs.rst | 64 +++++++++++++++++++++++++++++++++++++ Documentation/filesystems/isofs.txt | 48 ---------------------------- 3 files changed, 65 insertions(+), 48 deletions(-) create mode 100644 Documentation/filesystems/isofs.rst delete mode 100644 Documentation/filesystems/isofs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 5a737722652c..8c8813ada53f 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -71,6 +71,7 @@ Documentation for filesystem implementations. hfsplus hpfs inotify + isofs fuse overlayfs virtiofs diff --git a/Documentation/filesystems/isofs.rst b/Documentation/filesystems/isofs.rst new file mode 100644 index 000000000000..08fd469091d4 --- /dev/null +++ b/Documentation/filesystems/isofs.rst @@ -0,0 +1,64 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +ISO9660 Filesystem +================== + +Mount options that are the same as for msdos and vfat partitions. + + ========= ======================================================== + gid=nnn All files in the partition will be in group nnn. + uid=nnn All files in the partition will be owned by user id nnn. + umask=nnn The permission mask (see umask(1)) for the partition. + ========= ======================================================== + +Mount options that are the same as vfat partitions. These are only useful +when using discs encoded using Microsoft's Joliet extensions. + + ============== ============================================================= + iocharset=name Character set to use for converting from Unicode to + ASCII. Joliet filenames are stored in Unicode format, but + Unix for the most part doesn't know how to deal with Unicode. + There is also an option of doing UTF-8 translations with the + utf8 option. + utf8 Encode Unicode names in UTF-8 format. Default is no. + ============== ============================================================= + +Mount options unique to the isofs filesystem. + + ================= ============================================================ + block=512 Set the block size for the disk to 512 bytes + block=1024 Set the block size for the disk to 1024 bytes + block=2048 Set the block size for the disk to 2048 bytes + check=relaxed Matches filenames with different cases + check=strict Matches only filenames with the exact same case + cruft Try to handle badly formatted CDs. + map=off Do not map non-Rock Ridge filenames to lower case + map=normal Map non-Rock Ridge filenames to lower case + map=acorn As map=normal but also apply Acorn extensions if present + mode=xxx Sets the permissions on files to xxx unless Rock Ridge + extensions set the permissions otherwise + dmode=xxx Sets the permissions on directories to xxx unless Rock Ridge + extensions set the permissions otherwise + overriderockperm Set permissions on files and directories according to + 'mode' and 'dmode' even though Rock Ridge extensions are + present. + nojoliet Ignore Joliet extensions if they are present. + norock Ignore Rock Ridge extensions if they are present. + hide Completely strip hidden files from the file system. + showassoc Show files marked with the 'associated' bit + unhide Deprecated; showing hidden files is now default; + If given, it is a synonym for 'showassoc' which will + recreate previous unhide behavior + session=x Select number of session on multisession CD + sbsector=xxx Session begins from sector xxx + ================= ============================================================ + +Recommended documents about ISO 9660 standard are located at: + +- http://www.y-adagio.com/ +- ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf + +Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically +identical with ISO 9660.", so it is a valid and gratis substitute of the +official ISO specification. diff --git a/Documentation/filesystems/isofs.txt b/Documentation/filesystems/isofs.txt deleted file mode 100644 index ba0a93384de0..000000000000 --- a/Documentation/filesystems/isofs.txt +++ /dev/null @@ -1,48 +0,0 @@ -Mount options that are the same as for msdos and vfat partitions. - - gid=nnn All files in the partition will be in group nnn. - uid=nnn All files in the partition will be owned by user id nnn. - umask=nnn The permission mask (see umask(1)) for the partition. - -Mount options that are the same as vfat partitions. These are only useful -when using discs encoded using Microsoft's Joliet extensions. - iocharset=name Character set to use for converting from Unicode to - ASCII. Joliet filenames are stored in Unicode format, but - Unix for the most part doesn't know how to deal with Unicode. - There is also an option of doing UTF-8 translations with the - utf8 option. - utf8 Encode Unicode names in UTF-8 format. Default is no. - -Mount options unique to the isofs filesystem. - block=512 Set the block size for the disk to 512 bytes - block=1024 Set the block size for the disk to 1024 bytes - block=2048 Set the block size for the disk to 2048 bytes - check=relaxed Matches filenames with different cases - check=strict Matches only filenames with the exact same case - cruft Try to handle badly formatted CDs. - map=off Do not map non-Rock Ridge filenames to lower case - map=normal Map non-Rock Ridge filenames to lower case - map=acorn As map=normal but also apply Acorn extensions if present - mode=xxx Sets the permissions on files to xxx unless Rock Ridge - extensions set the permissions otherwise - dmode=xxx Sets the permissions on directories to xxx unless Rock Ridge - extensions set the permissions otherwise - overriderockperm Set permissions on files and directories according to - 'mode' and 'dmode' even though Rock Ridge extensions are - present. - nojoliet Ignore Joliet extensions if they are present. - norock Ignore Rock Ridge extensions if they are present. - hide Completely strip hidden files from the file system. - showassoc Show files marked with the 'associated' bit - unhide Deprecated; showing hidden files is now default; - If given, it is a synonym for 'showassoc' which will - recreate previous unhide behavior - session=x Select number of session on multisession CD - sbsector=xxx Session begins from sector xxx - -Recommended documents about ISO 9660 standard are located at: -http://www.y-adagio.com/ -ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf -Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically -identical with ISO 9660.", so it is a valid and gratis substitute of the -official ISO specification. -- cgit From 2640c19dcab0f6530007dfb4ee5870f5d61b0772 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:12 +0100 Subject: docs: filesystems: convert nilfs2.txt to ReST - Add a SPDX header; - Add a document title; - Adjust document title; - Mark literal blocks as such; - use :field: markup; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/f7989ca501585f5990fffd2d365cfca4fe9fdd6f.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 3 +- Documentation/filesystems/nilfs2.rst | 286 +++++++++++++++++++++++++++++++++++ Documentation/filesystems/nilfs2.txt | 276 --------------------------------- 3 files changed, 288 insertions(+), 277 deletions(-) create mode 100644 Documentation/filesystems/nilfs2.rst delete mode 100644 Documentation/filesystems/nilfs2.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 8c8813ada53f..01587704fcc9 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -70,9 +70,10 @@ Documentation for filesystem implementations. hfs hfsplus hpfs + fuse inotify isofs - fuse + nilfs2 overlayfs virtiofs vfat diff --git a/Documentation/filesystems/nilfs2.rst b/Documentation/filesystems/nilfs2.rst new file mode 100644 index 000000000000..6c49f04e9e0a --- /dev/null +++ b/Documentation/filesystems/nilfs2.rst @@ -0,0 +1,286 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====== +NILFS2 +====== + +NILFS2 is a log-structured file system (LFS) supporting continuous +snapshotting. In addition to versioning capability of the entire file +system, users can even restore files mistakenly overwritten or +destroyed just a few seconds ago. Since NILFS2 can keep consistency +like conventional LFS, it achieves quick recovery after system +crashes. + +NILFS2 creates a number of checkpoints every few seconds or per +synchronous write basis (unless there is no change). Users can select +significant versions among continuously created checkpoints, and can +change them into snapshots which will be preserved until they are +changed back to checkpoints. + +There is no limit on the number of snapshots until the volume gets +full. Each snapshot is mountable as a read-only file system +concurrently with its writable mount, and this feature is convenient +for online backup. + +The userland tools are included in nilfs-utils package, which is +available from the following download page. At least "mkfs.nilfs2", +"mount.nilfs2", "umount.nilfs2", and "nilfs_cleanerd" (so called +cleaner or garbage collector) are required. Details on the tools are +described in the man pages included in the package. + +:Project web page: https://nilfs.sourceforge.io/ +:Download page: https://nilfs.sourceforge.io/en/download.html +:List info: http://vger.kernel.org/vger-lists.html#linux-nilfs + +Caveats +======= + +Features which NILFS2 does not support yet: + + - atime + - extended attributes + - POSIX ACLs + - quotas + - fsck + - defragmentation + +Mount options +============= + +NILFS2 supports the following mount options: +(*) == default + +======================= ======================================================= +barrier(*) This enables/disables the use of write barriers. This +nobarrier requires an IO stack which can support barriers, and + if nilfs gets an error on a barrier write, it will + disable again with a warning. +errors=continue Keep going on a filesystem error. +errors=remount-ro(*) Remount the filesystem read-only on an error. +errors=panic Panic and halt the machine if an error occurs. +cp=n Specify the checkpoint-number of the snapshot to be + mounted. Checkpoints and snapshots are listed by lscp + user command. Only the checkpoints marked as snapshot + are mountable with this option. Snapshot is read-only, + so a read-only mount option must be specified together. +order=relaxed(*) Apply relaxed order semantics that allows modified data + blocks to be written to disk without making a + checkpoint if no metadata update is going. This mode + is equivalent to the ordered data mode of the ext3 + filesystem except for the updates on data blocks still + conserve atomicity. This will improve synchronous + write performance for overwriting. +order=strict Apply strict in-order semantics that preserves sequence + of all file operations including overwriting of data + blocks. That means, it is guaranteed that no + overtaking of events occurs in the recovered file + system after a crash. +norecovery Disable recovery of the filesystem on mount. + This disables every write access on the device for + read-only mounts or snapshots. This option will fail + for r/w mounts on an unclean volume. +discard This enables/disables the use of discard/TRIM commands. +nodiscard(*) The discard/TRIM commands are sent to the underlying + block device when blocks are freed. This is useful + for SSD devices and sparse/thinly-provisioned LUNs. +======================= ======================================================= + +Ioctls +====== + +There is some NILFS2 specific functionality which can be accessed by applications +through the system call interfaces. The list of all NILFS2 specific ioctls are +shown in the table below. + +Table of NILFS2 specific ioctls: + + ============================== =============================================== + Ioctl Description + ============================== =============================================== + NILFS_IOCTL_CHANGE_CPMODE Change mode of given checkpoint between + checkpoint and snapshot state. This ioctl is + used in chcp and mkcp utilities. + + NILFS_IOCTL_DELETE_CHECKPOINT Remove checkpoint from NILFS2 file system. + This ioctl is used in rmcp utility. + + NILFS_IOCTL_GET_CPINFO Return info about requested checkpoints. This + ioctl is used in lscp utility and by + nilfs_cleanerd daemon. + + NILFS_IOCTL_GET_CPSTAT Return checkpoints statistics. This ioctl is + used by lscp, rmcp utilities and by + nilfs_cleanerd daemon. + + NILFS_IOCTL_GET_SUINFO Return segment usage info about requested + segments. This ioctl is used in lssu, + nilfs_resize utilities and by nilfs_cleanerd + daemon. + + NILFS_IOCTL_SET_SUINFO Modify segment usage info of requested + segments. This ioctl is used by + nilfs_cleanerd daemon to skip unnecessary + cleaning operation of segments and reduce + performance penalty or wear of flash device + due to redundant move of in-use blocks. + + NILFS_IOCTL_GET_SUSTAT Return segment usage statistics. This ioctl + is used in lssu, nilfs_resize utilities and + by nilfs_cleanerd daemon. + + NILFS_IOCTL_GET_VINFO Return information on virtual block addresses. + This ioctl is used by nilfs_cleanerd daemon. + + NILFS_IOCTL_GET_BDESCS Return information about descriptors of disk + block numbers. This ioctl is used by + nilfs_cleanerd daemon. + + NILFS_IOCTL_CLEAN_SEGMENTS Do garbage collection operation in the + environment of requested parameters from + userspace. This ioctl is used by + nilfs_cleanerd daemon. + + NILFS_IOCTL_SYNC Make a checkpoint. This ioctl is used in + mkcp utility. + + NILFS_IOCTL_RESIZE Resize NILFS2 volume. This ioctl is used + by nilfs_resize utility. + + NILFS_IOCTL_SET_ALLOC_RANGE Define lower limit of segments in bytes and + upper limit of segments in bytes. This ioctl + is used by nilfs_resize utility. + ============================== =============================================== + +NILFS2 usage +============ + +To use nilfs2 as a local file system, simply:: + + # mkfs -t nilfs2 /dev/block_device + # mount -t nilfs2 /dev/block_device /dir + +This will also invoke the cleaner through the mount helper program +(mount.nilfs2). + +Checkpoints and snapshots are managed by the following commands. +Their manpages are included in the nilfs-utils package above. + + ==== =========================================================== + lscp list checkpoints or snapshots. + mkcp make a checkpoint or a snapshot. + chcp change an existing checkpoint to a snapshot or vice versa. + rmcp invalidate specified checkpoint(s). + ==== =========================================================== + +To mount a snapshot:: + + # mount -t nilfs2 -r -o cp= /dev/block_device /snap_dir + +where is the checkpoint number of the snapshot. + +To unmount the NILFS2 mount point or snapshot, simply:: + + # umount /dir + +Then, the cleaner daemon is automatically shut down by the umount +helper program (umount.nilfs2). + +Disk format +=========== + +A nilfs2 volume is equally divided into a number of segments except +for the super block (SB) and segment #0. A segment is the container +of logs. Each log is composed of summary information blocks, payload +blocks, and an optional super root block (SR):: + + ______________________________________________________ + | |SB| | Segment | Segment | Segment | ... | Segment | | + |_|__|_|____0____|____1____|____2____|_____|____N____|_| + 0 +1K +4K +8M +16M +24M +(8MB x N) + . . (Typical offsets for 4KB-block) + . . + .______________________. + | log | log |... | log | + |__1__|__2__|____|__m__| + . . + . . + . . + .______________________________. + | Summary | Payload blocks |SR| + |_blocks__|_________________|__| + +The payload blocks are organized per file, and each file consists of +data blocks and B-tree node blocks:: + + |<--- File-A --->|<--- File-B --->| + _______________________________________________________________ + | Data blocks | B-tree blocks | Data blocks | B-tree blocks | ... + _|_____________|_______________|_____________|_______________|_ + + +Since only the modified blocks are written in the log, it may have +files without data blocks or B-tree node blocks. + +The organization of the blocks is recorded in the summary information +blocks, which contains a header structure (nilfs_segment_summary), per +file structures (nilfs_finfo), and per block structures (nilfs_binfo):: + + _________________________________________________________________________ + | Summary | finfo | binfo | ... | binfo | finfo | binfo | ... | binfo |... + |_blocks__|___A___|_(A,1)_|_____|(A,Na)_|___B___|_(B,1)_|_____|(B,Nb)_|___ + + +The logs include regular files, directory files, symbolic link files +and several meta data files. The mata data files are the files used +to maintain file system meta data. The current version of NILFS2 uses +the following meta data files:: + + 1) Inode file (ifile) -- Stores on-disk inodes + 2) Checkpoint file (cpfile) -- Stores checkpoints + 3) Segment usage file (sufile) -- Stores allocation state of segments + 4) Data address translation file -- Maps virtual block numbers to usual + (DAT) block numbers. This file serves to + make on-disk blocks relocatable. + +The following figure shows a typical organization of the logs:: + + _________________________________________________________________________ + | Summary | regular file | file | ... | ifile | cpfile | sufile | DAT |SR| + |_blocks__|_or_directory_|_______|_____|_______|________|________|_____|__| + + +To stride over segment boundaries, this sequence of files may be split +into multiple logs. The sequence of logs that should be treated as +logically one log, is delimited with flags marked in the segment +summary. The recovery code of nilfs2 looks this boundary information +to ensure atomicity of updates. + +The super root block is inserted for every checkpoints. It includes +three special inodes, inodes for the DAT, cpfile, and sufile. Inodes +of regular files, directories, symlinks and other special files, are +included in the ifile. The inode of ifile itself is included in the +corresponding checkpoint entry in the cpfile. Thus, the hierarchy +among NILFS2 files can be depicted as follows:: + + Super block (SB) + | + v + Super root block (the latest cno=xx) + |-- DAT + |-- sufile + `-- cpfile + |-- ifile (cno=c1) + |-- ifile (cno=c2) ---- file (ino=i1) + : : |-- file (ino=i2) + `-- ifile (cno=xx) |-- file (ino=i3) + : : + `-- file (ino=yy) + ( regular file, directory, or symlink ) + +For detail on the format of each file, please see nilfs2_ondisk.h +located at include/uapi/linux directory. + +There are no patents or other intellectual property that we protect +with regard to the design of NILFS2. It is allowed to replicate the +design in hopes that other operating systems could share (mount, read, +write, etc.) data stored in this format. diff --git a/Documentation/filesystems/nilfs2.txt b/Documentation/filesystems/nilfs2.txt deleted file mode 100644 index f2f3f8592a6f..000000000000 --- a/Documentation/filesystems/nilfs2.txt +++ /dev/null @@ -1,276 +0,0 @@ -NILFS2 ------- - -NILFS2 is a log-structured file system (LFS) supporting continuous -snapshotting. In addition to versioning capability of the entire file -system, users can even restore files mistakenly overwritten or -destroyed just a few seconds ago. Since NILFS2 can keep consistency -like conventional LFS, it achieves quick recovery after system -crashes. - -NILFS2 creates a number of checkpoints every few seconds or per -synchronous write basis (unless there is no change). Users can select -significant versions among continuously created checkpoints, and can -change them into snapshots which will be preserved until they are -changed back to checkpoints. - -There is no limit on the number of snapshots until the volume gets -full. Each snapshot is mountable as a read-only file system -concurrently with its writable mount, and this feature is convenient -for online backup. - -The userland tools are included in nilfs-utils package, which is -available from the following download page. At least "mkfs.nilfs2", -"mount.nilfs2", "umount.nilfs2", and "nilfs_cleanerd" (so called -cleaner or garbage collector) are required. Details on the tools are -described in the man pages included in the package. - -Project web page: https://nilfs.sourceforge.io/ -Download page: https://nilfs.sourceforge.io/en/download.html -List info: http://vger.kernel.org/vger-lists.html#linux-nilfs - -Caveats -======= - -Features which NILFS2 does not support yet: - - - atime - - extended attributes - - POSIX ACLs - - quotas - - fsck - - defragmentation - -Mount options -============= - -NILFS2 supports the following mount options: -(*) == default - -barrier(*) This enables/disables the use of write barriers. This -nobarrier requires an IO stack which can support barriers, and - if nilfs gets an error on a barrier write, it will - disable again with a warning. -errors=continue Keep going on a filesystem error. -errors=remount-ro(*) Remount the filesystem read-only on an error. -errors=panic Panic and halt the machine if an error occurs. -cp=n Specify the checkpoint-number of the snapshot to be - mounted. Checkpoints and snapshots are listed by lscp - user command. Only the checkpoints marked as snapshot - are mountable with this option. Snapshot is read-only, - so a read-only mount option must be specified together. -order=relaxed(*) Apply relaxed order semantics that allows modified data - blocks to be written to disk without making a - checkpoint if no metadata update is going. This mode - is equivalent to the ordered data mode of the ext3 - filesystem except for the updates on data blocks still - conserve atomicity. This will improve synchronous - write performance for overwriting. -order=strict Apply strict in-order semantics that preserves sequence - of all file operations including overwriting of data - blocks. That means, it is guaranteed that no - overtaking of events occurs in the recovered file - system after a crash. -norecovery Disable recovery of the filesystem on mount. - This disables every write access on the device for - read-only mounts or snapshots. This option will fail - for r/w mounts on an unclean volume. -discard This enables/disables the use of discard/TRIM commands. -nodiscard(*) The discard/TRIM commands are sent to the underlying - block device when blocks are freed. This is useful - for SSD devices and sparse/thinly-provisioned LUNs. - -Ioctls -====== - -There is some NILFS2 specific functionality which can be accessed by applications -through the system call interfaces. The list of all NILFS2 specific ioctls are -shown in the table below. - -Table of NILFS2 specific ioctls -.............................................................................. - Ioctl Description - NILFS_IOCTL_CHANGE_CPMODE Change mode of given checkpoint between - checkpoint and snapshot state. This ioctl is - used in chcp and mkcp utilities. - - NILFS_IOCTL_DELETE_CHECKPOINT Remove checkpoint from NILFS2 file system. - This ioctl is used in rmcp utility. - - NILFS_IOCTL_GET_CPINFO Return info about requested checkpoints. This - ioctl is used in lscp utility and by - nilfs_cleanerd daemon. - - NILFS_IOCTL_GET_CPSTAT Return checkpoints statistics. This ioctl is - used by lscp, rmcp utilities and by - nilfs_cleanerd daemon. - - NILFS_IOCTL_GET_SUINFO Return segment usage info about requested - segments. This ioctl is used in lssu, - nilfs_resize utilities and by nilfs_cleanerd - daemon. - - NILFS_IOCTL_SET_SUINFO Modify segment usage info of requested - segments. This ioctl is used by - nilfs_cleanerd daemon to skip unnecessary - cleaning operation of segments and reduce - performance penalty or wear of flash device - due to redundant move of in-use blocks. - - NILFS_IOCTL_GET_SUSTAT Return segment usage statistics. This ioctl - is used in lssu, nilfs_resize utilities and - by nilfs_cleanerd daemon. - - NILFS_IOCTL_GET_VINFO Return information on virtual block addresses. - This ioctl is used by nilfs_cleanerd daemon. - - NILFS_IOCTL_GET_BDESCS Return information about descriptors of disk - block numbers. This ioctl is used by - nilfs_cleanerd daemon. - - NILFS_IOCTL_CLEAN_SEGMENTS Do garbage collection operation in the - environment of requested parameters from - userspace. This ioctl is used by - nilfs_cleanerd daemon. - - NILFS_IOCTL_SYNC Make a checkpoint. This ioctl is used in - mkcp utility. - - NILFS_IOCTL_RESIZE Resize NILFS2 volume. This ioctl is used - by nilfs_resize utility. - - NILFS_IOCTL_SET_ALLOC_RANGE Define lower limit of segments in bytes and - upper limit of segments in bytes. This ioctl - is used by nilfs_resize utility. - -NILFS2 usage -============ - -To use nilfs2 as a local file system, simply: - - # mkfs -t nilfs2 /dev/block_device - # mount -t nilfs2 /dev/block_device /dir - -This will also invoke the cleaner through the mount helper program -(mount.nilfs2). - -Checkpoints and snapshots are managed by the following commands. -Their manpages are included in the nilfs-utils package above. - - lscp list checkpoints or snapshots. - mkcp make a checkpoint or a snapshot. - chcp change an existing checkpoint to a snapshot or vice versa. - rmcp invalidate specified checkpoint(s). - -To mount a snapshot, - - # mount -t nilfs2 -r -o cp= /dev/block_device /snap_dir - -where is the checkpoint number of the snapshot. - -To unmount the NILFS2 mount point or snapshot, simply: - - # umount /dir - -Then, the cleaner daemon is automatically shut down by the umount -helper program (umount.nilfs2). - -Disk format -=========== - -A nilfs2 volume is equally divided into a number of segments except -for the super block (SB) and segment #0. A segment is the container -of logs. Each log is composed of summary information blocks, payload -blocks, and an optional super root block (SR): - - ______________________________________________________ - | |SB| | Segment | Segment | Segment | ... | Segment | | - |_|__|_|____0____|____1____|____2____|_____|____N____|_| - 0 +1K +4K +8M +16M +24M +(8MB x N) - . . (Typical offsets for 4KB-block) - . . - .______________________. - | log | log |... | log | - |__1__|__2__|____|__m__| - . . - . . - . . - .______________________________. - | Summary | Payload blocks |SR| - |_blocks__|_________________|__| - -The payload blocks are organized per file, and each file consists of -data blocks and B-tree node blocks: - - |<--- File-A --->|<--- File-B --->| - _______________________________________________________________ - | Data blocks | B-tree blocks | Data blocks | B-tree blocks | ... - _|_____________|_______________|_____________|_______________|_ - - -Since only the modified blocks are written in the log, it may have -files without data blocks or B-tree node blocks. - -The organization of the blocks is recorded in the summary information -blocks, which contains a header structure (nilfs_segment_summary), per -file structures (nilfs_finfo), and per block structures (nilfs_binfo): - - _________________________________________________________________________ - | Summary | finfo | binfo | ... | binfo | finfo | binfo | ... | binfo |... - |_blocks__|___A___|_(A,1)_|_____|(A,Na)_|___B___|_(B,1)_|_____|(B,Nb)_|___ - - -The logs include regular files, directory files, symbolic link files -and several meta data files. The mata data files are the files used -to maintain file system meta data. The current version of NILFS2 uses -the following meta data files: - - 1) Inode file (ifile) -- Stores on-disk inodes - 2) Checkpoint file (cpfile) -- Stores checkpoints - 3) Segment usage file (sufile) -- Stores allocation state of segments - 4) Data address translation file -- Maps virtual block numbers to usual - (DAT) block numbers. This file serves to - make on-disk blocks relocatable. - -The following figure shows a typical organization of the logs: - - _________________________________________________________________________ - | Summary | regular file | file | ... | ifile | cpfile | sufile | DAT |SR| - |_blocks__|_or_directory_|_______|_____|_______|________|________|_____|__| - - -To stride over segment boundaries, this sequence of files may be split -into multiple logs. The sequence of logs that should be treated as -logically one log, is delimited with flags marked in the segment -summary. The recovery code of nilfs2 looks this boundary information -to ensure atomicity of updates. - -The super root block is inserted for every checkpoints. It includes -three special inodes, inodes for the DAT, cpfile, and sufile. Inodes -of regular files, directories, symlinks and other special files, are -included in the ifile. The inode of ifile itself is included in the -corresponding checkpoint entry in the cpfile. Thus, the hierarchy -among NILFS2 files can be depicted as follows: - - Super block (SB) - | - v - Super root block (the latest cno=xx) - |-- DAT - |-- sufile - `-- cpfile - |-- ifile (cno=c1) - |-- ifile (cno=c2) ---- file (ino=i1) - : : |-- file (ino=i2) - `-- ifile (cno=xx) |-- file (ino=i3) - : : - `-- file (ino=yy) - ( regular file, directory, or symlink ) - -For detail on the format of each file, please see nilfs2_ondisk.h -located at include/uapi/linux directory. - -There are no patents or other intellectual property that we protect -with regard to the design of NILFS2. It is allowed to replicate the -design in hopes that other operating systems could share (mount, read, -write, etc.) data stored in this format. -- cgit From 461f2c8f13fcc0d349e4acac46aacf63dbeb34ca Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:13 +0100 Subject: docs: filesystems: convert ntfs.txt to ReST - Add a SPDX header; - Adjust document title; - Comment out text-only ToC; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/f09ca6c9bdd4e7aa7208f3dba0b8753080b38d03.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 3 +- Documentation/filesystems/ntfs.rst | 466 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/ntfs.txt | 451 ---------------------------------- 3 files changed, 468 insertions(+), 452 deletions(-) create mode 100644 Documentation/filesystems/ntfs.rst delete mode 100644 Documentation/filesystems/ntfs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 01587704fcc9..62be53c4755d 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -74,7 +74,8 @@ Documentation for filesystem implementations. inotify isofs nilfs2 + nfs/index + ntfs overlayfs virtiofs vfat - nfs/index diff --git a/Documentation/filesystems/ntfs.rst b/Documentation/filesystems/ntfs.rst new file mode 100644 index 000000000000..5bb093a26485 --- /dev/null +++ b/Documentation/filesystems/ntfs.rst @@ -0,0 +1,466 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================ +The Linux NTFS filesystem driver +================================ + + +.. Table of contents + + - Overview + - Web site + - Features + - Supported mount options + - Known bugs and (mis-)features + - Using NTFS volume and stripe sets + - The Device-Mapper driver + - The Software RAID / MD driver + - Limitations when using the MD driver + + +Overview +======== + +Linux-NTFS comes with a number of user-space programs known as ntfsprogs. +These include mkntfs, a full-featured ntfs filesystem format utility, +ntfsundelete used for recovering files that were unintentionally deleted +from an NTFS volume and ntfsresize which is used to resize an NTFS partition. +See the web site for more information. + +To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file +system type 'ntfs'. The driver currently supports read-only mode (with no +fault-tolerance, encryption or journalling) and very limited, but safe, write +support. + +For fault tolerance and raid support (i.e. volume and stripe sets), you can +use the kernel's Software RAID / MD driver. See section "Using Software RAID +with NTFS" for details. + + +Web site +======== + +There is plenty of additional information on the linux-ntfs web site +at http://www.linux-ntfs.org/ + +The web site has a lot of additional information, such as a comprehensive +FAQ, documentation on the NTFS on-disk format, information on the Linux-NTFS +userspace utilities, etc. + + +Features +======== + +- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and + earlier kernels. This new driver implements NTFS read support and is + functionally equivalent to the old ntfs driver and it also implements limited + write support. The biggest limitation at present is that files/directories + cannot be created or deleted. See below for the list of write features that + are so far supported. Another limitation is that writing to compressed files + is not implemented at all. Also, neither read nor write access to encrypted + files is so far implemented. +- The new driver has full support for sparse files on NTFS 3.x volumes which + the old driver isn't happy with. +- The new driver supports execution of binaries due to mmap() now being + supported. +- The new driver supports loopback mounting of files on NTFS which is used by + some Linux distributions to enable the user to run Linux from an NTFS + partition by creating a large file while in Windows and then loopback + mounting the file while in Linux and creating a Linux filesystem on it that + is used to install Linux on it. +- A comparison of the two drivers using:: + + time find . -type f -exec md5sum "{}" \; + + run three times in sequence with each driver (after a reboot) on a 1.4GiB + NTFS partition, showed the new driver to be 20% faster in total time elapsed + (from 9:43 minutes on average down to 7:53). The time spent in user space + was unchanged but the time spent in the kernel was decreased by a factor of + 2.5 (from 85 CPU seconds down to 33). +- The driver does not support short file names in general. For backwards + compatibility, we implement access to files using their short file names if + they exist. The driver will not create short file names however, and a + rename will discard any existing short file name. +- The new driver supports exporting of mounted NTFS volumes via NFS. +- The new driver supports async io (aio). +- The new driver supports fsync(2), fdatasync(2), and msync(2). +- The new driver supports readv(2) and writev(2). +- The new driver supports access time updates (including mtime and ctime). +- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present + only very limited support for highly fragmented files, i.e. ones which have + their data attribute split across multiple extents, is included. Another + limitation is that at present truncate(2) will never create sparse files, + since to mark a file sparse we need to modify the directory entry for the + file and we do not implement directory modifications yet. +- The new driver supports write(2) which can both overwrite existing data and + extend the file size so that you can write beyond the existing data. Also, + writing into sparse regions is supported and the holes are filled in with + clusters. But at present only limited support for highly fragmented files, + i.e. ones which have their data attribute split across multiple extents, is + included. Another limitation is that write(2) will never create sparse + files, since to mark a file sparse we need to modify the directory entry for + the file and we do not implement directory modifications yet. + +Supported mount options +======================= + +In addition to the generic mount options described by the manual page for the +mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the +following mount options: + +======================= ======================================================= +iocharset=name Deprecated option. Still supported but please use + nls=name in the future. See description for nls=name. + +nls=name Character set to use when returning file names. + Unlike VFAT, NTFS suppresses names that contain + unconvertible characters. Note that most character + sets contain insufficient characters to represent all + possible Unicode characters that can exist on NTFS. + To be sure you are not missing any files, you are + advised to use nls=utf8 which is capable of + representing all Unicode characters. + +utf8= Option no longer supported. Currently mapped to + nls=utf8 but please use nls=utf8 in the future and + make sure utf8 is compiled either as module or into + the kernel. See description for nls=name. + +uid= +gid= +umask= Provide default owner, group, and access mode mask. + These options work as documented in mount(8). By + default, the files/directories are owned by root and + he/she has read and write permissions, as well as + browse permission for directories. No one else has any + access permissions. I.e. the mode on all files is by + default rw------- and for directories rwx------, a + consequence of the default fmask=0177 and dmask=0077. + Using a umask of zero will grant all permissions to + everyone, i.e. all files and directories will have mode + rwxrwxrwx. + +fmask= +dmask= Instead of specifying umask which applies both to + files and directories, fmask applies only to files and + dmask only to directories. + +sloppy= If sloppy is specified, ignore unknown mount options. + Otherwise the default behaviour is to abort mount if + any unknown options are found. + +show_sys_files= If show_sys_files is specified, show the system files + in directory listings. Otherwise the default behaviour + is to hide the system files. + Note that even when show_sys_files is specified, "$MFT" + will not be visible due to bugs/mis-features in glibc. + Further, note that irrespective of show_sys_files, all + files are accessible by name, i.e. you can always do + "ls -l \$UpCase" for example to specifically show the + system file containing the Unicode upcase table. + +case_sensitive= If case_sensitive is specified, treat all file names as + case sensitive and create file names in the POSIX + namespace. Otherwise the default behaviour is to treat + file names as case insensitive and to create file names + in the WIN32/LONG name space. Note, the Linux NTFS + driver will never create short file names and will + remove them on rename/delete of the corresponding long + file name. + Note that files remain accessible via their short file + name, if it exists. If case_sensitive, you will need + to provide the correct case of the short file name. + +disable_sparse= If disable_sparse is specified, creation of sparse + regions, i.e. holes, inside files is disabled for the + volume (for the duration of this mount only). By + default, creation of sparse regions is enabled, which + is consistent with the behaviour of traditional Unix + filesystems. + +errors=opt What to do when critical filesystem errors are found. + Following values can be used for "opt": + + ======== ========================================= + continue DEFAULT, try to clean-up as much as + possible, e.g. marking a corrupt inode as + bad so it is no longer accessed, and then + continue. + recover At present only supported is recovery of + the boot sector from the backup copy. + If read-only mount, the recovery is done + in memory only and not written to disk. + ======== ========================================= + + Note that the options are additive, i.e. specifying:: + + errors=continue,errors=recover + + means the driver will attempt to recover and if that + fails it will clean-up as much as possible and + continue. + +mft_zone_multiplier= Set the MFT zone multiplier for the volume (this + setting is not persistent across mounts and can be + changed from mount to mount but cannot be changed on + remount). Values of 1 to 4 are allowed, 1 being the + default. The MFT zone multiplier determines how much + space is reserved for the MFT on the volume. If all + other space is used up, then the MFT zone will be + shrunk dynamically, so this has no impact on the + amount of free space. However, it can have an impact + on performance by affecting fragmentation of the MFT. + In general use the default. If you have a lot of small + files then use a higher value. The values have the + following meaning: + + ===== ================================= + Value MFT zone size (% of volume size) + ===== ================================= + 1 12.5% + 2 25% + 3 37.5% + 4 50% + ===== ================================= + + Note this option is irrelevant for read-only mounts. +======================= ======================================================= + + +Known bugs and (mis-)features +============================= + +- The link count on each directory inode entry is set to 1, due to Linux not + supporting directory hard links. This may well confuse some user space + applications, since the directory names will have the same inode numbers. + This also speeds up ntfs_read_inode() immensely. And we haven't found any + problems with this approach so far. If you find a problem with this, please + let us know. + + +Please send bug reports/comments/feedback/abuse to the Linux-NTFS development +list at sourceforge: linux-ntfs-dev@lists.sourceforge.net + + +Using NTFS volume and stripe sets +================================= + +For support of volume and stripe sets, you can either use the kernel's +Device-Mapper driver or the kernel's Software RAID / MD driver. The former is +the recommended one to use for linear raid. But the latter is required for +raid level 5. For striping and mirroring, either driver should work fine. + + +The Device-Mapper driver +------------------------ + +You will need to create a table of the components of the volume/stripe set and +how they fit together and load this into the kernel using the dmsetup utility +(see man 8 dmsetup). + +Linear volume sets, i.e. linear raid, has been tested and works fine. Even +though untested, there is no reason why stripe sets, i.e. raid level 0, and +mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e. +raid level 5, unfortunately cannot work yet because the current version of the +Device-Mapper driver does not support raid level 5. You may be able to use the +Software RAID / MD driver for raid level 5, see the next section for details. + +To create the table describing your volume you will need to know each of its +components and their sizes in sectors, i.e. multiples of 512-byte blocks. + +For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for +example if one of your partitions is /dev/hda2 you would do:: + + $ fdisk -ul /dev/hda + + Disk /dev/hda: 81.9 GB, 81964302336 bytes + 255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors + Units = sectors of 1 * 512 = 512 bytes + + Device Boot Start End Blocks Id System + /dev/hda1 * 63 4209029 2104483+ 83 Linux + /dev/hda2 4209030 37768814 16779892+ 86 NTFS + /dev/hda3 37768815 46170809 4200997+ 83 Linux + +And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 = +33559785 sectors. + +For Win2k and later dynamic disks, you can for example use the ldminfo utility +which is part of the Linux LDM tools (the latest version at the time of +writing is linux-ldm-0.0.8.tar.bz2). You can download it from: + + http://www.linux-ntfs.org/ + +Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go +into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You +will find the precompiled (i386) ldminfo utility there. NOTE: You will not be +able to compile this yourself easily so use the binary version! + +Then you would use ldminfo in dump mode to obtain the necessary information:: + + $ ./ldminfo --dump /dev/hda + +This would dump the LDM database found on /dev/hda which describes all of your +dynamic disks and all the volumes on them. At the bottom you will see the +VOLUME DEFINITIONS section which is all you really need. You may need to look +further above to determine which of the disks in the volume definitions is +which device in Linux. Hint: Run ldminfo on each of your dynamic disks and +look at the Disk Id close to the top of the output for each (the PRIVATE HEADER +section). You can then find these Disk Ids in the VBLK DATABASE section in the + components where you will get the LDM Name for the disk that is found in +the VOLUME DEFINITIONS section. + +Note you will also need to enable the LDM driver in the Linux kernel. If your +distribution did not enable it, you will need to recompile the kernel with it +enabled. This will create the LDM partitions on each device at boot time. You +would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc) +in the Device-Mapper table. + +You can also bypass using the LDM driver by using the main device (e.g. +/dev/hda) and then using the offsets of the LDM partitions into this device as +the "Start sector of device" when creating the table. Once again ldminfo would +give you the correct information to do this. + +Assuming you know all your devices and their sizes things are easy. + +For a linear raid the table would look like this (note all values are in +512-byte sectors):: + + # Offset into Size of this Raid type Device Start sector + # volume device of device + 0 1028161 linear /dev/hda1 0 + 1028161 3903762 linear /dev/hdb2 0 + 4931923 2103211 linear /dev/hdc1 0 + +For a striped volume, i.e. raid level 0, you will need to know the chunk size +you used when creating the volume. Windows uses 64kiB as the default, so it +will probably be this unless you changes the defaults when creating the array. + +For a raid level 0 the table would look like this (note all values are in +512-byte sectors):: + + # Offset Size Raid Number Chunk 1st Start 2nd Start + # into of the type of size Device in Device in + # volume volume stripes device device + 0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0 + +If there are more than two devices, just add each of them to the end of the +line. + +Finally, for a mirrored volume, i.e. raid level 1, the table would look like +this (note all values are in 512-byte sectors):: + + # Ofs Size Raid Log Number Region Should Number Source Start Target Start + # in of the type type of log size sync? of Device in Device in + # vol volume params mirrors Device Device + 0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0 + +If you are mirroring to multiple devices you can specify further targets at the +end of the line. + +Note the "Should sync?" parameter "nosync" means that the two mirrors are +already in sync which will be the case on a clean shutdown of Windows. If the +mirrors are not clean, you can specify the "sync" option instead of "nosync" +and the Device-Mapper driver will then copy the entirety of the "Source Device" +to the "Target Device" or if you specified multiple target devices to all of +them. + +Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1), +and hand it over to dmsetup to work with, like so:: + + $ dmsetup create myvolume1 /etc/ntfsvolume1 + +You can obviously replace "myvolume1" with whatever name you like. + +If it all worked, you will now have the device /dev/device-mapper/myvolume1 +which you can then just use as an argument to the mount command as usual to +mount the ntfs volume. For example:: + + $ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1 + +(You need to create the directory /mnt/myvol1 first and of course you can use +anything you like instead of /mnt/myvol1 as long as it is an existing +directory.) + +It is advisable to do the mount read-only to see if the volume has been setup +correctly to avoid the possibility of causing damage to the data on the ntfs +volume. + + +The Software RAID / MD driver +----------------------------- + +An alternative to using the Device-Mapper driver is to use the kernel's +Software RAID / MD driver. For which you need to set up your /etc/raidtab +appropriately (see man 5 raidtab). + +Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level +0, have been tested and work fine (though see section "Limitations when using +the MD driver with NTFS volumes" especially if you want to use linear raid). +Even though untested, there is no reason why mirrors, i.e. raid level 1, and +stripes with parity, i.e. raid level 5, should not work, too. + +You have to use the "persistent-superblock 0" option for each raid-disk in the +NTFS volume/stripe you are configuring in /etc/raidtab as the persistent +superblock used by the MD driver would damage the NTFS volume. + +Windows by default uses a stripe chunk size of 64k, so you probably want the +"chunk-size 64k" option for each raid-disk, too. + +For example, if you have a stripe set consisting of two partitions /dev/hda5 +and /dev/hdb1 your /etc/raidtab would look like this:: + + raiddev /dev/md0 + raid-level 0 + nr-raid-disks 2 + nr-spare-disks 0 + persistent-superblock 0 + chunk-size 64k + device /dev/hda5 + raid-disk 0 + device /dev/hdb1 + raid-disk 1 + +For linear raid, just change the raid-level above to "raid-level linear", for +mirrors, change it to "raid-level 1", and for stripe sets with parity, change +it to "raid-level 5". + +Note for stripe sets with parity you will also need to tell the MD driver +which parity algorithm to use by specifying the option "parity-algorithm +which", where you need to replace "which" with the name of the algorithm to +use (see man 5 raidtab for available algorithms) and you will have to try the +different available algorithms until you find one that works. Make sure you +are working read-only when playing with this as you may damage your data +otherwise. If you find which algorithm works please let us know (email the +linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on +IRC in channel #ntfs on the irc.freenode.net network) so we can update this +documentation. + +Once the raidtab is setup, run for example raid0run -a to start all devices or +raid0run /dev/md0 to start a particular md device, in this case /dev/md0. + +Then just use the mount command as usual to mount the ntfs volume using for +example:: + + mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume + +It is advisable to do the mount read-only to see if the md volume has been +setup correctly to avoid the possibility of causing damage to the data on the +ntfs volume. + + +Limitations when using the Software RAID / MD driver +----------------------------------------------------- + +Using the md driver will not work properly if any of your NTFS partitions have +an odd number of sectors. This is especially important for linear raid as all +data after the first partition with an odd number of sectors will be offset by +one or more sectors so if you mount such a partition with write support you +will cause massive damage to the data on the volume which will only become +apparent when you try to use the volume again under Windows. + +So when using linear raid, make sure that all your partitions have an even +number of sectors BEFORE attempting to use it. You have been warned! + +Even better is to simply use the Device-Mapper for linear raid and then you do +not have this problem with odd numbers of sectors. diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt deleted file mode 100644 index 553f10d03076..000000000000 --- a/Documentation/filesystems/ntfs.txt +++ /dev/null @@ -1,451 +0,0 @@ -The Linux NTFS filesystem driver -================================ - - -Table of contents -================= - -- Overview -- Web site -- Features -- Supported mount options -- Known bugs and (mis-)features -- Using NTFS volume and stripe sets - - The Device-Mapper driver - - The Software RAID / MD driver - - Limitations when using the MD driver - - -Overview -======== - -Linux-NTFS comes with a number of user-space programs known as ntfsprogs. -These include mkntfs, a full-featured ntfs filesystem format utility, -ntfsundelete used for recovering files that were unintentionally deleted -from an NTFS volume and ntfsresize which is used to resize an NTFS partition. -See the web site for more information. - -To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file -system type 'ntfs'. The driver currently supports read-only mode (with no -fault-tolerance, encryption or journalling) and very limited, but safe, write -support. - -For fault tolerance and raid support (i.e. volume and stripe sets), you can -use the kernel's Software RAID / MD driver. See section "Using Software RAID -with NTFS" for details. - - -Web site -======== - -There is plenty of additional information on the linux-ntfs web site -at http://www.linux-ntfs.org/ - -The web site has a lot of additional information, such as a comprehensive -FAQ, documentation on the NTFS on-disk format, information on the Linux-NTFS -userspace utilities, etc. - - -Features -======== - -- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and - earlier kernels. This new driver implements NTFS read support and is - functionally equivalent to the old ntfs driver and it also implements limited - write support. The biggest limitation at present is that files/directories - cannot be created or deleted. See below for the list of write features that - are so far supported. Another limitation is that writing to compressed files - is not implemented at all. Also, neither read nor write access to encrypted - files is so far implemented. -- The new driver has full support for sparse files on NTFS 3.x volumes which - the old driver isn't happy with. -- The new driver supports execution of binaries due to mmap() now being - supported. -- The new driver supports loopback mounting of files on NTFS which is used by - some Linux distributions to enable the user to run Linux from an NTFS - partition by creating a large file while in Windows and then loopback - mounting the file while in Linux and creating a Linux filesystem on it that - is used to install Linux on it. -- A comparison of the two drivers using: - time find . -type f -exec md5sum "{}" \; - run three times in sequence with each driver (after a reboot) on a 1.4GiB - NTFS partition, showed the new driver to be 20% faster in total time elapsed - (from 9:43 minutes on average down to 7:53). The time spent in user space - was unchanged but the time spent in the kernel was decreased by a factor of - 2.5 (from 85 CPU seconds down to 33). -- The driver does not support short file names in general. For backwards - compatibility, we implement access to files using their short file names if - they exist. The driver will not create short file names however, and a - rename will discard any existing short file name. -- The new driver supports exporting of mounted NTFS volumes via NFS. -- The new driver supports async io (aio). -- The new driver supports fsync(2), fdatasync(2), and msync(2). -- The new driver supports readv(2) and writev(2). -- The new driver supports access time updates (including mtime and ctime). -- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present - only very limited support for highly fragmented files, i.e. ones which have - their data attribute split across multiple extents, is included. Another - limitation is that at present truncate(2) will never create sparse files, - since to mark a file sparse we need to modify the directory entry for the - file and we do not implement directory modifications yet. -- The new driver supports write(2) which can both overwrite existing data and - extend the file size so that you can write beyond the existing data. Also, - writing into sparse regions is supported and the holes are filled in with - clusters. But at present only limited support for highly fragmented files, - i.e. ones which have their data attribute split across multiple extents, is - included. Another limitation is that write(2) will never create sparse - files, since to mark a file sparse we need to modify the directory entry for - the file and we do not implement directory modifications yet. - -Supported mount options -======================= - -In addition to the generic mount options described by the manual page for the -mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the -following mount options: - -iocharset=name Deprecated option. Still supported but please use - nls=name in the future. See description for nls=name. - -nls=name Character set to use when returning file names. - Unlike VFAT, NTFS suppresses names that contain - unconvertible characters. Note that most character - sets contain insufficient characters to represent all - possible Unicode characters that can exist on NTFS. - To be sure you are not missing any files, you are - advised to use nls=utf8 which is capable of - representing all Unicode characters. - -utf8= Option no longer supported. Currently mapped to - nls=utf8 but please use nls=utf8 in the future and - make sure utf8 is compiled either as module or into - the kernel. See description for nls=name. - -uid= -gid= -umask= Provide default owner, group, and access mode mask. - These options work as documented in mount(8). By - default, the files/directories are owned by root and - he/she has read and write permissions, as well as - browse permission for directories. No one else has any - access permissions. I.e. the mode on all files is by - default rw------- and for directories rwx------, a - consequence of the default fmask=0177 and dmask=0077. - Using a umask of zero will grant all permissions to - everyone, i.e. all files and directories will have mode - rwxrwxrwx. - -fmask= -dmask= Instead of specifying umask which applies both to - files and directories, fmask applies only to files and - dmask only to directories. - -sloppy= If sloppy is specified, ignore unknown mount options. - Otherwise the default behaviour is to abort mount if - any unknown options are found. - -show_sys_files= If show_sys_files is specified, show the system files - in directory listings. Otherwise the default behaviour - is to hide the system files. - Note that even when show_sys_files is specified, "$MFT" - will not be visible due to bugs/mis-features in glibc. - Further, note that irrespective of show_sys_files, all - files are accessible by name, i.e. you can always do - "ls -l \$UpCase" for example to specifically show the - system file containing the Unicode upcase table. - -case_sensitive= If case_sensitive is specified, treat all file names as - case sensitive and create file names in the POSIX - namespace. Otherwise the default behaviour is to treat - file names as case insensitive and to create file names - in the WIN32/LONG name space. Note, the Linux NTFS - driver will never create short file names and will - remove them on rename/delete of the corresponding long - file name. - Note that files remain accessible via their short file - name, if it exists. If case_sensitive, you will need - to provide the correct case of the short file name. - -disable_sparse= If disable_sparse is specified, creation of sparse - regions, i.e. holes, inside files is disabled for the - volume (for the duration of this mount only). By - default, creation of sparse regions is enabled, which - is consistent with the behaviour of traditional Unix - filesystems. - -errors=opt What to do when critical filesystem errors are found. - Following values can be used for "opt": - continue: DEFAULT, try to clean-up as much as - possible, e.g. marking a corrupt inode as - bad so it is no longer accessed, and then - continue. - recover: At present only supported is recovery of - the boot sector from the backup copy. - If read-only mount, the recovery is done - in memory only and not written to disk. - Note that the options are additive, i.e. specifying: - errors=continue,errors=recover - means the driver will attempt to recover and if that - fails it will clean-up as much as possible and - continue. - -mft_zone_multiplier= Set the MFT zone multiplier for the volume (this - setting is not persistent across mounts and can be - changed from mount to mount but cannot be changed on - remount). Values of 1 to 4 are allowed, 1 being the - default. The MFT zone multiplier determines how much - space is reserved for the MFT on the volume. If all - other space is used up, then the MFT zone will be - shrunk dynamically, so this has no impact on the - amount of free space. However, it can have an impact - on performance by affecting fragmentation of the MFT. - In general use the default. If you have a lot of small - files then use a higher value. The values have the - following meaning: - Value MFT zone size (% of volume size) - 1 12.5% - 2 25% - 3 37.5% - 4 50% - Note this option is irrelevant for read-only mounts. - - -Known bugs and (mis-)features -============================= - -- The link count on each directory inode entry is set to 1, due to Linux not - supporting directory hard links. This may well confuse some user space - applications, since the directory names will have the same inode numbers. - This also speeds up ntfs_read_inode() immensely. And we haven't found any - problems with this approach so far. If you find a problem with this, please - let us know. - - -Please send bug reports/comments/feedback/abuse to the Linux-NTFS development -list at sourceforge: linux-ntfs-dev@lists.sourceforge.net - - -Using NTFS volume and stripe sets -================================= - -For support of volume and stripe sets, you can either use the kernel's -Device-Mapper driver or the kernel's Software RAID / MD driver. The former is -the recommended one to use for linear raid. But the latter is required for -raid level 5. For striping and mirroring, either driver should work fine. - - -The Device-Mapper driver ------------------------- - -You will need to create a table of the components of the volume/stripe set and -how they fit together and load this into the kernel using the dmsetup utility -(see man 8 dmsetup). - -Linear volume sets, i.e. linear raid, has been tested and works fine. Even -though untested, there is no reason why stripe sets, i.e. raid level 0, and -mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e. -raid level 5, unfortunately cannot work yet because the current version of the -Device-Mapper driver does not support raid level 5. You may be able to use the -Software RAID / MD driver for raid level 5, see the next section for details. - -To create the table describing your volume you will need to know each of its -components and their sizes in sectors, i.e. multiples of 512-byte blocks. - -For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for -example if one of your partitions is /dev/hda2 you would do: - -$ fdisk -ul /dev/hda - -Disk /dev/hda: 81.9 GB, 81964302336 bytes -255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors -Units = sectors of 1 * 512 = 512 bytes - - Device Boot Start End Blocks Id System - /dev/hda1 * 63 4209029 2104483+ 83 Linux - /dev/hda2 4209030 37768814 16779892+ 86 NTFS - /dev/hda3 37768815 46170809 4200997+ 83 Linux - -And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 = -33559785 sectors. - -For Win2k and later dynamic disks, you can for example use the ldminfo utility -which is part of the Linux LDM tools (the latest version at the time of -writing is linux-ldm-0.0.8.tar.bz2). You can download it from: - http://www.linux-ntfs.org/ -Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go -into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You -will find the precompiled (i386) ldminfo utility there. NOTE: You will not be -able to compile this yourself easily so use the binary version! - -Then you would use ldminfo in dump mode to obtain the necessary information: - -$ ./ldminfo --dump /dev/hda - -This would dump the LDM database found on /dev/hda which describes all of your -dynamic disks and all the volumes on them. At the bottom you will see the -VOLUME DEFINITIONS section which is all you really need. You may need to look -further above to determine which of the disks in the volume definitions is -which device in Linux. Hint: Run ldminfo on each of your dynamic disks and -look at the Disk Id close to the top of the output for each (the PRIVATE HEADER -section). You can then find these Disk Ids in the VBLK DATABASE section in the - components where you will get the LDM Name for the disk that is found in -the VOLUME DEFINITIONS section. - -Note you will also need to enable the LDM driver in the Linux kernel. If your -distribution did not enable it, you will need to recompile the kernel with it -enabled. This will create the LDM partitions on each device at boot time. You -would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc) -in the Device-Mapper table. - -You can also bypass using the LDM driver by using the main device (e.g. -/dev/hda) and then using the offsets of the LDM partitions into this device as -the "Start sector of device" when creating the table. Once again ldminfo would -give you the correct information to do this. - -Assuming you know all your devices and their sizes things are easy. - -For a linear raid the table would look like this (note all values are in -512-byte sectors): - ---- cut here --- -# Offset into Size of this Raid type Device Start sector -# volume device of device -0 1028161 linear /dev/hda1 0 -1028161 3903762 linear /dev/hdb2 0 -4931923 2103211 linear /dev/hdc1 0 ---- cut here --- - -For a striped volume, i.e. raid level 0, you will need to know the chunk size -you used when creating the volume. Windows uses 64kiB as the default, so it -will probably be this unless you changes the defaults when creating the array. - -For a raid level 0 the table would look like this (note all values are in -512-byte sectors): - ---- cut here --- -# Offset Size Raid Number Chunk 1st Start 2nd Start -# into of the type of size Device in Device in -# volume volume stripes device device -0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0 ---- cut here --- - -If there are more than two devices, just add each of them to the end of the -line. - -Finally, for a mirrored volume, i.e. raid level 1, the table would look like -this (note all values are in 512-byte sectors): - ---- cut here --- -# Ofs Size Raid Log Number Region Should Number Source Start Target Start -# in of the type type of log size sync? of Device in Device in -# vol volume params mirrors Device Device -0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0 ---- cut here --- - -If you are mirroring to multiple devices you can specify further targets at the -end of the line. - -Note the "Should sync?" parameter "nosync" means that the two mirrors are -already in sync which will be the case on a clean shutdown of Windows. If the -mirrors are not clean, you can specify the "sync" option instead of "nosync" -and the Device-Mapper driver will then copy the entirety of the "Source Device" -to the "Target Device" or if you specified multiple target devices to all of -them. - -Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1), -and hand it over to dmsetup to work with, like so: - -$ dmsetup create myvolume1 /etc/ntfsvolume1 - -You can obviously replace "myvolume1" with whatever name you like. - -If it all worked, you will now have the device /dev/device-mapper/myvolume1 -which you can then just use as an argument to the mount command as usual to -mount the ntfs volume. For example: - -$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1 - -(You need to create the directory /mnt/myvol1 first and of course you can use -anything you like instead of /mnt/myvol1 as long as it is an existing -directory.) - -It is advisable to do the mount read-only to see if the volume has been setup -correctly to avoid the possibility of causing damage to the data on the ntfs -volume. - - -The Software RAID / MD driver ------------------------------ - -An alternative to using the Device-Mapper driver is to use the kernel's -Software RAID / MD driver. For which you need to set up your /etc/raidtab -appropriately (see man 5 raidtab). - -Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level -0, have been tested and work fine (though see section "Limitations when using -the MD driver with NTFS volumes" especially if you want to use linear raid). -Even though untested, there is no reason why mirrors, i.e. raid level 1, and -stripes with parity, i.e. raid level 5, should not work, too. - -You have to use the "persistent-superblock 0" option for each raid-disk in the -NTFS volume/stripe you are configuring in /etc/raidtab as the persistent -superblock used by the MD driver would damage the NTFS volume. - -Windows by default uses a stripe chunk size of 64k, so you probably want the -"chunk-size 64k" option for each raid-disk, too. - -For example, if you have a stripe set consisting of two partitions /dev/hda5 -and /dev/hdb1 your /etc/raidtab would look like this: - -raiddev /dev/md0 - raid-level 0 - nr-raid-disks 2 - nr-spare-disks 0 - persistent-superblock 0 - chunk-size 64k - device /dev/hda5 - raid-disk 0 - device /dev/hdb1 - raid-disk 1 - -For linear raid, just change the raid-level above to "raid-level linear", for -mirrors, change it to "raid-level 1", and for stripe sets with parity, change -it to "raid-level 5". - -Note for stripe sets with parity you will also need to tell the MD driver -which parity algorithm to use by specifying the option "parity-algorithm -which", where you need to replace "which" with the name of the algorithm to -use (see man 5 raidtab for available algorithms) and you will have to try the -different available algorithms until you find one that works. Make sure you -are working read-only when playing with this as you may damage your data -otherwise. If you find which algorithm works please let us know (email the -linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on -IRC in channel #ntfs on the irc.freenode.net network) so we can update this -documentation. - -Once the raidtab is setup, run for example raid0run -a to start all devices or -raid0run /dev/md0 to start a particular md device, in this case /dev/md0. - -Then just use the mount command as usual to mount the ntfs volume using for -example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume - -It is advisable to do the mount read-only to see if the md volume has been -setup correctly to avoid the possibility of causing damage to the data on the -ntfs volume. - - -Limitations when using the Software RAID / MD driver ------------------------------------------------------ - -Using the md driver will not work properly if any of your NTFS partitions have -an odd number of sectors. This is especially important for linear raid as all -data after the first partition with an odd number of sectors will be offset by -one or more sectors so if you mount such a partition with write support you -will cause massive damage to the data on the volume which will only become -apparent when you try to use the volume again under Windows. - -So when using linear raid, make sure that all your partitions have an even -number of sectors BEFORE attempting to use it. You have been warned! - -Even better is to simply use the Device-Mapper for linear raid and then you do -not have this problem with odd numbers of sectors. -- cgit From 3d0c60d004644630f1431ce486e76adcc829e288 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:14 +0100 Subject: docs: filesystems: convert ocfs2-online-filecheck.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/6007166acc3252697755836354bd29b5a5fb82aa.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + .../filesystems/ocfs2-online-filecheck.rst | 99 ++++++++++++++++++++++ .../filesystems/ocfs2-online-filecheck.txt | 94 -------------------- 3 files changed, 100 insertions(+), 94 deletions(-) create mode 100644 Documentation/filesystems/ocfs2-online-filecheck.rst delete mode 100644 Documentation/filesystems/ocfs2-online-filecheck.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 62be53c4755d..f3a26fdbd04f 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -76,6 +76,7 @@ Documentation for filesystem implementations. nilfs2 nfs/index ntfs + ocfs2-online-filecheck overlayfs virtiofs vfat diff --git a/Documentation/filesystems/ocfs2-online-filecheck.rst b/Documentation/filesystems/ocfs2-online-filecheck.rst new file mode 100644 index 000000000000..2257bb53edc1 --- /dev/null +++ b/Documentation/filesystems/ocfs2-online-filecheck.rst @@ -0,0 +1,99 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================== +OCFS2 file system - online file check +===================================== + +This document will describe OCFS2 online file check feature. + +Introduction +============ +OCFS2 is often used in high-availability systems. However, OCFS2 usually +converts the filesystem to read-only when encounters an error. This may not be +necessary, since turning the filesystem read-only would affect other running +processes as well, decreasing availability. +Then, a mount option (errors=continue) is introduced, which would return the +-EIO errno to the calling process and terminate further processing so that the +filesystem is not corrupted further. The filesystem is not converted to +read-only, and the problematic file's inode number is reported in the kernel +log. The user can try to check/fix this file via online filecheck feature. + +Scope +===== +This effort is to check/fix small issues which may hinder day-to-day operations +of a cluster filesystem by turning the filesystem read-only. The scope of +checking/fixing is at the file level, initially for regular files and eventually +to all files (including system files) of the filesystem. + +In case of directory to file links is incorrect, the directory inode is +reported as erroneous. + +This feature is not suited for extravagant checks which involve dependency of +other components of the filesystem, such as but not limited to, checking if the +bits for file blocks in the allocation has been set. In case of such an error, +the offline fsck should/would be recommended. + +Finally, such an operation/feature should not be automated lest the filesystem +may end up with more damage than before the repair attempt. So, this has to +be performed using user interaction and consent. + +User interface +============== +When there are errors in the OCFS2 filesystem, they are usually accompanied +by the inode number which caused the error. This inode number would be the +input to check/fix the file. + +There is a sysfs directory for each OCFS2 file system mounting:: + + /sys/fs/ocfs2//filecheck + +Here, indicates the name of OCFS2 volume device which has been already +mounted. The file above would accept inode numbers. This could be used to +communicate with kernel space, tell which file(inode number) will be checked or +fixed. Currently, three operations are supported, which includes checking +inode, fixing inode and setting the size of result record history. + +1. If you want to know what error exactly happened to before fixing, do:: + + # echo "" > /sys/fs/ocfs2//filecheck/check + # cat /sys/fs/ocfs2//filecheck/check + +The output is like this:: + + INO DONE ERROR + 39502 1 GENERATION + + lists the inode numbers. + indicates whether the operation has been finished. + says what kind of errors was found. For the detailed error numbers, + please refer to the file linux/fs/ocfs2/filecheck.h. + +2. If you determine to fix this inode, do:: + + # echo "" > /sys/fs/ocfs2//filecheck/fix + # cat /sys/fs/ocfs2//filecheck/fix + +The output is like this::: + + INO DONE ERROR + 39502 1 SUCCESS + +This time, the column indicates whether this fix is successful or not. + +3. The record cache is used to store the history of check/fix results. It's +default size is 10, and can be adjust between the range of 10 ~ 100. You can +adjust the size like this:: + + # echo "" > /sys/fs/ocfs2//filecheck/set + +Fixing stuff +============ +On receiving the inode, the filesystem would read the inode and the +file metadata. In case of errors, the filesystem would fix the errors +and report the problems it fixed in the kernel log. As a precautionary measure, +the inode must first be checked for errors before performing a final fix. + +The inode and the result history will be maintained temporarily in a +small linked list buffer which would contain the last (N) inodes +fixed/checked, the detailed errors which were fixed/checked are printed in the +kernel log. diff --git a/Documentation/filesystems/ocfs2-online-filecheck.txt b/Documentation/filesystems/ocfs2-online-filecheck.txt deleted file mode 100644 index 139fab175c8a..000000000000 --- a/Documentation/filesystems/ocfs2-online-filecheck.txt +++ /dev/null @@ -1,94 +0,0 @@ - OCFS2 online file check - ----------------------- - -This document will describe OCFS2 online file check feature. - -Introduction -============ -OCFS2 is often used in high-availability systems. However, OCFS2 usually -converts the filesystem to read-only when encounters an error. This may not be -necessary, since turning the filesystem read-only would affect other running -processes as well, decreasing availability. -Then, a mount option (errors=continue) is introduced, which would return the --EIO errno to the calling process and terminate further processing so that the -filesystem is not corrupted further. The filesystem is not converted to -read-only, and the problematic file's inode number is reported in the kernel -log. The user can try to check/fix this file via online filecheck feature. - -Scope -===== -This effort is to check/fix small issues which may hinder day-to-day operations -of a cluster filesystem by turning the filesystem read-only. The scope of -checking/fixing is at the file level, initially for regular files and eventually -to all files (including system files) of the filesystem. - -In case of directory to file links is incorrect, the directory inode is -reported as erroneous. - -This feature is not suited for extravagant checks which involve dependency of -other components of the filesystem, such as but not limited to, checking if the -bits for file blocks in the allocation has been set. In case of such an error, -the offline fsck should/would be recommended. - -Finally, such an operation/feature should not be automated lest the filesystem -may end up with more damage than before the repair attempt. So, this has to -be performed using user interaction and consent. - -User interface -============== -When there are errors in the OCFS2 filesystem, they are usually accompanied -by the inode number which caused the error. This inode number would be the -input to check/fix the file. - -There is a sysfs directory for each OCFS2 file system mounting: - - /sys/fs/ocfs2//filecheck - -Here, indicates the name of OCFS2 volume device which has been already -mounted. The file above would accept inode numbers. This could be used to -communicate with kernel space, tell which file(inode number) will be checked or -fixed. Currently, three operations are supported, which includes checking -inode, fixing inode and setting the size of result record history. - -1. If you want to know what error exactly happened to before fixing, do - - # echo "" > /sys/fs/ocfs2//filecheck/check - # cat /sys/fs/ocfs2//filecheck/check - -The output is like this: - INO DONE ERROR -39502 1 GENERATION - - lists the inode numbers. - indicates whether the operation has been finished. - says what kind of errors was found. For the detailed error numbers, -please refer to the file linux/fs/ocfs2/filecheck.h. - -2. If you determine to fix this inode, do - - # echo "" > /sys/fs/ocfs2//filecheck/fix - # cat /sys/fs/ocfs2//filecheck/fix - -The output is like this: - INO DONE ERROR -39502 1 SUCCESS - -This time, the column indicates whether this fix is successful or not. - -3. The record cache is used to store the history of check/fix results. It's -default size is 10, and can be adjust between the range of 10 ~ 100. You can -adjust the size like this: - - # echo "" > /sys/fs/ocfs2//filecheck/set - -Fixing stuff -============ -On receiving the inode, the filesystem would read the inode and the -file metadata. In case of errors, the filesystem would fix the errors -and report the problems it fixed in the kernel log. As a precautionary measure, -the inode must first be checked for errors before performing a final fix. - -The inode and the result history will be maintained temporarily in a -small linked list buffer which would contain the last (N) inodes -fixed/checked, the detailed errors which were fixed/checked are printed in the -kernel log. -- cgit From fa95e087ff69468b4e452c50c3f4c59a45846b8d Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:15 +0100 Subject: docs: filesystems: convert ocfs2.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Joseph Qi Link: https://lore.kernel.org/r/e29a8120bf1d847f23fb68e915f10a7d43bed9e3.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/ocfs2.rst | 117 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/ocfs2.txt | 106 -------------------------------- 3 files changed, 118 insertions(+), 106 deletions(-) create mode 100644 Documentation/filesystems/ocfs2.rst delete mode 100644 Documentation/filesystems/ocfs2.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index f3a26fdbd04f..3b2b07491c98 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -76,6 +76,7 @@ Documentation for filesystem implementations. nilfs2 nfs/index ntfs + ocfs2 ocfs2-online-filecheck overlayfs virtiofs diff --git a/Documentation/filesystems/ocfs2.rst b/Documentation/filesystems/ocfs2.rst new file mode 100644 index 000000000000..412386bc6506 --- /dev/null +++ b/Documentation/filesystems/ocfs2.rst @@ -0,0 +1,117 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +OCFS2 filesystem +================ + +OCFS2 is a general purpose extent based shared disk cluster file +system with many similarities to ext3. It supports 64 bit inode +numbers, and has automatically extending metadata groups which may +also make it attractive for non-clustered use. + +You'll want to install the ocfs2-tools package in order to at least +get "mount.ocfs2" and "ocfs2_hb_ctl". + +Project web page: http://ocfs2.wiki.kernel.org +Tools git tree: https://github.com/markfasheh/ocfs2-tools +OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ + +All code copyright 2005 Oracle except when otherwise noted. + +Credits +======= + +Lots of code taken from ext3 and other projects. + +Authors in alphabetical order: + +- Joel Becker +- Zach Brown +- Mark Fasheh +- Kurt Hackel +- Tao Ma +- Sunil Mushran +- Manish Singh +- Tiger Yang + +Caveats +======= +Features which OCFS2 does not support yet: + + - Directory change notification (F_NOTIFY) + - Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease) + +Mount options +============= + +OCFS2 supports the following mount options: + +(*) == default + +======================= ======================================================== +barrier=1 This enables/disables barriers. barrier=0 disables it, + barrier=1 enables it. +errors=remount-ro(*) Remount the filesystem read-only on an error. +errors=panic Panic and halt the machine if an error occurs. +intr (*) Allow signals to interrupt cluster operations. +nointr Do not allow signals to interrupt cluster + operations. +noatime Do not update access time. +relatime(*) Update atime if the previous atime is older than + mtime or ctime +strictatime Always update atime, but the minimum update interval + is specified by atime_quantum. +atime_quantum=60(*) OCFS2 will not update atime unless this number + of seconds has passed since the last update. + Set to zero to always update atime. This option need + work with strictatime. +data=ordered (*) All data are forced directly out to the main file + system prior to its metadata being committed to the + journal. +data=writeback Data ordering is not preserved, data may be written + into the main file system after its metadata has been + committed to the journal. +preferred_slot=0(*) During mount, try to use this filesystem slot first. If + it is in use by another node, the first empty one found + will be chosen. Invalid values will be ignored. +commit=nrsec (*) Ocfs2 can be told to sync all its data and metadata + every 'nrsec' seconds. The default value is 5 seconds. + This means that if you lose your power, you will lose + as much as the latest 5 seconds of work (your + filesystem will not be damaged though, thanks to the + journaling). This default value (or any low value) + will hurt performance, but it's good for data-safety. + Setting it to 0 will have the same effect as leaving + it at the default (5 seconds). + Setting it to very large values will improve + performance. +localalloc=8(*) Allows custom localalloc size in MB. If the value is too + large, the fs will silently revert it to the default. +localflocks This disables cluster aware flock. +inode64 Indicates that Ocfs2 is allowed to create inodes at + any location in the filesystem, including those which + will result in inode numbers occupying more than 32 + bits of significance. +user_xattr (*) Enables Extended User Attributes. +nouser_xattr Disables Extended User Attributes. +acl Enables POSIX Access Control Lists support. +noacl (*) Disables POSIX Access Control Lists support. +resv_level=2 (*) Set how aggressive allocation reservations will be. + Valid values are between 0 (reservations off) to 8 + (maximum space for reservations). +dir_resv_level= (*) By default, directory reservations will scale with file + reservations - users should rarely need to change this + value. If allocation reservations are turned off, this + option will have no effect. +coherency=full (*) Disallow concurrent O_DIRECT writes, cluster inode + lock will be taken to force other nodes drop cache, + therefore full cluster coherency is guaranteed even + for O_DIRECT writes. +coherency=buffered Allow concurrent O_DIRECT writes without EX lock among + nodes, which gains high performance at risk of getting + stale data on other nodes. +journal_async_commit Commit block can be written to disk without waiting + for descriptor blocks. If enabled older kernels cannot + mount the device. This will enable 'journal_checksum' + internally. +======================= ======================================================== diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt deleted file mode 100644 index 4c49e5410595..000000000000 --- a/Documentation/filesystems/ocfs2.txt +++ /dev/null @@ -1,106 +0,0 @@ -OCFS2 filesystem -================== -OCFS2 is a general purpose extent based shared disk cluster file -system with many similarities to ext3. It supports 64 bit inode -numbers, and has automatically extending metadata groups which may -also make it attractive for non-clustered use. - -You'll want to install the ocfs2-tools package in order to at least -get "mount.ocfs2" and "ocfs2_hb_ctl". - -Project web page: http://ocfs2.wiki.kernel.org -Tools git tree: https://github.com/markfasheh/ocfs2-tools -OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ - -All code copyright 2005 Oracle except when otherwise noted. - -CREDITS: -Lots of code taken from ext3 and other projects. - -Authors in alphabetical order: -Joel Becker -Zach Brown -Mark Fasheh -Kurt Hackel -Tao Ma -Sunil Mushran -Manish Singh -Tiger Yang - -Caveats -======= -Features which OCFS2 does not support yet: - - Directory change notification (F_NOTIFY) - - Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease) - -Mount options -============= - -OCFS2 supports the following mount options: -(*) == default - -barrier=1 This enables/disables barriers. barrier=0 disables it, - barrier=1 enables it. -errors=remount-ro(*) Remount the filesystem read-only on an error. -errors=panic Panic and halt the machine if an error occurs. -intr (*) Allow signals to interrupt cluster operations. -nointr Do not allow signals to interrupt cluster - operations. -noatime Do not update access time. -relatime(*) Update atime if the previous atime is older than - mtime or ctime -strictatime Always update atime, but the minimum update interval - is specified by atime_quantum. -atime_quantum=60(*) OCFS2 will not update atime unless this number - of seconds has passed since the last update. - Set to zero to always update atime. This option need - work with strictatime. -data=ordered (*) All data are forced directly out to the main file - system prior to its metadata being committed to the - journal. -data=writeback Data ordering is not preserved, data may be written - into the main file system after its metadata has been - committed to the journal. -preferred_slot=0(*) During mount, try to use this filesystem slot first. If - it is in use by another node, the first empty one found - will be chosen. Invalid values will be ignored. -commit=nrsec (*) Ocfs2 can be told to sync all its data and metadata - every 'nrsec' seconds. The default value is 5 seconds. - This means that if you lose your power, you will lose - as much as the latest 5 seconds of work (your - filesystem will not be damaged though, thanks to the - journaling). This default value (or any low value) - will hurt performance, but it's good for data-safety. - Setting it to 0 will have the same effect as leaving - it at the default (5 seconds). - Setting it to very large values will improve - performance. -localalloc=8(*) Allows custom localalloc size in MB. If the value is too - large, the fs will silently revert it to the default. -localflocks This disables cluster aware flock. -inode64 Indicates that Ocfs2 is allowed to create inodes at - any location in the filesystem, including those which - will result in inode numbers occupying more than 32 - bits of significance. -user_xattr (*) Enables Extended User Attributes. -nouser_xattr Disables Extended User Attributes. -acl Enables POSIX Access Control Lists support. -noacl (*) Disables POSIX Access Control Lists support. -resv_level=2 (*) Set how aggressive allocation reservations will be. - Valid values are between 0 (reservations off) to 8 - (maximum space for reservations). -dir_resv_level= (*) By default, directory reservations will scale with file - reservations - users should rarely need to change this - value. If allocation reservations are turned off, this - option will have no effect. -coherency=full (*) Disallow concurrent O_DIRECT writes, cluster inode - lock will be taken to force other nodes drop cache, - therefore full cluster coherency is guaranteed even - for O_DIRECT writes. -coherency=buffered Allow concurrent O_DIRECT writes without EX lock among - nodes, which gains high performance at risk of getting - stale data on other nodes. -journal_async_commit Commit block can be written to disk without waiting - for descriptor blocks. If enabled older kernels cannot - mount the device. This will enable 'journal_checksum' - internally. -- cgit From 7cbb468f0c70878fe64d324790ee049c1881af7c Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:16 +0100 Subject: docs: filesystems: convert omfs.txt to ReST - Add a SPDX header; - Adjust document title; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Bob Copeland Link: https://lore.kernel.org/r/0c125c7c971d81a557ca954992b8d770a9d1e3e8.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/omfs.rst | 112 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/omfs.txt | 106 ---------------------------------- 3 files changed, 113 insertions(+), 106 deletions(-) create mode 100644 Documentation/filesystems/omfs.rst delete mode 100644 Documentation/filesystems/omfs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 3b2b07491c98..fbee77175840 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -78,6 +78,7 @@ Documentation for filesystem implementations. ntfs ocfs2 ocfs2-online-filecheck + omfs overlayfs virtiofs vfat diff --git a/Documentation/filesystems/omfs.rst b/Documentation/filesystems/omfs.rst new file mode 100644 index 000000000000..4c8bb3074169 --- /dev/null +++ b/Documentation/filesystems/omfs.rst @@ -0,0 +1,112 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================ +Optimized MPEG Filesystem (OMFS) +================================ + +Overview +======== + +OMFS is a filesystem created by SonicBlue for use in the ReplayTV DVR +and Rio Karma MP3 player. The filesystem is extent-based, utilizing +block sizes from 2k to 8k, with hash-based directories. This +filesystem driver may be used to read and write disks from these +devices. + +Note, it is not recommended that this FS be used in place of a general +filesystem for your own streaming media device. Native Linux filesystems +will likely perform better. + +More information is available at: + + http://linux-karma.sf.net/ + +Various utilities, including mkomfs and omfsck, are included with +omfsprogs, available at: + + http://bobcopeland.com/karma/ + +Instructions are included in its README. + +Options +======= + +OMFS supports the following mount-time options: + + ============ ======================================== + uid=n make all files owned by specified user + gid=n make all files owned by specified group + umask=xxx set permission umask to xxx + fmask=xxx set umask to xxx for files + dmask=xxx set umask to xxx for directories + ============ ======================================== + +Disk format +=========== + +OMFS discriminates between "sysblocks" and normal data blocks. The sysblock +group consists of super block information, file metadata, directory structures, +and extents. Each sysblock has a header containing CRCs of the entire +sysblock, and may be mirrored in successive blocks on the disk. A sysblock may +have a smaller size than a data block, but since they are both addressed by the +same 64-bit block number, any remaining space in the smaller sysblock is +unused. + +Sysblock header information:: + + struct omfs_header { + __be64 h_self; /* FS block where this is located */ + __be32 h_body_size; /* size of useful data after header */ + __be16 h_crc; /* crc-ccitt of body_size bytes */ + char h_fill1[2]; + u8 h_version; /* version, always 1 */ + char h_type; /* OMFS_INODE_X */ + u8 h_magic; /* OMFS_IMAGIC */ + u8 h_check_xor; /* XOR of header bytes before this */ + __be32 h_fill2; + }; + +Files and directories are both represented by omfs_inode:: + + struct omfs_inode { + struct omfs_header i_head; /* header */ + __be64 i_parent; /* parent containing this inode */ + __be64 i_sibling; /* next inode in hash bucket */ + __be64 i_ctime; /* ctime, in milliseconds */ + char i_fill1[35]; + char i_type; /* OMFS_[DIR,FILE] */ + __be32 i_fill2; + char i_fill3[64]; + char i_name[OMFS_NAMELEN]; /* filename */ + __be64 i_size; /* size of file, in bytes */ + }; + +Directories in OMFS are implemented as a large hash table. Filenames are +hashed then prepended into the bucket list beginning at OMFS_DIR_START. +Lookup requires hashing the filename, then seeking across i_sibling pointers +until a match is found on i_name. Empty buckets are represented by block +pointers with all-1s (~0). + +A file is an omfs_inode structure followed by an extent table beginning at +OMFS_EXTENT_START:: + + struct omfs_extent_entry { + __be64 e_cluster; /* start location of a set of blocks */ + __be64 e_blocks; /* number of blocks after e_cluster */ + }; + + struct omfs_extent { + __be64 e_next; /* next extent table location */ + __be32 e_extent_count; /* total # extents in this table */ + __be32 e_fill; + struct omfs_extent_entry e_entry; /* start of extent entries */ + }; + +Each extent holds the block offset followed by number of blocks allocated to +the extent. The final extent in each table is a terminator with e_cluster +being ~0 and e_blocks being ones'-complement of the total number of blocks +in the table. + +If this table overflows, a continuation inode is written and pointed to by +e_next. These have a header but lack the rest of the inode structure. + diff --git a/Documentation/filesystems/omfs.txt b/Documentation/filesystems/omfs.txt deleted file mode 100644 index 1d0d41ff5c65..000000000000 --- a/Documentation/filesystems/omfs.txt +++ /dev/null @@ -1,106 +0,0 @@ -Optimized MPEG Filesystem (OMFS) - -Overview -======== - -OMFS is a filesystem created by SonicBlue for use in the ReplayTV DVR -and Rio Karma MP3 player. The filesystem is extent-based, utilizing -block sizes from 2k to 8k, with hash-based directories. This -filesystem driver may be used to read and write disks from these -devices. - -Note, it is not recommended that this FS be used in place of a general -filesystem for your own streaming media device. Native Linux filesystems -will likely perform better. - -More information is available at: - - http://linux-karma.sf.net/ - -Various utilities, including mkomfs and omfsck, are included with -omfsprogs, available at: - - http://bobcopeland.com/karma/ - -Instructions are included in its README. - -Options -======= - -OMFS supports the following mount-time options: - - uid=n - make all files owned by specified user - gid=n - make all files owned by specified group - umask=xxx - set permission umask to xxx - fmask=xxx - set umask to xxx for files - dmask=xxx - set umask to xxx for directories - -Disk format -=========== - -OMFS discriminates between "sysblocks" and normal data blocks. The sysblock -group consists of super block information, file metadata, directory structures, -and extents. Each sysblock has a header containing CRCs of the entire -sysblock, and may be mirrored in successive blocks on the disk. A sysblock may -have a smaller size than a data block, but since they are both addressed by the -same 64-bit block number, any remaining space in the smaller sysblock is -unused. - -Sysblock header information: - -struct omfs_header { - __be64 h_self; /* FS block where this is located */ - __be32 h_body_size; /* size of useful data after header */ - __be16 h_crc; /* crc-ccitt of body_size bytes */ - char h_fill1[2]; - u8 h_version; /* version, always 1 */ - char h_type; /* OMFS_INODE_X */ - u8 h_magic; /* OMFS_IMAGIC */ - u8 h_check_xor; /* XOR of header bytes before this */ - __be32 h_fill2; -}; - -Files and directories are both represented by omfs_inode: - -struct omfs_inode { - struct omfs_header i_head; /* header */ - __be64 i_parent; /* parent containing this inode */ - __be64 i_sibling; /* next inode in hash bucket */ - __be64 i_ctime; /* ctime, in milliseconds */ - char i_fill1[35]; - char i_type; /* OMFS_[DIR,FILE] */ - __be32 i_fill2; - char i_fill3[64]; - char i_name[OMFS_NAMELEN]; /* filename */ - __be64 i_size; /* size of file, in bytes */ -}; - -Directories in OMFS are implemented as a large hash table. Filenames are -hashed then prepended into the bucket list beginning at OMFS_DIR_START. -Lookup requires hashing the filename, then seeking across i_sibling pointers -until a match is found on i_name. Empty buckets are represented by block -pointers with all-1s (~0). - -A file is an omfs_inode structure followed by an extent table beginning at -OMFS_EXTENT_START: - -struct omfs_extent_entry { - __be64 e_cluster; /* start location of a set of blocks */ - __be64 e_blocks; /* number of blocks after e_cluster */ -}; - -struct omfs_extent { - __be64 e_next; /* next extent table location */ - __be32 e_extent_count; /* total # extents in this table */ - __be32 e_fill; - struct omfs_extent_entry e_entry; /* start of extent entries */ -}; - -Each extent holds the block offset followed by number of blocks allocated to -the extent. The final extent in each table is a terminator with e_cluster -being ~0 and e_blocks being ones'-complement of the total number of blocks -in the table. - -If this table overflows, a continuation inode is written and pointed to by -e_next. These have a header but lack the rest of the inode structure. - -- cgit From 18ccb2233fc5f7c27b5be17f5b6585c2fa62d919 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:17 +0100 Subject: docs: filesystems: convert orangefs.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/6f438eeff5b029d229197a602bd9b74004fe9b63.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/orangefs.rst | 554 +++++++++++++++++++++++++++++++++ Documentation/filesystems/orangefs.txt | 529 ------------------------------- 3 files changed, 555 insertions(+), 529 deletions(-) create mode 100644 Documentation/filesystems/orangefs.rst delete mode 100644 Documentation/filesystems/orangefs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index fbee77175840..fed53f831192 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -79,6 +79,7 @@ Documentation for filesystem implementations. ocfs2 ocfs2-online-filecheck omfs + orangefs overlayfs virtiofs vfat diff --git a/Documentation/filesystems/orangefs.rst b/Documentation/filesystems/orangefs.rst new file mode 100644 index 000000000000..7d6d4cad73c4 --- /dev/null +++ b/Documentation/filesystems/orangefs.rst @@ -0,0 +1,554 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======== +ORANGEFS +======== + +OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal +for large storage problems faced by HPC, BigData, Streaming Video, +Genomics, Bioinformatics. + +Orangefs, originally called PVFS, was first developed in 1993 by +Walt Ligon and Eric Blumer as a parallel file system for Parallel +Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns +of parallel programs. + +Orangefs features include: + + * Distributes file data among multiple file servers + * Supports simultaneous access by multiple clients + * Stores file data and metadata on servers using local file system + and access methods + * Userspace implementation is easy to install and maintain + * Direct MPI support + * Stateless + + +Mailing List Archives +===================== + +http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/ + + +Mailing List Submissions +======================== + +devel@lists.orangefs.org + + +Documentation +============= + +http://www.orangefs.org/documentation/ + + +Userspace Filesystem Source +=========================== + +http://www.orangefs.org/download + +Orangefs versions prior to 2.9.3 would not be compatible with the +upstream version of the kernel client. + + +Running ORANGEFS On a Single Server +=================================== + +OrangeFS is usually run in large installations with multiple servers and +clients, but a complete filesystem can be run on a single machine for +development and testing. + +On Fedora, install orangefs and orangefs-server:: + + dnf -y install orangefs orangefs-server + +There is an example server configuration file in +/etc/orangefs/orangefs.conf. Change localhost to your hostname if +necessary. + +To generate a filesystem to run xfstests against, see below. + +There is an example client configuration file in /etc/pvfs2tab. It is a +single line. Uncomment it and change the hostname if necessary. This +controls clients which use libpvfs2. This does not control the +pvfs2-client-core. + +Create the filesystem:: + + pvfs2-server -f /etc/orangefs/orangefs.conf + +Start the server:: + + systemctl start orangefs-server + +Test the server:: + + pvfs2-ping -m /pvfsmnt + +Start the client. The module must be compiled in or loaded before this +point:: + + systemctl start orangefs-client + +Mount the filesystem:: + + mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt + + +Building ORANGEFS on a Single Server +==================================== + +Where OrangeFS cannot be installed from distribution packages, it may be +built from source. + +You can omit --prefix if you don't care that things are sprinkled around +in /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by +default, we will probably be changing the default to LMDB soon. + +:: + + ./configure --prefix=/opt/ofs --with-db-backend=lmdb + + make + + make install + +Create an orangefs config file:: + + /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf + +Create an /etc/pvfs2tab file:: + + echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ + /etc/pvfs2tab + +Create the mount point you specified in the tab file if needed:: + + mkdir /pvfsmnt + +Bootstrap the server:: + + /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf + +Start the server:: + + /opt/osf/sbin/pvfs2-server /etc/pvfs2.conf + +Now the server should be running. Pvfs2-ls is a simple +test to verify that the server is running:: + + /opt/ofs/bin/pvfs2-ls /pvfsmnt + +If stuff seems to be working, load the kernel module and +turn on the client core:: + + /opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core + +Mount your filesystem:: + + mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt + + +Running xfstests +================ + +It is useful to use a scratch filesystem with xfstests. This can be +done with only one server. + +Make a second copy of the FileSystem section in the server configuration +file, which is /etc/orangefs/orangefs.conf. Change the Name to scratch. +Change the ID to something other than the ID of the first FileSystem +section (2 is usually a good choice). + +Then there are two FileSystem sections: orangefs and scratch. + +This change should be made before creating the filesystem. + +:: + + pvfs2-server -f /etc/orangefs/orangefs.conf + +To run xfstests, create /etc/xfsqa.config:: + + TEST_DIR=/orangefs + TEST_DEV=tcp://localhost:3334/orangefs + SCRATCH_MNT=/scratch + SCRATCH_DEV=tcp://localhost:3334/scratch + +Then xfstests can be run:: + + ./check -pvfs2 + + +Options +======= + +The following mount options are accepted: + + acl + Allow the use of Access Control Lists on files and directories. + + intr + Some operations between the kernel client and the user space + filesystem can be interruptible, such as changes in debug levels + and the setting of tunable parameters. + + local_lock + Enable posix locking from the perspective of "this" kernel. The + default file_operations lock action is to return ENOSYS. Posix + locking kicks in if the filesystem is mounted with -o local_lock. + Distributed locking is being worked on for the future. + + +Debugging +========= + +If you want the debug (GOSSIP) statements in a particular +source file (inode.c for example) go to syslog:: + + echo inode > /sys/kernel/debug/orangefs/kernel-debug + +No debugging (the default):: + + echo none > /sys/kernel/debug/orangefs/kernel-debug + +Debugging from several source files:: + + echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug + +All debugging:: + + echo all > /sys/kernel/debug/orangefs/kernel-debug + +Get a list of all debugging keywords:: + + cat /sys/kernel/debug/orangefs/debug-help + + +Protocol between Kernel Module and Userspace +============================================ + +Orangefs is a user space filesystem and an associated kernel module. +We'll just refer to the user space part of Orangefs as "userspace" +from here on out. Orangefs descends from PVFS, and userspace code +still uses PVFS for function and variable names. Userspace typedefs +many of the important structures. Function and variable names in +the kernel module have been transitioned to "orangefs", and The Linux +Coding Style avoids typedefs, so kernel module structures that +correspond to userspace structures are not typedefed. + +The kernel module implements a pseudo device that userspace +can read from and write to. Userspace can also manipulate the +kernel module through the pseudo device with ioctl. + +The Bufmap +---------- + +At startup userspace allocates two page-size-aligned (posix_memalign) +mlocked memory buffers, one is used for IO and one is used for readdir +operations. The IO buffer is 41943040 bytes and the readdir buffer is +4194304 bytes. Each buffer contains logical chunks, or partitions, and +a pointer to each buffer is added to its own PVFS_dev_map_desc structure +which also describes its total size, as well as the size and number of +the partitions. + +A pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a +mapping routine in the kernel module with an ioctl. The structure is +copied from user space to kernel space with copy_from_user and is used +to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which +then contains: + + * refcnt + - a reference counter + * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's + partition size, which represents the filesystem's block size and + is used for s_blocksize in super blocks. + * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of + partitions in the IO buffer. + * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks. + * total_size - the total size of the IO buffer. + * page_count - the number of 4096 byte pages in the IO buffer. + * page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes + of kcalloced memory. This memory is used as an array of pointers + to each of the pages in the IO buffer through a call to get_user_pages. + * desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))`` + bytes of kcalloced memory. This memory is further intialized: + + user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc + structure. user_desc->ptr points to the IO buffer. + + :: + + pages_per_desc = bufmap->desc_size / PAGE_SIZE + offset = 0 + + bufmap->desc_array[0].page_array = &bufmap->page_array[offset] + bufmap->desc_array[0].array_count = pages_per_desc = 1024 + bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096) + offset += 1024 + . + . + . + bufmap->desc_array[9].page_array = &bufmap->page_array[offset] + bufmap->desc_array[9].array_count = pages_per_desc = 1024 + bufmap->desc_array[9].uaddr = (user_desc->ptr) + + (9 * 1024 * 4096) + offset += 1024 + + * buffer_index_array - a desc_count sized array of ints, used to + indicate which of the IO buffer's partitions are available to use. + * buffer_index_lock - a spinlock to protect buffer_index_array during update. + * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element + int array used to indicate which of the readdir buffer's partitions are + available to use. + * readdir_index_lock - a spinlock to protect readdir_index_array during + update. + +Operations +---------- + +The kernel module builds an "op" (struct orangefs_kernel_op_s) when it +needs to communicate with userspace. Part of the op contains the "upcall" +which expresses the request to userspace. Part of the op eventually +contains the "downcall" which expresses the results of the request. + +The slab allocator is used to keep a cache of op structures handy. + +At init time the kernel module defines and initializes a request list +and an in_progress hash table to keep track of all the ops that are +in flight at any given time. + +Ops are stateful: + + * unknown + - op was just initialized + * waiting + - op is on request_list (upward bound) + * inprogr + - op is in progress (waiting for downcall) + * serviced + - op has matching downcall; ok + * purged + - op has to start a timer since client-core + exited uncleanly before servicing op + * given up + - submitter has given up waiting for it + +When some arbitrary userspace program needs to perform a +filesystem operation on Orangefs (readdir, I/O, create, whatever) +an op structure is initialized and tagged with a distinguishing ID +number. The upcall part of the op is filled out, and the op is +passed to the "service_operation" function. + +Service_operation changes the op's state to "waiting", puts +it on the request list, and signals the Orangefs file_operations.poll +function through a wait queue. Userspace is polling the pseudo-device +and thus becomes aware of the upcall request that needs to be read. + +When the Orangefs file_operations.read function is triggered, the +request list is searched for an op that seems ready-to-process. +The op is removed from the request list. The tag from the op and +the filled-out upcall struct are copy_to_user'ed back to userspace. + +If any of these (and some additional protocol) copy_to_users fail, +the op's state is set to "waiting" and the op is added back to +the request list. Otherwise, the op's state is changed to "in progress", +and the op is hashed on its tag and put onto the end of a list in the +in_progress hash table at the index the tag hashed to. + +When userspace has assembled the response to the upcall, it +writes the response, which includes the distinguishing tag, back to +the pseudo device in a series of io_vecs. This triggers the Orangefs +file_operations.write_iter function to find the op with the associated +tag and remove it from the in_progress hash table. As long as the op's +state is not "canceled" or "given up", its state is set to "serviced". +The file_operations.write_iter function returns to the waiting vfs, +and back to service_operation through wait_for_matching_downcall. + +Service operation returns to its caller with the op's downcall +part (the response to the upcall) filled out. + +The "client-core" is the bridge between the kernel module and +userspace. The client-core is a daemon. The client-core has an +associated watchdog daemon. If the client-core is ever signaled +to die, the watchdog daemon restarts the client-core. Even though +the client-core is restarted "right away", there is a period of +time during such an event that the client-core is dead. A dead client-core +can't be triggered by the Orangefs file_operations.poll function. +Ops that pass through service_operation during a "dead spell" can timeout +on the wait queue and one attempt is made to recycle them. Obviously, +if the client-core stays dead too long, the arbitrary userspace processes +trying to use Orangefs will be negatively affected. Waiting ops +that can't be serviced will be removed from the request list and +have their states set to "given up". In-progress ops that can't +be serviced will be removed from the in_progress hash table and +have their states set to "given up". + +Readdir and I/O ops are atypical with respect to their payloads. + + - readdir ops use the smaller of the two pre-allocated pre-partitioned + memory buffers. The readdir buffer is only available to userspace. + The kernel module obtains an index to a free partition before launching + a readdir op. Userspace deposits the results into the indexed partition + and then writes them to back to the pvfs device. + + - io (read and write) ops use the larger of the two pre-allocated + pre-partitioned memory buffers. The IO buffer is accessible from + both userspace and the kernel module. The kernel module obtains an + index to a free partition before launching an io op. The kernel module + deposits write data into the indexed partition, to be consumed + directly by userspace. Userspace deposits the results of read + requests into the indexed partition, to be consumed directly + by the kernel module. + +Responses to kernel requests are all packaged in pvfs2_downcall_t +structs. Besides a few other members, pvfs2_downcall_t contains a +union of structs, each of which is associated with a particular +response type. + +The several members outside of the union are: + + ``int32_t type`` + - type of operation. + ``int32_t status`` + - return code for the operation. + ``int64_t trailer_size`` + - 0 unless readdir operation. + ``char *trailer_buf`` + - initialized to NULL, used during readdir operations. + +The appropriate member inside the union is filled out for any +particular response. + + PVFS2_VFS_OP_FILE_IO + fill a pvfs2_io_response_t + + PVFS2_VFS_OP_LOOKUP + fill a PVFS_object_kref + + PVFS2_VFS_OP_CREATE + fill a PVFS_object_kref + + PVFS2_VFS_OP_SYMLINK + fill a PVFS_object_kref + + PVFS2_VFS_OP_GETATTR + fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need) + fill in a string with the link target when the object is a symlink. + + PVFS2_VFS_OP_MKDIR + fill a PVFS_object_kref + + PVFS2_VFS_OP_STATFS + fill a pvfs2_statfs_response_t with useless info . It is hard for + us to know, in a timely fashion, these statistics about our + distributed network filesystem. + + PVFS2_VFS_OP_FS_MOUNT + fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref + except its members are in a different order and "__pad1" is replaced + with "id". + + PVFS2_VFS_OP_GETXATTR + fill a pvfs2_getxattr_response_t + + PVFS2_VFS_OP_LISTXATTR + fill a pvfs2_listxattr_response_t + + PVFS2_VFS_OP_PARAM + fill a pvfs2_param_response_t + + PVFS2_VFS_OP_PERF_COUNT + fill a pvfs2_perf_count_response_t + + PVFS2_VFS_OP_FSKEY + file a pvfs2_fs_key_response_t + + PVFS2_VFS_OP_READDIR + jamb everything needed to represent a pvfs2_readdir_response_t into + the readdir buffer descriptor specified in the upcall. + +Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests +made by the kernel side. + +A buffer_list containing: + + - a pointer to the prepared response to the request from the + kernel (struct pvfs2_downcall_t). + - and also, in the case of a readdir request, a pointer to a + buffer containing descriptors for the objects in the target + directory. + +... is sent to the function (PINT_dev_write_list) which performs +the writev. + +PINT_dev_write_list has a local iovec array: struct iovec io_array[10]; + +The first four elements of io_array are initialized like this for all +responses:: + + io_array[0].iov_base = address of local variable "proto_ver" (int32_t) + io_array[0].iov_len = sizeof(int32_t) + + io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) + io_array[1].iov_len = sizeof(int32_t) + + io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) + io_array[2].iov_len = sizeof(int64_t) + + io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t) + of global variable vfs_request (vfs_request_t) + io_array[3].iov_len = sizeof(pvfs2_downcall_t) + +Readdir responses initialize the fifth element io_array like this:: + + io_array[4].iov_base = contents of member trailer_buf (char *) + from out_downcall member of global variable + vfs_request + io_array[4].iov_len = contents of member trailer_size (PVFS_size) + from out_downcall member of global variable + vfs_request + +Orangefs exploits the dcache in order to avoid sending redundant +requests to userspace. We keep object inode attributes up-to-date with +orangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to +help it decide whether or not to update an inode: "new" and "bypass". +Orangefs keeps private data in an object's inode that includes a short +timeout value, getattr_time, which allows any iteration of +orangefs_inode_getattr to know how long it has been since the inode was +updated. When the object is not new (new == 0) and the bypass flag is not +set (bypass == 0) orangefs_inode_getattr returns without updating the inode +if getattr_time has not timed out. Getattr_time is updated each time the +inode is updated. + +Creation of a new object (file, dir, sym-link) includes the evaluation of +its pathname, resulting in a negative directory entry for the object. +A new inode is allocated and associated with the dentry, turning it from +a negative dentry into a "productive full member of society". Orangefs +obtains the new inode from Linux with new_inode() and associates +the inode with the dentry by sending the pair back to Linux with +d_instantiate(). + +The evaluation of a pathname for an object resolves to its corresponding +dentry. If there is no corresponding dentry, one is created for it in +the dcache. Whenever a dentry is modified or verified Orangefs stores a +short timeout value in the dentry's d_time, and the dentry will be trusted +for that amount of time. Orangefs is a network filesystem, and objects +can potentially change out-of-band with any particular Orangefs kernel module +instance, so trusting a dentry is risky. The alternative to trusting +dentries is to always obtain the needed information from userspace - at +least a trip to the client-core, maybe to the servers. Obtaining information +from a dentry is cheap, obtaining it from userspace is relatively expensive, +hence the motivation to use the dentry when possible. + +The timeout values d_time and getattr_time are jiffy based, and the +code is designed to avoid the jiffy-wrap problem:: + + "In general, if the clock may have wrapped around more than once, there + is no way to tell how much time has elapsed. However, if the times t1 + and t2 are known to be fairly close, we can reliably compute the + difference in a way that takes into account the possibility that the + clock may have wrapped between times." + +from course notes by instructor Andy Wang + diff --git a/Documentation/filesystems/orangefs.txt b/Documentation/filesystems/orangefs.txt deleted file mode 100644 index f4ba94950e3f..000000000000 --- a/Documentation/filesystems/orangefs.txt +++ /dev/null @@ -1,529 +0,0 @@ -ORANGEFS -======== - -OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal -for large storage problems faced by HPC, BigData, Streaming Video, -Genomics, Bioinformatics. - -Orangefs, originally called PVFS, was first developed in 1993 by -Walt Ligon and Eric Blumer as a parallel file system for Parallel -Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns -of parallel programs. - -Orangefs features include: - - * Distributes file data among multiple file servers - * Supports simultaneous access by multiple clients - * Stores file data and metadata on servers using local file system - and access methods - * Userspace implementation is easy to install and maintain - * Direct MPI support - * Stateless - - -MAILING LIST ARCHIVES -===================== - -http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/ - - -MAILING LIST SUBMISSIONS -======================== - -devel@lists.orangefs.org - - -DOCUMENTATION -============= - -http://www.orangefs.org/documentation/ - - -USERSPACE FILESYSTEM SOURCE -=========================== - -http://www.orangefs.org/download - -Orangefs versions prior to 2.9.3 would not be compatible with the -upstream version of the kernel client. - - -RUNNING ORANGEFS ON A SINGLE SERVER -=================================== - -OrangeFS is usually run in large installations with multiple servers and -clients, but a complete filesystem can be run on a single machine for -development and testing. - -On Fedora, install orangefs and orangefs-server. - -dnf -y install orangefs orangefs-server - -There is an example server configuration file in -/etc/orangefs/orangefs.conf. Change localhost to your hostname if -necessary. - -To generate a filesystem to run xfstests against, see below. - -There is an example client configuration file in /etc/pvfs2tab. It is a -single line. Uncomment it and change the hostname if necessary. This -controls clients which use libpvfs2. This does not control the -pvfs2-client-core. - -Create the filesystem. - -pvfs2-server -f /etc/orangefs/orangefs.conf - -Start the server. - -systemctl start orangefs-server - -Test the server. - -pvfs2-ping -m /pvfsmnt - -Start the client. The module must be compiled in or loaded before this -point. - -systemctl start orangefs-client - -Mount the filesystem. - -mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt - - -BUILDING ORANGEFS ON A SINGLE SERVER -==================================== - -Where OrangeFS cannot be installed from distribution packages, it may be -built from source. - -You can omit --prefix if you don't care that things are sprinkled around -in /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by -default, we will probably be changing the default to LMDB soon. - -./configure --prefix=/opt/ofs --with-db-backend=lmdb - -make - -make install - -Create an orangefs config file. - -/opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf - -Create an /etc/pvfs2tab file. - -echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ - /etc/pvfs2tab - -Create the mount point you specified in the tab file if needed. - -mkdir /pvfsmnt - -Bootstrap the server. - -/opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf - -Start the server. - -/opt/osf/sbin/pvfs2-server /etc/pvfs2.conf - -Now the server should be running. Pvfs2-ls is a simple -test to verify that the server is running. - -/opt/ofs/bin/pvfs2-ls /pvfsmnt - -If stuff seems to be working, load the kernel module and -turn on the client core. - -/opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core - -Mount your filesystem. - -mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt - - -RUNNING XFSTESTS -================ - -It is useful to use a scratch filesystem with xfstests. This can be -done with only one server. - -Make a second copy of the FileSystem section in the server configuration -file, which is /etc/orangefs/orangefs.conf. Change the Name to scratch. -Change the ID to something other than the ID of the first FileSystem -section (2 is usually a good choice). - -Then there are two FileSystem sections: orangefs and scratch. - -This change should be made before creating the filesystem. - -pvfs2-server -f /etc/orangefs/orangefs.conf - -To run xfstests, create /etc/xfsqa.config. - -TEST_DIR=/orangefs -TEST_DEV=tcp://localhost:3334/orangefs -SCRATCH_MNT=/scratch -SCRATCH_DEV=tcp://localhost:3334/scratch - -Then xfstests can be run - -./check -pvfs2 - - -OPTIONS -======= - -The following mount options are accepted: - - acl - Allow the use of Access Control Lists on files and directories. - - intr - Some operations between the kernel client and the user space - filesystem can be interruptible, such as changes in debug levels - and the setting of tunable parameters. - - local_lock - Enable posix locking from the perspective of "this" kernel. The - default file_operations lock action is to return ENOSYS. Posix - locking kicks in if the filesystem is mounted with -o local_lock. - Distributed locking is being worked on for the future. - - -DEBUGGING -========= - -If you want the debug (GOSSIP) statements in a particular -source file (inode.c for example) go to syslog: - - echo inode > /sys/kernel/debug/orangefs/kernel-debug - -No debugging (the default): - - echo none > /sys/kernel/debug/orangefs/kernel-debug - -Debugging from several source files: - - echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug - -All debugging: - - echo all > /sys/kernel/debug/orangefs/kernel-debug - -Get a list of all debugging keywords: - - cat /sys/kernel/debug/orangefs/debug-help - - -PROTOCOL BETWEEN KERNEL MODULE AND USERSPACE -============================================ - -Orangefs is a user space filesystem and an associated kernel module. -We'll just refer to the user space part of Orangefs as "userspace" -from here on out. Orangefs descends from PVFS, and userspace code -still uses PVFS for function and variable names. Userspace typedefs -many of the important structures. Function and variable names in -the kernel module have been transitioned to "orangefs", and The Linux -Coding Style avoids typedefs, so kernel module structures that -correspond to userspace structures are not typedefed. - -The kernel module implements a pseudo device that userspace -can read from and write to. Userspace can also manipulate the -kernel module through the pseudo device with ioctl. - -THE BUFMAP: - -At startup userspace allocates two page-size-aligned (posix_memalign) -mlocked memory buffers, one is used for IO and one is used for readdir -operations. The IO buffer is 41943040 bytes and the readdir buffer is -4194304 bytes. Each buffer contains logical chunks, or partitions, and -a pointer to each buffer is added to its own PVFS_dev_map_desc structure -which also describes its total size, as well as the size and number of -the partitions. - -A pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a -mapping routine in the kernel module with an ioctl. The structure is -copied from user space to kernel space with copy_from_user and is used -to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which -then contains: - - * refcnt - a reference counter - * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's - partition size, which represents the filesystem's block size and - is used for s_blocksize in super blocks. - * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of - partitions in the IO buffer. - * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks. - * total_size - the total size of the IO buffer. - * page_count - the number of 4096 byte pages in the IO buffer. - * page_array - a pointer to page_count * (sizeof(struct page*)) bytes - of kcalloced memory. This memory is used as an array of pointers - to each of the pages in the IO buffer through a call to get_user_pages. - * desc_array - a pointer to desc_count * (sizeof(struct orangefs_bufmap_desc)) - bytes of kcalloced memory. This memory is further intialized: - - user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc - structure. user_desc->ptr points to the IO buffer. - - pages_per_desc = bufmap->desc_size / PAGE_SIZE - offset = 0 - - bufmap->desc_array[0].page_array = &bufmap->page_array[offset] - bufmap->desc_array[0].array_count = pages_per_desc = 1024 - bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096) - offset += 1024 - . - . - . - bufmap->desc_array[9].page_array = &bufmap->page_array[offset] - bufmap->desc_array[9].array_count = pages_per_desc = 1024 - bufmap->desc_array[9].uaddr = (user_desc->ptr) + - (9 * 1024 * 4096) - offset += 1024 - - * buffer_index_array - a desc_count sized array of ints, used to - indicate which of the IO buffer's partitions are available to use. - * buffer_index_lock - a spinlock to protect buffer_index_array during update. - * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element - int array used to indicate which of the readdir buffer's partitions are - available to use. - * readdir_index_lock - a spinlock to protect readdir_index_array during - update. - -OPERATIONS: - -The kernel module builds an "op" (struct orangefs_kernel_op_s) when it -needs to communicate with userspace. Part of the op contains the "upcall" -which expresses the request to userspace. Part of the op eventually -contains the "downcall" which expresses the results of the request. - -The slab allocator is used to keep a cache of op structures handy. - -At init time the kernel module defines and initializes a request list -and an in_progress hash table to keep track of all the ops that are -in flight at any given time. - -Ops are stateful: - - * unknown - op was just initialized - * waiting - op is on request_list (upward bound) - * inprogr - op is in progress (waiting for downcall) - * serviced - op has matching downcall; ok - * purged - op has to start a timer since client-core - exited uncleanly before servicing op - * given up - submitter has given up waiting for it - -When some arbitrary userspace program needs to perform a -filesystem operation on Orangefs (readdir, I/O, create, whatever) -an op structure is initialized and tagged with a distinguishing ID -number. The upcall part of the op is filled out, and the op is -passed to the "service_operation" function. - -Service_operation changes the op's state to "waiting", puts -it on the request list, and signals the Orangefs file_operations.poll -function through a wait queue. Userspace is polling the pseudo-device -and thus becomes aware of the upcall request that needs to be read. - -When the Orangefs file_operations.read function is triggered, the -request list is searched for an op that seems ready-to-process. -The op is removed from the request list. The tag from the op and -the filled-out upcall struct are copy_to_user'ed back to userspace. - -If any of these (and some additional protocol) copy_to_users fail, -the op's state is set to "waiting" and the op is added back to -the request list. Otherwise, the op's state is changed to "in progress", -and the op is hashed on its tag and put onto the end of a list in the -in_progress hash table at the index the tag hashed to. - -When userspace has assembled the response to the upcall, it -writes the response, which includes the distinguishing tag, back to -the pseudo device in a series of io_vecs. This triggers the Orangefs -file_operations.write_iter function to find the op with the associated -tag and remove it from the in_progress hash table. As long as the op's -state is not "canceled" or "given up", its state is set to "serviced". -The file_operations.write_iter function returns to the waiting vfs, -and back to service_operation through wait_for_matching_downcall. - -Service operation returns to its caller with the op's downcall -part (the response to the upcall) filled out. - -The "client-core" is the bridge between the kernel module and -userspace. The client-core is a daemon. The client-core has an -associated watchdog daemon. If the client-core is ever signaled -to die, the watchdog daemon restarts the client-core. Even though -the client-core is restarted "right away", there is a period of -time during such an event that the client-core is dead. A dead client-core -can't be triggered by the Orangefs file_operations.poll function. -Ops that pass through service_operation during a "dead spell" can timeout -on the wait queue and one attempt is made to recycle them. Obviously, -if the client-core stays dead too long, the arbitrary userspace processes -trying to use Orangefs will be negatively affected. Waiting ops -that can't be serviced will be removed from the request list and -have their states set to "given up". In-progress ops that can't -be serviced will be removed from the in_progress hash table and -have their states set to "given up". - -Readdir and I/O ops are atypical with respect to their payloads. - - - readdir ops use the smaller of the two pre-allocated pre-partitioned - memory buffers. The readdir buffer is only available to userspace. - The kernel module obtains an index to a free partition before launching - a readdir op. Userspace deposits the results into the indexed partition - and then writes them to back to the pvfs device. - - - io (read and write) ops use the larger of the two pre-allocated - pre-partitioned memory buffers. The IO buffer is accessible from - both userspace and the kernel module. The kernel module obtains an - index to a free partition before launching an io op. The kernel module - deposits write data into the indexed partition, to be consumed - directly by userspace. Userspace deposits the results of read - requests into the indexed partition, to be consumed directly - by the kernel module. - -Responses to kernel requests are all packaged in pvfs2_downcall_t -structs. Besides a few other members, pvfs2_downcall_t contains a -union of structs, each of which is associated with a particular -response type. - -The several members outside of the union are: - - int32_t type - type of operation. - - int32_t status - return code for the operation. - - int64_t trailer_size - 0 unless readdir operation. - - char *trailer_buf - initialized to NULL, used during readdir operations. - -The appropriate member inside the union is filled out for any -particular response. - - PVFS2_VFS_OP_FILE_IO - fill a pvfs2_io_response_t - - PVFS2_VFS_OP_LOOKUP - fill a PVFS_object_kref - - PVFS2_VFS_OP_CREATE - fill a PVFS_object_kref - - PVFS2_VFS_OP_SYMLINK - fill a PVFS_object_kref - - PVFS2_VFS_OP_GETATTR - fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need) - fill in a string with the link target when the object is a symlink. - - PVFS2_VFS_OP_MKDIR - fill a PVFS_object_kref - - PVFS2_VFS_OP_STATFS - fill a pvfs2_statfs_response_t with useless info . It is hard for - us to know, in a timely fashion, these statistics about our - distributed network filesystem. - - PVFS2_VFS_OP_FS_MOUNT - fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref - except its members are in a different order and "__pad1" is replaced - with "id". - - PVFS2_VFS_OP_GETXATTR - fill a pvfs2_getxattr_response_t - - PVFS2_VFS_OP_LISTXATTR - fill a pvfs2_listxattr_response_t - - PVFS2_VFS_OP_PARAM - fill a pvfs2_param_response_t - - PVFS2_VFS_OP_PERF_COUNT - fill a pvfs2_perf_count_response_t - - PVFS2_VFS_OP_FSKEY - file a pvfs2_fs_key_response_t - - PVFS2_VFS_OP_READDIR - jamb everything needed to represent a pvfs2_readdir_response_t into - the readdir buffer descriptor specified in the upcall. - -Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests -made by the kernel side. - -A buffer_list containing: - - a pointer to the prepared response to the request from the - kernel (struct pvfs2_downcall_t). - - and also, in the case of a readdir request, a pointer to a - buffer containing descriptors for the objects in the target - directory. -... is sent to the function (PINT_dev_write_list) which performs -the writev. - -PINT_dev_write_list has a local iovec array: struct iovec io_array[10]; - -The first four elements of io_array are initialized like this for all -responses: - - io_array[0].iov_base = address of local variable "proto_ver" (int32_t) - io_array[0].iov_len = sizeof(int32_t) - - io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) - io_array[1].iov_len = sizeof(int32_t) - - io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) - io_array[2].iov_len = sizeof(int64_t) - - io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t) - of global variable vfs_request (vfs_request_t) - io_array[3].iov_len = sizeof(pvfs2_downcall_t) - -Readdir responses initialize the fifth element io_array like this: - - io_array[4].iov_base = contents of member trailer_buf (char *) - from out_downcall member of global variable - vfs_request - io_array[4].iov_len = contents of member trailer_size (PVFS_size) - from out_downcall member of global variable - vfs_request - -Orangefs exploits the dcache in order to avoid sending redundant -requests to userspace. We keep object inode attributes up-to-date with -orangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to -help it decide whether or not to update an inode: "new" and "bypass". -Orangefs keeps private data in an object's inode that includes a short -timeout value, getattr_time, which allows any iteration of -orangefs_inode_getattr to know how long it has been since the inode was -updated. When the object is not new (new == 0) and the bypass flag is not -set (bypass == 0) orangefs_inode_getattr returns without updating the inode -if getattr_time has not timed out. Getattr_time is updated each time the -inode is updated. - -Creation of a new object (file, dir, sym-link) includes the evaluation of -its pathname, resulting in a negative directory entry for the object. -A new inode is allocated and associated with the dentry, turning it from -a negative dentry into a "productive full member of society". Orangefs -obtains the new inode from Linux with new_inode() and associates -the inode with the dentry by sending the pair back to Linux with -d_instantiate(). - -The evaluation of a pathname for an object resolves to its corresponding -dentry. If there is no corresponding dentry, one is created for it in -the dcache. Whenever a dentry is modified or verified Orangefs stores a -short timeout value in the dentry's d_time, and the dentry will be trusted -for that amount of time. Orangefs is a network filesystem, and objects -can potentially change out-of-band with any particular Orangefs kernel module -instance, so trusting a dentry is risky. The alternative to trusting -dentries is to always obtain the needed information from userspace - at -least a trip to the client-core, maybe to the servers. Obtaining information -from a dentry is cheap, obtaining it from userspace is relatively expensive, -hence the motivation to use the dentry when possible. - -The timeout values d_time and getattr_time are jiffy based, and the -code is designed to avoid the jiffy-wrap problem: - -"In general, if the clock may have wrapped around more than once, there -is no way to tell how much time has elapsed. However, if the times t1 -and t2 are known to be fairly close, we can reliably compute the -difference in a way that takes into account the possibility that the -clock may have wrapped between times." - - from course notes by instructor Andy Wang - -- cgit From c33e97efa9d9de538e5f0afe6cb07f83afcd5b68 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:18 +0100 Subject: docs: filesystems: convert proc.txt to ReST This document has a nice format! Unfortunately, not exactly ReST. So, several adjustments were required: - Add a SPDX header; - Adjust document and section titles; - Whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add table captions; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/1d113d860188de416ca3b0b97371dc2195433d5b.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/proc.rst | 2169 +++++++++++++++++++++++++++++++++++ Documentation/filesystems/proc.txt | 2047 --------------------------------- 3 files changed, 2170 insertions(+), 2047 deletions(-) create mode 100644 Documentation/filesystems/proc.rst delete mode 100644 Documentation/filesystems/proc.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index fed53f831192..671906e2fee6 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -81,5 +81,6 @@ Documentation for filesystem implementations. omfs orangefs overlayfs + proc virtiofs vfat diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst new file mode 100644 index 000000000000..38b606991065 --- /dev/null +++ b/Documentation/filesystems/proc.rst @@ -0,0 +1,2169 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +The /proc Filesystem +==================== + +===================== ======================================= ================ +/proc/sys Terrehon Bowden , October 7 1999 + Bodo Bauer +2.4.x update Jorge Nerin November 14 2000 +move /proc/sys Shen Feng April 1 2009 +fixes/update part 1.1 Stefani Seibold June 9 2009 +===================== ======================================= ================ + + + +.. Table of Contents + + 0 Preface + 0.1 Introduction/Credits + 0.2 Legal Stuff + + 1 Collecting System Information + 1.1 Process-Specific Subdirectories + 1.2 Kernel data + 1.3 IDE devices in /proc/ide + 1.4 Networking info in /proc/net + 1.5 SCSI info + 1.6 Parallel port info in /proc/parport + 1.7 TTY info in /proc/tty + 1.8 Miscellaneous kernel statistics in /proc/stat + 1.9 Ext4 file system parameters + + 2 Modifying System Parameters + + 3 Per-Process Parameters + 3.1 /proc//oom_adj & /proc//oom_score_adj - Adjust the oom-killer + score + 3.2 /proc//oom_score - Display current oom-killer score + 3.3 /proc//io - Display the IO accounting fields + 3.4 /proc//coredump_filter - Core dump filtering settings + 3.5 /proc//mountinfo - Information about mounts + 3.6 /proc//comm & /proc//task//comm + 3.7 /proc//task//children - Information about task children + 3.8 /proc//fdinfo/ - Information about opened file + 3.9 /proc//map_files - Information about memory mapped files + 3.10 /proc//timerslack_ns - Task timerslack value + 3.11 /proc//patch_state - Livepatch patch operation state + 3.12 /proc//arch_status - Task architecture specific information + + 4 Configuring procfs + 4.1 Mount options + +Preface +======= + +0.1 Introduction/Credits +------------------------ + +This documentation is part of a soon (or so we hope) to be released book on +the SuSE Linux distribution. As there is no complete documentation for the +/proc file system and we've used many freely available sources to write these +chapters, it seems only fair to give the work back to the Linux community. +This work is based on the 2.2.* kernel version and the upcoming 2.4.*. I'm +afraid it's still far from complete, but we hope it will be useful. As far as +we know, it is the first 'all-in-one' document about the /proc file system. It +is focused on the Intel x86 hardware, so if you are looking for PPC, ARM, +SPARC, AXP, etc., features, you probably won't find what you are looking for. +It also only covers IPv4 networking, not IPv6 nor other protocols - sorry. But +additions and patches are welcome and will be added to this document if you +mail them to Bodo. + +We'd like to thank Alan Cox, Rik van Riel, and Alexey Kuznetsov and a lot of +other people for help compiling this documentation. We'd also like to extend a +special thank you to Andi Kleen for documentation, which we relied on heavily +to create this document, as well as the additional information he provided. +Thanks to everybody else who contributed source or docs to the Linux kernel +and helped create a great piece of software... :) + +If you have any comments, corrections or additions, please don't hesitate to +contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this +document. + +The latest version of this document is available online at +http://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html + +If the above direction does not works for you, you could try the kernel +mailing list at linux-kernel@vger.kernel.org and/or try to reach me at +comandante@zaralinux.com. + +0.2 Legal Stuff +--------------- + +We don't guarantee the correctness of this document, and if you come to us +complaining about how you screwed up your system because of incorrect +documentation, we won't feel responsible... + +Chapter 1: Collecting System Information +======================================== + +In This Chapter +--------------- +* Investigating the properties of the pseudo file system /proc and its + ability to provide information on the running Linux system +* Examining /proc's structure +* Uncovering various information about the kernel and the processes running + on the system + +------------------------------------------------------------------------------ + +The proc file system acts as an interface to internal data structures in the +kernel. It can be used to obtain information about the system and to change +certain kernel parameters at runtime (sysctl). + +First, we'll take a look at the read-only parts of /proc. In Chapter 2, we +show you how you can use /proc/sys to change settings. + +1.1 Process-Specific Subdirectories +----------------------------------- + +The directory /proc contains (among other things) one subdirectory for each +process running on the system, which is named after the process ID (PID). + +The link self points to the process reading the file system. Each process +subdirectory has the entries listed in Table 1-1. + +Note that an open a file descriptor to /proc/ or to any of its +contained files or subdirectories does not prevent being reused +for some other process in the event that exits. Operations on +open /proc/ file descriptors corresponding to dead processes +never act on any new process that the kernel may, through chance, have +also assigned the process ID . Instead, operations on these FDs +usually fail with ESRCH. + +.. table:: Table 1-1: Process specific entries in /proc + + ============= =============================================================== + File Content + ============= =============================================================== + clear_refs Clears page referenced bits shown in smaps output + cmdline Command line arguments + cpu Current and last cpu in which it was executed (2.4)(smp) + cwd Link to the current working directory + environ Values of environment variables + exe Link to the executable of this process + fd Directory, which contains all file descriptors + maps Memory maps to executables and library files (2.4) + mem Memory held by this process + root Link to the root directory of this process + stat Process status + statm Process memory status information + status Process status in human readable form + wchan Present with CONFIG_KALLSYMS=y: it shows the kernel function + symbol the task is blocked in - or "0" if not blocked. + pagemap Page table + stack Report full stack trace, enable via CONFIG_STACKTRACE + smaps An extension based on maps, showing the memory consumption of + each mapping and flags associated with it + smaps_rollup Accumulated smaps stats for all mappings of the process. This + can be derived from smaps, but is faster and more convenient + numa_maps An extension based on maps, showing the memory locality and + binding policy as well as mem usage (in pages) of each mapping. + ============= =============================================================== + +For example, to get the status information of a process, all you have to do is +read the file /proc/PID/status:: + + >cat /proc/self/status + Name: cat + State: R (running) + Tgid: 5452 + Pid: 5452 + PPid: 743 + TracerPid: 0 (2.4) + Uid: 501 501 501 501 + Gid: 100 100 100 100 + FDSize: 256 + Groups: 100 14 16 + VmPeak: 5004 kB + VmSize: 5004 kB + VmLck: 0 kB + VmHWM: 476 kB + VmRSS: 476 kB + RssAnon: 352 kB + RssFile: 120 kB + RssShmem: 4 kB + VmData: 156 kB + VmStk: 88 kB + VmExe: 68 kB + VmLib: 1412 kB + VmPTE: 20 kb + VmSwap: 0 kB + HugetlbPages: 0 kB + CoreDumping: 0 + THP_enabled: 1 + Threads: 1 + SigQ: 0/28578 + SigPnd: 0000000000000000 + ShdPnd: 0000000000000000 + SigBlk: 0000000000000000 + SigIgn: 0000000000000000 + SigCgt: 0000000000000000 + CapInh: 00000000fffffeff + CapPrm: 0000000000000000 + CapEff: 0000000000000000 + CapBnd: ffffffffffffffff + CapAmb: 0000000000000000 + NoNewPrivs: 0 + Seccomp: 0 + Speculation_Store_Bypass: thread vulnerable + voluntary_ctxt_switches: 0 + nonvoluntary_ctxt_switches: 1 + +This shows you nearly the same information you would get if you viewed it with +the ps command. In fact, ps uses the proc file system to obtain its +information. But you get a more detailed view of the process by reading the +file /proc/PID/status. It fields are described in table 1-2. + +The statm file contains more detailed information about the process +memory usage. Its seven fields are explained in Table 1-3. The stat file +contains details information about the process itself. Its fields are +explained in Table 1-4. + +(for SMP CONFIG users) + +For making accounting scalable, RSS related information are handled in an +asynchronous manner and the value may not be very precise. To see a precise +snapshot of a moment, you can see /proc//smaps file and scan page table. +It's slow but very precise. + +.. table:: Table 1-2: Contents of the status files (as of 4.19) + + ========================== =================================================== + Field Content + ========================== =================================================== + Name filename of the executable + Umask file mode creation mask + State state (R is running, S is sleeping, D is sleeping + in an uninterruptible wait, Z is zombie, + T is traced or stopped) + Tgid thread group ID + Ngid NUMA group ID (0 if none) + Pid process id + PPid process id of the parent process + TracerPid PID of process tracing this process (0 if not) + Uid Real, effective, saved set, and file system UIDs + Gid Real, effective, saved set, and file system GIDs + FDSize number of file descriptor slots currently allocated + Groups supplementary group list + NStgid descendant namespace thread group ID hierarchy + NSpid descendant namespace process ID hierarchy + NSpgid descendant namespace process group ID hierarchy + NSsid descendant namespace session ID hierarchy + VmPeak peak virtual memory size + VmSize total program size + VmLck locked memory size + VmPin pinned memory size + VmHWM peak resident set size ("high water mark") + VmRSS size of memory portions. It contains the three + following parts + (VmRSS = RssAnon + RssFile + RssShmem) + RssAnon size of resident anonymous memory + RssFile size of resident file mappings + RssShmem size of resident shmem memory (includes SysV shm, + mapping of tmpfs and shared anonymous mappings) + VmData size of private data segments + VmStk size of stack segments + VmExe size of text segment + VmLib size of shared library code + VmPTE size of page table entries + VmSwap amount of swap used by anonymous private data + (shmem swap usage is not included) + HugetlbPages size of hugetlb memory portions + CoreDumping process's memory is currently being dumped + (killing the process may lead to a corrupted core) + THP_enabled process is allowed to use THP (returns 0 when + PR_SET_THP_DISABLE is set on the process + Threads number of threads + SigQ number of signals queued/max. number for queue + SigPnd bitmap of pending signals for the thread + ShdPnd bitmap of shared pending signals for the process + SigBlk bitmap of blocked signals + SigIgn bitmap of ignored signals + SigCgt bitmap of caught signals + CapInh bitmap of inheritable capabilities + CapPrm bitmap of permitted capabilities + CapEff bitmap of effective capabilities + CapBnd bitmap of capabilities bounding set + CapAmb bitmap of ambient capabilities + NoNewPrivs no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...) + Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...) + Speculation_Store_Bypass speculative store bypass mitigation status + Cpus_allowed mask of CPUs on which this process may run + Cpus_allowed_list Same as previous, but in "list format" + Mems_allowed mask of memory nodes allowed to this process + Mems_allowed_list Same as previous, but in "list format" + voluntary_ctxt_switches number of voluntary context switches + nonvoluntary_ctxt_switches number of non voluntary context switches + ========================== =================================================== + + +.. table:: Table 1-3: Contents of the statm files (as of 2.6.8-rc3) + + ======== =============================== ============================== + Field Content + ======== =============================== ============================== + size total program size (pages) (same as VmSize in status) + resident size of memory portions (pages) (same as VmRSS in status) + shared number of pages that are shared (i.e. backed by a file, same + as RssFile+RssShmem in status) + trs number of pages that are 'code' (not including libs; broken, + includes data segment) + lrs number of pages of library (always 0 on 2.6) + drs number of pages of data/stack (including libs; broken, + includes library text) + dt number of dirty pages (always 0 on 2.6) + ======== =============================== ============================== + + +.. table:: Table 1-4: Contents of the stat files (as of 2.6.30-rc7) + + ============= =============================================================== + Field Content + ============= =============================================================== + pid process id + tcomm filename of the executable + state state (R is running, S is sleeping, D is sleeping in an + uninterruptible wait, Z is zombie, T is traced or stopped) + ppid process id of the parent process + pgrp pgrp of the process + sid session id + tty_nr tty the process uses + tty_pgrp pgrp of the tty + flags task flags + min_flt number of minor faults + cmin_flt number of minor faults with child's + maj_flt number of major faults + cmaj_flt number of major faults with child's + utime user mode jiffies + stime kernel mode jiffies + cutime user mode jiffies with child's + cstime kernel mode jiffies with child's + priority priority level + nice nice level + num_threads number of threads + it_real_value (obsolete, always 0) + start_time time the process started after system boot + vsize virtual memory size + rss resident set memory size + rsslim current limit in bytes on the rss + start_code address above which program text can run + end_code address below which program text can run + start_stack address of the start of the main process stack + esp current value of ESP + eip current value of EIP + pending bitmap of pending signals + blocked bitmap of blocked signals + sigign bitmap of ignored signals + sigcatch bitmap of caught signals + 0 (place holder, used to be the wchan address, + use /proc/PID/wchan instead) + 0 (place holder) + 0 (place holder) + exit_signal signal to send to parent thread on exit + task_cpu which CPU the task is scheduled on + rt_priority realtime priority + policy scheduling policy (man sched_setscheduler) + blkio_ticks time spent waiting for block IO + gtime guest time of the task in jiffies + cgtime guest time of the task children in jiffies + start_data address above which program data+bss is placed + end_data address below which program data+bss is placed + start_brk address above which program heap can be expanded with brk() + arg_start address above which program command line is placed + arg_end address below which program command line is placed + env_start address above which program environment is placed + env_end address below which program environment is placed + exit_code the thread's exit_code in the form reported by the waitpid + system call + ============= =============================================================== + +The /proc/PID/maps file contains the currently mapped memory regions and +their access permissions. + +The format is:: + + address perms offset dev inode pathname + + 08048000-08049000 r-xp 00000000 03:00 8312 /opt/test + 08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test + 0804a000-0806b000 rw-p 00000000 00:00 0 [heap] + a7cb1000-a7cb2000 ---p 00000000 00:00 0 + a7cb2000-a7eb2000 rw-p 00000000 00:00 0 + a7eb2000-a7eb3000 ---p 00000000 00:00 0 + a7eb3000-a7ed5000 rw-p 00000000 00:00 0 + a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 + a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 + a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 + a800b000-a800e000 rw-p 00000000 00:00 0 + a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 + a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 + a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 + a8024000-a8027000 rw-p 00000000 00:00 0 + a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 + a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 + a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 + aff35000-aff4a000 rw-p 00000000 00:00 0 [stack] + ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] + +where "address" is the address space in the process that it occupies, "perms" +is a set of permissions:: + + r = read + w = write + x = execute + s = shared + p = private (copy on write) + +"offset" is the offset into the mapping, "dev" is the device (major:minor), and +"inode" is the inode on that device. 0 indicates that no inode is associated +with the memory region, as the case would be with BSS (uninitialized data). +The "pathname" shows the name associated file for this mapping. If the mapping +is not associated with a file: + + ======= ==================================== + [heap] the heap of the program + [stack] the stack of the main process + [vdso] the "virtual dynamic shared object", + the kernel system call handler + ======= ==================================== + + or if empty, the mapping is anonymous. + +The /proc/PID/smaps is an extension based on maps, showing the memory +consumption for each of the process's mappings. For each mapping (aka Virtual +Memory Area, or VMA) there is a series of lines such as the following:: + + 08048000-080bc000 r-xp 00000000 03:02 13130 /bin/bash + + Size: 1084 kB + KernelPageSize: 4 kB + MMUPageSize: 4 kB + Rss: 892 kB + Pss: 374 kB + Shared_Clean: 892 kB + Shared_Dirty: 0 kB + Private_Clean: 0 kB + Private_Dirty: 0 kB + Referenced: 892 kB + Anonymous: 0 kB + LazyFree: 0 kB + AnonHugePages: 0 kB + ShmemPmdMapped: 0 kB + Shared_Hugetlb: 0 kB + Private_Hugetlb: 0 kB + Swap: 0 kB + SwapPss: 0 kB + KernelPageSize: 4 kB + MMUPageSize: 4 kB + Locked: 0 kB + THPeligible: 0 + VmFlags: rd ex mr mw me dw + +The first of these lines shows the same information as is displayed for the +mapping in /proc/PID/maps. Following lines show the size of the mapping +(size); the size of each page allocated when backing a VMA (KernelPageSize), +which is usually the same as the size in the page table entries; the page size +used by the MMU when backing a VMA (in most cases, the same as KernelPageSize); +the amount of the mapping that is currently resident in RAM (RSS); the +process' proportional share of this mapping (PSS); and the number of clean and +dirty shared and private pages in the mapping. + +The "proportional set size" (PSS) of a process is the count of pages it has +in memory, where each page is divided by the number of processes sharing it. +So if a process has 1000 pages all to itself, and 1000 shared with one other +process, its PSS will be 1500. + +Note that even a page which is part of a MAP_SHARED mapping, but has only +a single pte mapped, i.e. is currently used by only one process, is accounted +as private and not as shared. + +"Referenced" indicates the amount of memory currently marked as referenced or +accessed. + +"Anonymous" shows the amount of memory that does not belong to any file. Even +a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE +and a page is modified, the file page is replaced by a private anonymous copy. + +"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE). +The memory isn't freed immediately with madvise(). It's freed in memory +pressure if the memory is clean. Please note that the printed value might +be lower than the real value due to optimizations used in the current +implementation. If this is not desirable please file a bug report. + +"AnonHugePages" shows the ammount of memory backed by transparent hugepage. + +"ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by +huge pages. + +"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by +hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical +reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field. + +"Swap" shows how much would-be-anonymous memory is also used, but out on swap. + +For shmem mappings, "Swap" includes also the size of the mapped (and not +replaced by copy-on-write) part of the underlying shmem object out on swap. +"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this +does not take into account swapped out page of underlying shmem objects. +"Locked" indicates whether the mapping is locked in memory or not. +"THPeligible" indicates whether the mapping is eligible for allocating THP +pages - 1 if true, 0 otherwise. It just shows the current status. + +"VmFlags" field deserves a separate description. This member represents the +kernel flags associated with the particular virtual memory area in two letter +encoded manner. The codes are the following: + + == ======================================= + rd readable + wr writeable + ex executable + sh shared + mr may read + mw may write + me may execute + ms may share + gd stack segment growns down + pf pure PFN range + dw disabled write to the mapped file + lo pages are locked in memory + io memory mapped I/O area + sr sequential read advise provided + rr random read advise provided + dc do not copy area on fork + de do not expand area on remapping + ac area is accountable + nr swap space is not reserved for the area + ht area uses huge tlb pages + ar architecture specific flag + dd do not include area into core dump + sd soft dirty flag + mm mixed map area + hg huge page advise flag + nh no huge page advise flag + mg mergable advise flag + == ======================================= + +Note that there is no guarantee that every flag and associated mnemonic will +be present in all further kernel releases. Things get changed, the flags may +be vanished or the reverse -- new added. Interpretation of their meaning +might change in future as well. So each consumer of these flags has to +follow each specific kernel version for the exact semantic. + +This file is only present if the CONFIG_MMU kernel configuration option is +enabled. + +Note: reading /proc/PID/maps or /proc/PID/smaps is inherently racy (consistent +output can be achieved only in the single read call). + +This typically manifests when doing partial reads of these files while the +memory map is being modified. Despite the races, we do provide the following +guarantees: + +1) The mapped addresses never go backwards, which implies no two + regions will ever overlap. +2) If there is something at a given vaddr during the entirety of the + life of the smaps/maps walk, there will be some output for it. + +The /proc/PID/smaps_rollup file includes the same fields as /proc/PID/smaps, +but their values are the sums of the corresponding values for all mappings of +the process. Additionally, it contains these fields: + +- Pss_Anon +- Pss_File +- Pss_Shmem + +They represent the proportional shares of anonymous, file, and shmem pages, as +described for smaps above. These fields are omitted in smaps since each +mapping identifies the type (anon, file, or shmem) of all pages it contains. +Thus all information in smaps_rollup can be derived from smaps, but at a +significantly higher cost. + +The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG +bits on both physical and virtual pages associated with a process, and the +soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst +for details). +To clear the bits for all the pages associated with the process:: + + > echo 1 > /proc/PID/clear_refs + +To clear the bits for the anonymous pages associated with the process:: + + > echo 2 > /proc/PID/clear_refs + +To clear the bits for the file mapped pages associated with the process:: + + > echo 3 > /proc/PID/clear_refs + +To clear the soft-dirty bit:: + + > echo 4 > /proc/PID/clear_refs + +To reset the peak resident set size ("high water mark") to the process's +current value:: + + > echo 5 > /proc/PID/clear_refs + +Any other value written to /proc/PID/clear_refs will have no effect. + +The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags +using /proc/kpageflags and number of times a page is mapped using +/proc/kpagecount. For detailed explanation, see +Documentation/admin-guide/mm/pagemap.rst. + +The /proc/pid/numa_maps is an extension based on maps, showing the memory +locality and binding policy, as well as the memory usage (in pages) of +each mapping. The output follows a general format where mapping details get +summarized separated by blank spaces, one mapping per each file line:: + + address policy mapping details + + 00400000 default file=/usr/local/bin/app mapped=1 active=0 N3=1 kernelpagesize_kB=4 + 00600000 default file=/usr/local/bin/app anon=1 dirty=1 N3=1 kernelpagesize_kB=4 + 3206000000 default file=/lib64/ld-2.12.so mapped=26 mapmax=6 N0=24 N3=2 kernelpagesize_kB=4 + 320621f000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 + 3206220000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 + 3206221000 default anon=1 dirty=1 N3=1 kernelpagesize_kB=4 + 3206800000 default file=/lib64/libc-2.12.so mapped=59 mapmax=21 active=55 N0=41 N3=18 kernelpagesize_kB=4 + 320698b000 default file=/lib64/libc-2.12.so + 3206b8a000 default file=/lib64/libc-2.12.so anon=2 dirty=2 N3=2 kernelpagesize_kB=4 + 3206b8e000 default file=/lib64/libc-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 + 3206b8f000 default anon=3 dirty=3 active=1 N3=3 kernelpagesize_kB=4 + 7f4dc10a2000 default anon=3 dirty=3 N3=3 kernelpagesize_kB=4 + 7f4dc10b4000 default anon=2 dirty=2 active=1 N3=2 kernelpagesize_kB=4 + 7f4dc1200000 default file=/anon_hugepage\040(deleted) huge anon=1 dirty=1 N3=1 kernelpagesize_kB=2048 + 7fff335f0000 default stack anon=3 dirty=3 N3=3 kernelpagesize_kB=4 + 7fff3369d000 default mapped=1 mapmax=35 active=0 N3=1 kernelpagesize_kB=4 + +Where: + +"address" is the starting address for the mapping; + +"policy" reports the NUMA memory policy set for the mapping (see Documentation/admin-guide/mm/numa_memory_policy.rst); + +"mapping details" summarizes mapping data such as mapping type, page usage counters, +node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page +size, in KB, that is backing the mapping up. + +1.2 Kernel data +--------------- + +Similar to the process entries, the kernel data files give information about +the running kernel. The files used to obtain this information are contained in +/proc and are listed in Table 1-5. Not all of these will be present in your +system. It depends on the kernel configuration and the loaded modules, which +files are there, and which are missing. + +.. table:: Table 1-5: Kernel info in /proc + + ============ =============================================================== + File Content + ============ =============================================================== + apm Advanced power management info + buddyinfo Kernel memory allocator information (see text) (2.5) + bus Directory containing bus specific information + cmdline Kernel command line + cpuinfo Info about the CPU + devices Available devices (block and character) + dma Used DMS channels + filesystems Supported filesystems + driver Various drivers grouped here, currently rtc (2.4) + execdomains Execdomains, related to security (2.4) + fb Frame Buffer devices (2.4) + fs File system parameters, currently nfs/exports (2.4) + ide Directory containing info about the IDE subsystem + interrupts Interrupt usage + iomem Memory map (2.4) + ioports I/O port usage + irq Masks for irq to cpu affinity (2.4)(smp?) + isapnp ISA PnP (Plug&Play) Info (2.4) + kcore Kernel core image (can be ELF or A.OUT(deprecated in 2.4)) + kmsg Kernel messages + ksyms Kernel symbol table + loadavg Load average of last 1, 5 & 15 minutes + locks Kernel locks + meminfo Memory info + misc Miscellaneous + modules List of loaded modules + mounts Mounted filesystems + net Networking info (see text) + pagetypeinfo Additional page allocator information (see text) (2.5) + partitions Table of partitions known to the system + pci Deprecated info of PCI bus (new way -> /proc/bus/pci/, + decoupled by lspci (2.4) + rtc Real time clock + scsi SCSI info (see text) + slabinfo Slab pool info + softirqs softirq usage + stat Overall statistics + swaps Swap space utilization + sys See chapter 2 + sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4) + tty Info of tty drivers + uptime Wall clock since boot, combined idle time of all cpus + version Kernel version + video bttv info of video resources (2.4) + vmallocinfo Show vmalloced areas + ============ =============================================================== + +You can, for example, check which interrupts are currently in use and what +they are used for by looking in the file /proc/interrupts:: + + > cat /proc/interrupts + CPU0 + 0: 8728810 XT-PIC timer + 1: 895 XT-PIC keyboard + 2: 0 XT-PIC cascade + 3: 531695 XT-PIC aha152x + 4: 2014133 XT-PIC serial + 5: 44401 XT-PIC pcnet_cs + 8: 2 XT-PIC rtc + 11: 8 XT-PIC i82365 + 12: 182918 XT-PIC PS/2 Mouse + 13: 1 XT-PIC fpu + 14: 1232265 XT-PIC ide0 + 15: 7 XT-PIC ide1 + NMI: 0 + +In 2.4.* a couple of lines where added to this file LOC & ERR (this time is the +output of a SMP machine):: + + > cat /proc/interrupts + + CPU0 CPU1 + 0: 1243498 1214548 IO-APIC-edge timer + 1: 8949 8958 IO-APIC-edge keyboard + 2: 0 0 XT-PIC cascade + 5: 11286 10161 IO-APIC-edge soundblaster + 8: 1 0 IO-APIC-edge rtc + 9: 27422 27407 IO-APIC-edge 3c503 + 12: 113645 113873 IO-APIC-edge PS/2 Mouse + 13: 0 0 XT-PIC fpu + 14: 22491 24012 IO-APIC-edge ide0 + 15: 2183 2415 IO-APIC-edge ide1 + 17: 30564 30414 IO-APIC-level eth0 + 18: 177 164 IO-APIC-level bttv + NMI: 2457961 2457959 + LOC: 2457882 2457881 + ERR: 2155 + +NMI is incremented in this case because every timer interrupt generates a NMI +(Non Maskable Interrupt) which is used by the NMI Watchdog to detect lockups. + +LOC is the local interrupt counter of the internal APIC of every CPU. + +ERR is incremented in the case of errors in the IO-APIC bus (the bus that +connects the CPUs in a SMP system. This means that an error has been detected, +the IO-APIC automatically retry the transmission, so it should not be a big +problem, but you should read the SMP-FAQ. + +In 2.6.2* /proc/interrupts was expanded again. This time the goal was for +/proc/interrupts to display every IRQ vector in use by the system, not +just those considered 'most important'. The new vectors are: + +THR + interrupt raised when a machine check threshold counter + (typically counting ECC corrected errors of memory or cache) exceeds + a configurable threshold. Only available on some systems. + +TRM + a thermal event interrupt occurs when a temperature threshold + has been exceeded for the CPU. This interrupt may also be generated + when the temperature drops back to normal. + +SPU + a spurious interrupt is some interrupt that was raised then lowered + by some IO device before it could be fully processed by the APIC. Hence + the APIC sees the interrupt but does not know what device it came from. + For this case the APIC will generate the interrupt with a IRQ vector + of 0xff. This might also be generated by chipset bugs. + +RES, CAL, TLB] + rescheduling, call and TLB flush interrupts are + sent from one CPU to another per the needs of the OS. Typically, + their statistics are used by kernel developers and interested users to + determine the occurrence of interrupts of the given type. + +The above IRQ vectors are displayed only when relevant. For example, +the threshold vector does not exist on x86_64 platforms. Others are +suppressed when the system is a uniprocessor. As of this writing, only +i386 and x86_64 platforms support the new IRQ vector displays. + +Of some interest is the introduction of the /proc/irq directory to 2.4. +It could be used to set IRQ to CPU affinity, this means that you can "hook" an +IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the +irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and +prof_cpu_mask. + +For example:: + + > ls /proc/irq/ + 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask + 1 11 13 15 17 19 3 5 7 9 default_smp_affinity + > ls /proc/irq/0/ + smp_affinity + +smp_affinity is a bitmask, in which you can specify which CPUs can handle the +IRQ, you can set it by doing:: + + > echo 1 > /proc/irq/10/smp_affinity + +This means that only the first CPU will handle the IRQ, but you can also echo +5 which means that only the first and third CPU can handle the IRQ. + +The contents of each smp_affinity file is the same by default:: + + > cat /proc/irq/0/smp_affinity + ffffffff + +There is an alternate interface, smp_affinity_list which allows specifying +a cpu range instead of a bitmask:: + + > cat /proc/irq/0/smp_affinity_list + 1024-1031 + +The default_smp_affinity mask applies to all non-active IRQs, which are the +IRQs which have not yet been allocated/activated, and hence which lack a +/proc/irq/[0-9]* directory. + +The node file on an SMP system shows the node to which the device using the IRQ +reports itself as being attached. This hardware locality information does not +include information about any possible driver locality preference. + +prof_cpu_mask specifies which CPUs are to be profiled by the system wide +profiler. Default value is ffffffff (all cpus if there are only 32 of them). + +The way IRQs are routed is handled by the IO-APIC, and it's Round Robin +between all the CPUs which are allowed to handle it. As usual the kernel has +more info than you and does a better job than you, so the defaults are the +best choice for almost everyone. [Note this applies only to those IO-APIC's +that support "Round Robin" interrupt distribution.] + +There are three more important subdirectories in /proc: net, scsi, and sys. +The general rule is that the contents, or even the existence of these +directories, depend on your kernel configuration. If SCSI is not enabled, the +directory scsi may not exist. The same is true with the net, which is there +only when networking support is present in the running kernel. + +The slabinfo file gives information about memory usage at the slab level. +Linux uses slab pools for memory management above page level in version 2.2. +Commonly used objects have their own slab pool (such as network buffers, +directory cache, and so on). + +:: + + > cat /proc/buddyinfo + + Node 0, zone DMA 0 4 5 4 4 3 ... + Node 0, zone Normal 1 0 0 1 101 8 ... + Node 0, zone HighMem 2 0 0 1 1 0 ... + +External fragmentation is a problem under some workloads, and buddyinfo is a +useful tool for helping diagnose these problems. Buddyinfo will give you a +clue as to how big an area you can safely allocate, or why a previous +allocation failed. + +Each column represents the number of pages of a certain order which are +available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in +ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE +available in ZONE_NORMAL, etc... + +More information relevant to external fragmentation can be found in +pagetypeinfo:: + + > cat /proc/pagetypeinfo + Page block order: 9 + Pages per block: 512 + + Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 + Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0 + Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 + Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2 + Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0 + Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 + Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9 + Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0 + Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452 + Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0 + Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 + + Number of blocks type Unmovable Reclaimable Movable Reserve Isolate + Node 0, zone DMA 2 0 5 1 0 + Node 0, zone DMA32 41 6 967 2 0 + +Fragmentation avoidance in the kernel works by grouping pages of different +migrate types into the same contiguous regions of memory called page blocks. +A page block is typically the size of the default hugepage size e.g. 2MB on +X86-64. By keeping pages grouped based on their ability to move, the kernel +can reclaim pages within a page block to satisfy a high-order allocation. + +The pagetypinfo begins with information on the size of a page block. It +then gives the same type of information as buddyinfo except broken down +by migrate-type and finishes with details on how many page blocks of each +type exist. + +If min_free_kbytes has been tuned correctly (recommendations made by hugeadm +from libhugetlbfs https://github.com/libhugetlbfs/libhugetlbfs/), one can +make an estimate of the likely number of huge pages that can be allocated +at a given point in time. All the "Movable" blocks should be allocatable +unless memory has been mlock()'d. Some of the Reclaimable blocks should +also be allocatable although a lot of filesystem metadata may have to be +reclaimed to achieve this. + + +meminfo +~~~~~~~ + +Provides information about distribution and utilization of memory. This +varies by architecture and compile options. The following is from a +16GB PIII, which has highmem enabled. You may not have all of these fields. + +:: + + > cat /proc/meminfo + + MemTotal: 16344972 kB + MemFree: 13634064 kB + MemAvailable: 14836172 kB + Buffers: 3656 kB + Cached: 1195708 kB + SwapCached: 0 kB + Active: 891636 kB + Inactive: 1077224 kB + HighTotal: 15597528 kB + HighFree: 13629632 kB + LowTotal: 747444 kB + LowFree: 4432 kB + SwapTotal: 0 kB + SwapFree: 0 kB + Dirty: 968 kB + Writeback: 0 kB + AnonPages: 861800 kB + Mapped: 280372 kB + Shmem: 644 kB + KReclaimable: 168048 kB + Slab: 284364 kB + SReclaimable: 159856 kB + SUnreclaim: 124508 kB + PageTables: 24448 kB + NFS_Unstable: 0 kB + Bounce: 0 kB + WritebackTmp: 0 kB + CommitLimit: 7669796 kB + Committed_AS: 100056 kB + VmallocTotal: 112216 kB + VmallocUsed: 428 kB + VmallocChunk: 111088 kB + Percpu: 62080 kB + HardwareCorrupted: 0 kB + AnonHugePages: 49152 kB + ShmemHugePages: 0 kB + ShmemPmdMapped: 0 kB + +MemTotal + Total usable ram (i.e. physical ram minus a few reserved + bits and the kernel binary code) +MemFree + The sum of LowFree+HighFree +MemAvailable + An estimate of how much memory is available for starting new + applications, without swapping. Calculated from MemFree, + SReclaimable, the size of the file LRU lists, and the low + watermarks in each zone. + The estimate takes into account that the system needs some + page cache to function well, and that not all reclaimable + slab will be reclaimable, due to items being in use. The + impact of those factors will vary from system to system. +Buffers + Relatively temporary storage for raw disk blocks + shouldn't get tremendously large (20MB or so) +Cached + in-memory cache for files read from the disk (the + pagecache). Doesn't include SwapCached +SwapCached + Memory that once was swapped out, is swapped back in but + still also is in the swapfile (if memory is needed it + doesn't need to be swapped out AGAIN because it is already + in the swapfile. This saves I/O) +Active + Memory that has been used more recently and usually not + reclaimed unless absolutely necessary. +Inactive + Memory which has been less recently used. It is more + eligible to be reclaimed for other purposes +HighTotal, HighFree + Highmem is all memory above ~860MB of physical memory + Highmem areas are for use by userspace programs, or + for the pagecache. The kernel must use tricks to access + this memory, making it slower to access than lowmem. +LowTotal, LowFree + Lowmem is memory which can be used for everything that + highmem can be used for, but it is also available for the + kernel's use for its own data structures. Among many + other things, it is where everything from the Slab is + allocated. Bad things happen when you're out of lowmem. +SwapTotal + total amount of swap space available +SwapFree + Memory which has been evicted from RAM, and is temporarily + on the disk +Dirty + Memory which is waiting to get written back to the disk +Writeback + Memory which is actively being written back to the disk +AnonPages + Non-file backed pages mapped into userspace page tables +HardwareCorrupted + The amount of RAM/memory in KB, the kernel identifies as + corrupted. +AnonHugePages + Non-file backed huge pages mapped into userspace page tables +Mapped + files which have been mmaped, such as libraries +Shmem + Total memory used by shared memory (shmem) and tmpfs +ShmemHugePages + Memory used by shared memory (shmem) and tmpfs allocated + with huge pages +ShmemPmdMapped + Shared memory mapped into userspace with huge pages +KReclaimable + Kernel allocations that the kernel will attempt to reclaim + under memory pressure. Includes SReclaimable (below), and other + direct allocations with a shrinker. +Slab + in-kernel data structures cache +SReclaimable + Part of Slab, that might be reclaimed, such as caches +SUnreclaim + Part of Slab, that cannot be reclaimed on memory pressure +PageTables + amount of memory dedicated to the lowest level of page + tables. +NFS_Unstable + NFS pages sent to the server, but not yet committed to stable + storage +Bounce + Memory used for block device "bounce buffers" +WritebackTmp + Memory used by FUSE for temporary writeback buffers +CommitLimit + Based on the overcommit ratio ('vm.overcommit_ratio'), + this is the total amount of memory currently available to + be allocated on the system. This limit is only adhered to + if strict overcommit accounting is enabled (mode 2 in + 'vm.overcommit_memory'). + + The CommitLimit is calculated with the following formula:: + + CommitLimit = ([total RAM pages] - [total huge TLB pages]) * + overcommit_ratio / 100 + [total swap pages] + + For example, on a system with 1G of physical RAM and 7G + of swap with a `vm.overcommit_ratio` of 30 it would + yield a CommitLimit of 7.3G. + + For more details, see the memory overcommit documentation + in vm/overcommit-accounting. +Committed_AS + The amount of memory presently allocated on the system. + The committed memory is a sum of all of the memory which + has been allocated by processes, even if it has not been + "used" by them as of yet. A process which malloc()'s 1G + of memory, but only touches 300M of it will show up as + using 1G. This 1G is memory which has been "committed" to + by the VM and can be used at any time by the allocating + application. With strict overcommit enabled on the system + (mode 2 in 'vm.overcommit_memory'),allocations which would + exceed the CommitLimit (detailed above) will not be permitted. + This is useful if one needs to guarantee that processes will + not fail due to lack of memory once that memory has been + successfully allocated. +VmallocTotal + total size of vmalloc memory area +VmallocUsed + amount of vmalloc area which is used +VmallocChunk + largest contiguous block of vmalloc area which is free +Percpu + Memory allocated to the percpu allocator used to back percpu + allocations. This stat excludes the cost of metadata. + +vmallocinfo +~~~~~~~~~~~ + +Provides information about vmalloced/vmaped areas. One line per area, +containing the virtual address range of the area, size in bytes, +caller information of the creator, and optional information depending +on the kind of area : + + ========== =================================================== + pages=nr number of pages + phys=addr if a physical address was specified + ioremap I/O mapping (ioremap() and friends) + vmalloc vmalloc() area + vmap vmap()ed pages + user VM_USERMAP area + vpages buffer for pages pointers was vmalloced (huge area) + N=nr (Only on NUMA kernels) + Number of pages allocated on memory node + ========== =================================================== + +:: + + > cat /proc/vmallocinfo + 0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204 ... + /0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128 + 0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204 ... + /0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64 + 0xffffc20000302000-0xffffc20000304000 8192 acpi_tb_verify_table+0x21/0x4f... + phys=7fee8000 ioremap + 0xffffc20000304000-0xffffc20000307000 12288 acpi_tb_verify_table+0x21/0x4f... + phys=7fee7000 ioremap + 0xffffc2000031d000-0xffffc2000031f000 8192 init_vdso_vars+0x112/0x210 + 0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e ... + /0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3 + 0xffffc2000033a000-0xffffc2000033d000 12288 sys_swapon+0x640/0xac0 ... + pages=2 vmalloc N1=2 + 0xffffc20000347000-0xffffc2000034c000 20480 xt_alloc_table_info+0xfe ... + /0x130 [x_tables] pages=4 vmalloc N0=4 + 0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 ... + pages=14 vmalloc N2=14 + 0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 ... + pages=4 vmalloc N1=4 + 0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 ... + pages=2 vmalloc N1=2 + 0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 ... + pages=10 vmalloc N0=10 + + +softirqs +~~~~~~~~ + +Provides counts of softirq handlers serviced since boot time, for each cpu. + +:: + + > cat /proc/softirqs + CPU0 CPU1 CPU2 CPU3 + HI: 0 0 0 0 + TIMER: 27166 27120 27097 27034 + NET_TX: 0 0 0 17 + NET_RX: 42 0 0 39 + BLOCK: 0 0 107 1121 + TASKLET: 0 0 0 290 + SCHED: 27035 26983 26971 26746 + HRTIMER: 0 0 0 0 + RCU: 1678 1769 2178 2250 + + +1.3 IDE devices in /proc/ide +---------------------------- + +The subdirectory /proc/ide contains information about all IDE devices of which +the kernel is aware. There is one subdirectory for each IDE controller, the +file drivers and a link for each IDE device, pointing to the device directory +in the controller specific subtree. + +The file drivers contains general information about the drivers used for the +IDE devices:: + + > cat /proc/ide/drivers + ide-cdrom version 4.53 + ide-disk version 1.08 + +More detailed information can be found in the controller specific +subdirectories. These are named ide0, ide1 and so on. Each of these +directories contains the files shown in table 1-6. + + +.. table:: Table 1-6: IDE controller info in /proc/ide/ide? + + ======= ======================================= + File Content + ======= ======================================= + channel IDE channel (0 or 1) + config Configuration (only for PCI/IDE bridge) + mate Mate name + model Type/Chipset of IDE controller + ======= ======================================= + +Each device connected to a controller has a separate subdirectory in the +controllers directory. The files listed in table 1-7 are contained in these +directories. + + +.. table:: Table 1-7: IDE device information + + ================ ========================================== + File Content + ================ ========================================== + cache The cache + capacity Capacity of the medium (in 512Byte blocks) + driver driver and version + geometry physical and logical geometry + identify device identify block + media media type + model device identifier + settings device setup + smart_thresholds IDE disk management thresholds + smart_values IDE disk management values + ================ ========================================== + +The most interesting file is ``settings``. This file contains a nice +overview of the drive parameters:: + + # cat /proc/ide/ide0/hda/settings + name value min max mode + ---- ----- --- --- ---- + bios_cyl 526 0 65535 rw + bios_head 255 0 255 rw + bios_sect 63 0 63 rw + breada_readahead 4 0 127 rw + bswap 0 0 1 r + file_readahead 72 0 2097151 rw + io_32bit 0 0 3 rw + keepsettings 0 0 1 rw + max_kb_per_request 122 1 127 rw + multcount 0 0 8 rw + nice1 1 0 1 rw + nowerr 0 0 1 rw + pio_mode write-only 0 255 w + slow 0 0 1 rw + unmaskirq 0 0 1 rw + using_dma 0 0 1 rw + + +1.4 Networking info in /proc/net +-------------------------------- + +The subdirectory /proc/net follows the usual pattern. Table 1-8 shows the +additional values you get for IP version 6 if you configure the kernel to +support this. Table 1-9 lists the files and their meaning. + + +.. table:: Table 1-8: IPv6 info in /proc/net + + ========== ===================================================== + File Content + ========== ===================================================== + udp6 UDP sockets (IPv6) + tcp6 TCP sockets (IPv6) + raw6 Raw device statistics (IPv6) + igmp6 IP multicast addresses, which this host joined (IPv6) + if_inet6 List of IPv6 interface addresses + ipv6_route Kernel routing table for IPv6 + rt6_stats Global IPv6 routing tables statistics + sockstat6 Socket statistics (IPv6) + snmp6 Snmp data (IPv6) + ========== ===================================================== + +.. table:: Table 1-9: Network info in /proc/net + + ============= ================================================================ + File Content + ============= ================================================================ + arp Kernel ARP table + dev network devices with statistics + dev_mcast the Layer2 multicast groups a device is listening too + (interface index, label, number of references, number of bound + addresses). + dev_stat network device status + ip_fwchains Firewall chain linkage + ip_fwnames Firewall chain names + ip_masq Directory containing the masquerading tables + ip_masquerade Major masquerading table + netstat Network statistics + raw raw device statistics + route Kernel routing table + rpc Directory containing rpc info + rt_cache Routing cache + snmp SNMP data + sockstat Socket statistics + tcp TCP sockets + udp UDP sockets + unix UNIX domain sockets + wireless Wireless interface data (Wavelan etc) + igmp IP multicast addresses, which this host joined + psched Global packet scheduler parameters. + netlink List of PF_NETLINK sockets + ip_mr_vifs List of multicast virtual interfaces + ip_mr_cache List of multicast routing cache + ============= ================================================================ + +You can use this information to see which network devices are available in +your system and how much traffic was routed over those devices:: + + > cat /proc/net/dev + Inter-|Receive |[... + face |bytes packets errs drop fifo frame compressed multicast|[... + lo: 908188 5596 0 0 0 0 0 0 [... + ppp0:15475140 20721 410 0 0 410 0 0 [... + eth0: 614530 7085 0 0 0 0 0 1 [... + + ...] Transmit + ...] bytes packets errs drop fifo colls carrier compressed + ...] 908188 5596 0 0 0 0 0 0 + ...] 1375103 17405 0 0 0 0 0 0 + ...] 1703981 5535 0 0 0 3 0 0 + +In addition, each Channel Bond interface has its own directory. For +example, the bond0 device will have a directory called /proc/net/bond0/. +It will contain information that is specific to that bond, such as the +current slaves of the bond, the link status of the slaves, and how +many times the slaves link has failed. + +1.5 SCSI info +------------- + +If you have a SCSI host adapter in your system, you'll find a subdirectory +named after the driver for this adapter in /proc/scsi. You'll also see a list +of all recognized SCSI devices in /proc/scsi:: + + >cat /proc/scsi/scsi + Attached devices: + Host: scsi0 Channel: 00 Id: 00 Lun: 00 + Vendor: IBM Model: DGHS09U Rev: 03E0 + Type: Direct-Access ANSI SCSI revision: 03 + Host: scsi0 Channel: 00 Id: 06 Lun: 00 + Vendor: PIONEER Model: CD-ROM DR-U06S Rev: 1.04 + Type: CD-ROM ANSI SCSI revision: 02 + + +The directory named after the driver has one file for each adapter found in +the system. These files contain information about the controller, including +the used IRQ and the IO address range. The amount of information shown is +dependent on the adapter you use. The example shows the output for an Adaptec +AHA-2940 SCSI adapter:: + + > cat /proc/scsi/aic7xxx/0 + + Adaptec AIC7xxx driver version: 5.1.19/3.2.4 + Compile Options: + TCQ Enabled By Default : Disabled + AIC7XXX_PROC_STATS : Disabled + AIC7XXX_RESET_DELAY : 5 + Adapter Configuration: + SCSI Adapter: Adaptec AHA-294X Ultra SCSI host adapter + Ultra Wide Controller + PCI MMAPed I/O Base: 0xeb001000 + Adapter SEEPROM Config: SEEPROM found and used. + Adaptec SCSI BIOS: Enabled + IRQ: 10 + SCBs: Active 0, Max Active 2, + Allocated 15, HW 16, Page 255 + Interrupts: 160328 + BIOS Control Word: 0x18b6 + Adapter Control Word: 0x005b + Extended Translation: Enabled + Disconnect Enable Flags: 0xffff + Ultra Enable Flags: 0x0001 + Tag Queue Enable Flags: 0x0000 + Ordered Queue Tag Flags: 0x0000 + Default Tag Queue Depth: 8 + Tagged Queue By Device array for aic7xxx host instance 0: + {255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255} + Actual queue depth per device for aic7xxx host instance 0: + {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} + Statistics: + (scsi0:0:0:0) + Device using Wide/Sync transfers at 40.0 MByte/sec, offset 8 + Transinfo settings: current(12/8/1/0), goal(12/8/1/0), user(12/15/1/0) + Total transfers 160151 (74577 reads and 85574 writes) + (scsi0:0:6:0) + Device using Narrow/Sync transfers at 5.0 MByte/sec, offset 15 + Transinfo settings: current(50/15/0/0), goal(50/15/0/0), user(50/15/0/0) + Total transfers 0 (0 reads and 0 writes) + + +1.6 Parallel port info in /proc/parport +--------------------------------------- + +The directory /proc/parport contains information about the parallel ports of +your system. It has one subdirectory for each port, named after the port +number (0,1,2,...). + +These directories contain the four files shown in Table 1-10. + + +.. table:: Table 1-10: Files in /proc/parport + + ========= ==================================================================== + File Content + ========= ==================================================================== + autoprobe Any IEEE-1284 device ID information that has been acquired. + devices list of the device drivers using that port. A + will appear by the + name of the device currently using the port (it might not appear + against any). + hardware Parallel port's base address, IRQ line and DMA channel. + irq IRQ that parport is using for that port. This is in a separate + file to allow you to alter it by writing a new value in (IRQ + number or none). + ========= ==================================================================== + +1.7 TTY info in /proc/tty +------------------------- + +Information about the available and actually used tty's can be found in the +directory /proc/tty.You'll find entries for drivers and line disciplines in +this directory, as shown in Table 1-11. + + +.. table:: Table 1-11: Files in /proc/tty + + ============= ============================================== + File Content + ============= ============================================== + drivers list of drivers and their usage + ldiscs registered line disciplines + driver/serial usage statistic and status of single tty lines + ============= ============================================== + +To see which tty's are currently in use, you can simply look into the file +/proc/tty/drivers:: + + > cat /proc/tty/drivers + pty_slave /dev/pts 136 0-255 pty:slave + pty_master /dev/ptm 128 0-255 pty:master + pty_slave /dev/ttyp 3 0-255 pty:slave + pty_master /dev/pty 2 0-255 pty:master + serial /dev/cua 5 64-67 serial:callout + serial /dev/ttyS 4 64-67 serial + /dev/tty0 /dev/tty0 4 0 system:vtmaster + /dev/ptmx /dev/ptmx 5 2 system + /dev/console /dev/console 5 1 system:console + /dev/tty /dev/tty 5 0 system:/dev/tty + unknown /dev/tty 4 1-63 console + + +1.8 Miscellaneous kernel statistics in /proc/stat +------------------------------------------------- + +Various pieces of information about kernel activity are available in the +/proc/stat file. All of the numbers reported in this file are aggregates +since the system first booted. For a quick look, simply cat the file:: + + > cat /proc/stat + cpu 2255 34 2290 22625563 6290 127 456 0 0 0 + cpu0 1132 34 1441 11311718 3675 127 438 0 0 0 + cpu1 1123 0 849 11313845 2614 0 18 0 0 0 + intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...] + ctxt 1990473 + btime 1062191376 + processes 2915 + procs_running 1 + procs_blocked 0 + softirq 183433 0 21755 12 39 1137 231 21459 2263 + +The very first "cpu" line aggregates the numbers in all of the other "cpuN" +lines. These numbers identify the amount of time the CPU has spent performing +different kinds of work. Time units are in USER_HZ (typically hundredths of a +second). The meanings of the columns are as follows, from left to right: + +- user: normal processes executing in user mode +- nice: niced processes executing in user mode +- system: processes executing in kernel mode +- idle: twiddling thumbs +- iowait: In a word, iowait stands for waiting for I/O to complete. But there + are several problems: + + 1. Cpu will not wait for I/O to complete, iowait is the time that a task is + waiting for I/O to complete. When cpu goes into idle state for + outstanding task io, another task will be scheduled on this CPU. + 2. In a multi-core CPU, the task waiting for I/O to complete is not running + on any CPU, so the iowait of each CPU is difficult to calculate. + 3. The value of iowait field in /proc/stat will decrease in certain + conditions. + + So, the iowait is not reliable by reading from /proc/stat. +- irq: servicing interrupts +- softirq: servicing softirqs +- steal: involuntary wait +- guest: running a normal guest +- guest_nice: running a niced guest + +The "intr" line gives counts of interrupts serviced since boot time, for each +of the possible system interrupts. The first column is the total of all +interrupts serviced including unnumbered architecture specific interrupts; +each subsequent column is the total for that particular numbered interrupt. +Unnumbered interrupts are not shown, only summed into the total. + +The "ctxt" line gives the total number of context switches across all CPUs. + +The "btime" line gives the time at which the system booted, in seconds since +the Unix epoch. + +The "processes" line gives the number of processes and threads created, which +includes (but is not limited to) those created by calls to the fork() and +clone() system calls. + +The "procs_running" line gives the total number of threads that are +running or ready to run (i.e., the total number of runnable threads). + +The "procs_blocked" line gives the number of processes currently blocked, +waiting for I/O to complete. + +The "softirq" line gives counts of softirqs serviced since boot time, for each +of the possible system softirqs. The first column is the total of all +softirqs serviced; each subsequent column is the total for that particular +softirq. + + +1.9 Ext4 file system parameters +------------------------------- + +Information about mounted ext4 file systems can be found in +/proc/fs/ext4. Each mounted filesystem will have a directory in +/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or +/proc/fs/ext4/dm-0). The files in each per-device directory are shown +in Table 1-12, below. + +.. table:: Table 1-12: Files in /proc/fs/ext4/ + + ============== ========================================================== + File Content + mb_groups details of multiblock allocator buddy cache of free blocks + ============== ========================================================== + +2.0 /proc/consoles +------------------ +Shows registered system console lines. + +To see which character device lines are currently used for the system console +/dev/console, you may simply look into the file /proc/consoles:: + + > cat /proc/consoles + tty0 -WU (ECp) 4:7 + ttyS0 -W- (Ep) 4:64 + +The columns are: + ++--------------------+-------------------------------------------------------+ +| device | name of the device | ++====================+=======================================================+ +| operations | * R = can do read operations | +| | * W = can do write operations | +| | * U = can do unblank | ++--------------------+-------------------------------------------------------+ +| flags | * E = it is enabled | +| | * C = it is preferred console | +| | * B = it is primary boot console | +| | * p = it is used for printk buffer | +| | * b = it is not a TTY but a Braille device | +| | * a = it is safe to use when cpu is offline | ++--------------------+-------------------------------------------------------+ +| major:minor | major and minor number of the device separated by a | +| | colon | ++--------------------+-------------------------------------------------------+ + +Summary +------- + +The /proc file system serves information about the running system. It not only +allows access to process data but also allows you to request the kernel status +by reading files in the hierarchy. + +The directory structure of /proc reflects the types of information and makes +it easy, if not obvious, where to look for specific data. + +Chapter 2: Modifying System Parameters +====================================== + +In This Chapter +--------------- + +* Modifying kernel parameters by writing into files found in /proc/sys +* Exploring the files which modify certain parameters +* Review of the /proc/sys file tree + +------------------------------------------------------------------------------ + +A very interesting part of /proc is the directory /proc/sys. This is not only +a source of information, it also allows you to change parameters within the +kernel. Be very careful when attempting this. You can optimize your system, +but you can also cause it to crash. Never alter kernel parameters on a +production system. Set up a development machine and test to make sure that +everything works the way you want it to. You may have no alternative but to +reboot the machine once an error has been made. + +To change a value, simply echo the new value into the file. An example is +given below in the section on the file system data. You need to be root to do +this. You can create your own boot script to perform this every time your +system boots. + +The files in /proc/sys can be used to fine tune and monitor miscellaneous and +general things in the operation of the Linux kernel. Since some of the files +can inadvertently disrupt your system, it is advisable to read both +documentation and source before actually making adjustments. In any case, be +very careful when writing to any of these files. The entries in /proc may +change slightly between the 2.1.* and the 2.2 kernel, so if there is any doubt +review the kernel documentation in the directory /usr/src/linux/Documentation. +This chapter is heavily based on the documentation included in the pre 2.2 +kernels, and became part of it in version 2.2.1 of the Linux kernel. + +Please see: Documentation/admin-guide/sysctl/ directory for descriptions of these +entries. + +Summary +------- + +Certain aspects of kernel behavior can be modified at runtime, without the +need to recompile the kernel, or even to reboot the system. The files in the +/proc/sys tree can not only be read, but also modified. You can use the echo +command to write value into these files, thereby changing the default settings +of the kernel. + + +Chapter 3: Per-process Parameters +================================= + +3.1 /proc//oom_adj & /proc//oom_score_adj- Adjust the oom-killer score +-------------------------------------------------------------------------------- + +These file can be used to adjust the badness heuristic used to select which +process gets killed in out of memory conditions. + +The badness heuristic assigns a value to each candidate task ranging from 0 +(never kill) to 1000 (always kill) to determine which process is targeted. The +units are roughly a proportion along that range of allowed memory the process +may allocate from based on an estimation of its current memory and swap use. +For example, if a task is using all allowed memory, its badness score will be +1000. If it is using half of its allowed memory, its score will be 500. + +There is an additional factor included in the badness score: the current memory +and swap usage is discounted by 3% for root processes. + +The amount of "allowed" memory depends on the context in which the oom killer +was called. If it is due to the memory assigned to the allocating task's cpuset +being exhausted, the allowed memory represents the set of mems assigned to that +cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed +memory represents the set of mempolicy nodes. If it is due to a memory +limit (or swap limit) being reached, the allowed memory is that configured +limit. Finally, if it is due to the entire system being out of memory, the +allowed memory represents all allocatable resources. + +The value of /proc//oom_score_adj is added to the badness score before it +is used to determine which task to kill. Acceptable values range from -1000 +(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to +polarize the preference for oom killing either by always preferring a certain +task or completely disabling it. The lowest possible value, -1000, is +equivalent to disabling oom killing entirely for that task since it will always +report a badness score of 0. + +Consequently, it is very simple for userspace to define the amount of memory to +consider for each task. Setting a /proc//oom_score_adj value of +500, for +example, is roughly equivalent to allowing the remainder of tasks sharing the +same system, cpuset, mempolicy, or memory controller resources to use at least +50% more memory. A value of -500, on the other hand, would be roughly +equivalent to discounting 50% of the task's allowed memory from being considered +as scoring against the task. + +For backwards compatibility with previous kernels, /proc//oom_adj may also +be used to tune the badness score. Its acceptable values range from -16 +(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17 +(OOM_DISABLE) to disable oom killing entirely for that task. Its value is +scaled linearly with /proc//oom_score_adj. + +The value of /proc//oom_score_adj may be reduced no lower than the last +value set by a CAP_SYS_RESOURCE process. To reduce the value any lower +requires CAP_SYS_RESOURCE. + +Caveat: when a parent task is selected, the oom killer will sacrifice any first +generation children with separate address spaces instead, if possible. This +avoids servers and important system daemons from being killed and loses the +minimal amount of work. + + +3.2 /proc//oom_score - Display current oom-killer score +------------------------------------------------------------- + +This file can be used to check the current score used by the oom-killer is for +any given . Use it together with /proc//oom_score_adj to tune which +process should be killed in an out-of-memory situation. + + +3.3 /proc//io - Display the IO accounting fields +------------------------------------------------------- + +This file contains IO statistics for each running process + +Example +~~~~~~~ + +:: + + test:/tmp # dd if=/dev/zero of=/tmp/test.dat & + [1] 3828 + + test:/tmp # cat /proc/3828/io + rchar: 323934931 + wchar: 323929600 + syscr: 632687 + syscw: 632675 + read_bytes: 0 + write_bytes: 323932160 + cancelled_write_bytes: 0 + + +Description +~~~~~~~~~~~ + +rchar +^^^^^ + +I/O counter: chars read +The number of bytes which this task has caused to be read from storage. This +is simply the sum of bytes which this process passed to read() and pread(). +It includes things like tty IO and it is unaffected by whether or not actual +physical disk IO was required (the read might have been satisfied from +pagecache) + + +wchar +^^^^^ + +I/O counter: chars written +The number of bytes which this task has caused, or shall cause to be written +to disk. Similar caveats apply here as with rchar. + + +syscr +^^^^^ + +I/O counter: read syscalls +Attempt to count the number of read I/O operations, i.e. syscalls like read() +and pread(). + + +syscw +^^^^^ + +I/O counter: write syscalls +Attempt to count the number of write I/O operations, i.e. syscalls like +write() and pwrite(). + + +read_bytes +^^^^^^^^^^ + +I/O counter: bytes read +Attempt to count the number of bytes which this process really did cause to +be fetched from the storage layer. Done at the submit_bio() level, so it is +accurate for block-backed filesystems. + + +write_bytes +^^^^^^^^^^^ + +I/O counter: bytes written +Attempt to count the number of bytes which this process caused to be sent to +the storage layer. This is done at page-dirtying time. + + +cancelled_write_bytes +^^^^^^^^^^^^^^^^^^^^^ + +The big inaccuracy here is truncate. If a process writes 1MB to a file and +then deletes the file, it will in fact perform no writeout. But it will have +been accounted as having caused 1MB of write. +In other words: The number of bytes which this process caused to not happen, +by truncating pagecache. A task can cause "negative" IO too. If this task +truncates some dirty pagecache, some IO which another task has been accounted +for (in its write_bytes) will not be happening. We _could_ just subtract that +from the truncating task's write_bytes, but there is information loss in doing +that. + + +.. Note:: + + At its current implementation state, this is a bit racy on 32-bit machines: + if process A reads process B's /proc/pid/io while process B is updating one + of those 64-bit counters, process A could see an intermediate result. + + +More information about this can be found within the taskstats documentation in +Documentation/accounting. + +3.4 /proc//coredump_filter - Core dump filtering settings +--------------------------------------------------------------- +When a process is dumped, all anonymous memory is written to a core file as +long as the size of the core file isn't limited. But sometimes we don't want +to dump some memory segments, for example, huge shared memory or DAX. +Conversely, sometimes we want to save file-backed memory segments into a core +file, not only the individual files. + +/proc//coredump_filter allows you to customize which memory segments +will be dumped when the process is dumped. coredump_filter is a bitmask +of memory types. If a bit of the bitmask is set, memory segments of the +corresponding memory type are dumped, otherwise they are not dumped. + +The following 9 memory types are supported: + + - (bit 0) anonymous private memory + - (bit 1) anonymous shared memory + - (bit 2) file-backed private memory + - (bit 3) file-backed shared memory + - (bit 4) ELF header pages in file-backed private memory areas (it is + effective only if the bit 2 is cleared) + - (bit 5) hugetlb private memory + - (bit 6) hugetlb shared memory + - (bit 7) DAX private memory + - (bit 8) DAX shared memory + + Note that MMIO pages such as frame buffer are never dumped and vDSO pages + are always dumped regardless of the bitmask status. + + Note that bits 0-4 don't affect hugetlb or DAX memory. hugetlb memory is + only affected by bit 5-6, and DAX is only affected by bits 7-8. + +The default value of coredump_filter is 0x33; this means all anonymous memory +segments, ELF header pages and hugetlb private memory are dumped. + +If you don't want to dump all shared memory segments attached to pid 1234, +write 0x31 to the process's proc file:: + + $ echo 0x31 > /proc/1234/coredump_filter + +When a new process is created, the process inherits the bitmask status from its +parent. It is useful to set up coredump_filter before the program runs. +For example:: + + $ echo 0x7 > /proc/self/coredump_filter + $ ./some_program + +3.5 /proc//mountinfo - Information about mounts +-------------------------------------------------------- + +This file contains lines of the form:: + + 36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue + (1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) + + (1) mount ID: unique identifier of the mount (may be reused after umount) + (2) parent ID: ID of parent (or of self for the top of the mount tree) + (3) major:minor: value of st_dev for files on filesystem + (4) root: root of the mount within the filesystem + (5) mount point: mount point relative to the process's root + (6) mount options: per mount options + (7) optional fields: zero or more fields of the form "tag[:value]" + (8) separator: marks the end of the optional fields + (9) filesystem type: name of filesystem of the form "type[.subtype]" + (10) mount source: filesystem specific information or "none" + (11) super options: per super block options + +Parsers should ignore all unrecognised optional fields. Currently the +possible optional fields are: + +================ ============================================================== +shared:X mount is shared in peer group X +master:X mount is slave to peer group X +propagate_from:X mount is slave and receives propagation from peer group X [#]_ +unbindable mount is unbindable +================ ============================================================== + +.. [#] X is the closest dominant peer group under the process's root. If + X is the immediate master of the mount, or if there's no dominant peer + group under the same root, then only the "master:X" field is present + and not the "propagate_from:X" field. + +For more information on mount propagation see: + + Documentation/filesystems/sharedsubtree.txt + + +3.6 /proc//comm & /proc//task//comm +-------------------------------------------------------- +These files provide a method to access a tasks comm value. It also allows for +a task to set its own or one of its thread siblings comm value. The comm value +is limited in size compared to the cmdline value, so writing anything longer +then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated +comm value. + + +3.7 /proc//task//children - Information about task children +------------------------------------------------------------------------- +This file provides a fast way to retrieve first level children pids +of a task pointed by / pair. The format is a space separated +stream of pids. + +Note the "first level" here -- if a child has own children they will +not be listed here, one needs to read /proc//task//children +to obtain the descendants. + +Since this interface is intended to be fast and cheap it doesn't +guarantee to provide precise results and some children might be +skipped, especially if they've exited right after we printed their +pids, so one need to either stop or freeze processes being inspected +if precise results are needed. + + +3.8 /proc//fdinfo/ - Information about opened file +--------------------------------------------------------------- +This file provides information associated with an opened file. The regular +files have at least three fields -- 'pos', 'flags' and mnt_id. The 'pos' +represents the current offset of the opened file in decimal form [see lseek(2) +for details], 'flags' denotes the octal O_xxx mask the file has been +created with [see open(2) for details] and 'mnt_id' represents mount ID of +the file system containing the opened file [see 3.5 /proc//mountinfo +for details]. + +A typical output is:: + + pos: 0 + flags: 0100002 + mnt_id: 19 + +All locks associated with a file descriptor are shown in its fdinfo too:: + + lock: 1: FLOCK ADVISORY WRITE 359 00:13:11691 0 EOF + +The files such as eventfd, fsnotify, signalfd, epoll among the regular pos/flags +pair provide additional information particular to the objects they represent. + +Eventfd files +~~~~~~~~~~~~~ + +:: + + pos: 0 + flags: 04002 + mnt_id: 9 + eventfd-count: 5a + +where 'eventfd-count' is hex value of a counter. + +Signalfd files +~~~~~~~~~~~~~~ + +:: + + pos: 0 + flags: 04002 + mnt_id: 9 + sigmask: 0000000000000200 + +where 'sigmask' is hex value of the signal mask associated +with a file. + +Epoll files +~~~~~~~~~~~ + +:: + + pos: 0 + flags: 02 + mnt_id: 9 + tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7 + +where 'tfd' is a target file descriptor number in decimal form, +'events' is events mask being watched and the 'data' is data +associated with a target [see epoll(7) for more details]. + +The 'pos' is current offset of the target file in decimal form +[see lseek(2)], 'ino' and 'sdev' are inode and device numbers +where target file resides, all in hex format. + +Fsnotify files +~~~~~~~~~~~~~~ +For inotify files the format is the following:: + + pos: 0 + flags: 02000000 + inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d + +where 'wd' is a watch descriptor in decimal form, ie a target file +descriptor number, 'ino' and 'sdev' are inode and device where the +target file resides and the 'mask' is the mask of events, all in hex +form [see inotify(7) for more details]. + +If the kernel was built with exportfs support, the path to the target +file is encoded as a file handle. The file handle is provided by three +fields 'fhandle-bytes', 'fhandle-type' and 'f_handle', all in hex +format. + +If the kernel is built without exportfs support the file handle won't be +printed out. + +If there is no inotify mark attached yet the 'inotify' line will be omitted. + +For fanotify files the format is:: + + pos: 0 + flags: 02 + mnt_id: 9 + fanotify flags:10 event-flags:0 + fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003 + fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4 + +where fanotify 'flags' and 'event-flags' are values used in fanotify_init +call, 'mnt_id' is the mount point identifier, 'mflags' is the value of +flags associated with mark which are tracked separately from events +mask. 'ino', 'sdev' are target inode and device, 'mask' is the events +mask and 'ignored_mask' is the mask of events which are to be ignored. +All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask' +does provide information about flags and mask used in fanotify_mark +call [see fsnotify manpage for details]. + +While the first three lines are mandatory and always printed, the rest is +optional and may be omitted if no marks created yet. + +Timerfd files +~~~~~~~~~~~~~ + +:: + + pos: 0 + flags: 02 + mnt_id: 9 + clockid: 0 + ticks: 0 + settime flags: 01 + it_value: (0, 49406829) + it_interval: (1, 0) + +where 'clockid' is the clock type and 'ticks' is the number of the timer expirations +that have occurred [see timerfd_create(2) for details]. 'settime flags' are +flags in octal form been used to setup the timer [see timerfd_settime(2) for +details]. 'it_value' is remaining time until the timer exiration. +'it_interval' is the interval for the timer. Note the timer might be set up +with TIMER_ABSTIME option which will be shown in 'settime flags', but 'it_value' +still exhibits timer's remaining time. + +3.9 /proc//map_files - Information about memory mapped files +--------------------------------------------------------------------- +This directory contains symbolic links which represent memory mapped files +the process is maintaining. Example output:: + + | lr-------- 1 root root 64 Jan 27 11:24 333c600000-333c620000 -> /usr/lib64/ld-2.18.so + | lr-------- 1 root root 64 Jan 27 11:24 333c81f000-333c820000 -> /usr/lib64/ld-2.18.so + | lr-------- 1 root root 64 Jan 27 11:24 333c820000-333c821000 -> /usr/lib64/ld-2.18.so + | ... + | lr-------- 1 root root 64 Jan 27 11:24 35d0421000-35d0422000 -> /usr/lib64/libselinux.so.1 + | lr-------- 1 root root 64 Jan 27 11:24 400000-41a000 -> /usr/bin/ls + +The name of a link represents the virtual memory bounds of a mapping, i.e. +vm_area_struct::vm_start-vm_area_struct::vm_end. + +The main purpose of the map_files is to retrieve a set of memory mapped +files in a fast way instead of parsing /proc//maps or +/proc//smaps, both of which contain many more records. At the same +time one can open(2) mappings from the listings of two processes and +comparing their inode numbers to figure out which anonymous memory areas +are actually shared. + +3.10 /proc//timerslack_ns - Task timerslack value +--------------------------------------------------------- +This file provides the value of the task's timerslack value in nanoseconds. +This value specifies a amount of time that normal timers may be deferred +in order to coalesce timers and avoid unnecessary wakeups. + +This allows a task's interactivity vs power consumption trade off to be +adjusted. + +Writing 0 to the file will set the tasks timerslack to the default value. + +Valid values are from 0 - ULLONG_MAX + +An application setting the value must have PTRACE_MODE_ATTACH_FSCREDS level +permissions on the task specified to change its timerslack_ns value. + +3.11 /proc//patch_state - Livepatch patch operation state +----------------------------------------------------------------- +When CONFIG_LIVEPATCH is enabled, this file displays the value of the +patch state for the task. + +A value of '-1' indicates that no patch is in transition. + +A value of '0' indicates that a patch is in transition and the task is +unpatched. If the patch is being enabled, then the task hasn't been +patched yet. If the patch is being disabled, then the task has already +been unpatched. + +A value of '1' indicates that a patch is in transition and the task is +patched. If the patch is being enabled, then the task has already been +patched. If the patch is being disabled, then the task hasn't been +unpatched yet. + +3.12 /proc//arch_status - task architecture specific status +------------------------------------------------------------------- +When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the +architecture specific status of the task. + +Example +~~~~~~~ + +:: + + $ cat /proc/6753/arch_status + AVX512_elapsed_ms: 8 + +Description +~~~~~~~~~~~ + +x86 specific entries: +~~~~~~~~~~~~~~~~~~~~~ + +AVX512_elapsed_ms: +^^^^^^^^^^^^^^^^^^ + + If AVX512 is supported on the machine, this entry shows the milliseconds + elapsed since the last time AVX512 usage was recorded. The recording + happens on a best effort basis when a task is scheduled out. This means + that the value depends on two factors: + + 1) The time which the task spent on the CPU without being scheduled + out. With CPU isolation and a single runnable task this can take + several seconds. + + 2) The time since the task was scheduled out last. Depending on the + reason for being scheduled out (time slice exhausted, syscall ...) + this can be arbitrary long time. + + As a consequence the value cannot be considered precise and authoritative + information. The application which uses this information has to be aware + of the overall scenario on the system in order to determine whether a + task is a real AVX512 user or not. Precise information can be obtained + with performance counters. + + A special value of '-1' indicates that no AVX512 usage was recorded, thus + the task is unlikely an AVX512 user, but depends on the workload and the + scheduling scenario, it also could be a false negative mentioned above. + +Configuring procfs +------------------ + +4.1 Mount options +--------------------- + +The following mount options are supported: + + ========= ======================================================== + hidepid= Set /proc// access mode. + gid= Set the group authorized to learn processes information. + ========= ======================================================== + +hidepid=0 means classic mode - everybody may access all /proc// directories +(default). + +hidepid=1 means users may not access any /proc// directories but their +own. Sensitive files like cmdline, sched*, status are now protected against +other users. This makes it impossible to learn whether any user runs +specific program (given the program doesn't reveal itself by its behaviour). +As an additional bonus, as /proc//cmdline is unaccessible for other users, +poorly written programs passing sensitive information via program arguments are +now protected against local eavesdroppers. + +hidepid=2 means hidepid=1 plus all /proc// will be fully invisible to other +users. It doesn't mean that it hides a fact whether a process with a specific +pid value exists (it can be learned by other means, e.g. by "kill -0 $PID"), +but it hides process' uid and gid, which may be learned by stat()'ing +/proc// otherwise. It greatly complicates an intruder's task of gathering +information about running processes, whether some daemon runs with elevated +privileges, whether other user runs some sensitive program, whether other users +run any program at all, etc. + +gid= defines a group authorized to learn processes information otherwise +prohibited by hidepid=. If you use some daemon like identd which needs to learn +information about processes information, just add identd to this group. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt deleted file mode 100644 index 99ca040e3f90..000000000000 --- a/Documentation/filesystems/proc.txt +++ /dev/null @@ -1,2047 +0,0 @@ ------------------------------------------------------------------------------- - T H E /proc F I L E S Y S T E M ------------------------------------------------------------------------------- -/proc/sys Terrehon Bowden October 7 1999 - Bodo Bauer - -2.4.x update Jorge Nerin November 14 2000 -move /proc/sys Shen Feng April 1 2009 ------------------------------------------------------------------------------- -Version 1.3 Kernel version 2.2.12 - Kernel version 2.4.0-test11-pre4 ------------------------------------------------------------------------------- -fixes/update part 1.1 Stefani Seibold June 9 2009 - -Table of Contents ------------------ - - 0 Preface - 0.1 Introduction/Credits - 0.2 Legal Stuff - - 1 Collecting System Information - 1.1 Process-Specific Subdirectories - 1.2 Kernel data - 1.3 IDE devices in /proc/ide - 1.4 Networking info in /proc/net - 1.5 SCSI info - 1.6 Parallel port info in /proc/parport - 1.7 TTY info in /proc/tty - 1.8 Miscellaneous kernel statistics in /proc/stat - 1.9 Ext4 file system parameters - - 2 Modifying System Parameters - - 3 Per-Process Parameters - 3.1 /proc//oom_adj & /proc//oom_score_adj - Adjust the oom-killer - score - 3.2 /proc//oom_score - Display current oom-killer score - 3.3 /proc//io - Display the IO accounting fields - 3.4 /proc//coredump_filter - Core dump filtering settings - 3.5 /proc//mountinfo - Information about mounts - 3.6 /proc//comm & /proc//task//comm - 3.7 /proc//task//children - Information about task children - 3.8 /proc//fdinfo/ - Information about opened file - 3.9 /proc//map_files - Information about memory mapped files - 3.10 /proc//timerslack_ns - Task timerslack value - 3.11 /proc//patch_state - Livepatch patch operation state - 3.12 /proc//arch_status - Task architecture specific information - - 4 Configuring procfs - 4.1 Mount options - ------------------------------------------------------------------------------- -Preface ------------------------------------------------------------------------------- - -0.1 Introduction/Credits ------------------------- - -This documentation is part of a soon (or so we hope) to be released book on -the SuSE Linux distribution. As there is no complete documentation for the -/proc file system and we've used many freely available sources to write these -chapters, it seems only fair to give the work back to the Linux community. -This work is based on the 2.2.* kernel version and the upcoming 2.4.*. I'm -afraid it's still far from complete, but we hope it will be useful. As far as -we know, it is the first 'all-in-one' document about the /proc file system. It -is focused on the Intel x86 hardware, so if you are looking for PPC, ARM, -SPARC, AXP, etc., features, you probably won't find what you are looking for. -It also only covers IPv4 networking, not IPv6 nor other protocols - sorry. But -additions and patches are welcome and will be added to this document if you -mail them to Bodo. - -We'd like to thank Alan Cox, Rik van Riel, and Alexey Kuznetsov and a lot of -other people for help compiling this documentation. We'd also like to extend a -special thank you to Andi Kleen for documentation, which we relied on heavily -to create this document, as well as the additional information he provided. -Thanks to everybody else who contributed source or docs to the Linux kernel -and helped create a great piece of software... :) - -If you have any comments, corrections or additions, please don't hesitate to -contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this -document. - -The latest version of this document is available online at -http://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html - -If the above direction does not works for you, you could try the kernel -mailing list at linux-kernel@vger.kernel.org and/or try to reach me at -comandante@zaralinux.com. - -0.2 Legal Stuff ---------------- - -We don't guarantee the correctness of this document, and if you come to us -complaining about how you screwed up your system because of incorrect -documentation, we won't feel responsible... - ------------------------------------------------------------------------------- -CHAPTER 1: COLLECTING SYSTEM INFORMATION ------------------------------------------------------------------------------- - ------------------------------------------------------------------------------- -In This Chapter ------------------------------------------------------------------------------- -* Investigating the properties of the pseudo file system /proc and its - ability to provide information on the running Linux system -* Examining /proc's structure -* Uncovering various information about the kernel and the processes running - on the system ------------------------------------------------------------------------------- - - -The proc file system acts as an interface to internal data structures in the -kernel. It can be used to obtain information about the system and to change -certain kernel parameters at runtime (sysctl). - -First, we'll take a look at the read-only parts of /proc. In Chapter 2, we -show you how you can use /proc/sys to change settings. - -1.1 Process-Specific Subdirectories ------------------------------------ - -The directory /proc contains (among other things) one subdirectory for each -process running on the system, which is named after the process ID (PID). - -The link self points to the process reading the file system. Each process -subdirectory has the entries listed in Table 1-1. - -Note that an open a file descriptor to /proc/ or to any of its -contained files or subdirectories does not prevent being reused -for some other process in the event that exits. Operations on -open /proc/ file descriptors corresponding to dead processes -never act on any new process that the kernel may, through chance, have -also assigned the process ID . Instead, operations on these FDs -usually fail with ESRCH. - -Table 1-1: Process specific entries in /proc -.............................................................................. - File Content - clear_refs Clears page referenced bits shown in smaps output - cmdline Command line arguments - cpu Current and last cpu in which it was executed (2.4)(smp) - cwd Link to the current working directory - environ Values of environment variables - exe Link to the executable of this process - fd Directory, which contains all file descriptors - maps Memory maps to executables and library files (2.4) - mem Memory held by this process - root Link to the root directory of this process - stat Process status - statm Process memory status information - status Process status in human readable form - wchan Present with CONFIG_KALLSYMS=y: it shows the kernel function - symbol the task is blocked in - or "0" if not blocked. - pagemap Page table - stack Report full stack trace, enable via CONFIG_STACKTRACE - smaps An extension based on maps, showing the memory consumption of - each mapping and flags associated with it - smaps_rollup Accumulated smaps stats for all mappings of the process. This - can be derived from smaps, but is faster and more convenient - numa_maps An extension based on maps, showing the memory locality and - binding policy as well as mem usage (in pages) of each mapping. -.............................................................................. - -For example, to get the status information of a process, all you have to do is -read the file /proc/PID/status: - - >cat /proc/self/status - Name: cat - State: R (running) - Tgid: 5452 - Pid: 5452 - PPid: 743 - TracerPid: 0 (2.4) - Uid: 501 501 501 501 - Gid: 100 100 100 100 - FDSize: 256 - Groups: 100 14 16 - VmPeak: 5004 kB - VmSize: 5004 kB - VmLck: 0 kB - VmHWM: 476 kB - VmRSS: 476 kB - RssAnon: 352 kB - RssFile: 120 kB - RssShmem: 4 kB - VmData: 156 kB - VmStk: 88 kB - VmExe: 68 kB - VmLib: 1412 kB - VmPTE: 20 kb - VmSwap: 0 kB - HugetlbPages: 0 kB - CoreDumping: 0 - THP_enabled: 1 - Threads: 1 - SigQ: 0/28578 - SigPnd: 0000000000000000 - ShdPnd: 0000000000000000 - SigBlk: 0000000000000000 - SigIgn: 0000000000000000 - SigCgt: 0000000000000000 - CapInh: 00000000fffffeff - CapPrm: 0000000000000000 - CapEff: 0000000000000000 - CapBnd: ffffffffffffffff - CapAmb: 0000000000000000 - NoNewPrivs: 0 - Seccomp: 0 - Speculation_Store_Bypass: thread vulnerable - voluntary_ctxt_switches: 0 - nonvoluntary_ctxt_switches: 1 - -This shows you nearly the same information you would get if you viewed it with -the ps command. In fact, ps uses the proc file system to obtain its -information. But you get a more detailed view of the process by reading the -file /proc/PID/status. It fields are described in table 1-2. - -The statm file contains more detailed information about the process -memory usage. Its seven fields are explained in Table 1-3. The stat file -contains details information about the process itself. Its fields are -explained in Table 1-4. - -(for SMP CONFIG users) -For making accounting scalable, RSS related information are handled in an -asynchronous manner and the value may not be very precise. To see a precise -snapshot of a moment, you can see /proc//smaps file and scan page table. -It's slow but very precise. - -Table 1-2: Contents of the status files (as of 4.19) -.............................................................................. - Field Content - Name filename of the executable - Umask file mode creation mask - State state (R is running, S is sleeping, D is sleeping - in an uninterruptible wait, Z is zombie, - T is traced or stopped) - Tgid thread group ID - Ngid NUMA group ID (0 if none) - Pid process id - PPid process id of the parent process - TracerPid PID of process tracing this process (0 if not) - Uid Real, effective, saved set, and file system UIDs - Gid Real, effective, saved set, and file system GIDs - FDSize number of file descriptor slots currently allocated - Groups supplementary group list - NStgid descendant namespace thread group ID hierarchy - NSpid descendant namespace process ID hierarchy - NSpgid descendant namespace process group ID hierarchy - NSsid descendant namespace session ID hierarchy - VmPeak peak virtual memory size - VmSize total program size - VmLck locked memory size - VmPin pinned memory size - VmHWM peak resident set size ("high water mark") - VmRSS size of memory portions. It contains the three - following parts (VmRSS = RssAnon + RssFile + RssShmem) - RssAnon size of resident anonymous memory - RssFile size of resident file mappings - RssShmem size of resident shmem memory (includes SysV shm, - mapping of tmpfs and shared anonymous mappings) - VmData size of private data segments - VmStk size of stack segments - VmExe size of text segment - VmLib size of shared library code - VmPTE size of page table entries - VmSwap amount of swap used by anonymous private data - (shmem swap usage is not included) - HugetlbPages size of hugetlb memory portions - CoreDumping process's memory is currently being dumped - (killing the process may lead to a corrupted core) - THP_enabled process is allowed to use THP (returns 0 when - PR_SET_THP_DISABLE is set on the process - Threads number of threads - SigQ number of signals queued/max. number for queue - SigPnd bitmap of pending signals for the thread - ShdPnd bitmap of shared pending signals for the process - SigBlk bitmap of blocked signals - SigIgn bitmap of ignored signals - SigCgt bitmap of caught signals - CapInh bitmap of inheritable capabilities - CapPrm bitmap of permitted capabilities - CapEff bitmap of effective capabilities - CapBnd bitmap of capabilities bounding set - CapAmb bitmap of ambient capabilities - NoNewPrivs no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...) - Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...) - Speculation_Store_Bypass speculative store bypass mitigation status - Cpus_allowed mask of CPUs on which this process may run - Cpus_allowed_list Same as previous, but in "list format" - Mems_allowed mask of memory nodes allowed to this process - Mems_allowed_list Same as previous, but in "list format" - voluntary_ctxt_switches number of voluntary context switches - nonvoluntary_ctxt_switches number of non voluntary context switches -.............................................................................. - -Table 1-3: Contents of the statm files (as of 2.6.8-rc3) -.............................................................................. - Field Content - size total program size (pages) (same as VmSize in status) - resident size of memory portions (pages) (same as VmRSS in status) - shared number of pages that are shared (i.e. backed by a file, same - as RssFile+RssShmem in status) - trs number of pages that are 'code' (not including libs; broken, - includes data segment) - lrs number of pages of library (always 0 on 2.6) - drs number of pages of data/stack (including libs; broken, - includes library text) - dt number of dirty pages (always 0 on 2.6) -.............................................................................. - - -Table 1-4: Contents of the stat files (as of 2.6.30-rc7) -.............................................................................. - Field Content - pid process id - tcomm filename of the executable - state state (R is running, S is sleeping, D is sleeping in an - uninterruptible wait, Z is zombie, T is traced or stopped) - ppid process id of the parent process - pgrp pgrp of the process - sid session id - tty_nr tty the process uses - tty_pgrp pgrp of the tty - flags task flags - min_flt number of minor faults - cmin_flt number of minor faults with child's - maj_flt number of major faults - cmaj_flt number of major faults with child's - utime user mode jiffies - stime kernel mode jiffies - cutime user mode jiffies with child's - cstime kernel mode jiffies with child's - priority priority level - nice nice level - num_threads number of threads - it_real_value (obsolete, always 0) - start_time time the process started after system boot - vsize virtual memory size - rss resident set memory size - rsslim current limit in bytes on the rss - start_code address above which program text can run - end_code address below which program text can run - start_stack address of the start of the main process stack - esp current value of ESP - eip current value of EIP - pending bitmap of pending signals - blocked bitmap of blocked signals - sigign bitmap of ignored signals - sigcatch bitmap of caught signals - 0 (place holder, used to be the wchan address, use /proc/PID/wchan instead) - 0 (place holder) - 0 (place holder) - exit_signal signal to send to parent thread on exit - task_cpu which CPU the task is scheduled on - rt_priority realtime priority - policy scheduling policy (man sched_setscheduler) - blkio_ticks time spent waiting for block IO - gtime guest time of the task in jiffies - cgtime guest time of the task children in jiffies - start_data address above which program data+bss is placed - end_data address below which program data+bss is placed - start_brk address above which program heap can be expanded with brk() - arg_start address above which program command line is placed - arg_end address below which program command line is placed - env_start address above which program environment is placed - env_end address below which program environment is placed - exit_code the thread's exit_code in the form reported by the waitpid system call -.............................................................................. - -The /proc/PID/maps file contains the currently mapped memory regions and -their access permissions. - -The format is: - -address perms offset dev inode pathname - -08048000-08049000 r-xp 00000000 03:00 8312 /opt/test -08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test -0804a000-0806b000 rw-p 00000000 00:00 0 [heap] -a7cb1000-a7cb2000 ---p 00000000 00:00 0 -a7cb2000-a7eb2000 rw-p 00000000 00:00 0 -a7eb2000-a7eb3000 ---p 00000000 00:00 0 -a7eb3000-a7ed5000 rw-p 00000000 00:00 0 -a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 -a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 -a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 -a800b000-a800e000 rw-p 00000000 00:00 0 -a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 -a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 -a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 -a8024000-a8027000 rw-p 00000000 00:00 0 -a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 -a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 -a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 -aff35000-aff4a000 rw-p 00000000 00:00 0 [stack] -ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] - -where "address" is the address space in the process that it occupies, "perms" -is a set of permissions: - - r = read - w = write - x = execute - s = shared - p = private (copy on write) - -"offset" is the offset into the mapping, "dev" is the device (major:minor), and -"inode" is the inode on that device. 0 indicates that no inode is associated -with the memory region, as the case would be with BSS (uninitialized data). -The "pathname" shows the name associated file for this mapping. If the mapping -is not associated with a file: - - [heap] = the heap of the program - [stack] = the stack of the main process - [vdso] = the "virtual dynamic shared object", - the kernel system call handler - - or if empty, the mapping is anonymous. - -The /proc/PID/smaps is an extension based on maps, showing the memory -consumption for each of the process's mappings. For each mapping (aka Virtual -Memory Area, or VMA) there is a series of lines such as the following: - -08048000-080bc000 r-xp 00000000 03:02 13130 /bin/bash - -Size: 1084 kB -KernelPageSize: 4 kB -MMUPageSize: 4 kB -Rss: 892 kB -Pss: 374 kB -Shared_Clean: 892 kB -Shared_Dirty: 0 kB -Private_Clean: 0 kB -Private_Dirty: 0 kB -Referenced: 892 kB -Anonymous: 0 kB -LazyFree: 0 kB -AnonHugePages: 0 kB -ShmemPmdMapped: 0 kB -Shared_Hugetlb: 0 kB -Private_Hugetlb: 0 kB -Swap: 0 kB -SwapPss: 0 kB -KernelPageSize: 4 kB -MMUPageSize: 4 kB -Locked: 0 kB -THPeligible: 0 -VmFlags: rd ex mr mw me dw - -The first of these lines shows the same information as is displayed for the -mapping in /proc/PID/maps. Following lines show the size of the mapping -(size); the size of each page allocated when backing a VMA (KernelPageSize), -which is usually the same as the size in the page table entries; the page size -used by the MMU when backing a VMA (in most cases, the same as KernelPageSize); -the amount of the mapping that is currently resident in RAM (RSS); the -process' proportional share of this mapping (PSS); and the number of clean and -dirty shared and private pages in the mapping. - -The "proportional set size" (PSS) of a process is the count of pages it has -in memory, where each page is divided by the number of processes sharing it. -So if a process has 1000 pages all to itself, and 1000 shared with one other -process, its PSS will be 1500. -Note that even a page which is part of a MAP_SHARED mapping, but has only -a single pte mapped, i.e. is currently used by only one process, is accounted -as private and not as shared. -"Referenced" indicates the amount of memory currently marked as referenced or -accessed. -"Anonymous" shows the amount of memory that does not belong to any file. Even -a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE -and a page is modified, the file page is replaced by a private anonymous copy. -"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE). -The memory isn't freed immediately with madvise(). It's freed in memory -pressure if the memory is clean. Please note that the printed value might -be lower than the real value due to optimizations used in the current -implementation. If this is not desirable please file a bug report. -"AnonHugePages" shows the ammount of memory backed by transparent hugepage. -"ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by -huge pages. -"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by -hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical -reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field. -"Swap" shows how much would-be-anonymous memory is also used, but out on swap. -For shmem mappings, "Swap" includes also the size of the mapped (and not -replaced by copy-on-write) part of the underlying shmem object out on swap. -"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this -does not take into account swapped out page of underlying shmem objects. -"Locked" indicates whether the mapping is locked in memory or not. -"THPeligible" indicates whether the mapping is eligible for allocating THP -pages - 1 if true, 0 otherwise. It just shows the current status. - -"VmFlags" field deserves a separate description. This member represents the kernel -flags associated with the particular virtual memory area in two letter encoded -manner. The codes are the following: - rd - readable - wr - writeable - ex - executable - sh - shared - mr - may read - mw - may write - me - may execute - ms - may share - gd - stack segment growns down - pf - pure PFN range - dw - disabled write to the mapped file - lo - pages are locked in memory - io - memory mapped I/O area - sr - sequential read advise provided - rr - random read advise provided - dc - do not copy area on fork - de - do not expand area on remapping - ac - area is accountable - nr - swap space is not reserved for the area - ht - area uses huge tlb pages - ar - architecture specific flag - dd - do not include area into core dump - sd - soft-dirty flag - mm - mixed map area - hg - huge page advise flag - nh - no-huge page advise flag - mg - mergable advise flag - -Note that there is no guarantee that every flag and associated mnemonic will -be present in all further kernel releases. Things get changed, the flags may -be vanished or the reverse -- new added. Interpretation of their meaning -might change in future as well. So each consumer of these flags has to -follow each specific kernel version for the exact semantic. - -This file is only present if the CONFIG_MMU kernel configuration option is -enabled. - -Note: reading /proc/PID/maps or /proc/PID/smaps is inherently racy (consistent -output can be achieved only in the single read call). -This typically manifests when doing partial reads of these files while the -memory map is being modified. Despite the races, we do provide the following -guarantees: - -1) The mapped addresses never go backwards, which implies no two - regions will ever overlap. -2) If there is something at a given vaddr during the entirety of the - life of the smaps/maps walk, there will be some output for it. - -The /proc/PID/smaps_rollup file includes the same fields as /proc/PID/smaps, -but their values are the sums of the corresponding values for all mappings of -the process. Additionally, it contains these fields: - -Pss_Anon -Pss_File -Pss_Shmem - -They represent the proportional shares of anonymous, file, and shmem pages, as -described for smaps above. These fields are omitted in smaps since each -mapping identifies the type (anon, file, or shmem) of all pages it contains. -Thus all information in smaps_rollup can be derived from smaps, but at a -significantly higher cost. - -The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG -bits on both physical and virtual pages associated with a process, and the -soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst -for details). -To clear the bits for all the pages associated with the process - > echo 1 > /proc/PID/clear_refs - -To clear the bits for the anonymous pages associated with the process - > echo 2 > /proc/PID/clear_refs - -To clear the bits for the file mapped pages associated with the process - > echo 3 > /proc/PID/clear_refs - -To clear the soft-dirty bit - > echo 4 > /proc/PID/clear_refs - -To reset the peak resident set size ("high water mark") to the process's -current value: - > echo 5 > /proc/PID/clear_refs - -Any other value written to /proc/PID/clear_refs will have no effect. - -The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags -using /proc/kpageflags and number of times a page is mapped using -/proc/kpagecount. For detailed explanation, see -Documentation/admin-guide/mm/pagemap.rst. - -The /proc/pid/numa_maps is an extension based on maps, showing the memory -locality and binding policy, as well as the memory usage (in pages) of -each mapping. The output follows a general format where mapping details get -summarized separated by blank spaces, one mapping per each file line: - -address policy mapping details - -00400000 default file=/usr/local/bin/app mapped=1 active=0 N3=1 kernelpagesize_kB=4 -00600000 default file=/usr/local/bin/app anon=1 dirty=1 N3=1 kernelpagesize_kB=4 -3206000000 default file=/lib64/ld-2.12.so mapped=26 mapmax=6 N0=24 N3=2 kernelpagesize_kB=4 -320621f000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 -3206220000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 -3206221000 default anon=1 dirty=1 N3=1 kernelpagesize_kB=4 -3206800000 default file=/lib64/libc-2.12.so mapped=59 mapmax=21 active=55 N0=41 N3=18 kernelpagesize_kB=4 -320698b000 default file=/lib64/libc-2.12.so -3206b8a000 default file=/lib64/libc-2.12.so anon=2 dirty=2 N3=2 kernelpagesize_kB=4 -3206b8e000 default file=/lib64/libc-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 -3206b8f000 default anon=3 dirty=3 active=1 N3=3 kernelpagesize_kB=4 -7f4dc10a2000 default anon=3 dirty=3 N3=3 kernelpagesize_kB=4 -7f4dc10b4000 default anon=2 dirty=2 active=1 N3=2 kernelpagesize_kB=4 -7f4dc1200000 default file=/anon_hugepage\040(deleted) huge anon=1 dirty=1 N3=1 kernelpagesize_kB=2048 -7fff335f0000 default stack anon=3 dirty=3 N3=3 kernelpagesize_kB=4 -7fff3369d000 default mapped=1 mapmax=35 active=0 N3=1 kernelpagesize_kB=4 - -Where: -"address" is the starting address for the mapping; -"policy" reports the NUMA memory policy set for the mapping (see Documentation/admin-guide/mm/numa_memory_policy.rst); -"mapping details" summarizes mapping data such as mapping type, page usage counters, -node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page -size, in KB, that is backing the mapping up. - -1.2 Kernel data ---------------- - -Similar to the process entries, the kernel data files give information about -the running kernel. The files used to obtain this information are contained in -/proc and are listed in Table 1-5. Not all of these will be present in your -system. It depends on the kernel configuration and the loaded modules, which -files are there, and which are missing. - -Table 1-5: Kernel info in /proc -.............................................................................. - File Content - apm Advanced power management info - buddyinfo Kernel memory allocator information (see text) (2.5) - bus Directory containing bus specific information - cmdline Kernel command line - cpuinfo Info about the CPU - devices Available devices (block and character) - dma Used DMS channels - filesystems Supported filesystems - driver Various drivers grouped here, currently rtc (2.4) - execdomains Execdomains, related to security (2.4) - fb Frame Buffer devices (2.4) - fs File system parameters, currently nfs/exports (2.4) - ide Directory containing info about the IDE subsystem - interrupts Interrupt usage - iomem Memory map (2.4) - ioports I/O port usage - irq Masks for irq to cpu affinity (2.4)(smp?) - isapnp ISA PnP (Plug&Play) Info (2.4) - kcore Kernel core image (can be ELF or A.OUT(deprecated in 2.4)) - kmsg Kernel messages - ksyms Kernel symbol table - loadavg Load average of last 1, 5 & 15 minutes - locks Kernel locks - meminfo Memory info - misc Miscellaneous - modules List of loaded modules - mounts Mounted filesystems - net Networking info (see text) - pagetypeinfo Additional page allocator information (see text) (2.5) - partitions Table of partitions known to the system - pci Deprecated info of PCI bus (new way -> /proc/bus/pci/, - decoupled by lspci (2.4) - rtc Real time clock - scsi SCSI info (see text) - slabinfo Slab pool info - softirqs softirq usage - stat Overall statistics - swaps Swap space utilization - sys See chapter 2 - sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4) - tty Info of tty drivers - uptime Wall clock since boot, combined idle time of all cpus - version Kernel version - video bttv info of video resources (2.4) - vmallocinfo Show vmalloced areas -.............................................................................. - -You can, for example, check which interrupts are currently in use and what -they are used for by looking in the file /proc/interrupts: - - > cat /proc/interrupts - CPU0 - 0: 8728810 XT-PIC timer - 1: 895 XT-PIC keyboard - 2: 0 XT-PIC cascade - 3: 531695 XT-PIC aha152x - 4: 2014133 XT-PIC serial - 5: 44401 XT-PIC pcnet_cs - 8: 2 XT-PIC rtc - 11: 8 XT-PIC i82365 - 12: 182918 XT-PIC PS/2 Mouse - 13: 1 XT-PIC fpu - 14: 1232265 XT-PIC ide0 - 15: 7 XT-PIC ide1 - NMI: 0 - -In 2.4.* a couple of lines where added to this file LOC & ERR (this time is the -output of a SMP machine): - - > cat /proc/interrupts - - CPU0 CPU1 - 0: 1243498 1214548 IO-APIC-edge timer - 1: 8949 8958 IO-APIC-edge keyboard - 2: 0 0 XT-PIC cascade - 5: 11286 10161 IO-APIC-edge soundblaster - 8: 1 0 IO-APIC-edge rtc - 9: 27422 27407 IO-APIC-edge 3c503 - 12: 113645 113873 IO-APIC-edge PS/2 Mouse - 13: 0 0 XT-PIC fpu - 14: 22491 24012 IO-APIC-edge ide0 - 15: 2183 2415 IO-APIC-edge ide1 - 17: 30564 30414 IO-APIC-level eth0 - 18: 177 164 IO-APIC-level bttv - NMI: 2457961 2457959 - LOC: 2457882 2457881 - ERR: 2155 - -NMI is incremented in this case because every timer interrupt generates a NMI -(Non Maskable Interrupt) which is used by the NMI Watchdog to detect lockups. - -LOC is the local interrupt counter of the internal APIC of every CPU. - -ERR is incremented in the case of errors in the IO-APIC bus (the bus that -connects the CPUs in a SMP system. This means that an error has been detected, -the IO-APIC automatically retry the transmission, so it should not be a big -problem, but you should read the SMP-FAQ. - -In 2.6.2* /proc/interrupts was expanded again. This time the goal was for -/proc/interrupts to display every IRQ vector in use by the system, not -just those considered 'most important'. The new vectors are: - - THR -- interrupt raised when a machine check threshold counter - (typically counting ECC corrected errors of memory or cache) exceeds - a configurable threshold. Only available on some systems. - - TRM -- a thermal event interrupt occurs when a temperature threshold - has been exceeded for the CPU. This interrupt may also be generated - when the temperature drops back to normal. - - SPU -- a spurious interrupt is some interrupt that was raised then lowered - by some IO device before it could be fully processed by the APIC. Hence - the APIC sees the interrupt but does not know what device it came from. - For this case the APIC will generate the interrupt with a IRQ vector - of 0xff. This might also be generated by chipset bugs. - - RES, CAL, TLB -- rescheduling, call and TLB flush interrupts are - sent from one CPU to another per the needs of the OS. Typically, - their statistics are used by kernel developers and interested users to - determine the occurrence of interrupts of the given type. - -The above IRQ vectors are displayed only when relevant. For example, -the threshold vector does not exist on x86_64 platforms. Others are -suppressed when the system is a uniprocessor. As of this writing, only -i386 and x86_64 platforms support the new IRQ vector displays. - -Of some interest is the introduction of the /proc/irq directory to 2.4. -It could be used to set IRQ to CPU affinity, this means that you can "hook" an -IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the -irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and -prof_cpu_mask. - -For example - > ls /proc/irq/ - 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask - 1 11 13 15 17 19 3 5 7 9 default_smp_affinity - > ls /proc/irq/0/ - smp_affinity - -smp_affinity is a bitmask, in which you can specify which CPUs can handle the -IRQ, you can set it by doing: - - > echo 1 > /proc/irq/10/smp_affinity - -This means that only the first CPU will handle the IRQ, but you can also echo -5 which means that only the first and third CPU can handle the IRQ. - -The contents of each smp_affinity file is the same by default: - - > cat /proc/irq/0/smp_affinity - ffffffff - -There is an alternate interface, smp_affinity_list which allows specifying -a cpu range instead of a bitmask: - - > cat /proc/irq/0/smp_affinity_list - 1024-1031 - -The default_smp_affinity mask applies to all non-active IRQs, which are the -IRQs which have not yet been allocated/activated, and hence which lack a -/proc/irq/[0-9]* directory. - -The node file on an SMP system shows the node to which the device using the IRQ -reports itself as being attached. This hardware locality information does not -include information about any possible driver locality preference. - -prof_cpu_mask specifies which CPUs are to be profiled by the system wide -profiler. Default value is ffffffff (all cpus if there are only 32 of them). - -The way IRQs are routed is handled by the IO-APIC, and it's Round Robin -between all the CPUs which are allowed to handle it. As usual the kernel has -more info than you and does a better job than you, so the defaults are the -best choice for almost everyone. [Note this applies only to those IO-APIC's -that support "Round Robin" interrupt distribution.] - -There are three more important subdirectories in /proc: net, scsi, and sys. -The general rule is that the contents, or even the existence of these -directories, depend on your kernel configuration. If SCSI is not enabled, the -directory scsi may not exist. The same is true with the net, which is there -only when networking support is present in the running kernel. - -The slabinfo file gives information about memory usage at the slab level. -Linux uses slab pools for memory management above page level in version 2.2. -Commonly used objects have their own slab pool (such as network buffers, -directory cache, and so on). - -.............................................................................. - -> cat /proc/buddyinfo - -Node 0, zone DMA 0 4 5 4 4 3 ... -Node 0, zone Normal 1 0 0 1 101 8 ... -Node 0, zone HighMem 2 0 0 1 1 0 ... - -External fragmentation is a problem under some workloads, and buddyinfo is a -useful tool for helping diagnose these problems. Buddyinfo will give you a -clue as to how big an area you can safely allocate, or why a previous -allocation failed. - -Each column represents the number of pages of a certain order which are -available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in -ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE -available in ZONE_NORMAL, etc... - -More information relevant to external fragmentation can be found in -pagetypeinfo. - -> cat /proc/pagetypeinfo -Page block order: 9 -Pages per block: 512 - -Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 -Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0 -Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 -Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2 -Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0 -Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 -Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9 -Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0 -Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452 -Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0 -Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 - -Number of blocks type Unmovable Reclaimable Movable Reserve Isolate -Node 0, zone DMA 2 0 5 1 0 -Node 0, zone DMA32 41 6 967 2 0 - -Fragmentation avoidance in the kernel works by grouping pages of different -migrate types into the same contiguous regions of memory called page blocks. -A page block is typically the size of the default hugepage size e.g. 2MB on -X86-64. By keeping pages grouped based on their ability to move, the kernel -can reclaim pages within a page block to satisfy a high-order allocation. - -The pagetypinfo begins with information on the size of a page block. It -then gives the same type of information as buddyinfo except broken down -by migrate-type and finishes with details on how many page blocks of each -type exist. - -If min_free_kbytes has been tuned correctly (recommendations made by hugeadm -from libhugetlbfs https://github.com/libhugetlbfs/libhugetlbfs/), one can -make an estimate of the likely number of huge pages that can be allocated -at a given point in time. All the "Movable" blocks should be allocatable -unless memory has been mlock()'d. Some of the Reclaimable blocks should -also be allocatable although a lot of filesystem metadata may have to be -reclaimed to achieve this. - -.............................................................................. - -meminfo: - -Provides information about distribution and utilization of memory. This -varies by architecture and compile options. The following is from a -16GB PIII, which has highmem enabled. You may not have all of these fields. - -> cat /proc/meminfo - -MemTotal: 16344972 kB -MemFree: 13634064 kB -MemAvailable: 14836172 kB -Buffers: 3656 kB -Cached: 1195708 kB -SwapCached: 0 kB -Active: 891636 kB -Inactive: 1077224 kB -HighTotal: 15597528 kB -HighFree: 13629632 kB -LowTotal: 747444 kB -LowFree: 4432 kB -SwapTotal: 0 kB -SwapFree: 0 kB -Dirty: 968 kB -Writeback: 0 kB -AnonPages: 861800 kB -Mapped: 280372 kB -Shmem: 644 kB -KReclaimable: 168048 kB -Slab: 284364 kB -SReclaimable: 159856 kB -SUnreclaim: 124508 kB -PageTables: 24448 kB -NFS_Unstable: 0 kB -Bounce: 0 kB -WritebackTmp: 0 kB -CommitLimit: 7669796 kB -Committed_AS: 100056 kB -VmallocTotal: 112216 kB -VmallocUsed: 428 kB -VmallocChunk: 111088 kB -Percpu: 62080 kB -HardwareCorrupted: 0 kB -AnonHugePages: 49152 kB -ShmemHugePages: 0 kB -ShmemPmdMapped: 0 kB - - - MemTotal: Total usable ram (i.e. physical ram minus a few reserved - bits and the kernel binary code) - MemFree: The sum of LowFree+HighFree -MemAvailable: An estimate of how much memory is available for starting new - applications, without swapping. Calculated from MemFree, - SReclaimable, the size of the file LRU lists, and the low - watermarks in each zone. - The estimate takes into account that the system needs some - page cache to function well, and that not all reclaimable - slab will be reclaimable, due to items being in use. The - impact of those factors will vary from system to system. - Buffers: Relatively temporary storage for raw disk blocks - shouldn't get tremendously large (20MB or so) - Cached: in-memory cache for files read from the disk (the - pagecache). Doesn't include SwapCached - SwapCached: Memory that once was swapped out, is swapped back in but - still also is in the swapfile (if memory is needed it - doesn't need to be swapped out AGAIN because it is already - in the swapfile. This saves I/O) - Active: Memory that has been used more recently and usually not - reclaimed unless absolutely necessary. - Inactive: Memory which has been less recently used. It is more - eligible to be reclaimed for other purposes - HighTotal: - HighFree: Highmem is all memory above ~860MB of physical memory - Highmem areas are for use by userspace programs, or - for the pagecache. The kernel must use tricks to access - this memory, making it slower to access than lowmem. - LowTotal: - LowFree: Lowmem is memory which can be used for everything that - highmem can be used for, but it is also available for the - kernel's use for its own data structures. Among many - other things, it is where everything from the Slab is - allocated. Bad things happen when you're out of lowmem. - SwapTotal: total amount of swap space available - SwapFree: Memory which has been evicted from RAM, and is temporarily - on the disk - Dirty: Memory which is waiting to get written back to the disk - Writeback: Memory which is actively being written back to the disk - AnonPages: Non-file backed pages mapped into userspace page tables -HardwareCorrupted: The amount of RAM/memory in KB, the kernel identifies as - corrupted. -AnonHugePages: Non-file backed huge pages mapped into userspace page tables - Mapped: files which have been mmaped, such as libraries - Shmem: Total memory used by shared memory (shmem) and tmpfs -ShmemHugePages: Memory used by shared memory (shmem) and tmpfs allocated - with huge pages -ShmemPmdMapped: Shared memory mapped into userspace with huge pages -KReclaimable: Kernel allocations that the kernel will attempt to reclaim - under memory pressure. Includes SReclaimable (below), and other - direct allocations with a shrinker. - Slab: in-kernel data structures cache -SReclaimable: Part of Slab, that might be reclaimed, such as caches - SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure - PageTables: amount of memory dedicated to the lowest level of page - tables. -NFS_Unstable: NFS pages sent to the server, but not yet committed to stable - storage - Bounce: Memory used for block device "bounce buffers" -WritebackTmp: Memory used by FUSE for temporary writeback buffers - CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'), - this is the total amount of memory currently available to - be allocated on the system. This limit is only adhered to - if strict overcommit accounting is enabled (mode 2 in - 'vm.overcommit_memory'). - The CommitLimit is calculated with the following formula: - CommitLimit = ([total RAM pages] - [total huge TLB pages]) * - overcommit_ratio / 100 + [total swap pages] - For example, on a system with 1G of physical RAM and 7G - of swap with a `vm.overcommit_ratio` of 30 it would - yield a CommitLimit of 7.3G. - For more details, see the memory overcommit documentation - in vm/overcommit-accounting. -Committed_AS: The amount of memory presently allocated on the system. - The committed memory is a sum of all of the memory which - has been allocated by processes, even if it has not been - "used" by them as of yet. A process which malloc()'s 1G - of memory, but only touches 300M of it will show up as - using 1G. This 1G is memory which has been "committed" to - by the VM and can be used at any time by the allocating - application. With strict overcommit enabled on the system - (mode 2 in 'vm.overcommit_memory'),allocations which would - exceed the CommitLimit (detailed above) will not be permitted. - This is useful if one needs to guarantee that processes will - not fail due to lack of memory once that memory has been - successfully allocated. -VmallocTotal: total size of vmalloc memory area - VmallocUsed: amount of vmalloc area which is used -VmallocChunk: largest contiguous block of vmalloc area which is free - Percpu: Memory allocated to the percpu allocator used to back percpu - allocations. This stat excludes the cost of metadata. - -.............................................................................. - -vmallocinfo: - -Provides information about vmalloced/vmaped areas. One line per area, -containing the virtual address range of the area, size in bytes, -caller information of the creator, and optional information depending -on the kind of area : - - pages=nr number of pages - phys=addr if a physical address was specified - ioremap I/O mapping (ioremap() and friends) - vmalloc vmalloc() area - vmap vmap()ed pages - user VM_USERMAP area - vpages buffer for pages pointers was vmalloced (huge area) - N=nr (Only on NUMA kernels) - Number of pages allocated on memory node - -> cat /proc/vmallocinfo -0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204 ... - /0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128 -0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204 ... - /0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64 -0xffffc20000302000-0xffffc20000304000 8192 acpi_tb_verify_table+0x21/0x4f... - phys=7fee8000 ioremap -0xffffc20000304000-0xffffc20000307000 12288 acpi_tb_verify_table+0x21/0x4f... - phys=7fee7000 ioremap -0xffffc2000031d000-0xffffc2000031f000 8192 init_vdso_vars+0x112/0x210 -0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e ... - /0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3 -0xffffc2000033a000-0xffffc2000033d000 12288 sys_swapon+0x640/0xac0 ... - pages=2 vmalloc N1=2 -0xffffc20000347000-0xffffc2000034c000 20480 xt_alloc_table_info+0xfe ... - /0x130 [x_tables] pages=4 vmalloc N0=4 -0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 ... - pages=14 vmalloc N2=14 -0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 ... - pages=4 vmalloc N1=4 -0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 ... - pages=2 vmalloc N1=2 -0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 ... - pages=10 vmalloc N0=10 - -.............................................................................. - -softirqs: - -Provides counts of softirq handlers serviced since boot time, for each cpu. - -> cat /proc/softirqs - CPU0 CPU1 CPU2 CPU3 - HI: 0 0 0 0 - TIMER: 27166 27120 27097 27034 - NET_TX: 0 0 0 17 - NET_RX: 42 0 0 39 - BLOCK: 0 0 107 1121 - TASKLET: 0 0 0 290 - SCHED: 27035 26983 26971 26746 - HRTIMER: 0 0 0 0 - RCU: 1678 1769 2178 2250 - - -1.3 IDE devices in /proc/ide ----------------------------- - -The subdirectory /proc/ide contains information about all IDE devices of which -the kernel is aware. There is one subdirectory for each IDE controller, the -file drivers and a link for each IDE device, pointing to the device directory -in the controller specific subtree. - -The file drivers contains general information about the drivers used for the -IDE devices: - - > cat /proc/ide/drivers - ide-cdrom version 4.53 - ide-disk version 1.08 - -More detailed information can be found in the controller specific -subdirectories. These are named ide0, ide1 and so on. Each of these -directories contains the files shown in table 1-6. - - -Table 1-6: IDE controller info in /proc/ide/ide? -.............................................................................. - File Content - channel IDE channel (0 or 1) - config Configuration (only for PCI/IDE bridge) - mate Mate name - model Type/Chipset of IDE controller -.............................................................................. - -Each device connected to a controller has a separate subdirectory in the -controllers directory. The files listed in table 1-7 are contained in these -directories. - - -Table 1-7: IDE device information -.............................................................................. - File Content - cache The cache - capacity Capacity of the medium (in 512Byte blocks) - driver driver and version - geometry physical and logical geometry - identify device identify block - media media type - model device identifier - settings device setup - smart_thresholds IDE disk management thresholds - smart_values IDE disk management values -.............................................................................. - -The most interesting file is settings. This file contains a nice overview of -the drive parameters: - - # cat /proc/ide/ide0/hda/settings - name value min max mode - ---- ----- --- --- ---- - bios_cyl 526 0 65535 rw - bios_head 255 0 255 rw - bios_sect 63 0 63 rw - breada_readahead 4 0 127 rw - bswap 0 0 1 r - file_readahead 72 0 2097151 rw - io_32bit 0 0 3 rw - keepsettings 0 0 1 rw - max_kb_per_request 122 1 127 rw - multcount 0 0 8 rw - nice1 1 0 1 rw - nowerr 0 0 1 rw - pio_mode write-only 0 255 w - slow 0 0 1 rw - unmaskirq 0 0 1 rw - using_dma 0 0 1 rw - - -1.4 Networking info in /proc/net --------------------------------- - -The subdirectory /proc/net follows the usual pattern. Table 1-8 shows the -additional values you get for IP version 6 if you configure the kernel to -support this. Table 1-9 lists the files and their meaning. - - -Table 1-8: IPv6 info in /proc/net -.............................................................................. - File Content - udp6 UDP sockets (IPv6) - tcp6 TCP sockets (IPv6) - raw6 Raw device statistics (IPv6) - igmp6 IP multicast addresses, which this host joined (IPv6) - if_inet6 List of IPv6 interface addresses - ipv6_route Kernel routing table for IPv6 - rt6_stats Global IPv6 routing tables statistics - sockstat6 Socket statistics (IPv6) - snmp6 Snmp data (IPv6) -.............................................................................. - - -Table 1-9: Network info in /proc/net -.............................................................................. - File Content - arp Kernel ARP table - dev network devices with statistics - dev_mcast the Layer2 multicast groups a device is listening too - (interface index, label, number of references, number of bound - addresses). - dev_stat network device status - ip_fwchains Firewall chain linkage - ip_fwnames Firewall chain names - ip_masq Directory containing the masquerading tables - ip_masquerade Major masquerading table - netstat Network statistics - raw raw device statistics - route Kernel routing table - rpc Directory containing rpc info - rt_cache Routing cache - snmp SNMP data - sockstat Socket statistics - tcp TCP sockets - udp UDP sockets - unix UNIX domain sockets - wireless Wireless interface data (Wavelan etc) - igmp IP multicast addresses, which this host joined - psched Global packet scheduler parameters. - netlink List of PF_NETLINK sockets - ip_mr_vifs List of multicast virtual interfaces - ip_mr_cache List of multicast routing cache -.............................................................................. - -You can use this information to see which network devices are available in -your system and how much traffic was routed over those devices: - - > cat /proc/net/dev - Inter-|Receive |[... - face |bytes packets errs drop fifo frame compressed multicast|[... - lo: 908188 5596 0 0 0 0 0 0 [... - ppp0:15475140 20721 410 0 0 410 0 0 [... - eth0: 614530 7085 0 0 0 0 0 1 [... - - ...] Transmit - ...] bytes packets errs drop fifo colls carrier compressed - ...] 908188 5596 0 0 0 0 0 0 - ...] 1375103 17405 0 0 0 0 0 0 - ...] 1703981 5535 0 0 0 3 0 0 - -In addition, each Channel Bond interface has its own directory. For -example, the bond0 device will have a directory called /proc/net/bond0/. -It will contain information that is specific to that bond, such as the -current slaves of the bond, the link status of the slaves, and how -many times the slaves link has failed. - -1.5 SCSI info -------------- - -If you have a SCSI host adapter in your system, you'll find a subdirectory -named after the driver for this adapter in /proc/scsi. You'll also see a list -of all recognized SCSI devices in /proc/scsi: - - >cat /proc/scsi/scsi - Attached devices: - Host: scsi0 Channel: 00 Id: 00 Lun: 00 - Vendor: IBM Model: DGHS09U Rev: 03E0 - Type: Direct-Access ANSI SCSI revision: 03 - Host: scsi0 Channel: 00 Id: 06 Lun: 00 - Vendor: PIONEER Model: CD-ROM DR-U06S Rev: 1.04 - Type: CD-ROM ANSI SCSI revision: 02 - - -The directory named after the driver has one file for each adapter found in -the system. These files contain information about the controller, including -the used IRQ and the IO address range. The amount of information shown is -dependent on the adapter you use. The example shows the output for an Adaptec -AHA-2940 SCSI adapter: - - > cat /proc/scsi/aic7xxx/0 - - Adaptec AIC7xxx driver version: 5.1.19/3.2.4 - Compile Options: - TCQ Enabled By Default : Disabled - AIC7XXX_PROC_STATS : Disabled - AIC7XXX_RESET_DELAY : 5 - Adapter Configuration: - SCSI Adapter: Adaptec AHA-294X Ultra SCSI host adapter - Ultra Wide Controller - PCI MMAPed I/O Base: 0xeb001000 - Adapter SEEPROM Config: SEEPROM found and used. - Adaptec SCSI BIOS: Enabled - IRQ: 10 - SCBs: Active 0, Max Active 2, - Allocated 15, HW 16, Page 255 - Interrupts: 160328 - BIOS Control Word: 0x18b6 - Adapter Control Word: 0x005b - Extended Translation: Enabled - Disconnect Enable Flags: 0xffff - Ultra Enable Flags: 0x0001 - Tag Queue Enable Flags: 0x0000 - Ordered Queue Tag Flags: 0x0000 - Default Tag Queue Depth: 8 - Tagged Queue By Device array for aic7xxx host instance 0: - {255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255} - Actual queue depth per device for aic7xxx host instance 0: - {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} - Statistics: - (scsi0:0:0:0) - Device using Wide/Sync transfers at 40.0 MByte/sec, offset 8 - Transinfo settings: current(12/8/1/0), goal(12/8/1/0), user(12/15/1/0) - Total transfers 160151 (74577 reads and 85574 writes) - (scsi0:0:6:0) - Device using Narrow/Sync transfers at 5.0 MByte/sec, offset 15 - Transinfo settings: current(50/15/0/0), goal(50/15/0/0), user(50/15/0/0) - Total transfers 0 (0 reads and 0 writes) - - -1.6 Parallel port info in /proc/parport ---------------------------------------- - -The directory /proc/parport contains information about the parallel ports of -your system. It has one subdirectory for each port, named after the port -number (0,1,2,...). - -These directories contain the four files shown in Table 1-10. - - -Table 1-10: Files in /proc/parport -.............................................................................. - File Content - autoprobe Any IEEE-1284 device ID information that has been acquired. - devices list of the device drivers using that port. A + will appear by the - name of the device currently using the port (it might not appear - against any). - hardware Parallel port's base address, IRQ line and DMA channel. - irq IRQ that parport is using for that port. This is in a separate - file to allow you to alter it by writing a new value in (IRQ - number or none). -.............................................................................. - -1.7 TTY info in /proc/tty -------------------------- - -Information about the available and actually used tty's can be found in the -directory /proc/tty.You'll find entries for drivers and line disciplines in -this directory, as shown in Table 1-11. - - -Table 1-11: Files in /proc/tty -.............................................................................. - File Content - drivers list of drivers and their usage - ldiscs registered line disciplines - driver/serial usage statistic and status of single tty lines -.............................................................................. - -To see which tty's are currently in use, you can simply look into the file -/proc/tty/drivers: - - > cat /proc/tty/drivers - pty_slave /dev/pts 136 0-255 pty:slave - pty_master /dev/ptm 128 0-255 pty:master - pty_slave /dev/ttyp 3 0-255 pty:slave - pty_master /dev/pty 2 0-255 pty:master - serial /dev/cua 5 64-67 serial:callout - serial /dev/ttyS 4 64-67 serial - /dev/tty0 /dev/tty0 4 0 system:vtmaster - /dev/ptmx /dev/ptmx 5 2 system - /dev/console /dev/console 5 1 system:console - /dev/tty /dev/tty 5 0 system:/dev/tty - unknown /dev/tty 4 1-63 console - - -1.8 Miscellaneous kernel statistics in /proc/stat -------------------------------------------------- - -Various pieces of information about kernel activity are available in the -/proc/stat file. All of the numbers reported in this file are aggregates -since the system first booted. For a quick look, simply cat the file: - - > cat /proc/stat - cpu 2255 34 2290 22625563 6290 127 456 0 0 0 - cpu0 1132 34 1441 11311718 3675 127 438 0 0 0 - cpu1 1123 0 849 11313845 2614 0 18 0 0 0 - intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...] - ctxt 1990473 - btime 1062191376 - processes 2915 - procs_running 1 - procs_blocked 0 - softirq 183433 0 21755 12 39 1137 231 21459 2263 - -The very first "cpu" line aggregates the numbers in all of the other "cpuN" -lines. These numbers identify the amount of time the CPU has spent performing -different kinds of work. Time units are in USER_HZ (typically hundredths of a -second). The meanings of the columns are as follows, from left to right: - -- user: normal processes executing in user mode -- nice: niced processes executing in user mode -- system: processes executing in kernel mode -- idle: twiddling thumbs -- iowait: In a word, iowait stands for waiting for I/O to complete. But there - are several problems: - 1. Cpu will not wait for I/O to complete, iowait is the time that a task is - waiting for I/O to complete. When cpu goes into idle state for - outstanding task io, another task will be scheduled on this CPU. - 2. In a multi-core CPU, the task waiting for I/O to complete is not running - on any CPU, so the iowait of each CPU is difficult to calculate. - 3. The value of iowait field in /proc/stat will decrease in certain - conditions. - So, the iowait is not reliable by reading from /proc/stat. -- irq: servicing interrupts -- softirq: servicing softirqs -- steal: involuntary wait -- guest: running a normal guest -- guest_nice: running a niced guest - -The "intr" line gives counts of interrupts serviced since boot time, for each -of the possible system interrupts. The first column is the total of all -interrupts serviced including unnumbered architecture specific interrupts; -each subsequent column is the total for that particular numbered interrupt. -Unnumbered interrupts are not shown, only summed into the total. - -The "ctxt" line gives the total number of context switches across all CPUs. - -The "btime" line gives the time at which the system booted, in seconds since -the Unix epoch. - -The "processes" line gives the number of processes and threads created, which -includes (but is not limited to) those created by calls to the fork() and -clone() system calls. - -The "procs_running" line gives the total number of threads that are -running or ready to run (i.e., the total number of runnable threads). - -The "procs_blocked" line gives the number of processes currently blocked, -waiting for I/O to complete. - -The "softirq" line gives counts of softirqs serviced since boot time, for each -of the possible system softirqs. The first column is the total of all -softirqs serviced; each subsequent column is the total for that particular -softirq. - - -1.9 Ext4 file system parameters -------------------------------- - -Information about mounted ext4 file systems can be found in -/proc/fs/ext4. Each mounted filesystem will have a directory in -/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or -/proc/fs/ext4/dm-0). The files in each per-device directory are shown -in Table 1-12, below. - -Table 1-12: Files in /proc/fs/ext4/ -.............................................................................. - File Content - mb_groups details of multiblock allocator buddy cache of free blocks -.............................................................................. - -2.0 /proc/consoles ------------------- -Shows registered system console lines. - -To see which character device lines are currently used for the system console -/dev/console, you may simply look into the file /proc/consoles: - - > cat /proc/consoles - tty0 -WU (ECp) 4:7 - ttyS0 -W- (Ep) 4:64 - -The columns are: - - device name of the device - operations R = can do read operations - W = can do write operations - U = can do unblank - flags E = it is enabled - C = it is preferred console - B = it is primary boot console - p = it is used for printk buffer - b = it is not a TTY but a Braille device - a = it is safe to use when cpu is offline - major:minor major and minor number of the device separated by a colon - ------------------------------------------------------------------------------- -Summary ------------------------------------------------------------------------------- -The /proc file system serves information about the running system. It not only -allows access to process data but also allows you to request the kernel status -by reading files in the hierarchy. - -The directory structure of /proc reflects the types of information and makes -it easy, if not obvious, where to look for specific data. ------------------------------------------------------------------------------- - ------------------------------------------------------------------------------- -CHAPTER 2: MODIFYING SYSTEM PARAMETERS ------------------------------------------------------------------------------- - ------------------------------------------------------------------------------- -In This Chapter ------------------------------------------------------------------------------- -* Modifying kernel parameters by writing into files found in /proc/sys -* Exploring the files which modify certain parameters -* Review of the /proc/sys file tree ------------------------------------------------------------------------------- - - -A very interesting part of /proc is the directory /proc/sys. This is not only -a source of information, it also allows you to change parameters within the -kernel. Be very careful when attempting this. You can optimize your system, -but you can also cause it to crash. Never alter kernel parameters on a -production system. Set up a development machine and test to make sure that -everything works the way you want it to. You may have no alternative but to -reboot the machine once an error has been made. - -To change a value, simply echo the new value into the file. An example is -given below in the section on the file system data. You need to be root to do -this. You can create your own boot script to perform this every time your -system boots. - -The files in /proc/sys can be used to fine tune and monitor miscellaneous and -general things in the operation of the Linux kernel. Since some of the files -can inadvertently disrupt your system, it is advisable to read both -documentation and source before actually making adjustments. In any case, be -very careful when writing to any of these files. The entries in /proc may -change slightly between the 2.1.* and the 2.2 kernel, so if there is any doubt -review the kernel documentation in the directory /usr/src/linux/Documentation. -This chapter is heavily based on the documentation included in the pre 2.2 -kernels, and became part of it in version 2.2.1 of the Linux kernel. - -Please see: Documentation/admin-guide/sysctl/ directory for descriptions of these -entries. - ------------------------------------------------------------------------------- -Summary ------------------------------------------------------------------------------- -Certain aspects of kernel behavior can be modified at runtime, without the -need to recompile the kernel, or even to reboot the system. The files in the -/proc/sys tree can not only be read, but also modified. You can use the echo -command to write value into these files, thereby changing the default settings -of the kernel. ------------------------------------------------------------------------------- - ------------------------------------------------------------------------------- -CHAPTER 3: PER-PROCESS PARAMETERS ------------------------------------------------------------------------------- - -3.1 /proc//oom_adj & /proc//oom_score_adj- Adjust the oom-killer score --------------------------------------------------------------------------------- - -These file can be used to adjust the badness heuristic used to select which -process gets killed in out of memory conditions. - -The badness heuristic assigns a value to each candidate task ranging from 0 -(never kill) to 1000 (always kill) to determine which process is targeted. The -units are roughly a proportion along that range of allowed memory the process -may allocate from based on an estimation of its current memory and swap use. -For example, if a task is using all allowed memory, its badness score will be -1000. If it is using half of its allowed memory, its score will be 500. - -There is an additional factor included in the badness score: the current memory -and swap usage is discounted by 3% for root processes. - -The amount of "allowed" memory depends on the context in which the oom killer -was called. If it is due to the memory assigned to the allocating task's cpuset -being exhausted, the allowed memory represents the set of mems assigned to that -cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed -memory represents the set of mempolicy nodes. If it is due to a memory -limit (or swap limit) being reached, the allowed memory is that configured -limit. Finally, if it is due to the entire system being out of memory, the -allowed memory represents all allocatable resources. - -The value of /proc//oom_score_adj is added to the badness score before it -is used to determine which task to kill. Acceptable values range from -1000 -(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to -polarize the preference for oom killing either by always preferring a certain -task or completely disabling it. The lowest possible value, -1000, is -equivalent to disabling oom killing entirely for that task since it will always -report a badness score of 0. - -Consequently, it is very simple for userspace to define the amount of memory to -consider for each task. Setting a /proc//oom_score_adj value of +500, for -example, is roughly equivalent to allowing the remainder of tasks sharing the -same system, cpuset, mempolicy, or memory controller resources to use at least -50% more memory. A value of -500, on the other hand, would be roughly -equivalent to discounting 50% of the task's allowed memory from being considered -as scoring against the task. - -For backwards compatibility with previous kernels, /proc//oom_adj may also -be used to tune the badness score. Its acceptable values range from -16 -(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17 -(OOM_DISABLE) to disable oom killing entirely for that task. Its value is -scaled linearly with /proc//oom_score_adj. - -The value of /proc//oom_score_adj may be reduced no lower than the last -value set by a CAP_SYS_RESOURCE process. To reduce the value any lower -requires CAP_SYS_RESOURCE. - -Caveat: when a parent task is selected, the oom killer will sacrifice any first -generation children with separate address spaces instead, if possible. This -avoids servers and important system daemons from being killed and loses the -minimal amount of work. - - -3.2 /proc//oom_score - Display current oom-killer score -------------------------------------------------------------- - -This file can be used to check the current score used by the oom-killer is for -any given . Use it together with /proc//oom_score_adj to tune which -process should be killed in an out-of-memory situation. - - -3.3 /proc//io - Display the IO accounting fields -------------------------------------------------------- - -This file contains IO statistics for each running process - -Example -------- - -test:/tmp # dd if=/dev/zero of=/tmp/test.dat & -[1] 3828 - -test:/tmp # cat /proc/3828/io -rchar: 323934931 -wchar: 323929600 -syscr: 632687 -syscw: 632675 -read_bytes: 0 -write_bytes: 323932160 -cancelled_write_bytes: 0 - - -Description ------------ - -rchar ------ - -I/O counter: chars read -The number of bytes which this task has caused to be read from storage. This -is simply the sum of bytes which this process passed to read() and pread(). -It includes things like tty IO and it is unaffected by whether or not actual -physical disk IO was required (the read might have been satisfied from -pagecache) - - -wchar ------ - -I/O counter: chars written -The number of bytes which this task has caused, or shall cause to be written -to disk. Similar caveats apply here as with rchar. - - -syscr ------ - -I/O counter: read syscalls -Attempt to count the number of read I/O operations, i.e. syscalls like read() -and pread(). - - -syscw ------ - -I/O counter: write syscalls -Attempt to count the number of write I/O operations, i.e. syscalls like -write() and pwrite(). - - -read_bytes ----------- - -I/O counter: bytes read -Attempt to count the number of bytes which this process really did cause to -be fetched from the storage layer. Done at the submit_bio() level, so it is -accurate for block-backed filesystems. - - -write_bytes ------------ - -I/O counter: bytes written -Attempt to count the number of bytes which this process caused to be sent to -the storage layer. This is done at page-dirtying time. - - -cancelled_write_bytes ---------------------- - -The big inaccuracy here is truncate. If a process writes 1MB to a file and -then deletes the file, it will in fact perform no writeout. But it will have -been accounted as having caused 1MB of write. -In other words: The number of bytes which this process caused to not happen, -by truncating pagecache. A task can cause "negative" IO too. If this task -truncates some dirty pagecache, some IO which another task has been accounted -for (in its write_bytes) will not be happening. We _could_ just subtract that -from the truncating task's write_bytes, but there is information loss in doing -that. - - -Note ----- - -At its current implementation state, this is a bit racy on 32-bit machines: if -process A reads process B's /proc/pid/io while process B is updating one of -those 64-bit counters, process A could see an intermediate result. - - -More information about this can be found within the taskstats documentation in -Documentation/accounting. - -3.4 /proc//coredump_filter - Core dump filtering settings ---------------------------------------------------------------- -When a process is dumped, all anonymous memory is written to a core file as -long as the size of the core file isn't limited. But sometimes we don't want -to dump some memory segments, for example, huge shared memory or DAX. -Conversely, sometimes we want to save file-backed memory segments into a core -file, not only the individual files. - -/proc//coredump_filter allows you to customize which memory segments -will be dumped when the process is dumped. coredump_filter is a bitmask -of memory types. If a bit of the bitmask is set, memory segments of the -corresponding memory type are dumped, otherwise they are not dumped. - -The following 9 memory types are supported: - - (bit 0) anonymous private memory - - (bit 1) anonymous shared memory - - (bit 2) file-backed private memory - - (bit 3) file-backed shared memory - - (bit 4) ELF header pages in file-backed private memory areas (it is - effective only if the bit 2 is cleared) - - (bit 5) hugetlb private memory - - (bit 6) hugetlb shared memory - - (bit 7) DAX private memory - - (bit 8) DAX shared memory - - Note that MMIO pages such as frame buffer are never dumped and vDSO pages - are always dumped regardless of the bitmask status. - - Note that bits 0-4 don't affect hugetlb or DAX memory. hugetlb memory is - only affected by bit 5-6, and DAX is only affected by bits 7-8. - -The default value of coredump_filter is 0x33; this means all anonymous memory -segments, ELF header pages and hugetlb private memory are dumped. - -If you don't want to dump all shared memory segments attached to pid 1234, -write 0x31 to the process's proc file. - - $ echo 0x31 > /proc/1234/coredump_filter - -When a new process is created, the process inherits the bitmask status from its -parent. It is useful to set up coredump_filter before the program runs. -For example: - - $ echo 0x7 > /proc/self/coredump_filter - $ ./some_program - -3.5 /proc//mountinfo - Information about mounts --------------------------------------------------------- - -This file contains lines of the form: - -36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue -(1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) - -(1) mount ID: unique identifier of the mount (may be reused after umount) -(2) parent ID: ID of parent (or of self for the top of the mount tree) -(3) major:minor: value of st_dev for files on filesystem -(4) root: root of the mount within the filesystem -(5) mount point: mount point relative to the process's root -(6) mount options: per mount options -(7) optional fields: zero or more fields of the form "tag[:value]" -(8) separator: marks the end of the optional fields -(9) filesystem type: name of filesystem of the form "type[.subtype]" -(10) mount source: filesystem specific information or "none" -(11) super options: per super block options - -Parsers should ignore all unrecognised optional fields. Currently the -possible optional fields are: - -shared:X mount is shared in peer group X -master:X mount is slave to peer group X -propagate_from:X mount is slave and receives propagation from peer group X (*) -unbindable mount is unbindable - -(*) X is the closest dominant peer group under the process's root. If -X is the immediate master of the mount, or if there's no dominant peer -group under the same root, then only the "master:X" field is present -and not the "propagate_from:X" field. - -For more information on mount propagation see: - - Documentation/filesystems/sharedsubtree.txt - - -3.6 /proc//comm & /proc//task//comm --------------------------------------------------------- -These files provide a method to access a tasks comm value. It also allows for -a task to set its own or one of its thread siblings comm value. The comm value -is limited in size compared to the cmdline value, so writing anything longer -then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated -comm value. - - -3.7 /proc//task//children - Information about task children -------------------------------------------------------------------------- -This file provides a fast way to retrieve first level children pids -of a task pointed by / pair. The format is a space separated -stream of pids. - -Note the "first level" here -- if a child has own children they will -not be listed here, one needs to read /proc//task//children -to obtain the descendants. - -Since this interface is intended to be fast and cheap it doesn't -guarantee to provide precise results and some children might be -skipped, especially if they've exited right after we printed their -pids, so one need to either stop or freeze processes being inspected -if precise results are needed. - - -3.8 /proc//fdinfo/ - Information about opened file ---------------------------------------------------------------- -This file provides information associated with an opened file. The regular -files have at least three fields -- 'pos', 'flags' and mnt_id. The 'pos' -represents the current offset of the opened file in decimal form [see lseek(2) -for details], 'flags' denotes the octal O_xxx mask the file has been -created with [see open(2) for details] and 'mnt_id' represents mount ID of -the file system containing the opened file [see 3.5 /proc//mountinfo -for details]. - -A typical output is - - pos: 0 - flags: 0100002 - mnt_id: 19 - -All locks associated with a file descriptor are shown in its fdinfo too. - -lock: 1: FLOCK ADVISORY WRITE 359 00:13:11691 0 EOF - -The files such as eventfd, fsnotify, signalfd, epoll among the regular pos/flags -pair provide additional information particular to the objects they represent. - - Eventfd files - ~~~~~~~~~~~~~ - pos: 0 - flags: 04002 - mnt_id: 9 - eventfd-count: 5a - - where 'eventfd-count' is hex value of a counter. - - Signalfd files - ~~~~~~~~~~~~~~ - pos: 0 - flags: 04002 - mnt_id: 9 - sigmask: 0000000000000200 - - where 'sigmask' is hex value of the signal mask associated - with a file. - - Epoll files - ~~~~~~~~~~~ - pos: 0 - flags: 02 - mnt_id: 9 - tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7 - - where 'tfd' is a target file descriptor number in decimal form, - 'events' is events mask being watched and the 'data' is data - associated with a target [see epoll(7) for more details]. - - The 'pos' is current offset of the target file in decimal form - [see lseek(2)], 'ino' and 'sdev' are inode and device numbers - where target file resides, all in hex format. - - Fsnotify files - ~~~~~~~~~~~~~~ - For inotify files the format is the following - - pos: 0 - flags: 02000000 - inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d - - where 'wd' is a watch descriptor in decimal form, ie a target file - descriptor number, 'ino' and 'sdev' are inode and device where the - target file resides and the 'mask' is the mask of events, all in hex - form [see inotify(7) for more details]. - - If the kernel was built with exportfs support, the path to the target - file is encoded as a file handle. The file handle is provided by three - fields 'fhandle-bytes', 'fhandle-type' and 'f_handle', all in hex - format. - - If the kernel is built without exportfs support the file handle won't be - printed out. - - If there is no inotify mark attached yet the 'inotify' line will be omitted. - - For fanotify files the format is - - pos: 0 - flags: 02 - mnt_id: 9 - fanotify flags:10 event-flags:0 - fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003 - fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4 - - where fanotify 'flags' and 'event-flags' are values used in fanotify_init - call, 'mnt_id' is the mount point identifier, 'mflags' is the value of - flags associated with mark which are tracked separately from events - mask. 'ino', 'sdev' are target inode and device, 'mask' is the events - mask and 'ignored_mask' is the mask of events which are to be ignored. - All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask' - does provide information about flags and mask used in fanotify_mark - call [see fsnotify manpage for details]. - - While the first three lines are mandatory and always printed, the rest is - optional and may be omitted if no marks created yet. - - Timerfd files - ~~~~~~~~~~~~~ - - pos: 0 - flags: 02 - mnt_id: 9 - clockid: 0 - ticks: 0 - settime flags: 01 - it_value: (0, 49406829) - it_interval: (1, 0) - - where 'clockid' is the clock type and 'ticks' is the number of the timer expirations - that have occurred [see timerfd_create(2) for details]. 'settime flags' are - flags in octal form been used to setup the timer [see timerfd_settime(2) for - details]. 'it_value' is remaining time until the timer exiration. - 'it_interval' is the interval for the timer. Note the timer might be set up - with TIMER_ABSTIME option which will be shown in 'settime flags', but 'it_value' - still exhibits timer's remaining time. - -3.9 /proc//map_files - Information about memory mapped files ---------------------------------------------------------------------- -This directory contains symbolic links which represent memory mapped files -the process is maintaining. Example output: - - | lr-------- 1 root root 64 Jan 27 11:24 333c600000-333c620000 -> /usr/lib64/ld-2.18.so - | lr-------- 1 root root 64 Jan 27 11:24 333c81f000-333c820000 -> /usr/lib64/ld-2.18.so - | lr-------- 1 root root 64 Jan 27 11:24 333c820000-333c821000 -> /usr/lib64/ld-2.18.so - | ... - | lr-------- 1 root root 64 Jan 27 11:24 35d0421000-35d0422000 -> /usr/lib64/libselinux.so.1 - | lr-------- 1 root root 64 Jan 27 11:24 400000-41a000 -> /usr/bin/ls - -The name of a link represents the virtual memory bounds of a mapping, i.e. -vm_area_struct::vm_start-vm_area_struct::vm_end. - -The main purpose of the map_files is to retrieve a set of memory mapped -files in a fast way instead of parsing /proc//maps or -/proc//smaps, both of which contain many more records. At the same -time one can open(2) mappings from the listings of two processes and -comparing their inode numbers to figure out which anonymous memory areas -are actually shared. - -3.10 /proc//timerslack_ns - Task timerslack value ---------------------------------------------------------- -This file provides the value of the task's timerslack value in nanoseconds. -This value specifies a amount of time that normal timers may be deferred -in order to coalesce timers and avoid unnecessary wakeups. - -This allows a task's interactivity vs power consumption trade off to be -adjusted. - -Writing 0 to the file will set the tasks timerslack to the default value. - -Valid values are from 0 - ULLONG_MAX - -An application setting the value must have PTRACE_MODE_ATTACH_FSCREDS level -permissions on the task specified to change its timerslack_ns value. - -3.11 /proc//patch_state - Livepatch patch operation state ------------------------------------------------------------------ -When CONFIG_LIVEPATCH is enabled, this file displays the value of the -patch state for the task. - -A value of '-1' indicates that no patch is in transition. - -A value of '0' indicates that a patch is in transition and the task is -unpatched. If the patch is being enabled, then the task hasn't been -patched yet. If the patch is being disabled, then the task has already -been unpatched. - -A value of '1' indicates that a patch is in transition and the task is -patched. If the patch is being enabled, then the task has already been -patched. If the patch is being disabled, then the task hasn't been -unpatched yet. - -3.12 /proc//arch_status - task architecture specific status -------------------------------------------------------------------- -When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the -architecture specific status of the task. - -Example -------- - $ cat /proc/6753/arch_status - AVX512_elapsed_ms: 8 - -Description ------------ - -x86 specific entries: ---------------------- - AVX512_elapsed_ms: - ------------------ - If AVX512 is supported on the machine, this entry shows the milliseconds - elapsed since the last time AVX512 usage was recorded. The recording - happens on a best effort basis when a task is scheduled out. This means - that the value depends on two factors: - - 1) The time which the task spent on the CPU without being scheduled - out. With CPU isolation and a single runnable task this can take - several seconds. - - 2) The time since the task was scheduled out last. Depending on the - reason for being scheduled out (time slice exhausted, syscall ...) - this can be arbitrary long time. - - As a consequence the value cannot be considered precise and authoritative - information. The application which uses this information has to be aware - of the overall scenario on the system in order to determine whether a - task is a real AVX512 user or not. Precise information can be obtained - with performance counters. - - A special value of '-1' indicates that no AVX512 usage was recorded, thus - the task is unlikely an AVX512 user, but depends on the workload and the - scheduling scenario, it also could be a false negative mentioned above. - ------------------------------------------------------------------------------- -Configuring procfs ------------------------------------------------------------------------------- - -4.1 Mount options ---------------------- - -The following mount options are supported: - - hidepid= Set /proc// access mode. - gid= Set the group authorized to learn processes information. - -hidepid=0 means classic mode - everybody may access all /proc// directories -(default). - -hidepid=1 means users may not access any /proc// directories but their -own. Sensitive files like cmdline, sched*, status are now protected against -other users. This makes it impossible to learn whether any user runs -specific program (given the program doesn't reveal itself by its behaviour). -As an additional bonus, as /proc//cmdline is unaccessible for other users, -poorly written programs passing sensitive information via program arguments are -now protected against local eavesdroppers. - -hidepid=2 means hidepid=1 plus all /proc// will be fully invisible to other -users. It doesn't mean that it hides a fact whether a process with a specific -pid value exists (it can be learned by other means, e.g. by "kill -0 $PID"), -but it hides process' uid and gid, which may be learned by stat()'ing -/proc// otherwise. It greatly complicates an intruder's task of gathering -information about running processes, whether some daemon runs with elevated -privileges, whether other user runs some sensitive program, whether other users -run any program at all, etc. - -gid= defines a group authorized to learn processes information otherwise -prohibited by hidepid=. If you use some daemon like identd which needs to learn -information about processes information, just add identd to this group. -- cgit From d5eefa2c5e567751df74d38d5b8cec7ed6e7a08c Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:19 +0100 Subject: docs: filesystems: convert qnx6.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/ccd22c1e1426ce4cb30ece9a71c39ebb41844762.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/qnx6.rst | 196 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/qnx6.txt | 174 -------------------------------- 3 files changed, 197 insertions(+), 174 deletions(-) create mode 100644 Documentation/filesystems/qnx6.rst delete mode 100644 Documentation/filesystems/qnx6.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 671906e2fee6..08883a481a76 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -82,5 +82,6 @@ Documentation for filesystem implementations. orangefs overlayfs proc + qnx6 virtiofs vfat diff --git a/Documentation/filesystems/qnx6.rst b/Documentation/filesystems/qnx6.rst new file mode 100644 index 000000000000..b71308314070 --- /dev/null +++ b/Documentation/filesystems/qnx6.rst @@ -0,0 +1,196 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== +The QNX6 Filesystem +=================== + +The qnx6fs is used by newer QNX operating system versions. (e.g. Neutrino) +It got introduced in QNX 6.4.0 and is used default since 6.4.1. + +Option +====== + +mmi_fs Mount filesystem as used for example by Audi MMI 3G system + +Specification +============= + +qnx6fs shares many properties with traditional Unix filesystems. It has the +concepts of blocks, inodes and directories. + +On QNX it is possible to create little endian and big endian qnx6 filesystems. +This feature makes it possible to create and use a different endianness fs +for the target (QNX is used on quite a range of embedded systems) platform +running on a different endianness. + +The Linux driver handles endianness transparently. (LE and BE) + +Blocks +------ + +The space in the device or file is split up into blocks. These are a fixed +size of 512, 1024, 2048 or 4096, which is decided when the filesystem is +created. + +Blockpointers are 32bit, so the maximum space that can be addressed is +2^32 * 4096 bytes or 16TB + +The superblocks +--------------- + +The superblock contains all global information about the filesystem. +Each qnx6fs got two superblocks, each one having a 64bit serial number. +That serial number is used to identify the "active" superblock. +In write mode with reach new snapshot (after each synchronous write), the +serial of the new master superblock is increased (old superblock serial + 1) + +So basically the snapshot functionality is realized by an atomic final +update of the serial number. Before updating that serial, all modifications +are done by copying all modified blocks during that specific write request +(or period) and building up a new (stable) filesystem structure under the +inactive superblock. + +Each superblock holds a set of root inodes for the different filesystem +parts. (Inode, Bitmap and Longfilenames) +Each of these root nodes holds information like total size of the stored +data and the addressing levels in that specific tree. +If the level value is 0, up to 16 direct blocks can be addressed by each +node. + +Level 1 adds an additional indirect addressing level where each indirect +addressing block holds up to blocksize / 4 bytes pointers to data blocks. +Level 2 adds an additional indirect addressing block level (so, already up +to 16 * 256 * 256 = 1048576 blocks that can be addressed by such a tree). + +Unused block pointers are always set to ~0 - regardless of root node, +indirect addressing blocks or inodes. + +Data leaves are always on the lowest level. So no data is stored on upper +tree levels. + +The first Superblock is located at 0x2000. (0x2000 is the bootblock size) +The Audi MMI 3G first superblock directly starts at byte 0. + +Second superblock position can either be calculated from the superblock +information (total number of filesystem blocks) or by taking the highest +device address, zeroing the last 3 bytes and then subtracting 0x1000 from +that address. + +0x1000 is the size reserved for each superblock - regardless of the +blocksize of the filesystem. + +Inodes +------ + +Each object in the filesystem is represented by an inode. (index node) +The inode structure contains pointers to the filesystem blocks which contain +the data held in the object and all of the metadata about an object except +its longname. (filenames longer than 27 characters) +The metadata about an object includes the permissions, owner, group, flags, +size, number of blocks used, access time, change time and modification time. + +Object mode field is POSIX format. (which makes things easier) + +There are also pointers to the first 16 blocks, if the object data can be +addressed with 16 direct blocks. + +For more than 16 blocks an indirect addressing in form of another tree is +used. (scheme is the same as the one used for the superblock root nodes) + +The filesize is stored 64bit. Inode counting starts with 1. (while long +filename inodes start with 0) + +Directories +----------- + +A directory is a filesystem object and has an inode just like a file. +It is a specially formatted file containing records which associate each +name with an inode number. + +'.' inode number points to the directory inode + +'..' inode number points to the parent directory inode + +Eeach filename record additionally got a filename length field. + +One special case are long filenames or subdirectory names. + +These got set a filename length field of 0xff in the corresponding directory +record plus the longfile inode number also stored in that record. + +With that longfilename inode number, the longfilename tree can be walked +starting with the superblock longfilename root node pointers. + +Special files +------------- + +Symbolic links are also filesystem objects with inodes. They got a specific +bit in the inode mode field identifying them as symbolic link. + +The directory entry file inode pointer points to the target file inode. + +Hard links got an inode, a directory entry, but a specific mode bit set, +no block pointers and the directory file record pointing to the target file +inode. + +Character and block special devices do not exist in QNX as those files +are handled by the QNX kernel/drivers and created in /dev independent of the +underlaying filesystem. + +Long filenames +-------------- + +Long filenames are stored in a separate addressing tree. The staring point +is the longfilename root node in the active superblock. + +Each data block (tree leaves) holds one long filename. That filename is +limited to 510 bytes. The first two starting bytes are used as length field +for the actual filename. + +If that structure shall fit for all allowed blocksizes, it is clear why there +is a limit of 510 bytes for the actual filename stored. + +Bitmap +------ + +The qnx6fs filesystem allocation bitmap is stored in a tree under bitmap +root node in the superblock and each bit in the bitmap represents one +filesystem block. + +The first block is block 0, which starts 0x1000 after superblock start. +So for a normal qnx6fs 0x3000 (bootblock + superblock) is the physical +address at which block 0 is located. + +Bits at the end of the last bitmap block are set to 1, if the device is +smaller than addressing space in the bitmap. + +Bitmap system area +------------------ + +The bitmap itself is divided into three parts. + +First the system area, that is split into two halves. + +Then userspace. + +The requirement for a static, fixed preallocated system area comes from how +qnx6fs deals with writes. + +Each superblock got it's own half of the system area. So superblock #1 +always uses blocks from the lower half while superblock #2 just writes to +blocks represented by the upper half bitmap system area bits. + +Bitmap blocks, Inode blocks and indirect addressing blocks for those two +tree structures are treated as system blocks. + +The rational behind that is that a write request can work on a new snapshot +(system area of the inactive - resp. lower serial numbered superblock) while +at the same time there is still a complete stable filesystem structer in the +other half of the system area. + +When finished with writing (a sync write is completed, the maximum sync leap +time or a filesystem sync is requested), serial of the previously inactive +superblock atomically is increased and the fs switches over to that - then +stable declared - superblock. + +For all data outside the system area, blocks are just copied while writing. diff --git a/Documentation/filesystems/qnx6.txt b/Documentation/filesystems/qnx6.txt deleted file mode 100644 index 48ea68f15845..000000000000 --- a/Documentation/filesystems/qnx6.txt +++ /dev/null @@ -1,174 +0,0 @@ -The QNX6 Filesystem -=================== - -The qnx6fs is used by newer QNX operating system versions. (e.g. Neutrino) -It got introduced in QNX 6.4.0 and is used default since 6.4.1. - -Option -====== - -mmi_fs Mount filesystem as used for example by Audi MMI 3G system - -Specification -============= - -qnx6fs shares many properties with traditional Unix filesystems. It has the -concepts of blocks, inodes and directories. -On QNX it is possible to create little endian and big endian qnx6 filesystems. -This feature makes it possible to create and use a different endianness fs -for the target (QNX is used on quite a range of embedded systems) platform -running on a different endianness. -The Linux driver handles endianness transparently. (LE and BE) - -Blocks ------- - -The space in the device or file is split up into blocks. These are a fixed -size of 512, 1024, 2048 or 4096, which is decided when the filesystem is -created. -Blockpointers are 32bit, so the maximum space that can be addressed is -2^32 * 4096 bytes or 16TB - -The superblocks ---------------- - -The superblock contains all global information about the filesystem. -Each qnx6fs got two superblocks, each one having a 64bit serial number. -That serial number is used to identify the "active" superblock. -In write mode with reach new snapshot (after each synchronous write), the -serial of the new master superblock is increased (old superblock serial + 1) - -So basically the snapshot functionality is realized by an atomic final -update of the serial number. Before updating that serial, all modifications -are done by copying all modified blocks during that specific write request -(or period) and building up a new (stable) filesystem structure under the -inactive superblock. - -Each superblock holds a set of root inodes for the different filesystem -parts. (Inode, Bitmap and Longfilenames) -Each of these root nodes holds information like total size of the stored -data and the addressing levels in that specific tree. -If the level value is 0, up to 16 direct blocks can be addressed by each -node. -Level 1 adds an additional indirect addressing level where each indirect -addressing block holds up to blocksize / 4 bytes pointers to data blocks. -Level 2 adds an additional indirect addressing block level (so, already up -to 16 * 256 * 256 = 1048576 blocks that can be addressed by such a tree). - -Unused block pointers are always set to ~0 - regardless of root node, -indirect addressing blocks or inodes. -Data leaves are always on the lowest level. So no data is stored on upper -tree levels. - -The first Superblock is located at 0x2000. (0x2000 is the bootblock size) -The Audi MMI 3G first superblock directly starts at byte 0. -Second superblock position can either be calculated from the superblock -information (total number of filesystem blocks) or by taking the highest -device address, zeroing the last 3 bytes and then subtracting 0x1000 from -that address. - -0x1000 is the size reserved for each superblock - regardless of the -blocksize of the filesystem. - -Inodes ------- - -Each object in the filesystem is represented by an inode. (index node) -The inode structure contains pointers to the filesystem blocks which contain -the data held in the object and all of the metadata about an object except -its longname. (filenames longer than 27 characters) -The metadata about an object includes the permissions, owner, group, flags, -size, number of blocks used, access time, change time and modification time. - -Object mode field is POSIX format. (which makes things easier) - -There are also pointers to the first 16 blocks, if the object data can be -addressed with 16 direct blocks. -For more than 16 blocks an indirect addressing in form of another tree is -used. (scheme is the same as the one used for the superblock root nodes) - -The filesize is stored 64bit. Inode counting starts with 1. (while long -filename inodes start with 0) - -Directories ------------ - -A directory is a filesystem object and has an inode just like a file. -It is a specially formatted file containing records which associate each -name with an inode number. -'.' inode number points to the directory inode -'..' inode number points to the parent directory inode -Eeach filename record additionally got a filename length field. - -One special case are long filenames or subdirectory names. -These got set a filename length field of 0xff in the corresponding directory -record plus the longfile inode number also stored in that record. -With that longfilename inode number, the longfilename tree can be walked -starting with the superblock longfilename root node pointers. - -Special files -------------- - -Symbolic links are also filesystem objects with inodes. They got a specific -bit in the inode mode field identifying them as symbolic link. -The directory entry file inode pointer points to the target file inode. - -Hard links got an inode, a directory entry, but a specific mode bit set, -no block pointers and the directory file record pointing to the target file -inode. - -Character and block special devices do not exist in QNX as those files -are handled by the QNX kernel/drivers and created in /dev independent of the -underlaying filesystem. - -Long filenames --------------- - -Long filenames are stored in a separate addressing tree. The staring point -is the longfilename root node in the active superblock. -Each data block (tree leaves) holds one long filename. That filename is -limited to 510 bytes. The first two starting bytes are used as length field -for the actual filename. -If that structure shall fit for all allowed blocksizes, it is clear why there -is a limit of 510 bytes for the actual filename stored. - -Bitmap ------- - -The qnx6fs filesystem allocation bitmap is stored in a tree under bitmap -root node in the superblock and each bit in the bitmap represents one -filesystem block. -The first block is block 0, which starts 0x1000 after superblock start. -So for a normal qnx6fs 0x3000 (bootblock + superblock) is the physical -address at which block 0 is located. - -Bits at the end of the last bitmap block are set to 1, if the device is -smaller than addressing space in the bitmap. - -Bitmap system area ------------------- - -The bitmap itself is divided into three parts. -First the system area, that is split into two halves. -Then userspace. - -The requirement for a static, fixed preallocated system area comes from how -qnx6fs deals with writes. -Each superblock got it's own half of the system area. So superblock #1 -always uses blocks from the lower half while superblock #2 just writes to -blocks represented by the upper half bitmap system area bits. - -Bitmap blocks, Inode blocks and indirect addressing blocks for those two -tree structures are treated as system blocks. - -The rational behind that is that a write request can work on a new snapshot -(system area of the inactive - resp. lower serial numbered superblock) while -at the same time there is still a complete stable filesystem structer in the -other half of the system area. - -When finished with writing (a sync write is completed, the maximum sync leap -time or a filesystem sync is requested), serial of the previously inactive -superblock atomically is increased and the fs switches over to that - then -stable declared - superblock. - -For all data outside the system area, blocks are just copied while writing. -- cgit From 8979fc9a282441d086ead589528c711d9df3d94a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:20 +0100 Subject: docs: filesystems: convert ramfs-rootfs-initramfs.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Use notes markups; - Add lists markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/89cbcc99a6371f3bff3ea1668fe497e8a15c226b.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + .../filesystems/ramfs-rootfs-initramfs.rst | 369 +++++++++++++++++++++ .../filesystems/ramfs-rootfs-initramfs.txt | 359 -------------------- 3 files changed, 370 insertions(+), 359 deletions(-) create mode 100644 Documentation/filesystems/ramfs-rootfs-initramfs.rst delete mode 100644 Documentation/filesystems/ramfs-rootfs-initramfs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 08883a481a76..b8689d082911 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -83,5 +83,6 @@ Documentation for filesystem implementations. overlayfs proc qnx6 + ramfs-rootfs-initramfs virtiofs vfat diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.rst b/Documentation/filesystems/ramfs-rootfs-initramfs.rst new file mode 100644 index 000000000000..6c576e241d86 --- /dev/null +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.rst @@ -0,0 +1,369 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +Ramfs, rootfs and initramfs +=========================== + +October 17, 2005 + +Rob Landley +============================= + +What is ramfs? +-------------- + +Ramfs is a very simple filesystem that exports Linux's disk caching +mechanisms (the page cache and dentry cache) as a dynamically resizable +RAM-based filesystem. + +Normally all files are cached in memory by Linux. Pages of data read from +backing store (usually the block device the filesystem is mounted on) are kept +around in case it's needed again, but marked as clean (freeable) in case the +Virtual Memory system needs the memory for something else. Similarly, data +written to files is marked clean as soon as it has been written to backing +store, but kept around for caching purposes until the VM reallocates the +memory. A similar mechanism (the dentry cache) greatly speeds up access to +directories. + +With ramfs, there is no backing store. Files written into ramfs allocate +dentries and page cache as usual, but there's nowhere to write them to. +This means the pages are never marked clean, so they can't be freed by the +VM when it's looking to recycle memory. + +The amount of code required to implement ramfs is tiny, because all the +work is done by the existing Linux caching infrastructure. Basically, +you're mounting the disk cache as a filesystem. Because of this, ramfs is not +an optional component removable via menuconfig, since there would be negligible +space savings. + +ramfs and ramdisk: +------------------ + +The older "ram disk" mechanism created a synthetic block device out of +an area of RAM and used it as backing store for a filesystem. This block +device was of fixed size, so the filesystem mounted on it was of fixed +size. Using a ram disk also required unnecessarily copying memory from the +fake block device into the page cache (and copying changes back out), as well +as creating and destroying dentries. Plus it needed a filesystem driver +(such as ext2) to format and interpret this data. + +Compared to ramfs, this wastes memory (and memory bus bandwidth), creates +unnecessary work for the CPU, and pollutes the CPU caches. (There are tricks +to avoid this copying by playing with the page tables, but they're unpleasantly +complicated and turn out to be about as expensive as the copying anyway.) +More to the point, all the work ramfs is doing has to happen _anyway_, +since all file access goes through the page and dentry caches. The RAM +disk is simply unnecessary; ramfs is internally much simpler. + +Another reason ramdisks are semi-obsolete is that the introduction of +loopback devices offered a more flexible and convenient way to create +synthetic block devices, now from files instead of from chunks of memory. +See losetup (8) for details. + +ramfs and tmpfs: +---------------- + +One downside of ramfs is you can keep writing data into it until you fill +up all memory, and the VM can't free it because the VM thinks that files +should get written to backing store (rather than swap space), but ramfs hasn't +got any backing store. Because of this, only root (or a trusted user) should +be allowed write access to a ramfs mount. + +A ramfs derivative called tmpfs was created to add size limits, and the ability +to write the data to swap space. Normal users can be allowed write access to +tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. + +What is rootfs? +--------------- + +Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is +always present in 2.6 systems. You can't unmount rootfs for approximately the +same reason you can't kill the init process; rather than having special code +to check for and handle an empty list, it's smaller and simpler for the kernel +to just make sure certain lists can't become empty. + +Most systems just mount another filesystem over rootfs and ignore it. The +amount of space an empty instance of ramfs takes up is tiny. + +If CONFIG_TMPFS is enabled, rootfs will use tmpfs instead of ramfs by +default. To force ramfs, add "rootfstype=ramfs" to the kernel command +line. + +What is initramfs? +------------------ + +All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is +extracted into rootfs when the kernel boots up. After extracting, the kernel +checks to see if rootfs contains a file "init", and if so it executes it as PID +1. If found, this init process is responsible for bringing the system the +rest of the way up, including locating and mounting the real root device (if +any). If rootfs does not contain an init program after the embedded cpio +archive is extracted into it, the kernel will fall through to the older code +to locate and mount a root partition, then exec some variant of /sbin/init +out of that. + +All this differs from the old initrd in several ways: + + - The old initrd was always a separate file, while the initramfs archive is + linked into the linux kernel image. (The directory ``linux-*/usr`` is + devoted to generating this archive during the build.) + + - The old initrd file was a gzipped filesystem image (in some file format, + such as ext2, that needed a driver built into the kernel), while the new + initramfs archive is a gzipped cpio archive (like tar only simpler, + see cpio(1) and Documentation/driver-api/early-userspace/buffer-format.rst). + The kernel's cpio extraction code is not only extremely small, it's also + __init text and data that can be discarded during the boot process. + + - The program run by the old initrd (which was called /initrd, not /init) did + some setup and then returned to the kernel, while the init program from + initramfs is not expected to return to the kernel. (If /init needs to hand + off control it can overmount / with a new root device and exec another init + program. See the switch_root utility, below.) + + - When switching another root device, initrd would pivot_root and then + umount the ramdisk. But initramfs is rootfs: you can neither pivot_root + rootfs, nor unmount it. Instead delete everything out of rootfs to + free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs + with the new root (cd /newmount; mount --move . /; chroot .), attach + stdin/stdout/stderr to the new /dev/console, and exec the new init. + + Since this is a remarkably persnickety process (and involves deleting + commands before you can run them), the klibc package introduced a helper + program (utils/run_init.c) to do all this for you. Most other packages + (such as busybox) have named this command "switch_root". + +Populating initramfs: +--------------------- + +The 2.6 kernel build process always creates a gzipped cpio format initramfs +archive and links it into the resulting kernel binary. By default, this +archive is empty (consuming 134 bytes on x86). + +The config option CONFIG_INITRAMFS_SOURCE (in General Setup in menuconfig, +and living in usr/Kconfig) can be used to specify a source for the +initramfs archive, which will automatically be incorporated into the +resulting binary. This option can point to an existing gzipped cpio +archive, a directory containing files to be archived, or a text file +specification such as the following example:: + + dir /dev 755 0 0 + nod /dev/console 644 0 0 c 5 1 + nod /dev/loop0 644 0 0 b 7 0 + dir /bin 755 1000 1000 + slink /bin/sh busybox 777 0 0 + file /bin/busybox initramfs/busybox 755 0 0 + dir /proc 755 0 0 + dir /sys 755 0 0 + dir /mnt 755 0 0 + file /init initramfs/init.sh 755 0 0 + +Run "usr/gen_init_cpio" (after the kernel build) to get a usage message +documenting the above file format. + +One advantage of the configuration file is that root access is not required to +set permissions or create device nodes in the new archive. (Note that those +two example "file" entries expect to find files named "init.sh" and "busybox" in +a directory called "initramfs", under the linux-2.6.* directory. See +Documentation/driver-api/early-userspace/early_userspace_support.rst for more details.) + +The kernel does not depend on external cpio tools. If you specify a +directory instead of a configuration file, the kernel's build infrastructure +creates a configuration file from that directory (usr/Makefile calls +usr/gen_initramfs_list.sh), and proceeds to package up that directory +using the config file (by feeding it to usr/gen_init_cpio, which is created +from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is +entirely self-contained, and the kernel's boot-time extractor is also +(obviously) self-contained. + +The one thing you might need external cpio utilities installed for is creating +or extracting your own preprepared cpio files to feed to the kernel build +(instead of a config file or directory). + +The following command line can extract a cpio image (either by the above script +or by the kernel build) back into its component files:: + + cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames + +The following shell script can create a prebuilt cpio archive you can +use in place of the above config file:: + + #!/bin/sh + + # Copyright 2006 Rob Landley and TimeSys Corporation. + # Licensed under GPL version 2 + + if [ $# -ne 2 ] + then + echo "usage: mkinitramfs directory imagename.cpio.gz" + exit 1 + fi + + if [ -d "$1" ] + then + echo "creating $2 from $1" + (cd "$1"; find . | cpio -o -H newc | gzip) > "$2" + else + echo "First argument must be a directory" + exit 1 + fi + +.. Note:: + + The cpio man page contains some bad advice that will break your initramfs + archive if you follow it. It says "A typical way to generate the list + of filenames is with the find command; you should give find the -depth + option to minimize problems with permissions on directories that are + unwritable or not searchable." Don't do this when creating + initramfs.cpio.gz images, it won't work. The Linux kernel cpio extractor + won't create files in a directory that doesn't exist, so the directory + entries must go before the files that go in those directories. + The above script gets them in the right order. + +External initramfs images: +-------------------------- + +If the kernel has initrd support enabled, an external cpio.gz archive can also +be passed into a 2.6 kernel in place of an initrd. In this case, the kernel +will autodetect the type (initramfs, not initrd) and extract the external cpio +archive into rootfs before trying to run /init. + +This has the memory efficiency advantages of initramfs (no ramdisk block +device) but the separate packaging of initrd (which is nice if you have +non-GPL code you'd like to run from initramfs, without conflating it with +the GPL licensed Linux kernel binary). + +It can also be used to supplement the kernel's built-in initramfs image. The +files in the external archive will overwrite any conflicting files in +the built-in initramfs archive. Some distributors also prefer to customize +a single kernel image with task-specific initramfs images, without recompiling. + +Contents of initramfs: +---------------------- + +An initramfs archive is a complete self-contained root filesystem for Linux. +If you don't already understand what shared libraries, devices, and paths +you need to get a minimal root filesystem up and running, here are some +references: + +- http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ +- http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html +- http://www.linuxfromscratch.org/lfs/view/stable/ + +The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is +designed to be a tiny C library to statically link early userspace +code against, along with some related utilities. It is BSD licensed. + +I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) +myself. These are LGPL and GPL, respectively. (A self-contained initramfs +package is planned for the busybox 1.3 release.) + +In theory you could use glibc, but that's not well suited for small embedded +uses like this. (A "hello world" program statically linked against glibc is +over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do +name lookups, even when otherwise statically linked.) + +A good first step is to get initramfs to run a statically linked "hello world" +program as init, and test it under an emulator like qemu (www.qemu.org) or +User Mode Linux, like so:: + + cat > hello.c << EOF + #include + #include + + int main(int argc, char *argv[]) + { + printf("Hello world!\n"); + sleep(999999999); + } + EOF + gcc -static hello.c -o init + echo init | cpio -o -H newc | gzip > test.cpio.gz + # Testing external initramfs using the initrd loading mechanism. + qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero + +When debugging a normal root filesystem, it's nice to be able to boot with +"init=/bin/sh". The initramfs equivalent is "rdinit=/bin/sh", and it's +just as useful. + +Why cpio rather than tar? +------------------------- + +This decision was made back in December, 2001. The discussion started here: + + http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html + +And spawned a second thread (specifically on tar vs cpio), starting here: + + http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html + +The quick and dirty summary version (which is no substitute for reading +the above threads) is: + +1) cpio is a standard. It's decades old (from the AT&T days), and already + widely used on Linux (inside RPM, Red Hat's device driver disks). Here's + a Linux Journal article about it from 1996: + + http://www.linuxjournal.com/article/1213 + + It's not as popular as tar because the traditional cpio command line tools + require _truly_hideous_ command line arguments. But that says nothing + either way about the archive format, and there are alternative tools, + such as: + + http://freecode.com/projects/afio + +2) The cpio archive format chosen by the kernel is simpler and cleaner (and + thus easier to create and parse) than any of the (literally dozens of) + various tar archive formats. The complete initramfs archive format is + explained in buffer-format.txt, created in usr/gen_init_cpio.c, and + extracted in init/initramfs.c. All three together come to less than 26k + total of human-readable text. + +3) The GNU project standardizing on tar is approximately as relevant as + Windows standardizing on zip. Linux is not part of either, and is free + to make its own technical decisions. + +4) Since this is a kernel internal format, it could easily have been + something brand new. The kernel provides its own tools to create and + extract this format anyway. Using an existing standard was preferable, + but not essential. + +5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be + supported on the kernel side"): + + http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html + + explained his reasoning: + + - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html + - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html + + and, most importantly, designed and implemented the initramfs code. + +Future directions: +------------------ + +Today (2.6.16), initramfs is always compiled in, but not always used. The +kernel falls back to legacy boot code that is reached only if initramfs does +not contain an /init program. The fallback is legacy code, there to ensure a +smooth transition and allowing early boot functionality to gradually move to +"early userspace" (I.E. initramfs). + +The move to early userspace is necessary because finding and mounting the real +root device is complex. Root partitions can span multiple devices (raid or +separate journal). They can be out on the network (requiring dhcp, setting a +specific MAC address, logging into a server, etc). They can live on removable +media, with dynamically allocated major/minor numbers and persistent naming +issues requiring a full udev implementation to sort out. They can be +compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned, +and so on. + +This kind of complexity (which inevitably includes policy) is rightly handled +in userspace. Both klibc and busybox/uClibc are working on simple initramfs +packages to drop into a kernel build. + +The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree. +The kernel's current early boot code (partition detection, etc) will probably +be migrated into a default initramfs, automatically created and used by the +kernel build. diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt deleted file mode 100644 index 97d42ccaa92d..000000000000 --- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt +++ /dev/null @@ -1,359 +0,0 @@ -ramfs, rootfs and initramfs -October 17, 2005 -Rob Landley -============================= - -What is ramfs? --------------- - -Ramfs is a very simple filesystem that exports Linux's disk caching -mechanisms (the page cache and dentry cache) as a dynamically resizable -RAM-based filesystem. - -Normally all files are cached in memory by Linux. Pages of data read from -backing store (usually the block device the filesystem is mounted on) are kept -around in case it's needed again, but marked as clean (freeable) in case the -Virtual Memory system needs the memory for something else. Similarly, data -written to files is marked clean as soon as it has been written to backing -store, but kept around for caching purposes until the VM reallocates the -memory. A similar mechanism (the dentry cache) greatly speeds up access to -directories. - -With ramfs, there is no backing store. Files written into ramfs allocate -dentries and page cache as usual, but there's nowhere to write them to. -This means the pages are never marked clean, so they can't be freed by the -VM when it's looking to recycle memory. - -The amount of code required to implement ramfs is tiny, because all the -work is done by the existing Linux caching infrastructure. Basically, -you're mounting the disk cache as a filesystem. Because of this, ramfs is not -an optional component removable via menuconfig, since there would be negligible -space savings. - -ramfs and ramdisk: ------------------- - -The older "ram disk" mechanism created a synthetic block device out of -an area of RAM and used it as backing store for a filesystem. This block -device was of fixed size, so the filesystem mounted on it was of fixed -size. Using a ram disk also required unnecessarily copying memory from the -fake block device into the page cache (and copying changes back out), as well -as creating and destroying dentries. Plus it needed a filesystem driver -(such as ext2) to format and interpret this data. - -Compared to ramfs, this wastes memory (and memory bus bandwidth), creates -unnecessary work for the CPU, and pollutes the CPU caches. (There are tricks -to avoid this copying by playing with the page tables, but they're unpleasantly -complicated and turn out to be about as expensive as the copying anyway.) -More to the point, all the work ramfs is doing has to happen _anyway_, -since all file access goes through the page and dentry caches. The RAM -disk is simply unnecessary; ramfs is internally much simpler. - -Another reason ramdisks are semi-obsolete is that the introduction of -loopback devices offered a more flexible and convenient way to create -synthetic block devices, now from files instead of from chunks of memory. -See losetup (8) for details. - -ramfs and tmpfs: ----------------- - -One downside of ramfs is you can keep writing data into it until you fill -up all memory, and the VM can't free it because the VM thinks that files -should get written to backing store (rather than swap space), but ramfs hasn't -got any backing store. Because of this, only root (or a trusted user) should -be allowed write access to a ramfs mount. - -A ramfs derivative called tmpfs was created to add size limits, and the ability -to write the data to swap space. Normal users can be allowed write access to -tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. - -What is rootfs? ---------------- - -Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is -always present in 2.6 systems. You can't unmount rootfs for approximately the -same reason you can't kill the init process; rather than having special code -to check for and handle an empty list, it's smaller and simpler for the kernel -to just make sure certain lists can't become empty. - -Most systems just mount another filesystem over rootfs and ignore it. The -amount of space an empty instance of ramfs takes up is tiny. - -If CONFIG_TMPFS is enabled, rootfs will use tmpfs instead of ramfs by -default. To force ramfs, add "rootfstype=ramfs" to the kernel command -line. - -What is initramfs? ------------------- - -All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is -extracted into rootfs when the kernel boots up. After extracting, the kernel -checks to see if rootfs contains a file "init", and if so it executes it as PID -1. If found, this init process is responsible for bringing the system the -rest of the way up, including locating and mounting the real root device (if -any). If rootfs does not contain an init program after the embedded cpio -archive is extracted into it, the kernel will fall through to the older code -to locate and mount a root partition, then exec some variant of /sbin/init -out of that. - -All this differs from the old initrd in several ways: - - - The old initrd was always a separate file, while the initramfs archive is - linked into the linux kernel image. (The directory linux-*/usr is devoted - to generating this archive during the build.) - - - The old initrd file was a gzipped filesystem image (in some file format, - such as ext2, that needed a driver built into the kernel), while the new - initramfs archive is a gzipped cpio archive (like tar only simpler, - see cpio(1) and Documentation/driver-api/early-userspace/buffer-format.rst). The - kernel's cpio extraction code is not only extremely small, it's also - __init text and data that can be discarded during the boot process. - - - The program run by the old initrd (which was called /initrd, not /init) did - some setup and then returned to the kernel, while the init program from - initramfs is not expected to return to the kernel. (If /init needs to hand - off control it can overmount / with a new root device and exec another init - program. See the switch_root utility, below.) - - - When switching another root device, initrd would pivot_root and then - umount the ramdisk. But initramfs is rootfs: you can neither pivot_root - rootfs, nor unmount it. Instead delete everything out of rootfs to - free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs - with the new root (cd /newmount; mount --move . /; chroot .), attach - stdin/stdout/stderr to the new /dev/console, and exec the new init. - - Since this is a remarkably persnickety process (and involves deleting - commands before you can run them), the klibc package introduced a helper - program (utils/run_init.c) to do all this for you. Most other packages - (such as busybox) have named this command "switch_root". - -Populating initramfs: ---------------------- - -The 2.6 kernel build process always creates a gzipped cpio format initramfs -archive and links it into the resulting kernel binary. By default, this -archive is empty (consuming 134 bytes on x86). - -The config option CONFIG_INITRAMFS_SOURCE (in General Setup in menuconfig, -and living in usr/Kconfig) can be used to specify a source for the -initramfs archive, which will automatically be incorporated into the -resulting binary. This option can point to an existing gzipped cpio -archive, a directory containing files to be archived, or a text file -specification such as the following example: - - dir /dev 755 0 0 - nod /dev/console 644 0 0 c 5 1 - nod /dev/loop0 644 0 0 b 7 0 - dir /bin 755 1000 1000 - slink /bin/sh busybox 777 0 0 - file /bin/busybox initramfs/busybox 755 0 0 - dir /proc 755 0 0 - dir /sys 755 0 0 - dir /mnt 755 0 0 - file /init initramfs/init.sh 755 0 0 - -Run "usr/gen_init_cpio" (after the kernel build) to get a usage message -documenting the above file format. - -One advantage of the configuration file is that root access is not required to -set permissions or create device nodes in the new archive. (Note that those -two example "file" entries expect to find files named "init.sh" and "busybox" in -a directory called "initramfs", under the linux-2.6.* directory. See -Documentation/driver-api/early-userspace/early_userspace_support.rst for more details.) - -The kernel does not depend on external cpio tools. If you specify a -directory instead of a configuration file, the kernel's build infrastructure -creates a configuration file from that directory (usr/Makefile calls -usr/gen_initramfs_list.sh), and proceeds to package up that directory -using the config file (by feeding it to usr/gen_init_cpio, which is created -from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is -entirely self-contained, and the kernel's boot-time extractor is also -(obviously) self-contained. - -The one thing you might need external cpio utilities installed for is creating -or extracting your own preprepared cpio files to feed to the kernel build -(instead of a config file or directory). - -The following command line can extract a cpio image (either by the above script -or by the kernel build) back into its component files: - - cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames - -The following shell script can create a prebuilt cpio archive you can -use in place of the above config file: - - #!/bin/sh - - # Copyright 2006 Rob Landley and TimeSys Corporation. - # Licensed under GPL version 2 - - if [ $# -ne 2 ] - then - echo "usage: mkinitramfs directory imagename.cpio.gz" - exit 1 - fi - - if [ -d "$1" ] - then - echo "creating $2 from $1" - (cd "$1"; find . | cpio -o -H newc | gzip) > "$2" - else - echo "First argument must be a directory" - exit 1 - fi - -Note: The cpio man page contains some bad advice that will break your initramfs -archive if you follow it. It says "A typical way to generate the list -of filenames is with the find command; you should give find the -depth option -to minimize problems with permissions on directories that are unwritable or not -searchable." Don't do this when creating initramfs.cpio.gz images, it won't -work. The Linux kernel cpio extractor won't create files in a directory that -doesn't exist, so the directory entries must go before the files that go in -those directories. The above script gets them in the right order. - -External initramfs images: --------------------------- - -If the kernel has initrd support enabled, an external cpio.gz archive can also -be passed into a 2.6 kernel in place of an initrd. In this case, the kernel -will autodetect the type (initramfs, not initrd) and extract the external cpio -archive into rootfs before trying to run /init. - -This has the memory efficiency advantages of initramfs (no ramdisk block -device) but the separate packaging of initrd (which is nice if you have -non-GPL code you'd like to run from initramfs, without conflating it with -the GPL licensed Linux kernel binary). - -It can also be used to supplement the kernel's built-in initramfs image. The -files in the external archive will overwrite any conflicting files in -the built-in initramfs archive. Some distributors also prefer to customize -a single kernel image with task-specific initramfs images, without recompiling. - -Contents of initramfs: ----------------------- - -An initramfs archive is a complete self-contained root filesystem for Linux. -If you don't already understand what shared libraries, devices, and paths -you need to get a minimal root filesystem up and running, here are some -references: -http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ -http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html -http://www.linuxfromscratch.org/lfs/view/stable/ - -The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is -designed to be a tiny C library to statically link early userspace -code against, along with some related utilities. It is BSD licensed. - -I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) -myself. These are LGPL and GPL, respectively. (A self-contained initramfs -package is planned for the busybox 1.3 release.) - -In theory you could use glibc, but that's not well suited for small embedded -uses like this. (A "hello world" program statically linked against glibc is -over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do -name lookups, even when otherwise statically linked.) - -A good first step is to get initramfs to run a statically linked "hello world" -program as init, and test it under an emulator like qemu (www.qemu.org) or -User Mode Linux, like so: - - cat > hello.c << EOF - #include - #include - - int main(int argc, char *argv[]) - { - printf("Hello world!\n"); - sleep(999999999); - } - EOF - gcc -static hello.c -o init - echo init | cpio -o -H newc | gzip > test.cpio.gz - # Testing external initramfs using the initrd loading mechanism. - qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero - -When debugging a normal root filesystem, it's nice to be able to boot with -"init=/bin/sh". The initramfs equivalent is "rdinit=/bin/sh", and it's -just as useful. - -Why cpio rather than tar? -------------------------- - -This decision was made back in December, 2001. The discussion started here: - - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html - -And spawned a second thread (specifically on tar vs cpio), starting here: - - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html - -The quick and dirty summary version (which is no substitute for reading -the above threads) is: - -1) cpio is a standard. It's decades old (from the AT&T days), and already - widely used on Linux (inside RPM, Red Hat's device driver disks). Here's - a Linux Journal article about it from 1996: - - http://www.linuxjournal.com/article/1213 - - It's not as popular as tar because the traditional cpio command line tools - require _truly_hideous_ command line arguments. But that says nothing - either way about the archive format, and there are alternative tools, - such as: - - http://freecode.com/projects/afio - -2) The cpio archive format chosen by the kernel is simpler and cleaner (and - thus easier to create and parse) than any of the (literally dozens of) - various tar archive formats. The complete initramfs archive format is - explained in buffer-format.txt, created in usr/gen_init_cpio.c, and - extracted in init/initramfs.c. All three together come to less than 26k - total of human-readable text. - -3) The GNU project standardizing on tar is approximately as relevant as - Windows standardizing on zip. Linux is not part of either, and is free - to make its own technical decisions. - -4) Since this is a kernel internal format, it could easily have been - something brand new. The kernel provides its own tools to create and - extract this format anyway. Using an existing standard was preferable, - but not essential. - -5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be - supported on the kernel side"): - - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html - - explained his reasoning: - - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html - - and, most importantly, designed and implemented the initramfs code. - -Future directions: ------------------- - -Today (2.6.16), initramfs is always compiled in, but not always used. The -kernel falls back to legacy boot code that is reached only if initramfs does -not contain an /init program. The fallback is legacy code, there to ensure a -smooth transition and allowing early boot functionality to gradually move to -"early userspace" (I.E. initramfs). - -The move to early userspace is necessary because finding and mounting the real -root device is complex. Root partitions can span multiple devices (raid or -separate journal). They can be out on the network (requiring dhcp, setting a -specific MAC address, logging into a server, etc). They can live on removable -media, with dynamically allocated major/minor numbers and persistent naming -issues requiring a full udev implementation to sort out. They can be -compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned, -and so on. - -This kind of complexity (which inevitably includes policy) is rightly handled -in userspace. Both klibc and busybox/uClibc are working on simple initramfs -packages to drop into a kernel build. - -The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree. -The kernel's current early boot code (partition detection, etc) will probably -be migrated into a default initramfs, automatically created and used by the -kernel build. -- cgit From 56e6d5c0eb7b862b4c984107e665821722413008 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:21 +0100 Subject: docs: filesystems: convert relay.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Use notes markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/f48bb0fdf64d197f28c6f469adb61a7a091adb75.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/relay.rst | 501 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/relay.txt | 494 ----------------------------------- 3 files changed, 502 insertions(+), 494 deletions(-) create mode 100644 Documentation/filesystems/relay.rst delete mode 100644 Documentation/filesystems/relay.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index b8689d082911..0aade8146d4d 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -84,5 +84,6 @@ Documentation for filesystem implementations. proc qnx6 ramfs-rootfs-initramfs + relay virtiofs vfat diff --git a/Documentation/filesystems/relay.rst b/Documentation/filesystems/relay.rst new file mode 100644 index 000000000000..04ad083cfe62 --- /dev/null +++ b/Documentation/filesystems/relay.rst @@ -0,0 +1,501 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== +relay interface (formerly relayfs) +================================== + +The relay interface provides a means for kernel applications to +efficiently log and transfer large quantities of data from the kernel +to userspace via user-defined 'relay channels'. + +A 'relay channel' is a kernel->user data relay mechanism implemented +as a set of per-cpu kernel buffers ('channel buffers'), each +represented as a regular file ('relay file') in user space. Kernel +clients write into the channel buffers using efficient write +functions; these automatically log into the current cpu's channel +buffer. User space applications mmap() or read() from the relay files +and retrieve the data as it becomes available. The relay files +themselves are files created in a host filesystem, e.g. debugfs, and +are associated with the channel buffers using the API described below. + +The format of the data logged into the channel buffers is completely +up to the kernel client; the relay interface does however provide +hooks which allow kernel clients to impose some structure on the +buffer data. The relay interface doesn't implement any form of data +filtering - this also is left to the kernel client. The purpose is to +keep things as simple as possible. + +This document provides an overview of the relay interface API. The +details of the function parameters are documented along with the +functions in the relay interface code - please see that for details. + +Semantics +========= + +Each relay channel has one buffer per CPU, each buffer has one or more +sub-buffers. Messages are written to the first sub-buffer until it is +too full to contain a new message, in which case it is written to +the next (if available). Messages are never split across sub-buffers. +At this point, userspace can be notified so it empties the first +sub-buffer, while the kernel continues writing to the next. + +When notified that a sub-buffer is full, the kernel knows how many +bytes of it are padding i.e. unused space occurring because a complete +message couldn't fit into a sub-buffer. Userspace can use this +knowledge to copy only valid data. + +After copying it, userspace can notify the kernel that a sub-buffer +has been consumed. + +A relay channel can operate in a mode where it will overwrite data not +yet collected by userspace, and not wait for it to be consumed. + +The relay channel itself does not provide for communication of such +data between userspace and kernel, allowing the kernel side to remain +simple and not impose a single interface on userspace. It does +provide a set of examples and a separate helper though, described +below. + +The read() interface both removes padding and internally consumes the +read sub-buffers; thus in cases where read(2) is being used to drain +the channel buffers, special-purpose communication between kernel and +user isn't necessary for basic operation. + +One of the major goals of the relay interface is to provide a low +overhead mechanism for conveying kernel data to userspace. While the +read() interface is easy to use, it's not as efficient as the mmap() +approach; the example code attempts to make the tradeoff between the +two approaches as small as possible. + +klog and relay-apps example code +================================ + +The relay interface itself is ready to use, but to make things easier, +a couple simple utility functions and a set of examples are provided. + +The relay-apps example tarball, available on the relay sourceforge +site, contains a set of self-contained examples, each consisting of a +pair of .c files containing boilerplate code for each of the user and +kernel sides of a relay application. When combined these two sets of +boilerplate code provide glue to easily stream data to disk, without +having to bother with mundane housekeeping chores. + +The 'klog debugging functions' patch (klog.patch in the relay-apps +tarball) provides a couple of high-level logging functions to the +kernel which allow writing formatted text or raw data to a channel, +regardless of whether a channel to write into exists or not, or even +whether the relay interface is compiled into the kernel or not. These +functions allow you to put unconditional 'trace' statements anywhere +in the kernel or kernel modules; only when there is a 'klog handler' +registered will data actually be logged (see the klog and kleak +examples for details). + +It is of course possible to use the relay interface from scratch, +i.e. without using any of the relay-apps example code or klog, but +you'll have to implement communication between userspace and kernel, +allowing both to convey the state of buffers (full, empty, amount of +padding). The read() interface both removes padding and internally +consumes the read sub-buffers; thus in cases where read(2) is being +used to drain the channel buffers, special-purpose communication +between kernel and user isn't necessary for basic operation. Things +such as buffer-full conditions would still need to be communicated via +some channel though. + +klog and the relay-apps examples can be found in the relay-apps +tarball on http://relayfs.sourceforge.net + +The relay interface user space API +================================== + +The relay interface implements basic file operations for user space +access to relay channel buffer data. Here are the file operations +that are available and some comments regarding their behavior: + +=========== ============================================================ +open() enables user to open an _existing_ channel buffer. + +mmap() results in channel buffer being mapped into the caller's + memory space. Note that you can't do a partial mmap - you + must map the entire file, which is NRBUF * SUBBUFSIZE. + +read() read the contents of a channel buffer. The bytes read are + 'consumed' by the reader, i.e. they won't be available + again to subsequent reads. If the channel is being used + in no-overwrite mode (the default), it can be read at any + time even if there's an active kernel writer. If the + channel is being used in overwrite mode and there are + active channel writers, results may be unpredictable - + users should make sure that all logging to the channel has + ended before using read() with overwrite mode. Sub-buffer + padding is automatically removed and will not be seen by + the reader. + +sendfile() transfer data from a channel buffer to an output file + descriptor. Sub-buffer padding is automatically removed + and will not be seen by the reader. + +poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are + notified when sub-buffer boundaries are crossed. + +close() decrements the channel buffer's refcount. When the refcount + reaches 0, i.e. when no process or kernel client has the + buffer open, the channel buffer is freed. +=========== ============================================================ + +In order for a user application to make use of relay files, the +host filesystem must be mounted. For example:: + + mount -t debugfs debugfs /sys/kernel/debug + +.. Note:: + + the host filesystem doesn't need to be mounted for kernel + clients to create or use channels - it only needs to be + mounted when user space applications need access to the buffer + data. + + +The relay interface kernel API +============================== + +Here's a summary of the API the relay interface provides to in-kernel clients: + +TBD(curr. line MT:/API/) + channel management functions:: + + relay_open(base_filename, parent, subbuf_size, n_subbufs, + callbacks, private_data) + relay_close(chan) + relay_flush(chan) + relay_reset(chan) + + channel management typically called on instigation of userspace:: + + relay_subbufs_consumed(chan, cpu, subbufs_consumed) + + write functions:: + + relay_write(chan, data, length) + __relay_write(chan, data, length) + relay_reserve(chan, length) + + callbacks:: + + subbuf_start(buf, subbuf, prev_subbuf, prev_padding) + buf_mapped(buf, filp) + buf_unmapped(buf, filp) + create_buf_file(filename, parent, mode, buf, is_global) + remove_buf_file(dentry) + + helper functions:: + + relay_buf_full(buf) + subbuf_start_reserve(buf, length) + + +Creating a channel +------------------ + +relay_open() is used to create a channel, along with its per-cpu +channel buffers. Each channel buffer will have an associated file +created for it in the host filesystem, which can be and mmapped or +read from in user space. The files are named basename0...basenameN-1 +where N is the number of online cpus, and by default will be created +in the root of the filesystem (if the parent param is NULL). If you +want a directory structure to contain your relay files, you should +create it using the host filesystem's directory creation function, +e.g. debugfs_create_dir(), and pass the parent directory to +relay_open(). Users are responsible for cleaning up any directory +structure they create, when the channel is closed - again the host +filesystem's directory removal functions should be used for that, +e.g. debugfs_remove(). + +In order for a channel to be created and the host filesystem's files +associated with its channel buffers, the user must provide definitions +for two callback functions, create_buf_file() and remove_buf_file(). +create_buf_file() is called once for each per-cpu buffer from +relay_open() and allows the user to create the file which will be used +to represent the corresponding channel buffer. The callback should +return the dentry of the file created to represent the channel buffer. +remove_buf_file() must also be defined; it's responsible for deleting +the file(s) created in create_buf_file() and is called during +relay_close(). + +Here are some typical definitions for these callbacks, in this case +using debugfs:: + + /* + * create_buf_file() callback. Creates relay file in debugfs. + */ + static struct dentry *create_buf_file_handler(const char *filename, + struct dentry *parent, + umode_t mode, + struct rchan_buf *buf, + int *is_global) + { + return debugfs_create_file(filename, mode, parent, buf, + &relay_file_operations); + } + + /* + * remove_buf_file() callback. Removes relay file from debugfs. + */ + static int remove_buf_file_handler(struct dentry *dentry) + { + debugfs_remove(dentry); + + return 0; + } + + /* + * relay interface callbacks + */ + static struct rchan_callbacks relay_callbacks = + { + .create_buf_file = create_buf_file_handler, + .remove_buf_file = remove_buf_file_handler, + }; + +And an example relay_open() invocation using them:: + + chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL); + +If the create_buf_file() callback fails, or isn't defined, channel +creation and thus relay_open() will fail. + +The total size of each per-cpu buffer is calculated by multiplying the +number of sub-buffers by the sub-buffer size passed into relay_open(). +The idea behind sub-buffers is that they're basically an extension of +double-buffering to N buffers, and they also allow applications to +easily implement random-access-on-buffer-boundary schemes, which can +be important for some high-volume applications. The number and size +of sub-buffers is completely dependent on the application and even for +the same application, different conditions will warrant different +values for these parameters at different times. Typically, the right +values to use are best decided after some experimentation; in general, +though, it's safe to assume that having only 1 sub-buffer is a bad +idea - you're guaranteed to either overwrite data or lose events +depending on the channel mode being used. + +The create_buf_file() implementation can also be defined in such a way +as to allow the creation of a single 'global' buffer instead of the +default per-cpu set. This can be useful for applications interested +mainly in seeing the relative ordering of system-wide events without +the need to bother with saving explicit timestamps for the purpose of +merging/sorting per-cpu files in a postprocessing step. + +To have relay_open() create a global buffer, the create_buf_file() +implementation should set the value of the is_global outparam to a +non-zero value in addition to creating the file that will be used to +represent the single buffer. In the case of a global buffer, +create_buf_file() and remove_buf_file() will be called only once. The +normal channel-writing functions, e.g. relay_write(), can still be +used - writes from any cpu will transparently end up in the global +buffer - but since it is a global buffer, callers should make sure +they use the proper locking for such a buffer, either by wrapping +writes in a spinlock, or by copying a write function from relay.h and +creating a local version that internally does the proper locking. + +The private_data passed into relay_open() allows clients to associate +user-defined data with a channel, and is immediately available +(including in create_buf_file()) via chan->private_data or +buf->chan->private_data. + +Buffer-only channels +-------------------- + +These channels have no files associated and can be created with +relay_open(NULL, NULL, ...). Such channels are useful in scenarios such +as when doing early tracing in the kernel, before the VFS is up. In these +cases, one may open a buffer-only channel and then call +relay_late_setup_files() when the kernel is ready to handle files, +to expose the buffered data to the userspace. + +Channel 'modes' +--------------- + +relay channels can be used in either of two modes - 'overwrite' or +'no-overwrite'. The mode is entirely determined by the implementation +of the subbuf_start() callback, as described below. The default if no +subbuf_start() callback is defined is 'no-overwrite' mode. If the +default mode suits your needs, and you plan to use the read() +interface to retrieve channel data, you can ignore the details of this +section, as it pertains mainly to mmap() implementations. + +In 'overwrite' mode, also known as 'flight recorder' mode, writes +continuously cycle around the buffer and will never fail, but will +unconditionally overwrite old data regardless of whether it's actually +been consumed. In no-overwrite mode, writes will fail, i.e. data will +be lost, if the number of unconsumed sub-buffers equals the total +number of sub-buffers in the channel. It should be clear that if +there is no consumer or if the consumer can't consume sub-buffers fast +enough, data will be lost in either case; the only difference is +whether data is lost from the beginning or the end of a buffer. + +As explained above, a relay channel is made of up one or more +per-cpu channel buffers, each implemented as a circular buffer +subdivided into one or more sub-buffers. Messages are written into +the current sub-buffer of the channel's current per-cpu buffer via the +write functions described below. Whenever a message can't fit into +the current sub-buffer, because there's no room left for it, the +client is notified via the subbuf_start() callback that a switch to a +new sub-buffer is about to occur. The client uses this callback to 1) +initialize the next sub-buffer if appropriate 2) finalize the previous +sub-buffer if appropriate and 3) return a boolean value indicating +whether or not to actually move on to the next sub-buffer. + +To implement 'no-overwrite' mode, the userspace client would provide +an implementation of the subbuf_start() callback something like the +following:: + + static int subbuf_start(struct rchan_buf *buf, + void *subbuf, + void *prev_subbuf, + unsigned int prev_padding) + { + if (prev_subbuf) + *((unsigned *)prev_subbuf) = prev_padding; + + if (relay_buf_full(buf)) + return 0; + + subbuf_start_reserve(buf, sizeof(unsigned int)); + + return 1; + } + +If the current buffer is full, i.e. all sub-buffers remain unconsumed, +the callback returns 0 to indicate that the buffer switch should not +occur yet, i.e. until the consumer has had a chance to read the +current set of ready sub-buffers. For the relay_buf_full() function +to make sense, the consumer is responsible for notifying the relay +interface when sub-buffers have been consumed via +relay_subbufs_consumed(). Any subsequent attempts to write into the +buffer will again invoke the subbuf_start() callback with the same +parameters; only when the consumer has consumed one or more of the +ready sub-buffers will relay_buf_full() return 0, in which case the +buffer switch can continue. + +The implementation of the subbuf_start() callback for 'overwrite' mode +would be very similar:: + + static int subbuf_start(struct rchan_buf *buf, + void *subbuf, + void *prev_subbuf, + size_t prev_padding) + { + if (prev_subbuf) + *((unsigned *)prev_subbuf) = prev_padding; + + subbuf_start_reserve(buf, sizeof(unsigned int)); + + return 1; + } + +In this case, the relay_buf_full() check is meaningless and the +callback always returns 1, causing the buffer switch to occur +unconditionally. It's also meaningless for the client to use the +relay_subbufs_consumed() function in this mode, as it's never +consulted. + +The default subbuf_start() implementation, used if the client doesn't +define any callbacks, or doesn't define the subbuf_start() callback, +implements the simplest possible 'no-overwrite' mode, i.e. it does +nothing but return 0. + +Header information can be reserved at the beginning of each sub-buffer +by calling the subbuf_start_reserve() helper function from within the +subbuf_start() callback. This reserved area can be used to store +whatever information the client wants. In the example above, room is +reserved in each sub-buffer to store the padding count for that +sub-buffer. This is filled in for the previous sub-buffer in the +subbuf_start() implementation; the padding value for the previous +sub-buffer is passed into the subbuf_start() callback along with a +pointer to the previous sub-buffer, since the padding value isn't +known until a sub-buffer is filled. The subbuf_start() callback is +also called for the first sub-buffer when the channel is opened, to +give the client a chance to reserve space in it. In this case the +previous sub-buffer pointer passed into the callback will be NULL, so +the client should check the value of the prev_subbuf pointer before +writing into the previous sub-buffer. + +Writing to a channel +-------------------- + +Kernel clients write data into the current cpu's channel buffer using +relay_write() or __relay_write(). relay_write() is the main logging +function - it uses local_irqsave() to protect the buffer and should be +used if you might be logging from interrupt context. If you know +you'll never be logging from interrupt context, you can use +__relay_write(), which only disables preemption. These functions +don't return a value, so you can't determine whether or not they +failed - the assumption is that you wouldn't want to check a return +value in the fast logging path anyway, and that they'll always succeed +unless the buffer is full and no-overwrite mode is being used, in +which case you can detect a failed write in the subbuf_start() +callback by calling the relay_buf_full() helper function. + +relay_reserve() is used to reserve a slot in a channel buffer which +can be written to later. This would typically be used in applications +that need to write directly into a channel buffer without having to +stage data in a temporary buffer beforehand. Because the actual write +may not happen immediately after the slot is reserved, applications +using relay_reserve() can keep a count of the number of bytes actually +written, either in space reserved in the sub-buffers themselves or as +a separate array. See the 'reserve' example in the relay-apps tarball +at http://relayfs.sourceforge.net for an example of how this can be +done. Because the write is under control of the client and is +separated from the reserve, relay_reserve() doesn't protect the buffer +at all - it's up to the client to provide the appropriate +synchronization when using relay_reserve(). + +Closing a channel +----------------- + +The client calls relay_close() when it's finished using the channel. +The channel and its associated buffers are destroyed when there are no +longer any references to any of the channel buffers. relay_flush() +forces a sub-buffer switch on all the channel buffers, and can be used +to finalize and process the last sub-buffers before the channel is +closed. + +Misc +---- + +Some applications may want to keep a channel around and re-use it +rather than open and close a new channel for each use. relay_reset() +can be used for this purpose - it resets a channel to its initial +state without reallocating channel buffer memory or destroying +existing mappings. It should however only be called when it's safe to +do so, i.e. when the channel isn't currently being written to. + +Finally, there are a couple of utility callbacks that can be used for +different purposes. buf_mapped() is called whenever a channel buffer +is mmapped from user space and buf_unmapped() is called when it's +unmapped. The client can use this notification to trigger actions +within the kernel application, such as enabling/disabling logging to +the channel. + + +Resources +========= + +For news, example code, mailing list, etc. see the relay interface homepage: + + http://relayfs.sourceforge.net + + +Credits +======= + +The ideas and specs for the relay interface came about as a result of +discussions on tracing involving the following: + +Michel Dagenais +Richard Moore +Bob Wisniewski +Karim Yaghmour +Tom Zanussi + +Also thanks to Hubertus Franke for a lot of useful suggestions and bug +reports. diff --git a/Documentation/filesystems/relay.txt b/Documentation/filesystems/relay.txt deleted file mode 100644 index cd709a94d054..000000000000 --- a/Documentation/filesystems/relay.txt +++ /dev/null @@ -1,494 +0,0 @@ -relay interface (formerly relayfs) -================================== - -The relay interface provides a means for kernel applications to -efficiently log and transfer large quantities of data from the kernel -to userspace via user-defined 'relay channels'. - -A 'relay channel' is a kernel->user data relay mechanism implemented -as a set of per-cpu kernel buffers ('channel buffers'), each -represented as a regular file ('relay file') in user space. Kernel -clients write into the channel buffers using efficient write -functions; these automatically log into the current cpu's channel -buffer. User space applications mmap() or read() from the relay files -and retrieve the data as it becomes available. The relay files -themselves are files created in a host filesystem, e.g. debugfs, and -are associated with the channel buffers using the API described below. - -The format of the data logged into the channel buffers is completely -up to the kernel client; the relay interface does however provide -hooks which allow kernel clients to impose some structure on the -buffer data. The relay interface doesn't implement any form of data -filtering - this also is left to the kernel client. The purpose is to -keep things as simple as possible. - -This document provides an overview of the relay interface API. The -details of the function parameters are documented along with the -functions in the relay interface code - please see that for details. - -Semantics -========= - -Each relay channel has one buffer per CPU, each buffer has one or more -sub-buffers. Messages are written to the first sub-buffer until it is -too full to contain a new message, in which case it is written to -the next (if available). Messages are never split across sub-buffers. -At this point, userspace can be notified so it empties the first -sub-buffer, while the kernel continues writing to the next. - -When notified that a sub-buffer is full, the kernel knows how many -bytes of it are padding i.e. unused space occurring because a complete -message couldn't fit into a sub-buffer. Userspace can use this -knowledge to copy only valid data. - -After copying it, userspace can notify the kernel that a sub-buffer -has been consumed. - -A relay channel can operate in a mode where it will overwrite data not -yet collected by userspace, and not wait for it to be consumed. - -The relay channel itself does not provide for communication of such -data between userspace and kernel, allowing the kernel side to remain -simple and not impose a single interface on userspace. It does -provide a set of examples and a separate helper though, described -below. - -The read() interface both removes padding and internally consumes the -read sub-buffers; thus in cases where read(2) is being used to drain -the channel buffers, special-purpose communication between kernel and -user isn't necessary for basic operation. - -One of the major goals of the relay interface is to provide a low -overhead mechanism for conveying kernel data to userspace. While the -read() interface is easy to use, it's not as efficient as the mmap() -approach; the example code attempts to make the tradeoff between the -two approaches as small as possible. - -klog and relay-apps example code -================================ - -The relay interface itself is ready to use, but to make things easier, -a couple simple utility functions and a set of examples are provided. - -The relay-apps example tarball, available on the relay sourceforge -site, contains a set of self-contained examples, each consisting of a -pair of .c files containing boilerplate code for each of the user and -kernel sides of a relay application. When combined these two sets of -boilerplate code provide glue to easily stream data to disk, without -having to bother with mundane housekeeping chores. - -The 'klog debugging functions' patch (klog.patch in the relay-apps -tarball) provides a couple of high-level logging functions to the -kernel which allow writing formatted text or raw data to a channel, -regardless of whether a channel to write into exists or not, or even -whether the relay interface is compiled into the kernel or not. These -functions allow you to put unconditional 'trace' statements anywhere -in the kernel or kernel modules; only when there is a 'klog handler' -registered will data actually be logged (see the klog and kleak -examples for details). - -It is of course possible to use the relay interface from scratch, -i.e. without using any of the relay-apps example code or klog, but -you'll have to implement communication between userspace and kernel, -allowing both to convey the state of buffers (full, empty, amount of -padding). The read() interface both removes padding and internally -consumes the read sub-buffers; thus in cases where read(2) is being -used to drain the channel buffers, special-purpose communication -between kernel and user isn't necessary for basic operation. Things -such as buffer-full conditions would still need to be communicated via -some channel though. - -klog and the relay-apps examples can be found in the relay-apps -tarball on http://relayfs.sourceforge.net - -The relay interface user space API -================================== - -The relay interface implements basic file operations for user space -access to relay channel buffer data. Here are the file operations -that are available and some comments regarding their behavior: - -open() enables user to open an _existing_ channel buffer. - -mmap() results in channel buffer being mapped into the caller's - memory space. Note that you can't do a partial mmap - you - must map the entire file, which is NRBUF * SUBBUFSIZE. - -read() read the contents of a channel buffer. The bytes read are - 'consumed' by the reader, i.e. they won't be available - again to subsequent reads. If the channel is being used - in no-overwrite mode (the default), it can be read at any - time even if there's an active kernel writer. If the - channel is being used in overwrite mode and there are - active channel writers, results may be unpredictable - - users should make sure that all logging to the channel has - ended before using read() with overwrite mode. Sub-buffer - padding is automatically removed and will not be seen by - the reader. - -sendfile() transfer data from a channel buffer to an output file - descriptor. Sub-buffer padding is automatically removed - and will not be seen by the reader. - -poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are - notified when sub-buffer boundaries are crossed. - -close() decrements the channel buffer's refcount. When the refcount - reaches 0, i.e. when no process or kernel client has the - buffer open, the channel buffer is freed. - -In order for a user application to make use of relay files, the -host filesystem must be mounted. For example, - - mount -t debugfs debugfs /sys/kernel/debug - -NOTE: the host filesystem doesn't need to be mounted for kernel - clients to create or use channels - it only needs to be - mounted when user space applications need access to the buffer - data. - - -The relay interface kernel API -============================== - -Here's a summary of the API the relay interface provides to in-kernel clients: - -TBD(curr. line MT:/API/) - channel management functions: - - relay_open(base_filename, parent, subbuf_size, n_subbufs, - callbacks, private_data) - relay_close(chan) - relay_flush(chan) - relay_reset(chan) - - channel management typically called on instigation of userspace: - - relay_subbufs_consumed(chan, cpu, subbufs_consumed) - - write functions: - - relay_write(chan, data, length) - __relay_write(chan, data, length) - relay_reserve(chan, length) - - callbacks: - - subbuf_start(buf, subbuf, prev_subbuf, prev_padding) - buf_mapped(buf, filp) - buf_unmapped(buf, filp) - create_buf_file(filename, parent, mode, buf, is_global) - remove_buf_file(dentry) - - helper functions: - - relay_buf_full(buf) - subbuf_start_reserve(buf, length) - - -Creating a channel ------------------- - -relay_open() is used to create a channel, along with its per-cpu -channel buffers. Each channel buffer will have an associated file -created for it in the host filesystem, which can be and mmapped or -read from in user space. The files are named basename0...basenameN-1 -where N is the number of online cpus, and by default will be created -in the root of the filesystem (if the parent param is NULL). If you -want a directory structure to contain your relay files, you should -create it using the host filesystem's directory creation function, -e.g. debugfs_create_dir(), and pass the parent directory to -relay_open(). Users are responsible for cleaning up any directory -structure they create, when the channel is closed - again the host -filesystem's directory removal functions should be used for that, -e.g. debugfs_remove(). - -In order for a channel to be created and the host filesystem's files -associated with its channel buffers, the user must provide definitions -for two callback functions, create_buf_file() and remove_buf_file(). -create_buf_file() is called once for each per-cpu buffer from -relay_open() and allows the user to create the file which will be used -to represent the corresponding channel buffer. The callback should -return the dentry of the file created to represent the channel buffer. -remove_buf_file() must also be defined; it's responsible for deleting -the file(s) created in create_buf_file() and is called during -relay_close(). - -Here are some typical definitions for these callbacks, in this case -using debugfs: - -/* - * create_buf_file() callback. Creates relay file in debugfs. - */ -static struct dentry *create_buf_file_handler(const char *filename, - struct dentry *parent, - umode_t mode, - struct rchan_buf *buf, - int *is_global) -{ - return debugfs_create_file(filename, mode, parent, buf, - &relay_file_operations); -} - -/* - * remove_buf_file() callback. Removes relay file from debugfs. - */ -static int remove_buf_file_handler(struct dentry *dentry) -{ - debugfs_remove(dentry); - - return 0; -} - -/* - * relay interface callbacks - */ -static struct rchan_callbacks relay_callbacks = -{ - .create_buf_file = create_buf_file_handler, - .remove_buf_file = remove_buf_file_handler, -}; - -And an example relay_open() invocation using them: - - chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL); - -If the create_buf_file() callback fails, or isn't defined, channel -creation and thus relay_open() will fail. - -The total size of each per-cpu buffer is calculated by multiplying the -number of sub-buffers by the sub-buffer size passed into relay_open(). -The idea behind sub-buffers is that they're basically an extension of -double-buffering to N buffers, and they also allow applications to -easily implement random-access-on-buffer-boundary schemes, which can -be important for some high-volume applications. The number and size -of sub-buffers is completely dependent on the application and even for -the same application, different conditions will warrant different -values for these parameters at different times. Typically, the right -values to use are best decided after some experimentation; in general, -though, it's safe to assume that having only 1 sub-buffer is a bad -idea - you're guaranteed to either overwrite data or lose events -depending on the channel mode being used. - -The create_buf_file() implementation can also be defined in such a way -as to allow the creation of a single 'global' buffer instead of the -default per-cpu set. This can be useful for applications interested -mainly in seeing the relative ordering of system-wide events without -the need to bother with saving explicit timestamps for the purpose of -merging/sorting per-cpu files in a postprocessing step. - -To have relay_open() create a global buffer, the create_buf_file() -implementation should set the value of the is_global outparam to a -non-zero value in addition to creating the file that will be used to -represent the single buffer. In the case of a global buffer, -create_buf_file() and remove_buf_file() will be called only once. The -normal channel-writing functions, e.g. relay_write(), can still be -used - writes from any cpu will transparently end up in the global -buffer - but since it is a global buffer, callers should make sure -they use the proper locking for such a buffer, either by wrapping -writes in a spinlock, or by copying a write function from relay.h and -creating a local version that internally does the proper locking. - -The private_data passed into relay_open() allows clients to associate -user-defined data with a channel, and is immediately available -(including in create_buf_file()) via chan->private_data or -buf->chan->private_data. - -Buffer-only channels --------------------- - -These channels have no files associated and can be created with -relay_open(NULL, NULL, ...). Such channels are useful in scenarios such -as when doing early tracing in the kernel, before the VFS is up. In these -cases, one may open a buffer-only channel and then call -relay_late_setup_files() when the kernel is ready to handle files, -to expose the buffered data to the userspace. - -Channel 'modes' ---------------- - -relay channels can be used in either of two modes - 'overwrite' or -'no-overwrite'. The mode is entirely determined by the implementation -of the subbuf_start() callback, as described below. The default if no -subbuf_start() callback is defined is 'no-overwrite' mode. If the -default mode suits your needs, and you plan to use the read() -interface to retrieve channel data, you can ignore the details of this -section, as it pertains mainly to mmap() implementations. - -In 'overwrite' mode, also known as 'flight recorder' mode, writes -continuously cycle around the buffer and will never fail, but will -unconditionally overwrite old data regardless of whether it's actually -been consumed. In no-overwrite mode, writes will fail, i.e. data will -be lost, if the number of unconsumed sub-buffers equals the total -number of sub-buffers in the channel. It should be clear that if -there is no consumer or if the consumer can't consume sub-buffers fast -enough, data will be lost in either case; the only difference is -whether data is lost from the beginning or the end of a buffer. - -As explained above, a relay channel is made of up one or more -per-cpu channel buffers, each implemented as a circular buffer -subdivided into one or more sub-buffers. Messages are written into -the current sub-buffer of the channel's current per-cpu buffer via the -write functions described below. Whenever a message can't fit into -the current sub-buffer, because there's no room left for it, the -client is notified via the subbuf_start() callback that a switch to a -new sub-buffer is about to occur. The client uses this callback to 1) -initialize the next sub-buffer if appropriate 2) finalize the previous -sub-buffer if appropriate and 3) return a boolean value indicating -whether or not to actually move on to the next sub-buffer. - -To implement 'no-overwrite' mode, the userspace client would provide -an implementation of the subbuf_start() callback something like the -following: - -static int subbuf_start(struct rchan_buf *buf, - void *subbuf, - void *prev_subbuf, - unsigned int prev_padding) -{ - if (prev_subbuf) - *((unsigned *)prev_subbuf) = prev_padding; - - if (relay_buf_full(buf)) - return 0; - - subbuf_start_reserve(buf, sizeof(unsigned int)); - - return 1; -} - -If the current buffer is full, i.e. all sub-buffers remain unconsumed, -the callback returns 0 to indicate that the buffer switch should not -occur yet, i.e. until the consumer has had a chance to read the -current set of ready sub-buffers. For the relay_buf_full() function -to make sense, the consumer is responsible for notifying the relay -interface when sub-buffers have been consumed via -relay_subbufs_consumed(). Any subsequent attempts to write into the -buffer will again invoke the subbuf_start() callback with the same -parameters; only when the consumer has consumed one or more of the -ready sub-buffers will relay_buf_full() return 0, in which case the -buffer switch can continue. - -The implementation of the subbuf_start() callback for 'overwrite' mode -would be very similar: - -static int subbuf_start(struct rchan_buf *buf, - void *subbuf, - void *prev_subbuf, - size_t prev_padding) -{ - if (prev_subbuf) - *((unsigned *)prev_subbuf) = prev_padding; - - subbuf_start_reserve(buf, sizeof(unsigned int)); - - return 1; -} - -In this case, the relay_buf_full() check is meaningless and the -callback always returns 1, causing the buffer switch to occur -unconditionally. It's also meaningless for the client to use the -relay_subbufs_consumed() function in this mode, as it's never -consulted. - -The default subbuf_start() implementation, used if the client doesn't -define any callbacks, or doesn't define the subbuf_start() callback, -implements the simplest possible 'no-overwrite' mode, i.e. it does -nothing but return 0. - -Header information can be reserved at the beginning of each sub-buffer -by calling the subbuf_start_reserve() helper function from within the -subbuf_start() callback. This reserved area can be used to store -whatever information the client wants. In the example above, room is -reserved in each sub-buffer to store the padding count for that -sub-buffer. This is filled in for the previous sub-buffer in the -subbuf_start() implementation; the padding value for the previous -sub-buffer is passed into the subbuf_start() callback along with a -pointer to the previous sub-buffer, since the padding value isn't -known until a sub-buffer is filled. The subbuf_start() callback is -also called for the first sub-buffer when the channel is opened, to -give the client a chance to reserve space in it. In this case the -previous sub-buffer pointer passed into the callback will be NULL, so -the client should check the value of the prev_subbuf pointer before -writing into the previous sub-buffer. - -Writing to a channel --------------------- - -Kernel clients write data into the current cpu's channel buffer using -relay_write() or __relay_write(). relay_write() is the main logging -function - it uses local_irqsave() to protect the buffer and should be -used if you might be logging from interrupt context. If you know -you'll never be logging from interrupt context, you can use -__relay_write(), which only disables preemption. These functions -don't return a value, so you can't determine whether or not they -failed - the assumption is that you wouldn't want to check a return -value in the fast logging path anyway, and that they'll always succeed -unless the buffer is full and no-overwrite mode is being used, in -which case you can detect a failed write in the subbuf_start() -callback by calling the relay_buf_full() helper function. - -relay_reserve() is used to reserve a slot in a channel buffer which -can be written to later. This would typically be used in applications -that need to write directly into a channel buffer without having to -stage data in a temporary buffer beforehand. Because the actual write -may not happen immediately after the slot is reserved, applications -using relay_reserve() can keep a count of the number of bytes actually -written, either in space reserved in the sub-buffers themselves or as -a separate array. See the 'reserve' example in the relay-apps tarball -at http://relayfs.sourceforge.net for an example of how this can be -done. Because the write is under control of the client and is -separated from the reserve, relay_reserve() doesn't protect the buffer -at all - it's up to the client to provide the appropriate -synchronization when using relay_reserve(). - -Closing a channel ------------------ - -The client calls relay_close() when it's finished using the channel. -The channel and its associated buffers are destroyed when there are no -longer any references to any of the channel buffers. relay_flush() -forces a sub-buffer switch on all the channel buffers, and can be used -to finalize and process the last sub-buffers before the channel is -closed. - -Misc ----- - -Some applications may want to keep a channel around and re-use it -rather than open and close a new channel for each use. relay_reset() -can be used for this purpose - it resets a channel to its initial -state without reallocating channel buffer memory or destroying -existing mappings. It should however only be called when it's safe to -do so, i.e. when the channel isn't currently being written to. - -Finally, there are a couple of utility callbacks that can be used for -different purposes. buf_mapped() is called whenever a channel buffer -is mmapped from user space and buf_unmapped() is called when it's -unmapped. The client can use this notification to trigger actions -within the kernel application, such as enabling/disabling logging to -the channel. - - -Resources -========= - -For news, example code, mailing list, etc. see the relay interface homepage: - - http://relayfs.sourceforge.net - - -Credits -======= - -The ideas and specs for the relay interface came about as a result of -discussions on tracing involving the following: - -Michel Dagenais -Richard Moore -Bob Wisniewski -Karim Yaghmour -Tom Zanussi - -Also thanks to Hubertus Franke for a lot of useful suggestions and bug -reports. -- cgit From 6db0a480aa07ab65b6c7d34d095c714359af3e87 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:22 +0100 Subject: docs: filesystems: convert romfs.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/d2cc83e7cd6de63c793ccd3f2588ea40f7f1e764.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/romfs.rst | 194 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/romfs.txt | 186 ---------------------------------- 3 files changed, 195 insertions(+), 186 deletions(-) create mode 100644 Documentation/filesystems/romfs.rst delete mode 100644 Documentation/filesystems/romfs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 0aade8146d4d..3b26639517af 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -85,5 +85,6 @@ Documentation for filesystem implementations. qnx6 ramfs-rootfs-initramfs relay + romfs virtiofs vfat diff --git a/Documentation/filesystems/romfs.rst b/Documentation/filesystems/romfs.rst new file mode 100644 index 000000000000..465b11efa9be --- /dev/null +++ b/Documentation/filesystems/romfs.rst @@ -0,0 +1,194 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +ROMFS - ROM File System +======================= + +This is a quite dumb, read only filesystem, mainly for initial RAM +disks of installation disks. It has grown up by the need of having +modules linked at boot time. Using this filesystem, you get a very +similar feature, and even the possibility of a small kernel, with a +file system which doesn't take up useful memory from the router +functions in the basement of your office. + +For comparison, both the older minix and xiafs (the latter is now +defunct) filesystems, compiled as module need more than 20000 bytes, +while romfs is less than a page, about 4000 bytes (assuming i586 +code). Under the same conditions, the msdos filesystem would need +about 30K (and does not support device nodes or symlinks), while the +nfs module with nfsroot is about 57K. Furthermore, as a bit unfair +comparison, an actual rescue disk used up 3202 blocks with ext2, while +with romfs, it needed 3079 blocks. + +To create such a file system, you'll need a user program named +genromfs. It is available on http://romfs.sourceforge.net/ + +As the name suggests, romfs could be also used (space-efficiently) on +various read-only media, like (E)EPROM disks if someone will have the +motivation.. :) + +However, the main purpose of romfs is to have a very small kernel, +which has only this filesystem linked in, and then can load any module +later, with the current module utilities. It can also be used to run +some program to decide if you need SCSI devices, and even IDE or +floppy drives can be loaded later if you use the "initrd"--initial +RAM disk--feature of the kernel. This would not be really news +flash, but with romfs, you can even spare off your ext2 or minix or +maybe even affs filesystem until you really know that you need it. + +For example, a distribution boot disk can contain only the cd disk +drivers (and possibly the SCSI drivers), and the ISO 9660 filesystem +module. The kernel can be small enough, since it doesn't have other +filesystems, like the quite large ext2fs module, which can then be +loaded off the CD at a later stage of the installation. Another use +would be for a recovery disk, when you are reinstalling a workstation +from the network, and you will have all the tools/modules available +from a nearby server, so you don't want to carry two disks for this +purpose, just because it won't fit into ext2. + +romfs operates on block devices as you can expect, and the underlying +structure is very simple. Every accessible structure begins on 16 +byte boundaries for fast access. The minimum space a file will take +is 32 bytes (this is an empty file, with a less than 16 character +name). The maximum overhead for any non-empty file is the header, and +the 16 byte padding for the name and the contents, also 16+14+15 = 45 +bytes. This is quite rare however, since most file names are longer +than 3 bytes, and shorter than 15 bytes. + +The layout of the filesystem is the following:: + + offset content + + +---+---+---+---+ + 0 | - | r | o | m | \ + +---+---+---+---+ The ASCII representation of those bytes + 4 | 1 | f | s | - | / (i.e. "-rom1fs-") + +---+---+---+---+ + 8 | full size | The number of accessible bytes in this fs. + +---+---+---+---+ + 12 | checksum | The checksum of the FIRST 512 BYTES. + +---+---+---+---+ + 16 | volume name | The zero terminated name of the volume, + : : padded to 16 byte boundary. + +---+---+---+---+ + xx | file | + : headers : + +Every multi byte value (32 bit words, I'll use the longwords term from +now on) must be in big endian order. + +The first eight bytes identify the filesystem, even for the casual +inspector. After that, in the 3rd longword, it contains the number of +bytes accessible from the start of this filesystem. The 4th longword +is the checksum of the first 512 bytes (or the number of bytes +accessible, whichever is smaller). The applied algorithm is the same +as in the AFFS filesystem, namely a simple sum of the longwords +(assuming bigendian quantities again). For details, please consult +the source. This algorithm was chosen because although it's not quite +reliable, it does not require any tables, and it is very simple. + +The following bytes are now part of the file system; each file header +must begin on a 16 byte boundary:: + + offset content + + +---+---+---+---+ + 0 | next filehdr|X| The offset of the next file header + +---+---+---+---+ (zero if no more files) + 4 | spec.info | Info for directories/hard links/devices + +---+---+---+---+ + 8 | size | The size of this file in bytes + +---+---+---+---+ + 12 | checksum | Covering the meta data, including the file + +---+---+---+---+ name, and padding + 16 | file name | The zero terminated name of the file, + : : padded to 16 byte boundary + +---+---+---+---+ + xx | file data | + : : + +Since the file headers begin always at a 16 byte boundary, the lowest +4 bits would be always zero in the next filehdr pointer. These four +bits are used for the mode information. Bits 0..2 specify the type of +the file; while bit 4 shows if the file is executable or not. The +permissions are assumed to be world readable, if this bit is not set, +and world executable if it is; except the character and block devices, +they are never accessible for other than owner. The owner of every +file is user and group 0, this should never be a problem for the +intended use. The mapping of the 8 possible values to file types is +the following: + +== =============== ============================================ + mapping spec.info means +== =============== ============================================ + 0 hard link link destination [file header] + 1 directory first file's header + 2 regular file unused, must be zero [MBZ] + 3 symbolic link unused, MBZ (file data is the link content) + 4 block device 16/16 bits major/minor number + 5 char device - " - + 6 socket unused, MBZ + 7 fifo unused, MBZ +== =============== ============================================ + +Note that hard links are specifically marked in this filesystem, but +they will behave as you can expect (i.e. share the inode number). +Note also that it is your responsibility to not create hard link +loops, and creating all the . and .. links for directories. This is +normally done correctly by the genromfs program. Please refrain from +using the executable bits for special purposes on the socket and fifo +special files, they may have other uses in the future. Additionally, +please remember that only regular files, and symlinks are supposed to +have a nonzero size field; they contain the number of bytes available +directly after the (padded) file name. + +Another thing to note is that romfs works on file headers and data +aligned to 16 byte boundaries, but most hardware devices and the block +device drivers are unable to cope with smaller than block-sized data. +To overcome this limitation, the whole size of the file system must be +padded to an 1024 byte boundary. + +If you have any problems or suggestions concerning this file system, +please contact me. However, think twice before wanting me to add +features and code, because the primary and most important advantage of +this file system is the small code. On the other hand, don't be +alarmed, I'm not getting that much romfs related mail. Now I can +understand why Avery wrote poems in the ARCnet docs to get some more +feedback. :) + +romfs has also a mailing list, and to date, it hasn't received any +traffic, so you are welcome to join it to discuss your ideas. :) + +It's run by ezmlm, so you can subscribe to it by sending a message +to romfs-subscribe@shadow.banki.hu, the content is irrelevant. + +Pending issues: + +- Permissions and owner information are pretty essential features of a + Un*x like system, but romfs does not provide the full possibilities. + I have never found this limiting, but others might. + +- The file system is read only, so it can be very small, but in case + one would want to write _anything_ to a file system, he still needs + a writable file system, thus negating the size advantages. Possible + solutions: implement write access as a compile-time option, or a new, + similarly small writable filesystem for RAM disks. + +- Since the files are only required to have alignment on a 16 byte + boundary, it is currently possibly suboptimal to read or execute files + from the filesystem. It might be resolved by reordering file data to + have most of it (i.e. except the start and the end) laying at "natural" + boundaries, thus it would be possible to directly map a big portion of + the file contents to the mm subsystem. + +- Compression might be an useful feature, but memory is quite a + limiting factor in my eyes. + +- Where it is used? + +- Does it work on other architectures than intel and motorola? + + +Have fun, + +Janos Farkas diff --git a/Documentation/filesystems/romfs.txt b/Documentation/filesystems/romfs.txt deleted file mode 100644 index e2b07cc9120a..000000000000 --- a/Documentation/filesystems/romfs.txt +++ /dev/null @@ -1,186 +0,0 @@ -ROMFS - ROM FILE SYSTEM - -This is a quite dumb, read only filesystem, mainly for initial RAM -disks of installation disks. It has grown up by the need of having -modules linked at boot time. Using this filesystem, you get a very -similar feature, and even the possibility of a small kernel, with a -file system which doesn't take up useful memory from the router -functions in the basement of your office. - -For comparison, both the older minix and xiafs (the latter is now -defunct) filesystems, compiled as module need more than 20000 bytes, -while romfs is less than a page, about 4000 bytes (assuming i586 -code). Under the same conditions, the msdos filesystem would need -about 30K (and does not support device nodes or symlinks), while the -nfs module with nfsroot is about 57K. Furthermore, as a bit unfair -comparison, an actual rescue disk used up 3202 blocks with ext2, while -with romfs, it needed 3079 blocks. - -To create such a file system, you'll need a user program named -genromfs. It is available on http://romfs.sourceforge.net/ - -As the name suggests, romfs could be also used (space-efficiently) on -various read-only media, like (E)EPROM disks if someone will have the -motivation.. :) - -However, the main purpose of romfs is to have a very small kernel, -which has only this filesystem linked in, and then can load any module -later, with the current module utilities. It can also be used to run -some program to decide if you need SCSI devices, and even IDE or -floppy drives can be loaded later if you use the "initrd"--initial -RAM disk--feature of the kernel. This would not be really news -flash, but with romfs, you can even spare off your ext2 or minix or -maybe even affs filesystem until you really know that you need it. - -For example, a distribution boot disk can contain only the cd disk -drivers (and possibly the SCSI drivers), and the ISO 9660 filesystem -module. The kernel can be small enough, since it doesn't have other -filesystems, like the quite large ext2fs module, which can then be -loaded off the CD at a later stage of the installation. Another use -would be for a recovery disk, when you are reinstalling a workstation -from the network, and you will have all the tools/modules available -from a nearby server, so you don't want to carry two disks for this -purpose, just because it won't fit into ext2. - -romfs operates on block devices as you can expect, and the underlying -structure is very simple. Every accessible structure begins on 16 -byte boundaries for fast access. The minimum space a file will take -is 32 bytes (this is an empty file, with a less than 16 character -name). The maximum overhead for any non-empty file is the header, and -the 16 byte padding for the name and the contents, also 16+14+15 = 45 -bytes. This is quite rare however, since most file names are longer -than 3 bytes, and shorter than 15 bytes. - -The layout of the filesystem is the following: - -offset content - - +---+---+---+---+ - 0 | - | r | o | m | \ - +---+---+---+---+ The ASCII representation of those bytes - 4 | 1 | f | s | - | / (i.e. "-rom1fs-") - +---+---+---+---+ - 8 | full size | The number of accessible bytes in this fs. - +---+---+---+---+ - 12 | checksum | The checksum of the FIRST 512 BYTES. - +---+---+---+---+ - 16 | volume name | The zero terminated name of the volume, - : : padded to 16 byte boundary. - +---+---+---+---+ - xx | file | - : headers : - -Every multi byte value (32 bit words, I'll use the longwords term from -now on) must be in big endian order. - -The first eight bytes identify the filesystem, even for the casual -inspector. After that, in the 3rd longword, it contains the number of -bytes accessible from the start of this filesystem. The 4th longword -is the checksum of the first 512 bytes (or the number of bytes -accessible, whichever is smaller). The applied algorithm is the same -as in the AFFS filesystem, namely a simple sum of the longwords -(assuming bigendian quantities again). For details, please consult -the source. This algorithm was chosen because although it's not quite -reliable, it does not require any tables, and it is very simple. - -The following bytes are now part of the file system; each file header -must begin on a 16 byte boundary. - -offset content - - +---+---+---+---+ - 0 | next filehdr|X| The offset of the next file header - +---+---+---+---+ (zero if no more files) - 4 | spec.info | Info for directories/hard links/devices - +---+---+---+---+ - 8 | size | The size of this file in bytes - +---+---+---+---+ - 12 | checksum | Covering the meta data, including the file - +---+---+---+---+ name, and padding - 16 | file name | The zero terminated name of the file, - : : padded to 16 byte boundary - +---+---+---+---+ - xx | file data | - : : - -Since the file headers begin always at a 16 byte boundary, the lowest -4 bits would be always zero in the next filehdr pointer. These four -bits are used for the mode information. Bits 0..2 specify the type of -the file; while bit 4 shows if the file is executable or not. The -permissions are assumed to be world readable, if this bit is not set, -and world executable if it is; except the character and block devices, -they are never accessible for other than owner. The owner of every -file is user and group 0, this should never be a problem for the -intended use. The mapping of the 8 possible values to file types is -the following: - - mapping spec.info means - 0 hard link link destination [file header] - 1 directory first file's header - 2 regular file unused, must be zero [MBZ] - 3 symbolic link unused, MBZ (file data is the link content) - 4 block device 16/16 bits major/minor number - 5 char device - " - - 6 socket unused, MBZ - 7 fifo unused, MBZ - -Note that hard links are specifically marked in this filesystem, but -they will behave as you can expect (i.e. share the inode number). -Note also that it is your responsibility to not create hard link -loops, and creating all the . and .. links for directories. This is -normally done correctly by the genromfs program. Please refrain from -using the executable bits for special purposes on the socket and fifo -special files, they may have other uses in the future. Additionally, -please remember that only regular files, and symlinks are supposed to -have a nonzero size field; they contain the number of bytes available -directly after the (padded) file name. - -Another thing to note is that romfs works on file headers and data -aligned to 16 byte boundaries, but most hardware devices and the block -device drivers are unable to cope with smaller than block-sized data. -To overcome this limitation, the whole size of the file system must be -padded to an 1024 byte boundary. - -If you have any problems or suggestions concerning this file system, -please contact me. However, think twice before wanting me to add -features and code, because the primary and most important advantage of -this file system is the small code. On the other hand, don't be -alarmed, I'm not getting that much romfs related mail. Now I can -understand why Avery wrote poems in the ARCnet docs to get some more -feedback. :) - -romfs has also a mailing list, and to date, it hasn't received any -traffic, so you are welcome to join it to discuss your ideas. :) - -It's run by ezmlm, so you can subscribe to it by sending a message -to romfs-subscribe@shadow.banki.hu, the content is irrelevant. - -Pending issues: - -- Permissions and owner information are pretty essential features of a -Un*x like system, but romfs does not provide the full possibilities. -I have never found this limiting, but others might. - -- The file system is read only, so it can be very small, but in case -one would want to write _anything_ to a file system, he still needs -a writable file system, thus negating the size advantages. Possible -solutions: implement write access as a compile-time option, or a new, -similarly small writable filesystem for RAM disks. - -- Since the files are only required to have alignment on a 16 byte -boundary, it is currently possibly suboptimal to read or execute files -from the filesystem. It might be resolved by reordering file data to -have most of it (i.e. except the start and the end) laying at "natural" -boundaries, thus it would be possible to directly map a big portion of -the file contents to the mm subsystem. - -- Compression might be an useful feature, but memory is quite a -limiting factor in my eyes. - -- Where it is used? - -- Does it work on other architectures than intel and motorola? - - -Have fun, -Janos Farkas -- cgit From 31771f45c8e46d9356f1a58329f5cd40ab331e1a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:23 +0100 Subject: docs: filesystems: convert squashfs.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/cec30862c7ee7de7f9cd903e35e6c8bf74cc928a.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/squashfs.rst | 265 +++++++++++++++++++++++++++++++++ Documentation/filesystems/squashfs.txt | 259 -------------------------------- 3 files changed, 266 insertions(+), 259 deletions(-) create mode 100644 Documentation/filesystems/squashfs.rst delete mode 100644 Documentation/filesystems/squashfs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 3b26639517af..97a5f65ae509 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -86,5 +86,6 @@ Documentation for filesystem implementations. ramfs-rootfs-initramfs relay romfs + squashfs virtiofs vfat diff --git a/Documentation/filesystems/squashfs.rst b/Documentation/filesystems/squashfs.rst new file mode 100644 index 000000000000..df42106bae71 --- /dev/null +++ b/Documentation/filesystems/squashfs.rst @@ -0,0 +1,265 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +Squashfs 4.0 Filesystem +======================= + +Squashfs is a compressed read-only filesystem for Linux. + +It uses zlib, lz4, lzo, or xz compression to compress files, inodes and +directories. Inodes in the system are very small and all blocks are packed to +minimise data overhead. Block sizes greater than 4K are supported up to a +maximum of 1Mbytes (default block size 128K). + +Squashfs is intended for general read-only filesystem use, for archival +use (i.e. in cases where a .tar.gz file may be used), and in constrained +block device/memory systems (e.g. embedded systems) where low overhead is +needed. + +Mailing list: squashfs-devel@lists.sourceforge.net +Web site: www.squashfs.org + +1. Filesystem Features +---------------------- + +Squashfs filesystem features versus Cramfs: + +============================== ========= ========== + Squashfs Cramfs +============================== ========= ========== +Max filesystem size 2^64 256 MiB +Max file size ~ 2 TiB 16 MiB +Max files unlimited unlimited +Max directories unlimited unlimited +Max entries per directory unlimited unlimited +Max block size 1 MiB 4 KiB +Metadata compression yes no +Directory indexes yes no +Sparse file support yes no +Tail-end packing (fragments) yes no +Exportable (NFS etc.) yes no +Hard link support yes no +"." and ".." in readdir yes no +Real inode numbers yes no +32-bit uids/gids yes no +File creation time yes no +Xattr support yes no +ACL support no no +============================== ========= ========== + +Squashfs compresses data, inodes and directories. In addition, inode and +directory data are highly compacted, and packed on byte boundaries. Each +compressed inode is on average 8 bytes in length (the exact length varies on +file type, i.e. regular file, directory, symbolic link, and block/char device +inodes have different sizes). + +2. Using Squashfs +----------------- + +As squashfs is a read-only filesystem, the mksquashfs program must be used to +create populated squashfs filesystems. This and other squashfs utilities +can be obtained from http://www.squashfs.org. Usage instructions can be +obtained from this site also. + +The squashfs-tools development tree is now located on kernel.org + git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git + +3. Squashfs Filesystem Design +----------------------------- + +A squashfs filesystem consists of a maximum of nine parts, packed together on a +byte alignment:: + + --------------- + | superblock | + |---------------| + | compression | + | options | + |---------------| + | datablocks | + | & fragments | + |---------------| + | inode table | + |---------------| + | directory | + | table | + |---------------| + | fragment | + | table | + |---------------| + | export | + | table | + |---------------| + | uid/gid | + | lookup table | + |---------------| + | xattr | + | table | + --------------- + +Compressed data blocks are written to the filesystem as files are read from +the source directory, and checked for duplicates. Once all file data has been +written the completed inode, directory, fragment, export, uid/gid lookup and +xattr tables are written. + +3.1 Compression options +----------------------- + +Compressors can optionally support compression specific options (e.g. +dictionary size). If non-default compression options have been used, then +these are stored here. + +3.2 Inodes +---------- + +Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each +compressed block is prefixed by a two byte length, the top bit is set if the +block is uncompressed. A block will be uncompressed if the -noI option is set, +or if the compressed block was larger than the uncompressed block. + +Inodes are packed into the metadata blocks, and are not aligned to block +boundaries, therefore inodes overlap compressed blocks. Inodes are identified +by a 48-bit number which encodes the location of the compressed metadata block +containing the inode, and the byte offset into that block where the inode is +placed (). + +To maximise compression there are different inodes for each file type +(regular file, directory, device, etc.), the inode contents and length +varying with the type. + +To further maximise compression, two types of regular file inode and +directory inode are defined: inodes optimised for frequently occurring +regular files and directories, and extended types where extra +information has to be stored. + +3.3 Directories +--------------- + +Like inodes, directories are packed into compressed metadata blocks, stored +in a directory table. Directories are accessed using the start address of +the metablock containing the directory and the offset into the +decompressed block (). + +Directories are organised in a slightly complex way, and are not simply +a list of file names. The organisation takes advantage of the +fact that (in most cases) the inodes of the files will be in the same +compressed metadata block, and therefore, can share the start block. +Directories are therefore organised in a two level list, a directory +header containing the shared start block value, and a sequence of directory +entries, each of which share the shared start block. A new directory header +is written once/if the inode start block changes. The directory +header/directory entry list is repeated as many times as necessary. + +Directories are sorted, and can contain a directory index to speed up +file lookup. Directory indexes store one entry per metablock, each entry +storing the index/filename mapping to the first directory header +in each metadata block. Directories are sorted in alphabetical order, +and at lookup the index is scanned linearly looking for the first filename +alphabetically larger than the filename being looked up. At this point the +location of the metadata block the filename is in has been found. +The general idea of the index is to ensure only one metadata block needs to be +decompressed to do a lookup irrespective of the length of the directory. +This scheme has the advantage that it doesn't require extra memory overhead +and doesn't require much extra storage on disk. + +3.4 File data +------------- + +Regular files consist of a sequence of contiguous compressed blocks, and/or a +compressed fragment block (tail-end packed block). The compressed size +of each datablock is stored in a block list contained within the +file inode. + +To speed up access to datablocks when reading 'large' files (256 Mbytes or +larger), the code implements an index cache that caches the mapping from +block index to datablock location on disk. + +The index cache allows Squashfs to handle large files (up to 1.75 TiB) while +retaining a simple and space-efficient block list on disk. The cache +is split into slots, caching up to eight 224 GiB files (128 KiB blocks). +Larger files use multiple slots, with 1.75 TiB files using all 8 slots. +The index cache is designed to be memory efficient, and by default uses +16 KiB. + +3.5 Fragment lookup table +------------------------- + +Regular files can contain a fragment index which is mapped to a fragment +location on disk and compressed size using a fragment lookup table. This +fragment lookup table is itself stored compressed into metadata blocks. +A second index table is used to locate these. This second index table for +speed of access (and because it is small) is read at mount time and cached +in memory. + +3.6 Uid/gid lookup table +------------------------ + +For space efficiency regular files store uid and gid indexes, which are +converted to 32-bit uids/gids using an id look up table. This table is +stored compressed into metadata blocks. A second index table is used to +locate these. This second index table for speed of access (and because it +is small) is read at mount time and cached in memory. + +3.7 Export table +---------------- + +To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems +can optionally (disabled with the -no-exports Mksquashfs option) contain +an inode number to inode disk location lookup table. This is required to +enable Squashfs to map inode numbers passed in filehandles to the inode +location on disk, which is necessary when the export code reinstantiates +expired/flushed inodes. + +This table is stored compressed into metadata blocks. A second index table is +used to locate these. This second index table for speed of access (and because +it is small) is read at mount time and cached in memory. + +3.8 Xattr table +--------------- + +The xattr table contains extended attributes for each inode. The xattrs +for each inode are stored in a list, each list entry containing a type, +name and value field. The type field encodes the xattr prefix +("user.", "trusted." etc) and it also encodes how the name/value fields +should be interpreted. Currently the type indicates whether the value +is stored inline (in which case the value field contains the xattr value), +or if it is stored out of line (in which case the value field stores a +reference to where the actual value is stored). This allows large values +to be stored out of line improving scanning and lookup performance and it +also allows values to be de-duplicated, the value being stored once, and +all other occurrences holding an out of line reference to that value. + +The xattr lists are packed into compressed 8K metadata blocks. +To reduce overhead in inodes, rather than storing the on-disk +location of the xattr list inside each inode, a 32-bit xattr id +is stored. This xattr id is mapped into the location of the xattr +list using a second xattr id lookup table. + +4. TODOs and Outstanding Issues +------------------------------- + +4.1 TODO list +------------- + +Implement ACL support. + +4.2 Squashfs Internal Cache +--------------------------- + +Blocks in Squashfs are compressed. To avoid repeatedly decompressing +recently accessed data Squashfs uses two small metadata and fragment caches. + +The cache is not used for file datablocks, these are decompressed and cached in +the page-cache in the normal way. The cache is used to temporarily cache +fragment and metadata blocks which have been read as a result of a metadata +(i.e. inode or directory) or fragment access. Because metadata and fragments +are packed together into blocks (to gain greater compression) the read of a +particular piece of metadata or fragment will retrieve other metadata/fragments +which have been packed with it, these because of locality-of-reference may be +read in the near future. Temporarily caching them ensures they are available +for near future access without requiring an additional read and decompress. + +In the future this internal cache may be replaced with an implementation which +uses the kernel page cache. Because the page cache operates on page sized +units this may introduce additional complexity in terms of locking and +associated race conditions. diff --git a/Documentation/filesystems/squashfs.txt b/Documentation/filesystems/squashfs.txt deleted file mode 100644 index e5274f84dc56..000000000000 --- a/Documentation/filesystems/squashfs.txt +++ /dev/null @@ -1,259 +0,0 @@ -SQUASHFS 4.0 FILESYSTEM -======================= - -Squashfs is a compressed read-only filesystem for Linux. -It uses zlib, lz4, lzo, or xz compression to compress files, inodes and -directories. Inodes in the system are very small and all blocks are packed to -minimise data overhead. Block sizes greater than 4K are supported up to a -maximum of 1Mbytes (default block size 128K). - -Squashfs is intended for general read-only filesystem use, for archival -use (i.e. in cases where a .tar.gz file may be used), and in constrained -block device/memory systems (e.g. embedded systems) where low overhead is -needed. - -Mailing list: squashfs-devel@lists.sourceforge.net -Web site: www.squashfs.org - -1. FILESYSTEM FEATURES ----------------------- - -Squashfs filesystem features versus Cramfs: - - Squashfs Cramfs - -Max filesystem size: 2^64 256 MiB -Max file size: ~ 2 TiB 16 MiB -Max files: unlimited unlimited -Max directories: unlimited unlimited -Max entries per directory: unlimited unlimited -Max block size: 1 MiB 4 KiB -Metadata compression: yes no -Directory indexes: yes no -Sparse file support: yes no -Tail-end packing (fragments): yes no -Exportable (NFS etc.): yes no -Hard link support: yes no -"." and ".." in readdir: yes no -Real inode numbers: yes no -32-bit uids/gids: yes no -File creation time: yes no -Xattr support: yes no -ACL support: no no - -Squashfs compresses data, inodes and directories. In addition, inode and -directory data are highly compacted, and packed on byte boundaries. Each -compressed inode is on average 8 bytes in length (the exact length varies on -file type, i.e. regular file, directory, symbolic link, and block/char device -inodes have different sizes). - -2. USING SQUASHFS ------------------ - -As squashfs is a read-only filesystem, the mksquashfs program must be used to -create populated squashfs filesystems. This and other squashfs utilities -can be obtained from http://www.squashfs.org. Usage instructions can be -obtained from this site also. - -The squashfs-tools development tree is now located on kernel.org - git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git - -3. SQUASHFS FILESYSTEM DESIGN ------------------------------ - -A squashfs filesystem consists of a maximum of nine parts, packed together on a -byte alignment: - - --------------- - | superblock | - |---------------| - | compression | - | options | - |---------------| - | datablocks | - | & fragments | - |---------------| - | inode table | - |---------------| - | directory | - | table | - |---------------| - | fragment | - | table | - |---------------| - | export | - | table | - |---------------| - | uid/gid | - | lookup table | - |---------------| - | xattr | - | table | - --------------- - -Compressed data blocks are written to the filesystem as files are read from -the source directory, and checked for duplicates. Once all file data has been -written the completed inode, directory, fragment, export, uid/gid lookup and -xattr tables are written. - -3.1 Compression options ------------------------ - -Compressors can optionally support compression specific options (e.g. -dictionary size). If non-default compression options have been used, then -these are stored here. - -3.2 Inodes ----------- - -Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each -compressed block is prefixed by a two byte length, the top bit is set if the -block is uncompressed. A block will be uncompressed if the -noI option is set, -or if the compressed block was larger than the uncompressed block. - -Inodes are packed into the metadata blocks, and are not aligned to block -boundaries, therefore inodes overlap compressed blocks. Inodes are identified -by a 48-bit number which encodes the location of the compressed metadata block -containing the inode, and the byte offset into that block where the inode is -placed (). - -To maximise compression there are different inodes for each file type -(regular file, directory, device, etc.), the inode contents and length -varying with the type. - -To further maximise compression, two types of regular file inode and -directory inode are defined: inodes optimised for frequently occurring -regular files and directories, and extended types where extra -information has to be stored. - -3.3 Directories ---------------- - -Like inodes, directories are packed into compressed metadata blocks, stored -in a directory table. Directories are accessed using the start address of -the metablock containing the directory and the offset into the -decompressed block (). - -Directories are organised in a slightly complex way, and are not simply -a list of file names. The organisation takes advantage of the -fact that (in most cases) the inodes of the files will be in the same -compressed metadata block, and therefore, can share the start block. -Directories are therefore organised in a two level list, a directory -header containing the shared start block value, and a sequence of directory -entries, each of which share the shared start block. A new directory header -is written once/if the inode start block changes. The directory -header/directory entry list is repeated as many times as necessary. - -Directories are sorted, and can contain a directory index to speed up -file lookup. Directory indexes store one entry per metablock, each entry -storing the index/filename mapping to the first directory header -in each metadata block. Directories are sorted in alphabetical order, -and at lookup the index is scanned linearly looking for the first filename -alphabetically larger than the filename being looked up. At this point the -location of the metadata block the filename is in has been found. -The general idea of the index is to ensure only one metadata block needs to be -decompressed to do a lookup irrespective of the length of the directory. -This scheme has the advantage that it doesn't require extra memory overhead -and doesn't require much extra storage on disk. - -3.4 File data -------------- - -Regular files consist of a sequence of contiguous compressed blocks, and/or a -compressed fragment block (tail-end packed block). The compressed size -of each datablock is stored in a block list contained within the -file inode. - -To speed up access to datablocks when reading 'large' files (256 Mbytes or -larger), the code implements an index cache that caches the mapping from -block index to datablock location on disk. - -The index cache allows Squashfs to handle large files (up to 1.75 TiB) while -retaining a simple and space-efficient block list on disk. The cache -is split into slots, caching up to eight 224 GiB files (128 KiB blocks). -Larger files use multiple slots, with 1.75 TiB files using all 8 slots. -The index cache is designed to be memory efficient, and by default uses -16 KiB. - -3.5 Fragment lookup table -------------------------- - -Regular files can contain a fragment index which is mapped to a fragment -location on disk and compressed size using a fragment lookup table. This -fragment lookup table is itself stored compressed into metadata blocks. -A second index table is used to locate these. This second index table for -speed of access (and because it is small) is read at mount time and cached -in memory. - -3.6 Uid/gid lookup table ------------------------- - -For space efficiency regular files store uid and gid indexes, which are -converted to 32-bit uids/gids using an id look up table. This table is -stored compressed into metadata blocks. A second index table is used to -locate these. This second index table for speed of access (and because it -is small) is read at mount time and cached in memory. - -3.7 Export table ----------------- - -To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems -can optionally (disabled with the -no-exports Mksquashfs option) contain -an inode number to inode disk location lookup table. This is required to -enable Squashfs to map inode numbers passed in filehandles to the inode -location on disk, which is necessary when the export code reinstantiates -expired/flushed inodes. - -This table is stored compressed into metadata blocks. A second index table is -used to locate these. This second index table for speed of access (and because -it is small) is read at mount time and cached in memory. - -3.8 Xattr table ---------------- - -The xattr table contains extended attributes for each inode. The xattrs -for each inode are stored in a list, each list entry containing a type, -name and value field. The type field encodes the xattr prefix -("user.", "trusted." etc) and it also encodes how the name/value fields -should be interpreted. Currently the type indicates whether the value -is stored inline (in which case the value field contains the xattr value), -or if it is stored out of line (in which case the value field stores a -reference to where the actual value is stored). This allows large values -to be stored out of line improving scanning and lookup performance and it -also allows values to be de-duplicated, the value being stored once, and -all other occurrences holding an out of line reference to that value. - -The xattr lists are packed into compressed 8K metadata blocks. -To reduce overhead in inodes, rather than storing the on-disk -location of the xattr list inside each inode, a 32-bit xattr id -is stored. This xattr id is mapped into the location of the xattr -list using a second xattr id lookup table. - -4. TODOS AND OUTSTANDING ISSUES -------------------------------- - -4.1 Todo list -------------- - -Implement ACL support. - -4.2 Squashfs internal cache ---------------------------- - -Blocks in Squashfs are compressed. To avoid repeatedly decompressing -recently accessed data Squashfs uses two small metadata and fragment caches. - -The cache is not used for file datablocks, these are decompressed and cached in -the page-cache in the normal way. The cache is used to temporarily cache -fragment and metadata blocks which have been read as a result of a metadata -(i.e. inode or directory) or fragment access. Because metadata and fragments -are packed together into blocks (to gain greater compression) the read of a -particular piece of metadata or fragment will retrieve other metadata/fragments -which have been packed with it, these because of locality-of-reference may be -read in the near future. Temporarily caching them ensures they are available -for near future access without requiring an additional read and decompress. - -In the future this internal cache may be replaced with an implementation which -uses the kernel page cache. Because the page cache operates on page sized -units this may introduce additional complexity in terms of locking and -associated race conditions. -- cgit From 86beb976700b26576fe522a94a0b3a4e3d5ce424 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:24 +0100 Subject: docs: filesystems: convert sysfs.txt to ReST - Add a SPDX header; - Add a document title; - Adjust document and section titles; - use :field: markup; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/5c480dcb467315b5df6e25372a65e473b585c36d.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/sysfs.rst | 418 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/sysfs.txt | 408 ----------------------------------- 3 files changed, 419 insertions(+), 408 deletions(-) create mode 100644 Documentation/filesystems/sysfs.rst delete mode 100644 Documentation/filesystems/sysfs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 97a5f65ae509..bafe92c72433 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -87,5 +87,6 @@ Documentation for filesystem implementations. relay romfs squashfs + sysfs virtiofs vfat diff --git a/Documentation/filesystems/sysfs.rst b/Documentation/filesystems/sysfs.rst new file mode 100644 index 000000000000..290891c3fecb --- /dev/null +++ b/Documentation/filesystems/sysfs.rst @@ -0,0 +1,418 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================================== +sysfs - _The_ filesystem for exporting kernel objects +===================================================== + +Patrick Mochel + +Mike Murphy + +:Revised: 16 August 2011 +:Original: 10 January 2003 + + +What it is: +~~~~~~~~~~~ + +sysfs is a ram-based filesystem initially based on ramfs. It provides +a means to export kernel data structures, their attributes, and the +linkages between them to userspace. + +sysfs is tied inherently to the kobject infrastructure. Please read +Documentation/kobject.txt for more information concerning the kobject +interface. + + +Using sysfs +~~~~~~~~~~~ + +sysfs is always compiled in if CONFIG_SYSFS is defined. You can access +it by doing:: + + mount -t sysfs sysfs /sys + + +Directory Creation +~~~~~~~~~~~~~~~~~~ + +For every kobject that is registered with the system, a directory is +created for it in sysfs. That directory is created as a subdirectory +of the kobject's parent, expressing internal object hierarchies to +userspace. Top-level directories in sysfs represent the common +ancestors of object hierarchies; i.e. the subsystems the objects +belong to. + +Sysfs internally stores a pointer to the kobject that implements a +directory in the kernfs_node object associated with the directory. In +the past this kobject pointer has been used by sysfs to do reference +counting directly on the kobject whenever the file is opened or closed. +With the current sysfs implementation the kobject reference count is +only modified directly by the function sysfs_schedule_callback(). + + +Attributes +~~~~~~~~~~ + +Attributes can be exported for kobjects in the form of regular files in +the filesystem. Sysfs forwards file I/O operations to methods defined +for the attributes, providing a means to read and write kernel +attributes. + +Attributes should be ASCII text files, preferably with only one value +per file. It is noted that it may not be efficient to contain only one +value per file, so it is socially acceptable to express an array of +values of the same type. + +Mixing types, expressing multiple lines of data, and doing fancy +formatting of data is heavily frowned upon. Doing these things may get +you publicly humiliated and your code rewritten without notice. + + +An attribute definition is simply:: + + struct attribute { + char * name; + struct module *owner; + umode_t mode; + }; + + + int sysfs_create_file(struct kobject * kobj, const struct attribute * attr); + void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr); + + +A bare attribute contains no means to read or write the value of the +attribute. Subsystems are encouraged to define their own attribute +structure and wrapper functions for adding and removing attributes for +a specific object type. + +For example, the driver model defines struct device_attribute like:: + + struct device_attribute { + struct attribute attr; + ssize_t (*show)(struct device *dev, struct device_attribute *attr, + char *buf); + ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); + }; + + int device_create_file(struct device *, const struct device_attribute *); + void device_remove_file(struct device *, const struct device_attribute *); + +It also defines this helper for defining device attributes:: + + #define DEVICE_ATTR(_name, _mode, _show, _store) \ + struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store) + +For example, declaring:: + + static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo); + +is equivalent to doing:: + + static struct device_attribute dev_attr_foo = { + .attr = { + .name = "foo", + .mode = S_IWUSR | S_IRUGO, + }, + .show = show_foo, + .store = store_foo, + }; + +Note as stated in include/linux/kernel.h "OTHER_WRITABLE? Generally +considered a bad idea." so trying to set a sysfs file writable for +everyone will fail reverting to RO mode for "Others". + +For the common cases sysfs.h provides convenience macros to make +defining attributes easier as well as making code more concise and +readable. The above case could be shortened to: + +static struct device_attribute dev_attr_foo = __ATTR_RW(foo); + +the list of helpers available to define your wrapper function is: + +__ATTR_RO(name): + assumes default name_show and mode 0444 +__ATTR_WO(name): + assumes a name_store only and is restricted to mode + 0200 that is root write access only. +__ATTR_RO_MODE(name, mode): + fore more restrictive RO access currently + only use case is the EFI System Resource Table + (see drivers/firmware/efi/esrt.c) +__ATTR_RW(name): + assumes default name_show, name_store and setting + mode to 0644. +__ATTR_NULL: + which sets the name to NULL and is used as end of list + indicator (see: kernel/workqueue.c) + +Subsystem-Specific Callbacks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When a subsystem defines a new attribute type, it must implement a +set of sysfs operations for forwarding read and write calls to the +show and store methods of the attribute owners:: + + struct sysfs_ops { + ssize_t (*show)(struct kobject *, struct attribute *, char *); + ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t); + }; + +[ Subsystems should have already defined a struct kobj_type as a +descriptor for this type, which is where the sysfs_ops pointer is +stored. See the kobject documentation for more information. ] + +When a file is read or written, sysfs calls the appropriate method +for the type. The method then translates the generic struct kobject +and struct attribute pointers to the appropriate pointer types, and +calls the associated methods. + + +To illustrate:: + + #define to_dev(obj) container_of(obj, struct device, kobj) + #define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr) + + static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, + char *buf) + { + struct device_attribute *dev_attr = to_dev_attr(attr); + struct device *dev = to_dev(kobj); + ssize_t ret = -EIO; + + if (dev_attr->show) + ret = dev_attr->show(dev, dev_attr, buf); + if (ret >= (ssize_t)PAGE_SIZE) { + printk("dev_attr_show: %pS returned bad count\n", + dev_attr->show); + } + return ret; + } + + + +Reading/Writing Attribute Data +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To read or write attributes, show() or store() methods must be +specified when declaring the attribute. The method types should be as +simple as those defined for device attributes:: + + ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf); + ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); + +IOW, they should take only an object, an attribute, and a buffer as parameters. + + +sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the +method. Sysfs will call the method exactly once for each read or +write. This forces the following behavior on the method +implementations: + +- On read(2), the show() method should fill the entire buffer. + Recall that an attribute should only be exporting one value, or an + array of similar values, so this shouldn't be that expensive. + + This allows userspace to do partial reads and forward seeks + arbitrarily over the entire file at will. If userspace seeks back to + zero or does a pread(2) with an offset of '0' the show() method will + be called again, rearmed, to fill the buffer. + +- On write(2), sysfs expects the entire buffer to be passed during the + first write. Sysfs then passes the entire buffer to the store() method. + A terminating null is added after the data on stores. This makes + functions like sysfs_streq() safe to use. + + When writing sysfs files, userspace processes should first read the + entire file, modify the values it wishes to change, then write the + entire buffer back. + + Attribute method implementations should operate on an identical + buffer when reading and writing values. + +Other notes: + +- Writing causes the show() method to be rearmed regardless of current + file position. + +- The buffer will always be PAGE_SIZE bytes in length. On i386, this + is 4096. + +- show() methods should return the number of bytes printed into the + buffer. This is the return value of scnprintf(). + +- show() must not use snprintf() when formatting the value to be + returned to user space. If you can guarantee that an overflow + will never happen you can use sprintf() otherwise you must use + scnprintf(). + +- store() should return the number of bytes used from the buffer. If the + entire buffer has been used, just return the count argument. + +- show() or store() can always return errors. If a bad value comes + through, be sure to return an error. + +- The object passed to the methods will be pinned in memory via sysfs + referencing counting its embedded object. However, the physical + entity (e.g. device) the object represents may not be present. Be + sure to have a way to check this, if necessary. + + +A very simple (and naive) implementation of a device attribute is:: + + static ssize_t show_name(struct device *dev, struct device_attribute *attr, + char *buf) + { + return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name); + } + + static ssize_t store_name(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) + { + snprintf(dev->name, sizeof(dev->name), "%.*s", + (int)min(count, sizeof(dev->name) - 1), buf); + return count; + } + + static DEVICE_ATTR(name, S_IRUGO, show_name, store_name); + + +(Note that the real implementation doesn't allow userspace to set the +name for a device.) + + +Top Level Directory Layout +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The sysfs directory arrangement exposes the relationship of kernel +data structures. + +The top level sysfs directory looks like:: + + block/ + bus/ + class/ + dev/ + devices/ + firmware/ + net/ + fs/ + +devices/ contains a filesystem representation of the device tree. It maps +directly to the internal kernel device tree, which is a hierarchy of +struct device. + +bus/ contains flat directory layout of the various bus types in the +kernel. Each bus's directory contains two subdirectories:: + + devices/ + drivers/ + +devices/ contains symlinks for each device discovered in the system +that point to the device's directory under root/. + +drivers/ contains a directory for each device driver that is loaded +for devices on that particular bus (this assumes that drivers do not +span multiple bus types). + +fs/ contains a directory for some filesystems. Currently each +filesystem wanting to export attributes must create its own hierarchy +below fs/ (see ./fuse.txt for an example). + +dev/ contains two directories char/ and block/. Inside these two +directories there are symlinks named :. These symlinks +point to the sysfs directory for the given device. /sys/dev provides a +quick way to lookup the sysfs interface for a device from the result of +a stat(2) operation. + +More information can driver-model specific features can be found in +Documentation/driver-api/driver-model/. + + +TODO: Finish this section. + + +Current Interfaces +~~~~~~~~~~~~~~~~~~ + +The following interface layers currently exist in sysfs: + + +devices (include/linux/device.h) +-------------------------------- +Structure:: + + struct device_attribute { + struct attribute attr; + ssize_t (*show)(struct device *dev, struct device_attribute *attr, + char *buf); + ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); + }; + +Declaring:: + + DEVICE_ATTR(_name, _mode, _show, _store); + +Creation/Removal:: + + int device_create_file(struct device *dev, const struct device_attribute * attr); + void device_remove_file(struct device *dev, const struct device_attribute * attr); + + +bus drivers (include/linux/device.h) +------------------------------------ +Structure:: + + struct bus_attribute { + struct attribute attr; + ssize_t (*show)(struct bus_type *, char * buf); + ssize_t (*store)(struct bus_type *, const char * buf, size_t count); + }; + +Declaring:: + + static BUS_ATTR_RW(name); + static BUS_ATTR_RO(name); + static BUS_ATTR_WO(name); + +Creation/Removal:: + + int bus_create_file(struct bus_type *, struct bus_attribute *); + void bus_remove_file(struct bus_type *, struct bus_attribute *); + + +device drivers (include/linux/device.h) +--------------------------------------- + +Structure:: + + struct driver_attribute { + struct attribute attr; + ssize_t (*show)(struct device_driver *, char * buf); + ssize_t (*store)(struct device_driver *, const char * buf, + size_t count); + }; + +Declaring:: + + DRIVER_ATTR_RO(_name) + DRIVER_ATTR_RW(_name) + +Creation/Removal:: + + int driver_create_file(struct device_driver *, const struct driver_attribute *); + void driver_remove_file(struct device_driver *, const struct driver_attribute *); + + +Documentation +~~~~~~~~~~~~~ + +The sysfs directory structure and the attributes in each directory define an +ABI between the kernel and user space. As for any ABI, it is important that +this ABI is stable and properly documented. All new sysfs attributes must be +documented in Documentation/ABI. See also Documentation/ABI/README for more +information. diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt deleted file mode 100644 index ddf15b1b0d5a..000000000000 --- a/Documentation/filesystems/sysfs.txt +++ /dev/null @@ -1,408 +0,0 @@ - -sysfs - _The_ filesystem for exporting kernel objects. - -Patrick Mochel -Mike Murphy - -Revised: 16 August 2011 -Original: 10 January 2003 - - -What it is: -~~~~~~~~~~~ - -sysfs is a ram-based filesystem initially based on ramfs. It provides -a means to export kernel data structures, their attributes, and the -linkages between them to userspace. - -sysfs is tied inherently to the kobject infrastructure. Please read -Documentation/kobject.txt for more information concerning the kobject -interface. - - -Using sysfs -~~~~~~~~~~~ - -sysfs is always compiled in if CONFIG_SYSFS is defined. You can access -it by doing: - - mount -t sysfs sysfs /sys - - -Directory Creation -~~~~~~~~~~~~~~~~~~ - -For every kobject that is registered with the system, a directory is -created for it in sysfs. That directory is created as a subdirectory -of the kobject's parent, expressing internal object hierarchies to -userspace. Top-level directories in sysfs represent the common -ancestors of object hierarchies; i.e. the subsystems the objects -belong to. - -Sysfs internally stores a pointer to the kobject that implements a -directory in the kernfs_node object associated with the directory. In -the past this kobject pointer has been used by sysfs to do reference -counting directly on the kobject whenever the file is opened or closed. -With the current sysfs implementation the kobject reference count is -only modified directly by the function sysfs_schedule_callback(). - - -Attributes -~~~~~~~~~~ - -Attributes can be exported for kobjects in the form of regular files in -the filesystem. Sysfs forwards file I/O operations to methods defined -for the attributes, providing a means to read and write kernel -attributes. - -Attributes should be ASCII text files, preferably with only one value -per file. It is noted that it may not be efficient to contain only one -value per file, so it is socially acceptable to express an array of -values of the same type. - -Mixing types, expressing multiple lines of data, and doing fancy -formatting of data is heavily frowned upon. Doing these things may get -you publicly humiliated and your code rewritten without notice. - - -An attribute definition is simply: - -struct attribute { - char * name; - struct module *owner; - umode_t mode; -}; - - -int sysfs_create_file(struct kobject * kobj, const struct attribute * attr); -void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr); - - -A bare attribute contains no means to read or write the value of the -attribute. Subsystems are encouraged to define their own attribute -structure and wrapper functions for adding and removing attributes for -a specific object type. - -For example, the driver model defines struct device_attribute like: - -struct device_attribute { - struct attribute attr; - ssize_t (*show)(struct device *dev, struct device_attribute *attr, - char *buf); - ssize_t (*store)(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count); -}; - -int device_create_file(struct device *, const struct device_attribute *); -void device_remove_file(struct device *, const struct device_attribute *); - -It also defines this helper for defining device attributes: - -#define DEVICE_ATTR(_name, _mode, _show, _store) \ -struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store) - -For example, declaring - -static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo); - -is equivalent to doing: - -static struct device_attribute dev_attr_foo = { - .attr = { - .name = "foo", - .mode = S_IWUSR | S_IRUGO, - }, - .show = show_foo, - .store = store_foo, -}; - -Note as stated in include/linux/kernel.h "OTHER_WRITABLE? Generally -considered a bad idea." so trying to set a sysfs file writable for -everyone will fail reverting to RO mode for "Others". - -For the common cases sysfs.h provides convenience macros to make -defining attributes easier as well as making code more concise and -readable. The above case could be shortened to: - -static struct device_attribute dev_attr_foo = __ATTR_RW(foo); - -the list of helpers available to define your wrapper function is: -__ATTR_RO(name): assumes default name_show and mode 0444 -__ATTR_WO(name): assumes a name_store only and is restricted to mode - 0200 that is root write access only. -__ATTR_RO_MODE(name, mode): fore more restrictive RO access currently - only use case is the EFI System Resource Table - (see drivers/firmware/efi/esrt.c) -__ATTR_RW(name): assumes default name_show, name_store and setting - mode to 0644. -__ATTR_NULL: which sets the name to NULL and is used as end of list - indicator (see: kernel/workqueue.c) - -Subsystem-Specific Callbacks -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -When a subsystem defines a new attribute type, it must implement a -set of sysfs operations for forwarding read and write calls to the -show and store methods of the attribute owners. - -struct sysfs_ops { - ssize_t (*show)(struct kobject *, struct attribute *, char *); - ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t); -}; - -[ Subsystems should have already defined a struct kobj_type as a -descriptor for this type, which is where the sysfs_ops pointer is -stored. See the kobject documentation for more information. ] - -When a file is read or written, sysfs calls the appropriate method -for the type. The method then translates the generic struct kobject -and struct attribute pointers to the appropriate pointer types, and -calls the associated methods. - - -To illustrate: - -#define to_dev(obj) container_of(obj, struct device, kobj) -#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr) - -static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, - char *buf) -{ - struct device_attribute *dev_attr = to_dev_attr(attr); - struct device *dev = to_dev(kobj); - ssize_t ret = -EIO; - - if (dev_attr->show) - ret = dev_attr->show(dev, dev_attr, buf); - if (ret >= (ssize_t)PAGE_SIZE) { - printk("dev_attr_show: %pS returned bad count\n", - dev_attr->show); - } - return ret; -} - - - -Reading/Writing Attribute Data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -To read or write attributes, show() or store() methods must be -specified when declaring the attribute. The method types should be as -simple as those defined for device attributes: - -ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf); -ssize_t (*store)(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count); - -IOW, they should take only an object, an attribute, and a buffer as parameters. - - -sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the -method. Sysfs will call the method exactly once for each read or -write. This forces the following behavior on the method -implementations: - -- On read(2), the show() method should fill the entire buffer. - Recall that an attribute should only be exporting one value, or an - array of similar values, so this shouldn't be that expensive. - - This allows userspace to do partial reads and forward seeks - arbitrarily over the entire file at will. If userspace seeks back to - zero or does a pread(2) with an offset of '0' the show() method will - be called again, rearmed, to fill the buffer. - -- On write(2), sysfs expects the entire buffer to be passed during the - first write. Sysfs then passes the entire buffer to the store() method. - A terminating null is added after the data on stores. This makes - functions like sysfs_streq() safe to use. - - When writing sysfs files, userspace processes should first read the - entire file, modify the values it wishes to change, then write the - entire buffer back. - - Attribute method implementations should operate on an identical - buffer when reading and writing values. - -Other notes: - -- Writing causes the show() method to be rearmed regardless of current - file position. - -- The buffer will always be PAGE_SIZE bytes in length. On i386, this - is 4096. - -- show() methods should return the number of bytes printed into the - buffer. This is the return value of scnprintf(). - -- show() must not use snprintf() when formatting the value to be - returned to user space. If you can guarantee that an overflow - will never happen you can use sprintf() otherwise you must use - scnprintf(). - -- store() should return the number of bytes used from the buffer. If the - entire buffer has been used, just return the count argument. - -- show() or store() can always return errors. If a bad value comes - through, be sure to return an error. - -- The object passed to the methods will be pinned in memory via sysfs - referencing counting its embedded object. However, the physical - entity (e.g. device) the object represents may not be present. Be - sure to have a way to check this, if necessary. - - -A very simple (and naive) implementation of a device attribute is: - -static ssize_t show_name(struct device *dev, struct device_attribute *attr, - char *buf) -{ - return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name); -} - -static ssize_t store_name(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count) -{ - snprintf(dev->name, sizeof(dev->name), "%.*s", - (int)min(count, sizeof(dev->name) - 1), buf); - return count; -} - -static DEVICE_ATTR(name, S_IRUGO, show_name, store_name); - - -(Note that the real implementation doesn't allow userspace to set the -name for a device.) - - -Top Level Directory Layout -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The sysfs directory arrangement exposes the relationship of kernel -data structures. - -The top level sysfs directory looks like: - -block/ -bus/ -class/ -dev/ -devices/ -firmware/ -net/ -fs/ - -devices/ contains a filesystem representation of the device tree. It maps -directly to the internal kernel device tree, which is a hierarchy of -struct device. - -bus/ contains flat directory layout of the various bus types in the -kernel. Each bus's directory contains two subdirectories: - - devices/ - drivers/ - -devices/ contains symlinks for each device discovered in the system -that point to the device's directory under root/. - -drivers/ contains a directory for each device driver that is loaded -for devices on that particular bus (this assumes that drivers do not -span multiple bus types). - -fs/ contains a directory for some filesystems. Currently each -filesystem wanting to export attributes must create its own hierarchy -below fs/ (see ./fuse.txt for an example). - -dev/ contains two directories char/ and block/. Inside these two -directories there are symlinks named :. These symlinks -point to the sysfs directory for the given device. /sys/dev provides a -quick way to lookup the sysfs interface for a device from the result of -a stat(2) operation. - -More information can driver-model specific features can be found in -Documentation/driver-api/driver-model/. - - -TODO: Finish this section. - - -Current Interfaces -~~~~~~~~~~~~~~~~~~ - -The following interface layers currently exist in sysfs: - - -- devices (include/linux/device.h) ----------------------------------- -Structure: - -struct device_attribute { - struct attribute attr; - ssize_t (*show)(struct device *dev, struct device_attribute *attr, - char *buf); - ssize_t (*store)(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count); -}; - -Declaring: - -DEVICE_ATTR(_name, _mode, _show, _store); - -Creation/Removal: - -int device_create_file(struct device *dev, const struct device_attribute * attr); -void device_remove_file(struct device *dev, const struct device_attribute * attr); - - -- bus drivers (include/linux/device.h) --------------------------------------- -Structure: - -struct bus_attribute { - struct attribute attr; - ssize_t (*show)(struct bus_type *, char * buf); - ssize_t (*store)(struct bus_type *, const char * buf, size_t count); -}; - -Declaring: - -static BUS_ATTR_RW(name); -static BUS_ATTR_RO(name); -static BUS_ATTR_WO(name); - -Creation/Removal: - -int bus_create_file(struct bus_type *, struct bus_attribute *); -void bus_remove_file(struct bus_type *, struct bus_attribute *); - - -- device drivers (include/linux/device.h) ------------------------------------------ - -Structure: - -struct driver_attribute { - struct attribute attr; - ssize_t (*show)(struct device_driver *, char * buf); - ssize_t (*store)(struct device_driver *, const char * buf, - size_t count); -}; - -Declaring: - -DRIVER_ATTR_RO(_name) -DRIVER_ATTR_RW(_name) - -Creation/Removal: - -int driver_create_file(struct device_driver *, const struct driver_attribute *); -void driver_remove_file(struct device_driver *, const struct driver_attribute *); - - -Documentation -~~~~~~~~~~~~~ - -The sysfs directory structure and the attributes in each directory define an -ABI between the kernel and user space. As for any ABI, it is important that -this ABI is stable and properly documented. All new sysfs attributes must be -documented in Documentation/ABI. See also Documentation/ABI/README for more -information. -- cgit From 826a613d3f81695022f324a5cb84fe73ec09e51d Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:25 +0100 Subject: docs: filesystems: convert sysv-fs.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/5b96a6efba95773af439ab25a7dbe4d0edf8c867.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/sysv-fs.rst | 264 ++++++++++++++++++++++++++++++++++ Documentation/filesystems/sysv-fs.txt | 197 ------------------------- 3 files changed, 265 insertions(+), 197 deletions(-) create mode 100644 Documentation/filesystems/sysv-fs.rst delete mode 100644 Documentation/filesystems/sysv-fs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index bafe92c72433..d583b8b35196 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -88,5 +88,6 @@ Documentation for filesystem implementations. romfs squashfs sysfs + sysv-fs virtiofs vfat diff --git a/Documentation/filesystems/sysv-fs.rst b/Documentation/filesystems/sysv-fs.rst new file mode 100644 index 000000000000..89e40911ad7c --- /dev/null +++ b/Documentation/filesystems/sysv-fs.rst @@ -0,0 +1,264 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +SystemV Filesystem +================== + +It implements all of + - Xenix FS, + - SystemV/386 FS, + - Coherent FS. + +To install: + +* Answer the 'System V and Coherent filesystem support' question with 'y' + when configuring the kernel. +* To mount a disk or a partition, use:: + + mount [-r] -t sysv device mountpoint + + The file system type names:: + + -t sysv + -t xenix + -t coherent + + may be used interchangeably, but the last two will eventually disappear. + +Bugs in the present implementation: + +- Coherent FS: + + - The "free list interleave" n:m is currently ignored. + - Only file systems with no filesystem name and no pack name are recognized. + (See Coherent "man mkfs" for a description of these features.) + +- SystemV Release 2 FS: + + The superblock is only searched in the blocks 9, 15, 18, which + corresponds to the beginning of track 1 on floppy disks. No support + for this FS on hard disk yet. + + +These filesystems are rather similar. Here is a comparison with Minix FS: + +* Linux fdisk reports on partitions + + - Minix FS 0x81 Linux/Minix + - Xenix FS ?? + - SystemV FS ?? + - Coherent FS 0x08 AIX bootable + +* Size of a block or zone (data allocation unit on disk) + + - Minix FS 1024 + - Xenix FS 1024 (also 512 ??) + - SystemV FS 1024 (also 512 and 2048) + - Coherent FS 512 + +* General layout: all have one boot block, one super block and + separate areas for inodes and for directories/data. + On SystemV Release 2 FS (e.g. Microport) the first track is reserved and + all the block numbers (including the super block) are offset by one track. + +* Byte ordering of "short" (16 bit entities) on disk: + + - Minix FS little endian 0 1 + - Xenix FS little endian 0 1 + - SystemV FS little endian 0 1 + - Coherent FS little endian 0 1 + + Of course, this affects only the file system, not the data of files on it! + +* Byte ordering of "long" (32 bit entities) on disk: + + - Minix FS little endian 0 1 2 3 + - Xenix FS little endian 0 1 2 3 + - SystemV FS little endian 0 1 2 3 + - Coherent FS PDP-11 2 3 0 1 + + Of course, this affects only the file system, not the data of files on it! + +* Inode on disk: "short", 0 means non-existent, the root dir ino is: + + ================================= == + Minix FS 1 + Xenix FS, SystemV FS, Coherent FS 2 + ================================= == + +* Maximum number of hard links to a file: + + =========== ========= + Minix FS 250 + Xenix FS ?? + SystemV FS ?? + Coherent FS >=10000 + =========== ========= + +* Free inode management: + + - Minix FS + a bitmap + - Xenix FS, SystemV FS, Coherent FS + There is a cache of a certain number of free inodes in the super-block. + When it is exhausted, new free inodes are found using a linear search. + +* Free block management: + + - Minix FS + a bitmap + - Xenix FS, SystemV FS, Coherent FS + Free blocks are organized in a "free list". Maybe a misleading term, + since it is not true that every free block contains a pointer to + the next free block. Rather, the free blocks are organized in chunks + of limited size, and every now and then a free block contains pointers + to the free blocks pertaining to the next chunk; the first of these + contains pointers and so on. The list terminates with a "block number" + 0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS. + +* Super-block location: + + =========== ========================== + Minix FS block 1 = bytes 1024..2047 + Xenix FS block 1 = bytes 1024..2047 + SystemV FS bytes 512..1023 + Coherent FS block 1 = bytes 512..1023 + =========== ========================== + +* Super-block layout: + + - Minix FS:: + + unsigned short s_ninodes; + unsigned short s_nzones; + unsigned short s_imap_blocks; + unsigned short s_zmap_blocks; + unsigned short s_firstdatazone; + unsigned short s_log_zone_size; + unsigned long s_max_size; + unsigned short s_magic; + + - Xenix FS, SystemV FS, Coherent FS:: + + unsigned short s_firstdatazone; + unsigned long s_nzones; + unsigned short s_fzone_count; + unsigned long s_fzones[NICFREE]; + unsigned short s_finode_count; + unsigned short s_finodes[NICINOD]; + char s_flock; + char s_ilock; + char s_modified; + char s_rdonly; + unsigned long s_time; + short s_dinfo[4]; -- SystemV FS only + unsigned long s_free_zones; + unsigned short s_free_inodes; + short s_dinfo[4]; -- Xenix FS only + unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only + char s_fname[6]; + char s_fpack[6]; + + then they differ considerably: + + Xenix FS:: + + char s_clean; + char s_fill[371]; + long s_magic; + long s_type; + + SystemV FS:: + + long s_fill[12 or 14]; + long s_state; + long s_magic; + long s_type; + + Coherent FS:: + + unsigned long s_unique; + + Note that Coherent FS has no magic. + +* Inode layout: + + - Minix FS:: + + unsigned short i_mode; + unsigned short i_uid; + unsigned long i_size; + unsigned long i_time; + unsigned char i_gid; + unsigned char i_nlinks; + unsigned short i_zone[7+1+1]; + + - Xenix FS, SystemV FS, Coherent FS:: + + unsigned short i_mode; + unsigned short i_nlink; + unsigned short i_uid; + unsigned short i_gid; + unsigned long i_size; + unsigned char i_zone[3*(10+1+1+1)]; + unsigned long i_atime; + unsigned long i_mtime; + unsigned long i_ctime; + + +* Regular file data blocks are organized as + + - Minix FS: + + - 7 direct blocks + - 1 indirect block (pointers to blocks) + - 1 double-indirect block (pointer to pointers to blocks) + + - Xenix FS, SystemV FS, Coherent FS: + + - 10 direct blocks + - 1 indirect block (pointers to blocks) + - 1 double-indirect block (pointer to pointers to blocks) + - 1 triple-indirect block (pointer to pointers to pointers to blocks) + + + =========== ========== ================ + Inode size inodes per block + =========== ========== ================ + Minix FS 32 32 + Xenix FS 64 16 + SystemV FS 64 16 + Coherent FS 64 8 + =========== ========== ================ + +* Directory entry on disk + + - Minix FS:: + + unsigned short inode; + char name[14/30]; + + - Xenix FS, SystemV FS, Coherent FS:: + + unsigned short inode; + char name[14]; + + =========== ============== ===================== + Dir entry size dir entries per block + =========== ============== ===================== + Minix FS 16/32 64/32 + Xenix FS 16 64 + SystemV FS 16 64 + Coherent FS 16 32 + =========== ============== ===================== + +* How to implement symbolic links such that the host fsck doesn't scream: + + - Minix FS normal + - Xenix FS kludge: as regular files with chmod 1000 + - SystemV FS ?? + - Coherent FS kludge: as regular files with chmod 1000 + + +Notation: We often speak of a "block" but mean a zone (the allocation unit) +and not the disk driver's notion of "block". diff --git a/Documentation/filesystems/sysv-fs.txt b/Documentation/filesystems/sysv-fs.txt deleted file mode 100644 index 253b50d1328e..000000000000 --- a/Documentation/filesystems/sysv-fs.txt +++ /dev/null @@ -1,197 +0,0 @@ -It implements all of - - Xenix FS, - - SystemV/386 FS, - - Coherent FS. - -To install: -* Answer the 'System V and Coherent filesystem support' question with 'y' - when configuring the kernel. -* To mount a disk or a partition, use - mount [-r] -t sysv device mountpoint - The file system type names - -t sysv - -t xenix - -t coherent - may be used interchangeably, but the last two will eventually disappear. - -Bugs in the present implementation: -- Coherent FS: - - The "free list interleave" n:m is currently ignored. - - Only file systems with no filesystem name and no pack name are recognized. - (See Coherent "man mkfs" for a description of these features.) -- SystemV Release 2 FS: - The superblock is only searched in the blocks 9, 15, 18, which - corresponds to the beginning of track 1 on floppy disks. No support - for this FS on hard disk yet. - - -These filesystems are rather similar. Here is a comparison with Minix FS: - -* Linux fdisk reports on partitions - - Minix FS 0x81 Linux/Minix - - Xenix FS ?? - - SystemV FS ?? - - Coherent FS 0x08 AIX bootable - -* Size of a block or zone (data allocation unit on disk) - - Minix FS 1024 - - Xenix FS 1024 (also 512 ??) - - SystemV FS 1024 (also 512 and 2048) - - Coherent FS 512 - -* General layout: all have one boot block, one super block and - separate areas for inodes and for directories/data. - On SystemV Release 2 FS (e.g. Microport) the first track is reserved and - all the block numbers (including the super block) are offset by one track. - -* Byte ordering of "short" (16 bit entities) on disk: - - Minix FS little endian 0 1 - - Xenix FS little endian 0 1 - - SystemV FS little endian 0 1 - - Coherent FS little endian 0 1 - Of course, this affects only the file system, not the data of files on it! - -* Byte ordering of "long" (32 bit entities) on disk: - - Minix FS little endian 0 1 2 3 - - Xenix FS little endian 0 1 2 3 - - SystemV FS little endian 0 1 2 3 - - Coherent FS PDP-11 2 3 0 1 - Of course, this affects only the file system, not the data of files on it! - -* Inode on disk: "short", 0 means non-existent, the root dir ino is: - - Minix FS 1 - - Xenix FS, SystemV FS, Coherent FS 2 - -* Maximum number of hard links to a file: - - Minix FS 250 - - Xenix FS ?? - - SystemV FS ?? - - Coherent FS >=10000 - -* Free inode management: - - Minix FS a bitmap - - Xenix FS, SystemV FS, Coherent FS - There is a cache of a certain number of free inodes in the super-block. - When it is exhausted, new free inodes are found using a linear search. - -* Free block management: - - Minix FS a bitmap - - Xenix FS, SystemV FS, Coherent FS - Free blocks are organized in a "free list". Maybe a misleading term, - since it is not true that every free block contains a pointer to - the next free block. Rather, the free blocks are organized in chunks - of limited size, and every now and then a free block contains pointers - to the free blocks pertaining to the next chunk; the first of these - contains pointers and so on. The list terminates with a "block number" - 0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS. - -* Super-block location: - - Minix FS block 1 = bytes 1024..2047 - - Xenix FS block 1 = bytes 1024..2047 - - SystemV FS bytes 512..1023 - - Coherent FS block 1 = bytes 512..1023 - -* Super-block layout: - - Minix FS - unsigned short s_ninodes; - unsigned short s_nzones; - unsigned short s_imap_blocks; - unsigned short s_zmap_blocks; - unsigned short s_firstdatazone; - unsigned short s_log_zone_size; - unsigned long s_max_size; - unsigned short s_magic; - - Xenix FS, SystemV FS, Coherent FS - unsigned short s_firstdatazone; - unsigned long s_nzones; - unsigned short s_fzone_count; - unsigned long s_fzones[NICFREE]; - unsigned short s_finode_count; - unsigned short s_finodes[NICINOD]; - char s_flock; - char s_ilock; - char s_modified; - char s_rdonly; - unsigned long s_time; - short s_dinfo[4]; -- SystemV FS only - unsigned long s_free_zones; - unsigned short s_free_inodes; - short s_dinfo[4]; -- Xenix FS only - unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only - char s_fname[6]; - char s_fpack[6]; - then they differ considerably: - Xenix FS - char s_clean; - char s_fill[371]; - long s_magic; - long s_type; - SystemV FS - long s_fill[12 or 14]; - long s_state; - long s_magic; - long s_type; - Coherent FS - unsigned long s_unique; - Note that Coherent FS has no magic. - -* Inode layout: - - Minix FS - unsigned short i_mode; - unsigned short i_uid; - unsigned long i_size; - unsigned long i_time; - unsigned char i_gid; - unsigned char i_nlinks; - unsigned short i_zone[7+1+1]; - - Xenix FS, SystemV FS, Coherent FS - unsigned short i_mode; - unsigned short i_nlink; - unsigned short i_uid; - unsigned short i_gid; - unsigned long i_size; - unsigned char i_zone[3*(10+1+1+1)]; - unsigned long i_atime; - unsigned long i_mtime; - unsigned long i_ctime; - -* Regular file data blocks are organized as - - Minix FS - 7 direct blocks - 1 indirect block (pointers to blocks) - 1 double-indirect block (pointer to pointers to blocks) - - Xenix FS, SystemV FS, Coherent FS - 10 direct blocks - 1 indirect block (pointers to blocks) - 1 double-indirect block (pointer to pointers to blocks) - 1 triple-indirect block (pointer to pointers to pointers to blocks) - -* Inode size, inodes per block - - Minix FS 32 32 - - Xenix FS 64 16 - - SystemV FS 64 16 - - Coherent FS 64 8 - -* Directory entry on disk - - Minix FS - unsigned short inode; - char name[14/30]; - - Xenix FS, SystemV FS, Coherent FS - unsigned short inode; - char name[14]; - -* Dir entry size, dir entries per block - - Minix FS 16/32 64/32 - - Xenix FS 16 64 - - SystemV FS 16 64 - - Coherent FS 16 32 - -* How to implement symbolic links such that the host fsck doesn't scream: - - Minix FS normal - - Xenix FS kludge: as regular files with chmod 1000 - - SystemV FS ?? - - Coherent FS kludge: as regular files with chmod 1000 - - -Notation: We often speak of a "block" but mean a zone (the allocation unit) -and not the disk driver's notion of "block". -- cgit From 7e7cd458b8105b02e69e3af2ef4cd186326d7f84 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:26 +0100 Subject: docs: filesystems: convert tmpfs.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Use :field: markup; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/30397a47a78ca59760fbc0fc5f50c5f1002d487a.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/tmpfs.rst | 163 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/tmpfs.txt | 149 -------------------------------- 3 files changed, 164 insertions(+), 149 deletions(-) create mode 100644 Documentation/filesystems/tmpfs.rst delete mode 100644 Documentation/filesystems/tmpfs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index d583b8b35196..27d37e7712da 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -89,5 +89,6 @@ Documentation for filesystem implementations. squashfs sysfs sysv-fs + tmpfs virtiofs vfat diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst new file mode 100644 index 000000000000..4e95929301a5 --- /dev/null +++ b/Documentation/filesystems/tmpfs.rst @@ -0,0 +1,163 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===== +Tmpfs +===== + +Tmpfs is a file system which keeps all files in virtual memory. + + +Everything in tmpfs is temporary in the sense that no files will be +created on your hard drive. If you unmount a tmpfs instance, +everything stored therein is lost. + +tmpfs puts everything into the kernel internal caches and grows and +shrinks to accommodate the files it contains and is able to swap +unneeded pages out to swap space. It has maximum size limits which can +be adjusted on the fly via 'mount -o remount ...' + +If you compare it to ramfs (which was the template to create tmpfs) +you gain swapping and limit checking. Another similar thing is the RAM +disk (/dev/ram*), which simulates a fixed size hard disk in physical +RAM, where you have to create an ordinary filesystem on top. Ramdisks +cannot swap and you do not have the possibility to resize them. + +Since tmpfs lives completely in the page cache and on swap, all tmpfs +pages will be shown as "Shmem" in /proc/meminfo and "Shared" in +free(1). Notice that these counters also include shared memory +(shmem, see ipcs(1)). The most reliable way to get the count is +using df(1) and du(1). + +tmpfs has the following uses: + +1) There is always a kernel internal mount which you will not see at + all. This is used for shared anonymous mappings and SYSV shared + memory. + + This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not + set, the user visible part of tmpfs is not build. But the internal + mechanisms are always present. + +2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for + POSIX shared memory (shm_open, shm_unlink). Adding the following + line to /etc/fstab should take care of this:: + + tmpfs /dev/shm tmpfs defaults 0 0 + + Remember to create the directory that you intend to mount tmpfs on + if necessary. + + This mount is _not_ needed for SYSV shared memory. The internal + mount is used for that. (In the 2.3 kernel versions it was + necessary to mount the predecessor of tmpfs (shm fs) to use SYSV + shared memory) + +3) Some people (including me) find it very convenient to mount it + e.g. on /tmp and /var/tmp and have a big swap partition. And now + loop mounts of tmpfs files do work, so mkinitrd shipped by most + distributions should succeed with a tmpfs /tmp. + +4) And probably a lot more I do not know about :-) + + +tmpfs has three mount options for sizing: + +========= ============================================================ +size The limit of allocated bytes for this tmpfs instance. The + default is half of your physical RAM without swap. If you + oversize your tmpfs instances the machine will deadlock + since the OOM handler will not be able to free that memory. +nr_blocks The same as size, but in blocks of PAGE_SIZE. +nr_inodes The maximum number of inodes for this instance. The default + is half of the number of your physical RAM pages, or (on a + machine with highmem) the number of lowmem RAM pages, + whichever is the lower. +========= ============================================================ + +These parameters accept a suffix k, m or g for kilo, mega and giga and +can be changed on remount. The size parameter also accepts a suffix % +to limit this tmpfs instance to that percentage of your physical RAM: +the default, when neither size nor nr_blocks is specified, is size=50% + +If nr_blocks=0 (or size=0), blocks will not be limited in that instance; +if nr_inodes=0, inodes will not be limited. It is generally unwise to +mount with such options, since it allows any user with write access to +use up all the memory on the machine; but enhances the scalability of +that instance in a system with many cpus making intensive use of it. + + +tmpfs has a mount option to set the NUMA memory allocation policy for +all files in that instance (if CONFIG_NUMA is enabled) - which can be +adjusted on the fly via 'mount -o remount ...' + +======================== ============================================== +mpol=default use the process allocation policy + (see set_mempolicy(2)) +mpol=prefer:Node prefers to allocate memory from the given Node +mpol=bind:NodeList allocates memory only from nodes in NodeList +mpol=interleave prefers to allocate from each node in turn +mpol=interleave:NodeList allocates from each node of NodeList in turn +mpol=local prefers to allocate memory from the local node +======================== ============================================== + +NodeList format is a comma-separated list of decimal numbers and ranges, +a range being two hyphen-separated decimal numbers, the smallest and +largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15 + +A memory policy with a valid NodeList will be saved, as specified, for +use at file creation time. When a task allocates a file in the file +system, the mount option memory policy will be applied with a NodeList, +if any, modified by the calling task's cpuset constraints +[See Documentation/admin-guide/cgroup-v1/cpusets.rst] and any optional flags, +listed below. If the resulting NodeLists is the empty set, the effective +memory policy for the file will revert to "default" policy. + +NUMA memory allocation policies have optional flags that can be used in +conjunction with their modes. These optional flags can be specified +when tmpfs is mounted by appending them to the mode before the NodeList. +See Documentation/admin-guide/mm/numa_memory_policy.rst for a list of +all available memory allocation policy mode flags and their effect on +memory policy. + +:: + + =static is equivalent to MPOL_F_STATIC_NODES + =relative is equivalent to MPOL_F_RELATIVE_NODES + +For example, mpol=bind=static:NodeList, is the equivalent of an +allocation policy of MPOL_BIND | MPOL_F_STATIC_NODES. + +Note that trying to mount a tmpfs with an mpol option will fail if the +running kernel does not support NUMA; and will fail if its nodelist +specifies a node which is not online. If your system relies on that +tmpfs being mounted, but from time to time runs a kernel built without +NUMA capability (perhaps a safe recovery kernel), or with fewer nodes +online, then it is advisable to omit the mpol option from automatic +mount options. It can be added later, when the tmpfs is already mounted +on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'. + + +To specify the initial root directory you can use the following mount +options: + +==== ================================== +mode The permissions as an octal number +uid The user id +gid The group id +==== ================================== + +These options do not have any effect on remount. You can change these +parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem. + + +So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs' +will give you tmpfs instance on /mytmpfs which can allocate 10GB +RAM/SWAP in 10240 inodes and it is only accessible by root. + + +:Author: + Christoph Rohland , 1.12.01 +:Updated: + Hugh Dickins, 4 June 2007 +:Updated: + KOSAKI Motohiro, 16 Mar 2010 diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt deleted file mode 100644 index 5ecbc03e6b2f..000000000000 --- a/Documentation/filesystems/tmpfs.txt +++ /dev/null @@ -1,149 +0,0 @@ -Tmpfs is a file system which keeps all files in virtual memory. - - -Everything in tmpfs is temporary in the sense that no files will be -created on your hard drive. If you unmount a tmpfs instance, -everything stored therein is lost. - -tmpfs puts everything into the kernel internal caches and grows and -shrinks to accommodate the files it contains and is able to swap -unneeded pages out to swap space. It has maximum size limits which can -be adjusted on the fly via 'mount -o remount ...' - -If you compare it to ramfs (which was the template to create tmpfs) -you gain swapping and limit checking. Another similar thing is the RAM -disk (/dev/ram*), which simulates a fixed size hard disk in physical -RAM, where you have to create an ordinary filesystem on top. Ramdisks -cannot swap and you do not have the possibility to resize them. - -Since tmpfs lives completely in the page cache and on swap, all tmpfs -pages will be shown as "Shmem" in /proc/meminfo and "Shared" in -free(1). Notice that these counters also include shared memory -(shmem, see ipcs(1)). The most reliable way to get the count is -using df(1) and du(1). - -tmpfs has the following uses: - -1) There is always a kernel internal mount which you will not see at - all. This is used for shared anonymous mappings and SYSV shared - memory. - - This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not - set, the user visible part of tmpfs is not build. But the internal - mechanisms are always present. - -2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for - POSIX shared memory (shm_open, shm_unlink). Adding the following - line to /etc/fstab should take care of this: - - tmpfs /dev/shm tmpfs defaults 0 0 - - Remember to create the directory that you intend to mount tmpfs on - if necessary. - - This mount is _not_ needed for SYSV shared memory. The internal - mount is used for that. (In the 2.3 kernel versions it was - necessary to mount the predecessor of tmpfs (shm fs) to use SYSV - shared memory) - -3) Some people (including me) find it very convenient to mount it - e.g. on /tmp and /var/tmp and have a big swap partition. And now - loop mounts of tmpfs files do work, so mkinitrd shipped by most - distributions should succeed with a tmpfs /tmp. - -4) And probably a lot more I do not know about :-) - - -tmpfs has three mount options for sizing: - -size: The limit of allocated bytes for this tmpfs instance. The - default is half of your physical RAM without swap. If you - oversize your tmpfs instances the machine will deadlock - since the OOM handler will not be able to free that memory. -nr_blocks: The same as size, but in blocks of PAGE_SIZE. -nr_inodes: The maximum number of inodes for this instance. The default - is half of the number of your physical RAM pages, or (on a - machine with highmem) the number of lowmem RAM pages, - whichever is the lower. - -These parameters accept a suffix k, m or g for kilo, mega and giga and -can be changed on remount. The size parameter also accepts a suffix % -to limit this tmpfs instance to that percentage of your physical RAM: -the default, when neither size nor nr_blocks is specified, is size=50% - -If nr_blocks=0 (or size=0), blocks will not be limited in that instance; -if nr_inodes=0, inodes will not be limited. It is generally unwise to -mount with such options, since it allows any user with write access to -use up all the memory on the machine; but enhances the scalability of -that instance in a system with many cpus making intensive use of it. - - -tmpfs has a mount option to set the NUMA memory allocation policy for -all files in that instance (if CONFIG_NUMA is enabled) - which can be -adjusted on the fly via 'mount -o remount ...' - -mpol=default use the process allocation policy - (see set_mempolicy(2)) -mpol=prefer:Node prefers to allocate memory from the given Node -mpol=bind:NodeList allocates memory only from nodes in NodeList -mpol=interleave prefers to allocate from each node in turn -mpol=interleave:NodeList allocates from each node of NodeList in turn -mpol=local prefers to allocate memory from the local node - -NodeList format is a comma-separated list of decimal numbers and ranges, -a range being two hyphen-separated decimal numbers, the smallest and -largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15 - -A memory policy with a valid NodeList will be saved, as specified, for -use at file creation time. When a task allocates a file in the file -system, the mount option memory policy will be applied with a NodeList, -if any, modified by the calling task's cpuset constraints -[See Documentation/admin-guide/cgroup-v1/cpusets.rst] and any optional flags, listed -below. If the resulting NodeLists is the empty set, the effective memory -policy for the file will revert to "default" policy. - -NUMA memory allocation policies have optional flags that can be used in -conjunction with their modes. These optional flags can be specified -when tmpfs is mounted by appending them to the mode before the NodeList. -See Documentation/admin-guide/mm/numa_memory_policy.rst for a list of -all available memory allocation policy mode flags and their effect on -memory policy. - - =static is equivalent to MPOL_F_STATIC_NODES - =relative is equivalent to MPOL_F_RELATIVE_NODES - -For example, mpol=bind=static:NodeList, is the equivalent of an -allocation policy of MPOL_BIND | MPOL_F_STATIC_NODES. - -Note that trying to mount a tmpfs with an mpol option will fail if the -running kernel does not support NUMA; and will fail if its nodelist -specifies a node which is not online. If your system relies on that -tmpfs being mounted, but from time to time runs a kernel built without -NUMA capability (perhaps a safe recovery kernel), or with fewer nodes -online, then it is advisable to omit the mpol option from automatic -mount options. It can be added later, when the tmpfs is already mounted -on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'. - - -To specify the initial root directory you can use the following mount -options: - -mode: The permissions as an octal number -uid: The user id -gid: The group id - -These options do not have any effect on remount. You can change these -parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem. - - -So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs' -will give you tmpfs instance on /mytmpfs which can allocate 10GB -RAM/SWAP in 10240 inodes and it is only accessible by root. - - -Author: - Christoph Rohland , 1.12.01 -Updated: - Hugh Dickins, 4 June 2007 -Updated: - KOSAKI Motohiro, 16 Mar 2010 -- cgit From 688f118e3139f81f813ba1896931cf8fad93430d Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:27 +0100 Subject: docs: filesystems: convert ubifs-authentication.rst.txt to ReST - Add a SPDX header; - Mark some literals as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/0c36091b6660cd372f994bd98e1264491d766c22.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/ubifs-authentication.rst | 10 ++++++---- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 27d37e7712da..bb14738df358 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -90,5 +90,6 @@ Documentation for filesystem implementations. sysfs sysv-fs tmpfs + ubifs-authentication.rst virtiofs vfat diff --git a/Documentation/filesystems/ubifs-authentication.rst b/Documentation/filesystems/ubifs-authentication.rst index 6a9584f6ff46..16efd729bf7c 100644 --- a/Documentation/filesystems/ubifs-authentication.rst +++ b/Documentation/filesystems/ubifs-authentication.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + :orphan: .. UBIFS Authentication @@ -92,11 +94,11 @@ UBIFS Index & Tree Node Cache ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Basic on-flash UBIFS entities are called *nodes*. UBIFS knows different types -of nodes. Eg. data nodes (`struct ubifs_data_node`) which store chunks of file -contents or inode nodes (`struct ubifs_ino_node`) which represent VFS inodes. -Almost all types of nodes share a common header (`ubifs_ch`) containing basic +of nodes. Eg. data nodes (``struct ubifs_data_node``) which store chunks of file +contents or inode nodes (``struct ubifs_ino_node``) which represent VFS inodes. +Almost all types of nodes share a common header (``ubifs_ch``) containing basic information like node type, node length, a sequence number, etc. (see -`fs/ubifs/ubifs-media.h`in kernel source). Exceptions are entries of the LPT +``fs/ubifs/ubifs-media.h`` in kernel source). Exceptions are entries of the LPT and some less important node types like padding nodes which are used to pad unusable content at the end of LEBs. -- cgit From 38e56b4ec44139b5781d6ff13f1b422e4b38f0d4 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:28 +0100 Subject: docs: filesystems: convert ubifs.txt to ReST - Add a SPDX header; - Add a document title; - Adjust section titles; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add lists markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/9043dc2965cafc64e6a521e2317c00ecc8303bf6.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/ubifs.rst | 137 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/ubifs.txt | 126 --------------------------------- 3 files changed, 138 insertions(+), 126 deletions(-) create mode 100644 Documentation/filesystems/ubifs.rst delete mode 100644 Documentation/filesystems/ubifs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index bb14738df358..58d57c9bf922 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -90,6 +90,7 @@ Documentation for filesystem implementations. sysfs sysv-fs tmpfs + ubifs ubifs-authentication.rst virtiofs vfat diff --git a/Documentation/filesystems/ubifs.rst b/Documentation/filesystems/ubifs.rst new file mode 100644 index 000000000000..e6ee99762534 --- /dev/null +++ b/Documentation/filesystems/ubifs.rst @@ -0,0 +1,137 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +UBI File System +=============== + +Introduction +============ + +UBIFS file-system stands for UBI File System. UBI stands for "Unsorted +Block Images". UBIFS is a flash file system, which means it is designed +to work with flash devices. It is important to understand, that UBIFS +is completely different to any traditional file-system in Linux, like +Ext2, XFS, JFS, etc. UBIFS represents a separate class of file-systems +which work with MTD devices, not block devices. The other Linux +file-system of this class is JFFS2. + +To make it more clear, here is a small comparison of MTD devices and +block devices. + +1 MTD devices represent flash devices and they consist of eraseblocks of + rather large size, typically about 128KiB. Block devices consist of + small blocks, typically 512 bytes. +2 MTD devices support 3 main operations - read from some offset within an + eraseblock, write to some offset within an eraseblock, and erase a whole + eraseblock. Block devices support 2 main operations - read a whole + block and write a whole block. +3 The whole eraseblock has to be erased before it becomes possible to + re-write its contents. Blocks may be just re-written. +4 Eraseblocks become worn out after some number of erase cycles - + typically 100K-1G for SLC NAND and NOR flashes, and 1K-10K for MLC + NAND flashes. Blocks do not have the wear-out property. +5 Eraseblocks may become bad (only on NAND flashes) and software should + deal with this. Blocks on hard drives typically do not become bad, + because hardware has mechanisms to substitute bad blocks, at least in + modern LBA disks. + +It should be quite obvious why UBIFS is very different to traditional +file-systems. + +UBIFS works on top of UBI. UBI is a separate software layer which may be +found in drivers/mtd/ubi. UBI is basically a volume management and +wear-leveling layer. It provides so called UBI volumes which is a higher +level abstraction than a MTD device. The programming model of UBI devices +is very similar to MTD devices - they still consist of large eraseblocks, +they have read/write/erase operations, but UBI devices are devoid of +limitations like wear and bad blocks (items 4 and 5 in the above list). + +In a sense, UBIFS is a next generation of JFFS2 file-system, but it is +very different and incompatible to JFFS2. The following are the main +differences. + +* JFFS2 works on top of MTD devices, UBIFS depends on UBI and works on + top of UBI volumes. +* JFFS2 does not have on-media index and has to build it while mounting, + which requires full media scan. UBIFS maintains the FS indexing + information on the flash media and does not require full media scan, + so it mounts many times faster than JFFS2. +* JFFS2 is a write-through file-system, while UBIFS supports write-back, + which makes UBIFS much faster on writes. + +Similarly to JFFS2, UBIFS supports on-the-flight compression which makes +it possible to fit quite a lot of data to the flash. + +Similarly to JFFS2, UBIFS is tolerant of unclean reboots and power-cuts. +It does not need stuff like fsck.ext2. UBIFS automatically replays its +journal and recovers from crashes, ensuring that the on-flash data +structures are consistent. + +UBIFS scales logarithmically (most of the data structures it uses are +trees), so the mount time and memory consumption do not linearly depend +on the flash size, like in case of JFFS2. This is because UBIFS +maintains the FS index on the flash media. However, UBIFS depends on +UBI, which scales linearly. So overall UBI/UBIFS stack scales linearly. +Nevertheless, UBI/UBIFS scales considerably better than JFFS2. + +The authors of UBIFS believe, that it is possible to develop UBI2 which +would scale logarithmically as well. UBI2 would support the same API as UBI, +but it would be binary incompatible to UBI. So UBIFS would not need to be +changed to use UBI2 + + +Mount options +============= + +(*) == default. + +==================== ======================================================= +bulk_read read more in one go to take advantage of flash + media that read faster sequentially +no_bulk_read (*) do not bulk-read +no_chk_data_crc (*) skip checking of CRCs on data nodes in order to + improve read performance. Use this option only + if the flash media is highly reliable. The effect + of this option is that corruption of the contents + of a file can go unnoticed. +chk_data_crc do not skip checking CRCs on data nodes +compr=none override default compressor and set it to "none" +compr=lzo override default compressor and set it to "lzo" +compr=zlib override default compressor and set it to "zlib" +auth_key= specify the key used for authenticating the filesystem. + Passing this option makes authentication mandatory. + The passed key must be present in the kernel keyring + and must be of type 'logon' +auth_hash_name= The hash algorithm used for authentication. Used for + both hashing and for creating HMACs. Typical values + include "sha256" or "sha512" +==================== ======================================================= + + +Quick usage instructions +======================== + +The UBI volume to mount is specified using "ubiX_Y" or "ubiX:NAME" syntax, +where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is +UBI volume name. + +Mount volume 0 on UBI device 0 to /mnt/ubifs:: + + $ mount -t ubifs ubi0_0 /mnt/ubifs + +Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume +name):: + + $ mount -t ubifs ubi0:rootfs /mnt/ubifs + +The following is an example of the kernel boot arguments to attach mtd0 +to UBI and mount volume "rootfs": +ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs + +References +========== + +UBIFS documentation and FAQ/HOWTO at the MTD web site: + +- http://www.linux-mtd.infradead.org/doc/ubifs.html +- http://www.linux-mtd.infradead.org/faq/ubifs.html diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.txt deleted file mode 100644 index acc80442a3bb..000000000000 --- a/Documentation/filesystems/ubifs.txt +++ /dev/null @@ -1,126 +0,0 @@ -Introduction -============= - -UBIFS file-system stands for UBI File System. UBI stands for "Unsorted -Block Images". UBIFS is a flash file system, which means it is designed -to work with flash devices. It is important to understand, that UBIFS -is completely different to any traditional file-system in Linux, like -Ext2, XFS, JFS, etc. UBIFS represents a separate class of file-systems -which work with MTD devices, not block devices. The other Linux -file-system of this class is JFFS2. - -To make it more clear, here is a small comparison of MTD devices and -block devices. - -1 MTD devices represent flash devices and they consist of eraseblocks of - rather large size, typically about 128KiB. Block devices consist of - small blocks, typically 512 bytes. -2 MTD devices support 3 main operations - read from some offset within an - eraseblock, write to some offset within an eraseblock, and erase a whole - eraseblock. Block devices support 2 main operations - read a whole - block and write a whole block. -3 The whole eraseblock has to be erased before it becomes possible to - re-write its contents. Blocks may be just re-written. -4 Eraseblocks become worn out after some number of erase cycles - - typically 100K-1G for SLC NAND and NOR flashes, and 1K-10K for MLC - NAND flashes. Blocks do not have the wear-out property. -5 Eraseblocks may become bad (only on NAND flashes) and software should - deal with this. Blocks on hard drives typically do not become bad, - because hardware has mechanisms to substitute bad blocks, at least in - modern LBA disks. - -It should be quite obvious why UBIFS is very different to traditional -file-systems. - -UBIFS works on top of UBI. UBI is a separate software layer which may be -found in drivers/mtd/ubi. UBI is basically a volume management and -wear-leveling layer. It provides so called UBI volumes which is a higher -level abstraction than a MTD device. The programming model of UBI devices -is very similar to MTD devices - they still consist of large eraseblocks, -they have read/write/erase operations, but UBI devices are devoid of -limitations like wear and bad blocks (items 4 and 5 in the above list). - -In a sense, UBIFS is a next generation of JFFS2 file-system, but it is -very different and incompatible to JFFS2. The following are the main -differences. - -* JFFS2 works on top of MTD devices, UBIFS depends on UBI and works on - top of UBI volumes. -* JFFS2 does not have on-media index and has to build it while mounting, - which requires full media scan. UBIFS maintains the FS indexing - information on the flash media and does not require full media scan, - so it mounts many times faster than JFFS2. -* JFFS2 is a write-through file-system, while UBIFS supports write-back, - which makes UBIFS much faster on writes. - -Similarly to JFFS2, UBIFS supports on-the-flight compression which makes -it possible to fit quite a lot of data to the flash. - -Similarly to JFFS2, UBIFS is tolerant of unclean reboots and power-cuts. -It does not need stuff like fsck.ext2. UBIFS automatically replays its -journal and recovers from crashes, ensuring that the on-flash data -structures are consistent. - -UBIFS scales logarithmically (most of the data structures it uses are -trees), so the mount time and memory consumption do not linearly depend -on the flash size, like in case of JFFS2. This is because UBIFS -maintains the FS index on the flash media. However, UBIFS depends on -UBI, which scales linearly. So overall UBI/UBIFS stack scales linearly. -Nevertheless, UBI/UBIFS scales considerably better than JFFS2. - -The authors of UBIFS believe, that it is possible to develop UBI2 which -would scale logarithmically as well. UBI2 would support the same API as UBI, -but it would be binary incompatible to UBI. So UBIFS would not need to be -changed to use UBI2 - - -Mount options -============= - -(*) == default. - -bulk_read read more in one go to take advantage of flash - media that read faster sequentially -no_bulk_read (*) do not bulk-read -no_chk_data_crc (*) skip checking of CRCs on data nodes in order to - improve read performance. Use this option only - if the flash media is highly reliable. The effect - of this option is that corruption of the contents - of a file can go unnoticed. -chk_data_crc do not skip checking CRCs on data nodes -compr=none override default compressor and set it to "none" -compr=lzo override default compressor and set it to "lzo" -compr=zlib override default compressor and set it to "zlib" -auth_key= specify the key used for authenticating the filesystem. - Passing this option makes authentication mandatory. - The passed key must be present in the kernel keyring - and must be of type 'logon' -auth_hash_name= The hash algorithm used for authentication. Used for - both hashing and for creating HMACs. Typical values - include "sha256" or "sha512" - - -Quick usage instructions -======================== - -The UBI volume to mount is specified using "ubiX_Y" or "ubiX:NAME" syntax, -where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is -UBI volume name. - -Mount volume 0 on UBI device 0 to /mnt/ubifs: -$ mount -t ubifs ubi0_0 /mnt/ubifs - -Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume -name): -$ mount -t ubifs ubi0:rootfs /mnt/ubifs - -The following is an example of the kernel boot arguments to attach mtd0 -to UBI and mount volume "rootfs": -ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs - -References -========== - -UBIFS documentation and FAQ/HOWTO at the MTD web site: -http://www.linux-mtd.infradead.org/doc/ubifs.html -http://www.linux-mtd.infradead.org/faq/ubifs.html -- cgit From c9817ad5d82f04fbc66278eda27bff094dcb3119 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:29 +0100 Subject: docs: filesystems: convert udf.txt to ReST - Add a SPDX header; - Add a document title; - Add table markups; - Add lists markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Jan Kara Link: https://lore.kernel.org/r/2887f8a3a813a31170389eab687e9f199327dc7d.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/udf.rst | 75 +++++++++++++++++++++++++++++++++++++ Documentation/filesystems/udf.txt | 66 -------------------------------- 3 files changed, 76 insertions(+), 66 deletions(-) create mode 100644 Documentation/filesystems/udf.rst delete mode 100644 Documentation/filesystems/udf.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 58d57c9bf922..ec03cb4d7353 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -92,5 +92,6 @@ Documentation for filesystem implementations. tmpfs ubifs ubifs-authentication.rst + udf virtiofs vfat diff --git a/Documentation/filesystems/udf.rst b/Documentation/filesystems/udf.rst new file mode 100644 index 000000000000..d9badbf285b2 --- /dev/null +++ b/Documentation/filesystems/udf.rst @@ -0,0 +1,75 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +UDF file system +=============== + +If you encounter problems with reading UDF discs using this driver, +please report them according to MAINTAINERS file. + +Write support requires a block driver which supports writing. Currently +dvd+rw drives and media support true random sector writes, and so a udf +filesystem on such devices can be directly mounted read/write. CD-RW +media however, does not support this. Instead the media can be formatted +for packet mode using the utility cdrwtool, then the pktcdvd driver can +be bound to the underlying cd device to provide the required buffering +and read-modify-write cycles to allow the filesystem random sector writes +while providing the hardware with only full packet writes. While not +required for dvd+rw media, use of the pktcdvd driver often enhances +performance due to very poor read-modify-write support supplied internally +by drive firmware. + +------------------------------------------------------------------------------- + +The following mount options are supported: + + =========== ====================================== + gid= Set the default group. + umask= Set the default umask. + mode= Set the default file permissions. + dmode= Set the default directory permissions. + uid= Set the default user. + bs= Set the block size. + unhide Show otherwise hidden files. + undelete Show deleted files in lists. + adinicb Embed data in the inode (default) + noadinicb Don't embed data in the inode + shortad Use short ad's + longad Use long ad's (default) + nostrict Unset strict conformance + iocharset= Set the NLS character set + =========== ====================================== + +The uid= and gid= options need a bit more explaining. They will accept a +decimal numeric value and all inodes on that mount will then appear as +belonging to that uid and gid. Mount options also accept the string "forget". +The forget option causes all IDs to be written to disk as -1 which is a way +of UDF standard to indicate that IDs are not supported for these files . + +For typical desktop use of removable media, you should set the ID to that of +the interactively logged on user, and also specify the forget option. This way +the interactive user will always see the files on the disk as belonging to him. + +The remaining are for debugging and disaster recovery: + + ===== ================================ + novrs Skip volume sequence recognition + ===== ================================ + +The following expect a offset from 0. + + ========== ================================================= + session= Set the CDROM session (default= last session) + anchor= Override standard anchor location. (default= 256) + lastblock= Set the last block of the filesystem/ + ========== ================================================= + +------------------------------------------------------------------------------- + + +For the latest version and toolset see: + https://github.com/pali/udftools + +Documentation on UDF and ECMA 167 is available FREE from: + - http://www.osta.org/ + - http://www.ecma-international.org/ diff --git a/Documentation/filesystems/udf.txt b/Documentation/filesystems/udf.txt deleted file mode 100644 index e2f2faf32f18..000000000000 --- a/Documentation/filesystems/udf.txt +++ /dev/null @@ -1,66 +0,0 @@ -* -* Documentation/filesystems/udf.txt -* - -If you encounter problems with reading UDF discs using this driver, -please report them according to MAINTAINERS file. - -Write support requires a block driver which supports writing. Currently -dvd+rw drives and media support true random sector writes, and so a udf -filesystem on such devices can be directly mounted read/write. CD-RW -media however, does not support this. Instead the media can be formatted -for packet mode using the utility cdrwtool, then the pktcdvd driver can -be bound to the underlying cd device to provide the required buffering -and read-modify-write cycles to allow the filesystem random sector writes -while providing the hardware with only full packet writes. While not -required for dvd+rw media, use of the pktcdvd driver often enhances -performance due to very poor read-modify-write support supplied internally -by drive firmware. - -------------------------------------------------------------------------------- -The following mount options are supported: - - gid= Set the default group. - umask= Set the default umask. - mode= Set the default file permissions. - dmode= Set the default directory permissions. - uid= Set the default user. - bs= Set the block size. - unhide Show otherwise hidden files. - undelete Show deleted files in lists. - adinicb Embed data in the inode (default) - noadinicb Don't embed data in the inode - shortad Use short ad's - longad Use long ad's (default) - nostrict Unset strict conformance - iocharset= Set the NLS character set - -The uid= and gid= options need a bit more explaining. They will accept a -decimal numeric value and all inodes on that mount will then appear as -belonging to that uid and gid. Mount options also accept the string "forget". -The forget option causes all IDs to be written to disk as -1 which is a way -of UDF standard to indicate that IDs are not supported for these files . - -For typical desktop use of removable media, you should set the ID to that of -the interactively logged on user, and also specify the forget option. This way -the interactive user will always see the files on the disk as belonging to him. - -The remaining are for debugging and disaster recovery: - - novrs Skip volume sequence recognition - -The following expect a offset from 0. - - session= Set the CDROM session (default= last session) - anchor= Override standard anchor location. (default= 256) - lastblock= Set the last block of the filesystem/ - -------------------------------------------------------------------------------- - - -For the latest version and toolset see: - https://github.com/pali/udftools - -Documentation on UDF and ECMA 167 is available FREE from: - http://www.osta.org/ - http://www.ecma-international.org/ -- cgit From 9a6108124c1d27192fee6f058b5de84f51ab62a0 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:30 +0100 Subject: docs: filesystems: convert zonefs.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Damien Le Moal Link: https://lore.kernel.org/r/42a7cfcd19f6b904a9a3188fd4af71bed5050052.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/zonefs.rst | 412 +++++++++++++++++++++++++++++++++++ Documentation/filesystems/zonefs.txt | 404 ---------------------------------- 3 files changed, 413 insertions(+), 404 deletions(-) create mode 100644 Documentation/filesystems/zonefs.rst delete mode 100644 Documentation/filesystems/zonefs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index ec03cb4d7353..53f46a88e6ec 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -95,3 +95,4 @@ Documentation for filesystem implementations. udf virtiofs vfat + zonefs diff --git a/Documentation/filesystems/zonefs.rst b/Documentation/filesystems/zonefs.rst new file mode 100644 index 000000000000..7e733e751e98 --- /dev/null +++ b/Documentation/filesystems/zonefs.rst @@ -0,0 +1,412 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================================ +ZoneFS - Zone filesystem for Zoned block devices +================================================ + +Introduction +============ + +zonefs is a very simple file system exposing each zone of a zoned block device +as a file. Unlike a regular POSIX-compliant file system with native zoned block +device support (e.g. f2fs), zonefs does not hide the sequential write +constraint of zoned block devices to the user. Files representing sequential +write zones of the device must be written sequentially starting from the end +of the file (append only writes). + +As such, zonefs is in essence closer to a raw block device access interface +than to a full-featured POSIX file system. The goal of zonefs is to simplify +the implementation of zoned block device support in applications by replacing +raw block device file accesses with a richer file API, avoiding relying on +direct block device file ioctls which may be more obscure to developers. One +example of this approach is the implementation of LSM (log-structured merge) +tree structures (such as used in RocksDB and LevelDB) on zoned block devices +by allowing SSTables to be stored in a zone file similarly to a regular file +system rather than as a range of sectors of the entire disk. The introduction +of the higher level construct "one file is one zone" can help reducing the +amount of changes needed in the application as well as introducing support for +different application programming languages. + +Zoned block devices +------------------- + +Zoned storage devices belong to a class of storage devices with an address +space that is divided into zones. A zone is a group of consecutive LBAs and all +zones are contiguous (there are no LBA gaps). Zones may have different types. + +* Conventional zones: there are no access constraints to LBAs belonging to + conventional zones. Any read or write access can be executed, similarly to a + regular block device. +* Sequential zones: these zones accept random reads but must be written + sequentially. Each sequential zone has a write pointer maintained by the + device that keeps track of the mandatory start LBA position of the next write + to the device. As a result of this write constraint, LBAs in a sequential zone + cannot be overwritten. Sequential zones must first be erased using a special + command (zone reset) before rewriting. + +Zoned storage devices can be implemented using various recording and media +technologies. The most common form of zoned storage today uses the SCSI Zoned +Block Commands (ZBC) and Zoned ATA Commands (ZAC) interfaces on Shingled +Magnetic Recording (SMR) HDDs. + +Solid State Disks (SSD) storage devices can also implement a zoned interface +to, for instance, reduce internal write amplification due to garbage collection. +The NVMe Zoned NameSpace (ZNS) is a technical proposal of the NVMe standard +committee aiming at adding a zoned storage interface to the NVMe protocol. + +Zonefs Overview +=============== + +Zonefs exposes the zones of a zoned block device as files. The files +representing zones are grouped by zone type, which are themselves represented +by sub-directories. This file structure is built entirely using zone information +provided by the device and so does not require any complex on-disk metadata +structure. + +On-disk metadata +---------------- + +zonefs on-disk metadata is reduced to an immutable super block which +persistently stores a magic number and optional feature flags and values. On +mount, zonefs uses blkdev_report_zones() to obtain the device zone configuration +and populates the mount point with a static file tree solely based on this +information. File sizes come from the device zone type and write pointer +position managed by the device itself. + +The super block is always written on disk at sector 0. The first zone of the +device storing the super block is never exposed as a zone file by zonefs. If +the zone containing the super block is a sequential zone, the mkzonefs format +tool always "finishes" the zone, that is, it transitions the zone to a full +state to make it read-only, preventing any data write. + +Zone type sub-directories +------------------------- + +Files representing zones of the same type are grouped together under the same +sub-directory automatically created on mount. + +For conventional zones, the sub-directory "cnv" is used. This directory is +however created if and only if the device has usable conventional zones. If +the device only has a single conventional zone at sector 0, the zone will not +be exposed as a file as it will be used to store the zonefs super block. For +such devices, the "cnv" sub-directory will not be created. + +For sequential write zones, the sub-directory "seq" is used. + +These two directories are the only directories that exist in zonefs. Users +cannot create other directories and cannot rename nor delete the "cnv" and +"seq" sub-directories. + +The size of the directories indicated by the st_size field of struct stat, +obtained with the stat() or fstat() system calls, indicates the number of files +existing under the directory. + +Zone files +---------- + +Zone files are named using the number of the zone they represent within the set +of zones of a particular type. That is, both the "cnv" and "seq" directories +contain files named "0", "1", "2", ... The file numbers also represent +increasing zone start sector on the device. + +All read and write operations to zone files are not allowed beyond the file +maximum size, that is, beyond the zone size. Any access exceeding the zone +size is failed with the -EFBIG error. + +Creating, deleting, renaming or modifying any attribute of files and +sub-directories is not allowed. + +The number of blocks of a file as reported by stat() and fstat() indicates the +size of the file zone, or in other words, the maximum file size. + +Conventional zone files +----------------------- + +The size of conventional zone files is fixed to the size of the zone they +represent. Conventional zone files cannot be truncated. + +These files can be randomly read and written using any type of I/O operation: +buffered I/Os, direct I/Os, memory mapped I/Os (mmap), etc. There are no I/O +constraint for these files beyond the file size limit mentioned above. + +Sequential zone files +--------------------- + +The size of sequential zone files grouped in the "seq" sub-directory represents +the file's zone write pointer position relative to the zone start sector. + +Sequential zone files can only be written sequentially, starting from the file +end, that is, write operations can only be append writes. Zonefs makes no +attempt at accepting random writes and will fail any write request that has a +start offset not corresponding to the end of the file, or to the end of the last +write issued and still in-flight (for asynchrnous I/O operations). + +Since dirty page writeback by the page cache does not guarantee a sequential +write pattern, zonefs prevents buffered writes and writeable shared mappings +on sequential files. Only direct I/O writes are accepted for these files. +zonefs relies on the sequential delivery of write I/O requests to the device +implemented by the block layer elevator. An elevator implementing the sequential +write feature for zoned block device (ELEVATOR_F_ZBD_SEQ_WRITE elevator feature) +must be used. This type of elevator (e.g. mq-deadline) is the set by default +for zoned block devices on device initialization. + +There are no restrictions on the type of I/O used for read operations in +sequential zone files. Buffered I/Os, direct I/Os and shared read mappings are +all accepted. + +Truncating sequential zone files is allowed only down to 0, in which case, the +zone is reset to rewind the file zone write pointer position to the start of +the zone, or up to the zone size, in which case the file's zone is transitioned +to the FULL state (finish zone operation). + +Format options +-------------- + +Several optional features of zonefs can be enabled at format time. + +* Conventional zone aggregation: ranges of contiguous conventional zones can be + aggregated into a single larger file instead of the default one file per zone. +* File ownership: The owner UID and GID of zone files is by default 0 (root) + but can be changed to any valid UID/GID. +* File access permissions: the default 640 access permissions can be changed. + +IO error handling +----------------- + +Zoned block devices may fail I/O requests for reasons similar to regular block +devices, e.g. due to bad sectors. However, in addition to such known I/O +failure pattern, the standards governing zoned block devices behavior define +additional conditions that result in I/O errors. + +* A zone may transition to the read-only condition (BLK_ZONE_COND_READONLY): + While the data already written in the zone is still readable, the zone can + no longer be written. No user action on the zone (zone management command or + read/write access) can change the zone condition back to a normal read/write + state. While the reasons for the device to transition a zone to read-only + state are not defined by the standards, a typical cause for such transition + would be a defective write head on an HDD (all zones under this head are + changed to read-only). + +* A zone may transition to the offline condition (BLK_ZONE_COND_OFFLINE): + An offline zone cannot be read nor written. No user action can transition an + offline zone back to an operational good state. Similarly to zone read-only + transitions, the reasons for a drive to transition a zone to the offline + condition are undefined. A typical cause would be a defective read-write head + on an HDD causing all zones on the platter under the broken head to be + inaccessible. + +* Unaligned write errors: These errors result from the host issuing write + requests with a start sector that does not correspond to a zone write pointer + position when the write request is executed by the device. Even though zonefs + enforces sequential file write for sequential zones, unaligned write errors + may still happen in the case of a partial failure of a very large direct I/O + operation split into multiple BIOs/requests or asynchronous I/O operations. + If one of the write request within the set of sequential write requests + issued to the device fails, all write requests after queued after it will + become unaligned and fail. + +* Delayed write errors: similarly to regular block devices, if the device side + write cache is enabled, write errors may occur in ranges of previously + completed writes when the device write cache is flushed, e.g. on fsync(). + Similarly to the previous immediate unaligned write error case, delayed write + errors can propagate through a stream of cached sequential data for a zone + causing all data to be dropped after the sector that caused the error. + +All I/O errors detected by zonefs are notified to the user with an error code +return for the system call that trigered or detected the error. The recovery +actions taken by zonefs in response to I/O errors depend on the I/O type (read +vs write) and on the reason for the error (bad sector, unaligned writes or zone +condition change). + +* For read I/O errors, zonefs does not execute any particular recovery action, + but only if the file zone is still in a good condition and there is no + inconsistency between the file inode size and its zone write pointer position. + If a problem is detected, I/O error recovery is executed (see below table). + +* For write I/O errors, zonefs I/O error recovery is always executed. + +* A zone condition change to read-only or offline also always triggers zonefs + I/O error recovery. + +Zonefs minimal I/O error recovery may change a file size and a file access +permissions. + +* File size changes: + Immediate or delayed write errors in a sequential zone file may cause the file + inode size to be inconsistent with the amount of data successfully written in + the file zone. For instance, the partial failure of a multi-BIO large write + operation will cause the zone write pointer to advance partially, even though + the entire write operation will be reported as failed to the user. In such + case, the file inode size must be advanced to reflect the zone write pointer + change and eventually allow the user to restart writing at the end of the + file. + A file size may also be reduced to reflect a delayed write error detected on + fsync(): in this case, the amount of data effectively written in the zone may + be less than originally indicated by the file inode size. After such I/O + error, zonefs always fixes a file inode size to reflect the amount of data + persistently stored in the file zone. + +* Access permission changes: + A zone condition change to read-only is indicated with a change in the file + access permissions to render the file read-only. This disables changes to the + file attributes and data modification. For offline zones, all permissions + (read and write) to the file are disabled. + +Further action taken by zonefs I/O error recovery can be controlled by the user +with the "errors=xxx" mount option. The table below summarizes the result of +zonefs I/O error processing depending on the mount option and on the zone +conditions:: + + +--------------+-----------+-----------------------------------------+ + | | | Post error state | + | "errors=xxx" | device | access permissions | + | mount | zone | file file device zone | + | option | condition | size read write read write | + +--------------+-----------+-----------------------------------------+ + | | good | fixed yes no yes yes | + | remount-ro | read-only | fixed yes no yes no | + | (default) | offline | 0 no no no no | + +--------------+-----------+-----------------------------------------+ + | | good | fixed yes no yes yes | + | zone-ro | read-only | fixed yes no yes no | + | | offline | 0 no no no no | + +--------------+-----------+-----------------------------------------+ + | | good | 0 no no yes yes | + | zone-offline | read-only | 0 no no yes no | + | | offline | 0 no no no no | + +--------------+-----------+-----------------------------------------+ + | | good | fixed yes yes yes yes | + | repair | read-only | fixed yes no yes no | + | | offline | 0 no no no no | + +--------------+-----------+-----------------------------------------+ + +Further notes: + +* The "errors=remount-ro" mount option is the default behavior of zonefs I/O + error processing if no errors mount option is specified. +* With the "errors=remount-ro" mount option, the change of the file access + permissions to read-only applies to all files. The file system is remounted + read-only. +* Access permission and file size changes due to the device transitioning zones + to the offline condition are permanent. Remounting or reformating the device + with mkfs.zonefs (mkzonefs) will not change back offline zone files to a good + state. +* File access permission changes to read-only due to the device transitioning + zones to the read-only condition are permanent. Remounting or reformating + the device will not re-enable file write access. +* File access permission changes implied by the remount-ro, zone-ro and + zone-offline mount options are temporary for zones in a good condition. + Unmounting and remounting the file system will restore the previous default + (format time values) access rights to the files affected. +* The repair mount option triggers only the minimal set of I/O error recovery + actions, that is, file size fixes for zones in a good condition. Zones + indicated as being read-only or offline by the device still imply changes to + the zone file access permissions as noted in the table above. + +Mount options +------------- + +zonefs define the "errors=" mount option to allow the user to specify +zonefs behavior in response to I/O errors, inode size inconsistencies or zone +condition chages. The defined behaviors are as follow: + +* remount-ro (default) +* zone-ro +* zone-offline +* repair + +The I/O error actions defined for each behavior is detailed in the previous +section. + +Zonefs User Space Tools +======================= + +The mkzonefs tool is used to format zoned block devices for use with zonefs. +This tool is available on Github at: + +https://github.com/damien-lemoal/zonefs-tools + +zonefs-tools also includes a test suite which can be run against any zoned +block device, including null_blk block device created with zoned mode. + +Examples +-------- + +The following formats a 15TB host-managed SMR HDD with 256 MB zones +with the conventional zones aggregation feature enabled:: + + # mkzonefs -o aggr_cnv /dev/sdX + # mount -t zonefs /dev/sdX /mnt + # ls -l /mnt/ + total 0 + dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv + dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq + +The size of the zone files sub-directories indicate the number of files +existing for each type of zones. In this example, there is only one +conventional zone file (all conventional zones are aggregated under a single +file):: + + # ls -l /mnt/cnv + total 137101312 + -rw-r----- 1 root root 140391743488 Nov 25 13:23 0 + +This aggregated conventional zone file can be used as a regular file:: + + # mkfs.ext4 /mnt/cnv/0 + # mount -o loop /mnt/cnv/0 /data + +The "seq" sub-directory grouping files for sequential write zones has in this +example 55356 zones:: + + # ls -lv /mnt/seq + total 14511243264 + -rw-r----- 1 root root 0 Nov 25 13:23 0 + -rw-r----- 1 root root 0 Nov 25 13:23 1 + -rw-r----- 1 root root 0 Nov 25 13:23 2 + ... + -rw-r----- 1 root root 0 Nov 25 13:23 55354 + -rw-r----- 1 root root 0 Nov 25 13:23 55355 + +For sequential write zone files, the file size changes as data is appended at +the end of the file, similarly to any regular file system:: + + # dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct + 1+0 records in + 1+0 records out + 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s + + # ls -l /mnt/seq/0 + -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0 + +The written file can be truncated to the zone size, preventing any further +write operation:: + + # truncate -s 268435456 /mnt/seq/0 + # ls -l /mnt/seq/0 + -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0 + +Truncation to 0 size allows freeing the file zone storage space and restart +append-writes to the file:: + + # truncate -s 0 /mnt/seq/0 + # ls -l /mnt/seq/0 + -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0 + +Since files are statically mapped to zones on the disk, the number of blocks of +a file as reported by stat() and fstat() indicates the size of the file zone:: + + # stat /mnt/seq/0 + File: /mnt/seq/0 + Size: 0 Blocks: 524288 IO Block: 4096 regular empty file + Device: 870h/2160d Inode: 50431 Links: 1 + Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root) + Access: 2019-11-25 13:23:57.048971997 +0900 + Modify: 2019-11-25 13:52:25.553805765 +0900 + Change: 2019-11-25 13:52:25.553805765 +0900 + Birth: - + +The number of blocks of the file ("Blocks") in units of 512B blocks gives the +maximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone +size in this example. Of note is that the "IO block" field always indicates the +minimum I/O size for writes and corresponds to the device physical sector size. diff --git a/Documentation/filesystems/zonefs.txt b/Documentation/filesystems/zonefs.txt deleted file mode 100644 index 935bf22031ca..000000000000 --- a/Documentation/filesystems/zonefs.txt +++ /dev/null @@ -1,404 +0,0 @@ -ZoneFS - Zone filesystem for Zoned block devices - -Introduction -============ - -zonefs is a very simple file system exposing each zone of a zoned block device -as a file. Unlike a regular POSIX-compliant file system with native zoned block -device support (e.g. f2fs), zonefs does not hide the sequential write -constraint of zoned block devices to the user. Files representing sequential -write zones of the device must be written sequentially starting from the end -of the file (append only writes). - -As such, zonefs is in essence closer to a raw block device access interface -than to a full-featured POSIX file system. The goal of zonefs is to simplify -the implementation of zoned block device support in applications by replacing -raw block device file accesses with a richer file API, avoiding relying on -direct block device file ioctls which may be more obscure to developers. One -example of this approach is the implementation of LSM (log-structured merge) -tree structures (such as used in RocksDB and LevelDB) on zoned block devices -by allowing SSTables to be stored in a zone file similarly to a regular file -system rather than as a range of sectors of the entire disk. The introduction -of the higher level construct "one file is one zone" can help reducing the -amount of changes needed in the application as well as introducing support for -different application programming languages. - -Zoned block devices -------------------- - -Zoned storage devices belong to a class of storage devices with an address -space that is divided into zones. A zone is a group of consecutive LBAs and all -zones are contiguous (there are no LBA gaps). Zones may have different types. -* Conventional zones: there are no access constraints to LBAs belonging to - conventional zones. Any read or write access can be executed, similarly to a - regular block device. -* Sequential zones: these zones accept random reads but must be written - sequentially. Each sequential zone has a write pointer maintained by the - device that keeps track of the mandatory start LBA position of the next write - to the device. As a result of this write constraint, LBAs in a sequential zone - cannot be overwritten. Sequential zones must first be erased using a special - command (zone reset) before rewriting. - -Zoned storage devices can be implemented using various recording and media -technologies. The most common form of zoned storage today uses the SCSI Zoned -Block Commands (ZBC) and Zoned ATA Commands (ZAC) interfaces on Shingled -Magnetic Recording (SMR) HDDs. - -Solid State Disks (SSD) storage devices can also implement a zoned interface -to, for instance, reduce internal write amplification due to garbage collection. -The NVMe Zoned NameSpace (ZNS) is a technical proposal of the NVMe standard -committee aiming at adding a zoned storage interface to the NVMe protocol. - -Zonefs Overview -=============== - -Zonefs exposes the zones of a zoned block device as files. The files -representing zones are grouped by zone type, which are themselves represented -by sub-directories. This file structure is built entirely using zone information -provided by the device and so does not require any complex on-disk metadata -structure. - -On-disk metadata ----------------- - -zonefs on-disk metadata is reduced to an immutable super block which -persistently stores a magic number and optional feature flags and values. On -mount, zonefs uses blkdev_report_zones() to obtain the device zone configuration -and populates the mount point with a static file tree solely based on this -information. File sizes come from the device zone type and write pointer -position managed by the device itself. - -The super block is always written on disk at sector 0. The first zone of the -device storing the super block is never exposed as a zone file by zonefs. If -the zone containing the super block is a sequential zone, the mkzonefs format -tool always "finishes" the zone, that is, it transitions the zone to a full -state to make it read-only, preventing any data write. - -Zone type sub-directories -------------------------- - -Files representing zones of the same type are grouped together under the same -sub-directory automatically created on mount. - -For conventional zones, the sub-directory "cnv" is used. This directory is -however created if and only if the device has usable conventional zones. If -the device only has a single conventional zone at sector 0, the zone will not -be exposed as a file as it will be used to store the zonefs super block. For -such devices, the "cnv" sub-directory will not be created. - -For sequential write zones, the sub-directory "seq" is used. - -These two directories are the only directories that exist in zonefs. Users -cannot create other directories and cannot rename nor delete the "cnv" and -"seq" sub-directories. - -The size of the directories indicated by the st_size field of struct stat, -obtained with the stat() or fstat() system calls, indicates the number of files -existing under the directory. - -Zone files ----------- - -Zone files are named using the number of the zone they represent within the set -of zones of a particular type. That is, both the "cnv" and "seq" directories -contain files named "0", "1", "2", ... The file numbers also represent -increasing zone start sector on the device. - -All read and write operations to zone files are not allowed beyond the file -maximum size, that is, beyond the zone size. Any access exceeding the zone -size is failed with the -EFBIG error. - -Creating, deleting, renaming or modifying any attribute of files and -sub-directories is not allowed. - -The number of blocks of a file as reported by stat() and fstat() indicates the -size of the file zone, or in other words, the maximum file size. - -Conventional zone files ------------------------ - -The size of conventional zone files is fixed to the size of the zone they -represent. Conventional zone files cannot be truncated. - -These files can be randomly read and written using any type of I/O operation: -buffered I/Os, direct I/Os, memory mapped I/Os (mmap), etc. There are no I/O -constraint for these files beyond the file size limit mentioned above. - -Sequential zone files ---------------------- - -The size of sequential zone files grouped in the "seq" sub-directory represents -the file's zone write pointer position relative to the zone start sector. - -Sequential zone files can only be written sequentially, starting from the file -end, that is, write operations can only be append writes. Zonefs makes no -attempt at accepting random writes and will fail any write request that has a -start offset not corresponding to the end of the file, or to the end of the last -write issued and still in-flight (for asynchrnous I/O operations). - -Since dirty page writeback by the page cache does not guarantee a sequential -write pattern, zonefs prevents buffered writes and writeable shared mappings -on sequential files. Only direct I/O writes are accepted for these files. -zonefs relies on the sequential delivery of write I/O requests to the device -implemented by the block layer elevator. An elevator implementing the sequential -write feature for zoned block device (ELEVATOR_F_ZBD_SEQ_WRITE elevator feature) -must be used. This type of elevator (e.g. mq-deadline) is the set by default -for zoned block devices on device initialization. - -There are no restrictions on the type of I/O used for read operations in -sequential zone files. Buffered I/Os, direct I/Os and shared read mappings are -all accepted. - -Truncating sequential zone files is allowed only down to 0, in which case, the -zone is reset to rewind the file zone write pointer position to the start of -the zone, or up to the zone size, in which case the file's zone is transitioned -to the FULL state (finish zone operation). - -Format options --------------- - -Several optional features of zonefs can be enabled at format time. -* Conventional zone aggregation: ranges of contiguous conventional zones can be - aggregated into a single larger file instead of the default one file per zone. -* File ownership: The owner UID and GID of zone files is by default 0 (root) - but can be changed to any valid UID/GID. -* File access permissions: the default 640 access permissions can be changed. - -IO error handling ------------------ - -Zoned block devices may fail I/O requests for reasons similar to regular block -devices, e.g. due to bad sectors. However, in addition to such known I/O -failure pattern, the standards governing zoned block devices behavior define -additional conditions that result in I/O errors. - -* A zone may transition to the read-only condition (BLK_ZONE_COND_READONLY): - While the data already written in the zone is still readable, the zone can - no longer be written. No user action on the zone (zone management command or - read/write access) can change the zone condition back to a normal read/write - state. While the reasons for the device to transition a zone to read-only - state are not defined by the standards, a typical cause for such transition - would be a defective write head on an HDD (all zones under this head are - changed to read-only). - -* A zone may transition to the offline condition (BLK_ZONE_COND_OFFLINE): - An offline zone cannot be read nor written. No user action can transition an - offline zone back to an operational good state. Similarly to zone read-only - transitions, the reasons for a drive to transition a zone to the offline - condition are undefined. A typical cause would be a defective read-write head - on an HDD causing all zones on the platter under the broken head to be - inaccessible. - -* Unaligned write errors: These errors result from the host issuing write - requests with a start sector that does not correspond to a zone write pointer - position when the write request is executed by the device. Even though zonefs - enforces sequential file write for sequential zones, unaligned write errors - may still happen in the case of a partial failure of a very large direct I/O - operation split into multiple BIOs/requests or asynchronous I/O operations. - If one of the write request within the set of sequential write requests - issued to the device fails, all write requests after queued after it will - become unaligned and fail. - -* Delayed write errors: similarly to regular block devices, if the device side - write cache is enabled, write errors may occur in ranges of previously - completed writes when the device write cache is flushed, e.g. on fsync(). - Similarly to the previous immediate unaligned write error case, delayed write - errors can propagate through a stream of cached sequential data for a zone - causing all data to be dropped after the sector that caused the error. - -All I/O errors detected by zonefs are notified to the user with an error code -return for the system call that trigered or detected the error. The recovery -actions taken by zonefs in response to I/O errors depend on the I/O type (read -vs write) and on the reason for the error (bad sector, unaligned writes or zone -condition change). - -* For read I/O errors, zonefs does not execute any particular recovery action, - but only if the file zone is still in a good condition and there is no - inconsistency between the file inode size and its zone write pointer position. - If a problem is detected, I/O error recovery is executed (see below table). - -* For write I/O errors, zonefs I/O error recovery is always executed. - -* A zone condition change to read-only or offline also always triggers zonefs - I/O error recovery. - -Zonefs minimal I/O error recovery may change a file size and a file access -permissions. - -* File size changes: - Immediate or delayed write errors in a sequential zone file may cause the file - inode size to be inconsistent with the amount of data successfully written in - the file zone. For instance, the partial failure of a multi-BIO large write - operation will cause the zone write pointer to advance partially, even though - the entire write operation will be reported as failed to the user. In such - case, the file inode size must be advanced to reflect the zone write pointer - change and eventually allow the user to restart writing at the end of the - file. - A file size may also be reduced to reflect a delayed write error detected on - fsync(): in this case, the amount of data effectively written in the zone may - be less than originally indicated by the file inode size. After such I/O - error, zonefs always fixes a file inode size to reflect the amount of data - persistently stored in the file zone. - -* Access permission changes: - A zone condition change to read-only is indicated with a change in the file - access permissions to render the file read-only. This disables changes to the - file attributes and data modification. For offline zones, all permissions - (read and write) to the file are disabled. - -Further action taken by zonefs I/O error recovery can be controlled by the user -with the "errors=xxx" mount option. The table below summarizes the result of -zonefs I/O error processing depending on the mount option and on the zone -conditions. - - +--------------+-----------+-----------------------------------------+ - | | | Post error state | - | "errors=xxx" | device | access permissions | - | mount | zone | file file device zone | - | option | condition | size read write read write | - +--------------+-----------+-----------------------------------------+ - | | good | fixed yes no yes yes | - | remount-ro | read-only | fixed yes no yes no | - | (default) | offline | 0 no no no no | - +--------------+-----------+-----------------------------------------+ - | | good | fixed yes no yes yes | - | zone-ro | read-only | fixed yes no yes no | - | | offline | 0 no no no no | - +--------------+-----------+-----------------------------------------+ - | | good | 0 no no yes yes | - | zone-offline | read-only | 0 no no yes no | - | | offline | 0 no no no no | - +--------------+-----------+-----------------------------------------+ - | | good | fixed yes yes yes yes | - | repair | read-only | fixed yes no yes no | - | | offline | 0 no no no no | - +--------------+-----------+-----------------------------------------+ - -Further notes: -* The "errors=remount-ro" mount option is the default behavior of zonefs I/O - error processing if no errors mount option is specified. -* With the "errors=remount-ro" mount option, the change of the file access - permissions to read-only applies to all files. The file system is remounted - read-only. -* Access permission and file size changes due to the device transitioning zones - to the offline condition are permanent. Remounting or reformating the device - with mkfs.zonefs (mkzonefs) will not change back offline zone files to a good - state. -* File access permission changes to read-only due to the device transitioning - zones to the read-only condition are permanent. Remounting or reformating - the device will not re-enable file write access. -* File access permission changes implied by the remount-ro, zone-ro and - zone-offline mount options are temporary for zones in a good condition. - Unmounting and remounting the file system will restore the previous default - (format time values) access rights to the files affected. -* The repair mount option triggers only the minimal set of I/O error recovery - actions, that is, file size fixes for zones in a good condition. Zones - indicated as being read-only or offline by the device still imply changes to - the zone file access permissions as noted in the table above. - -Mount options -------------- - -zonefs define the "errors=" mount option to allow the user to specify -zonefs behavior in response to I/O errors, inode size inconsistencies or zone -condition chages. The defined behaviors are as follow: -* remount-ro (default) -* zone-ro -* zone-offline -* repair - -The I/O error actions defined for each behavior is detailed in the previous -section. - -Zonefs User Space Tools -======================= - -The mkzonefs tool is used to format zoned block devices for use with zonefs. -This tool is available on Github at: - -https://github.com/damien-lemoal/zonefs-tools - -zonefs-tools also includes a test suite which can be run against any zoned -block device, including null_blk block device created with zoned mode. - -Examples --------- - -The following formats a 15TB host-managed SMR HDD with 256 MB zones -with the conventional zones aggregation feature enabled. - -# mkzonefs -o aggr_cnv /dev/sdX -# mount -t zonefs /dev/sdX /mnt -# ls -l /mnt/ -total 0 -dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv -dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq - -The size of the zone files sub-directories indicate the number of files -existing for each type of zones. In this example, there is only one -conventional zone file (all conventional zones are aggregated under a single -file). - -# ls -l /mnt/cnv -total 137101312 --rw-r----- 1 root root 140391743488 Nov 25 13:23 0 - -This aggregated conventional zone file can be used as a regular file. - -# mkfs.ext4 /mnt/cnv/0 -# mount -o loop /mnt/cnv/0 /data - -The "seq" sub-directory grouping files for sequential write zones has in this -example 55356 zones. - -# ls -lv /mnt/seq -total 14511243264 --rw-r----- 1 root root 0 Nov 25 13:23 0 --rw-r----- 1 root root 0 Nov 25 13:23 1 --rw-r----- 1 root root 0 Nov 25 13:23 2 -... --rw-r----- 1 root root 0 Nov 25 13:23 55354 --rw-r----- 1 root root 0 Nov 25 13:23 55355 - -For sequential write zone files, the file size changes as data is appended at -the end of the file, similarly to any regular file system. - -# dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct -1+0 records in -1+0 records out -4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s - -# ls -l /mnt/seq/0 --rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0 - -The written file can be truncated to the zone size, preventing any further -write operation. - -# truncate -s 268435456 /mnt/seq/0 -# ls -l /mnt/seq/0 --rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0 - -Truncation to 0 size allows freeing the file zone storage space and restart -append-writes to the file. - -# truncate -s 0 /mnt/seq/0 -# ls -l /mnt/seq/0 --rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0 - -Since files are statically mapped to zones on the disk, the number of blocks of -a file as reported by stat() and fstat() indicates the size of the file zone. - -# stat /mnt/seq/0 - File: /mnt/seq/0 - Size: 0 Blocks: 524288 IO Block: 4096 regular empty file -Device: 870h/2160d Inode: 50431 Links: 1 -Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root) -Access: 2019-11-25 13:23:57.048971997 +0900 -Modify: 2019-11-25 13:52:25.553805765 +0900 -Change: 2019-11-25 13:52:25.553805765 +0900 - Birth: - - -The number of blocks of the file ("Blocks") in units of 512B blocks gives the -maximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone -size in this example. Of note is that the "IO block" field always indicates the -minimum I/O size for writes and corresponds to the device physical sector size. -- cgit