diff options
Diffstat (limited to 'Documentation/filesystems')
| -rw-r--r-- | Documentation/filesystems/ext4.txt | 125 | ||||
| -rw-r--r-- | Documentation/filesystems/gfs2-glocks.txt | 114 | ||||
| -rw-r--r-- | Documentation/filesystems/proc.txt | 29 | ||||
| -rw-r--r-- | Documentation/filesystems/ubifs.txt | 164 | 
4 files changed, 371 insertions, 61 deletions
| diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 0c5086db8352..80e193d82e2e 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -13,72 +13,93 @@ Mailing list: [email protected]  1. Quick usage instructions:  =========================== -  - Grab updated e2fsprogs from -    ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/ -    This is a patchset on top of e2fsprogs-1.39, which can be found at +  - Compile and install the latest version of e2fsprogs (as of this +    writing version 1.41) from: + +    http://sourceforge.net/project/showfiles.php?group_id=2406 +	 +	or +      ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/ -  - It's still mke2fs -j /dev/hda1 +	or grab the latest git repository from: + +    git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git + +  - Create a new filesystem using the ext4dev filesystem type: + +    	# mke2fs -t ext4dev /dev/hda1 + +    Or configure an existing ext3 filesystem to support extents and set +    the test_fs flag to indicate that it's ok for an in-development +    filesystem to touch this filesystem: -  - mount /dev/hda1 /wherever -t ext4dev +	# tune2fs -O extents -E test_fs /dev/hda1 -  - To enable extents, +    If the filesystem was created with 128 byte inodes, it can be +    converted to use 256 byte for greater efficiency via: -	mount /dev/hda1 /wherever -t ext4dev -o extents +        # tune2fs -I 256 /dev/hda1 -  - The filesystem is compatible with the ext3 driver until you add a file -    which has extents (ie: `mount -o extents', then create a file). +    (Note: we currently do not have tools to convert an ext4dev +    filesystem back to ext3; so please do not do try this on production +    filesystems.) -    NOTE: The "extents" mount flag is temporary.  It will soon go away and -    extents will be enabled by the "-o extents" flag to mke2fs or tune2fs +  - Mounting: + +	# mount -t ext4dev /dev/hda1 /wherever    - When comparing performance with other filesystems, remember that -    ext3/4 by default offers higher data integrity guarantees than most.  So -    when comparing with a metadata-only journalling filesystem, use `mount -o -    data=writeback'.  And you might as well use `mount -o nobh' too along -    with it.  Making the journal larger than the mke2fs default often helps -    performance with metadata-intensive workloads. +    ext3/4 by default offers higher data integrity guarantees than most. +    So when comparing with a metadata-only journalling filesystem, such +    as ext3, use `mount -o data=writeback'.  And you might as well use +    `mount -o nobh' too along with it.  Making the journal larger than +    the mke2fs default often helps performance with metadata-intensive +    workloads.  2. Features  ===========  2.1 Currently available -* ability to use filesystems > 16TB +* ability to use filesystems > 16TB (e2fsprogs support not available yet)  * extent format reduces metadata overhead (RAM, IO for access, transactions)  * extent format more robust in face of on-disk corruption due to magics,  * internal redunancy in tree - -2.1 Previously available, soon to be enabled by default by "mkefs.ext4": - -* dir_index and resize inode will be on by default -* large inodes will be used by default for fast EAs, nsec timestamps, etc +* improved file allocation (multi-block alloc) +* fix 32000 subdirectory limit +* nsec timestamps for mtime, atime, ctime, create time +* inode version field on disk (NFSv4, Lustre) +* reduced e2fsck time via uninit_bg feature +* journal checksumming for robustness, performance +* persistent file preallocation (e.g for streaming media, databases) +* ability to pack bitmaps and inode tables into larger virtual groups via the +  flex_bg feature +* large file support +* Inode allocation using large virtual block groups via flex_bg +* delayed allocation +* large block (up to pagesize) support +* efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force +  the ordering)  2.2 Candidate features for future inclusion -There are several under discussion, whether they all make it in is -partly a function of how much time everyone has to work on them: +* Online defrag (patches available but not well tested) +* reduced mke2fs time via lazy itable initialization in conjuction with +  the uninit_bg feature (capability to do this is available in e2fsprogs +  but a kernel thread to do lazy zeroing of unused inode table blocks +  after filesystem is first mounted is required for safety) -* improved file allocation (multi-block alloc, delayed alloc; basically done) -* fix 32000 subdirectory limit (patch exists, needs some e2fsck work) -* nsec timestamps for mtime, atime, ctime, create time (patch exists, -  needs some e2fsck work) -* inode version field on disk (NFSv4, Lustre; prototype exists) -* reduced mke2fs/e2fsck time via uninitialized groups (prototype exists) -* journal checksumming for robustness, performance (prototype exists) -* persistent file preallocation (e.g for streaming media, databases) +There are several others under discussion, whether they all make it in is +partly a function of how much time everyone has to work on them. Features like +metadata checksumming have been discussed and planned for a bit but no patches +exist yet so I'm not sure they're in the near-term roadmap. -Features like metadata checksumming have been discussed and planned for -a bit but no patches exist yet so I'm not sure they're in the near-term -roadmap. +The big performance win will come with mballoc, delalloc and flex_bg +grouping of bitmaps and inode tables.  Some test results available here: -The big performance win will come with mballoc and delalloc.  CFS has -been using mballoc for a few years already with Lustre, and IBM + Bull -did a lot of benchmarking on it.  The reason it isn't in the first set of -patches is partly a manageability issue, and partly because it doesn't -directly affect the on-disk format (outside of much better allocation) -so it isn't critical to get into the first round of changes.  I believe -Alex is working on a new set of patches right now. + - http://www.bullopensource.org/ext4/20080530/ffsb-write-2.6.26-rc2.html + - http://www.bullopensource.org/ext4/20080530/ffsb-readwrite-2.6.26-rc2.html  3. Options  ========== @@ -222,9 +243,11 @@ stripe=n		Number of filesystem blocks that mballoc will try  			to use for allocation size and alignment. For RAID5/6  			systems this should be the number of data  			disks *  RAID chunk size in file system blocks. - +delalloc	(*)	Deferring block allocation until write-out time. +nodelalloc		Disable delayed allocation. Blocks are allocation +			when data is copied from user to page cache.  Data Mode ---------- +=========  There are 3 different data modes:  * writeback mode @@ -236,10 +259,10 @@ typically provide the best ext4 performance.  * ordered mode  In data=ordered mode, ext4 only officially journals metadata, but it logically -groups metadata and data blocks into a single unit called a transaction.  When -it's time to write the new metadata out to disk, the associated data blocks -are written first.  In general, this mode performs slightly slower than -writeback but significantly faster than journal mode. +groups metadata information related to data changes with the data blocks into a +single unit called a transaction.  When it's time to write the new metadata +out to disk, the associated data blocks are written first.  In general, +this mode performs slightly slower than writeback but significantly faster than journal mode.  * journal mode  data=journal mode provides full data and metadata journaling.  All new data is @@ -247,7 +270,8 @@ written to the journal first, and then to its final location.  In the event of a crash, the journal can be replayed, bringing both data and  metadata into a consistent state.  This mode is the slowest except when data  needs to be read from and written to disk at the same time where it -outperforms all others modes. +outperforms all others modes.  Curently ext4 does not have delayed +allocation support if this data journalling mode is selected.  References  ========== @@ -256,7 +280,8 @@ kernel source:	<file:fs/ext4/>  		<file:fs/jbd2/>  programs:	http://e2fsprogs.sourceforge.net/ -		http://ext2resize.sourceforge.net  useful links:	http://fedoraproject.org/wiki/ext3-devel  		http://www.bullopensource.org/ext4/ +		http://ext4.wiki.kernel.org/index.php/Main_Page +		http://fedoraproject.org/wiki/Features/Ext4 diff --git a/Documentation/filesystems/gfs2-glocks.txt b/Documentation/filesystems/gfs2-glocks.txt new file mode 100644 index 000000000000..4dae9a3840bf --- /dev/null +++ b/Documentation/filesystems/gfs2-glocks.txt @@ -0,0 +1,114 @@ +                   Glock internal locking rules +                  ------------------------------ + +This documents the basic principles of the glock state machine +internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h) +has two main (internal) locks: + + 1. A spinlock (gl_spin) which protects the internal state such +    as gl_state, gl_target and the list of holders (gl_holders) + 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other +    threads from making calls to the DLM, etc. at the same time. If a +    thread takes this lock, it must then call run_queue (usually via the +    workqueue) when it releases it in order to ensure any pending tasks +    are completed. + +The gl_holders list contains all the queued lock requests (not +just the holders) associated with the glock. If there are any +held locks, then they will be contiguous entries at the head +of the list. Locks are granted in strictly the order that they +are queued, except for those marked LM_FLAG_PRIORITY which are +used only during recovery, and even then only for journal locks. + +There are three lock states that users of the glock layer can request, +namely shared (SH), deferred (DF) and exclusive (EX). Those translate +to the following DLM lock modes: + +Glock mode    | DLM lock mode +------------------------------ +    UN        |    IV/NL  Unlocked (no DLM lock associated with glock) or NL +    SH        |    PR     (Protected read) +    DF        |    CW     (Concurrent write) +    EX        |    EX     (Exclusive) + +Thus DF is basically a shared mode which is incompatible with the "normal" +shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O +operations. The glocks are basically a lock plus some routines which deal +with cache management. The following rules apply for the cache: + +Glock mode   |  Cache data | Cache Metadata | Dirty Data | Dirty Metadata +-------------------------------------------------------------------------- +    UN       |     No      |       No       |     No     |      No +    SH       |     Yes     |       Yes      |     No     |      No +    DF       |     No      |       Yes      |     No     |      No +    EX       |     Yes     |       Yes      |     Yes    |      Yes + +These rules are implemented using the various glock operations which +are defined for each type of glock. Not all types of glocks use +all the modes. Only inode glocks use the DF mode for example. + +Table of glock operations and per type constants: + +Field            | Purpose +---------------------------------------------------------------------------- +go_xmote_th      | Called before remote state change (e.g. to sync dirty data) +go_xmote_bh      | Called after remote state change (e.g. to refill cache) +go_inval         | Called if remote state change requires invalidating the cache +go_demote_ok     | Returns boolean value of whether its ok to demote a glock +                 | (e.g. checks timeout, and that there is no cached data) +go_lock          | Called for the first local holder of a lock +go_unlock        | Called on the final local unlock of a lock +go_dump          | Called to print content of object for debugfs file, or on +                 | error to dump glock to the log. +go_type;         | The type of the glock, LM_TYPE_..... +go_min_hold_time | The minimum hold time + +The minimum hold time for each lock is the time after a remote lock +grant for which we ignore remote demote requests. This is in order to +prevent a situation where locks are being bounced around the cluster +from node to node with none of the nodes making any progress. This +tends to show up most with shared mmaped files which are being written +to by multiple nodes. By delaying the demotion in response to a +remote callback, that gives the userspace program time to make +some progress before the pages are unmapped. + +There is a plan to try and remove the go_lock and go_unlock callbacks +if possible, in order to try and speed up the fast path though the locking. +Also, eventually we hope to make the glock "EX" mode locally shared +such that any local locking will be done with the i_mutex as required +rather than via the glock. + +Locking rules for glock operations: + +Operation     |  GLF_LOCK bit lock held |  gl_spin spinlock held +----------------------------------------------------------------- +go_xmote_th   |       Yes               |       No +go_xmote_bh   |       Yes               |       No +go_inval      |       Yes               |       No +go_demote_ok  |       Sometimes         |       Yes +go_lock       |       Yes               |       No +go_unlock     |       Yes               |       No +go_dump       |       Sometimes         |       Yes + +N.B. Operations must not drop either the bit lock or the spinlock +if its held on entry. go_dump and do_demote_ok must never block. +Note that go_dump will only be called if the glock's state +indicates that it is caching uptodate data. + +Glock locking order within GFS2: + + 1. i_mutex (if required) + 2. Rename glock (for rename only) + 3. Inode glock(s) +    (Parents before children, inodes at "same level" with same parent in +     lock number order) + 4. Rgrp glock(s) (for (de)allocation operations) + 5. Transaction glock (via gfs2_trans_begin) for non-read operations + 6. Page lock  (always last, very important!) + +There are two glocks per inode. One deals with access to the inode +itself (locking order as above), and the other, known as the iopen +glock is used in conjunction with the i_nlink field in the inode to +determine the lifetime of the inode in question. Locking of inodes +is on a per-inode basis. Locking of rgrps is on a per rgrp basis. + diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index dbc3c6a3650f..7f268f327d75 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -380,28 +380,35 @@ i386 and x86_64 platforms support the new IRQ vector displays.  Of some interest is the introduction of the /proc/irq directory to 2.4.  It could be used to set IRQ to CPU affinity, this means that you can "hook" an  IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the -irq subdir is one subdir for each IRQ, and one file; prof_cpu_mask +irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and +prof_cpu_mask.  For example     > ls /proc/irq/    0  10  12  14  16  18  2  4  6  8  prof_cpu_mask -  1  11  13  15  17  19  3  5  7  9 +  1  11  13  15  17  19  3  5  7  9  default_smp_affinity    > ls /proc/irq/0/    smp_affinity -The contents of the prof_cpu_mask file and each smp_affinity file for each IRQ -is the same by default: +smp_affinity is a bitmask, in which you can specify which CPUs can handle the +IRQ, you can set it by doing: -  > cat /proc/irq/0/smp_affinity  -  ffffffff +  > echo 1 > /proc/irq/10/smp_affinity + +This means that only the first CPU will handle the IRQ, but you can also echo +5 which means that only the first and fourth CPU can handle the IRQ. -It's a bitmask, in which you can specify which CPUs can handle the IRQ, you can -set it by doing: +The contents of each smp_affinity file is the same by default: + +  > cat /proc/irq/0/smp_affinity +  ffffffff -  > echo 1 > /proc/irq/prof_cpu_mask +The default_smp_affinity mask applies to all non-active IRQs, which are the +IRQs which have not yet been allocated/activated, and hence which lack a +/proc/irq/[0-9]* directory. -This means that only the first CPU will handle the IRQ, but you can also echo 5 -which means that only the first and fourth CPU can handle the IRQ. +prof_cpu_mask specifies which CPUs are to be profiled by the system wide +profiler. Default value is ffffffff (all cpus).  The way IRQs are routed is handled by the IO-APIC, and it's Round Robin  between all the CPUs which are allowed to handle it. As usual the kernel has diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.txt new file mode 100644 index 000000000000..540e9e7f59c5 --- /dev/null +++ b/Documentation/filesystems/ubifs.txt @@ -0,0 +1,164 @@ +Introduction +============= + +UBIFS file-system stands for UBI File System. UBI stands for "Unsorted +Block Images". UBIFS is a flash file system, which means it is designed +to work with flash devices. It is important to understand, that UBIFS +is completely different to any traditional file-system in Linux, like +Ext2, XFS, JFS, etc. UBIFS represents a separate class of file-systems +which work with MTD devices, not block devices. The other Linux +file-system of this class is JFFS2. + +To make it more clear, here is a small comparison of MTD devices and +block devices. + +1 MTD devices represent flash devices and they consist of eraseblocks of +  rather large size, typically about 128KiB. Block devices consist of +  small blocks, typically 512 bytes. +2 MTD devices support 3 main operations - read from some offset within an +  eraseblock, write to some offset within an eraseblock, and erase a whole +  eraseblock. Block  devices support 2 main operations - read a whole +  block and write a whole block. +3 The whole eraseblock has to be erased before it becomes possible to +  re-write its contents. Blocks may be just re-written. +4 Eraseblocks become worn out after some number of erase cycles - +  typically 100K-1G for SLC NAND and NOR flashes, and 1K-10K for MLC +  NAND flashes. Blocks do not have the wear-out property. +5 Eraseblocks may become bad (only on NAND flashes) and software should +  deal with this. Blocks on hard drives typically do not become bad, +  because hardware has mechanisms to substitute bad blocks, at least in +  modern LBA disks. + +It should be quite obvious why UBIFS is very different to traditional +file-systems. + +UBIFS works on top of UBI. UBI is a separate software layer which may be +found in drivers/mtd/ubi. UBI is basically a volume management and +wear-leveling layer. It provides so called UBI volumes which is a higher +level abstraction than a MTD device. The programming model of UBI devices +is very similar to MTD devices - they still consist of large eraseblocks, +they have read/write/erase operations, but UBI devices are devoid of +limitations like wear and bad blocks (items 4 and 5 in the above list). + +In a sense, UBIFS is a next generation of JFFS2 file-system, but it is +very different and incompatible to JFFS2. The following are the main +differences. + +* JFFS2 works on top of MTD devices, UBIFS depends on UBI and works on +  top of UBI volumes. +* JFFS2 does not have on-media index and has to build it while mounting, +  which requires full media scan. UBIFS maintains the FS indexing +  information on the flash media and does not require full media scan, +  so it mounts many times faster than JFFS2. +* JFFS2 is a write-through file-system, while UBIFS supports write-back, +  which makes UBIFS much faster on writes. + +Similarly to JFFS2, UBIFS supports on-the-flight compression which makes +it possible to fit quite a lot of data to the flash. + +Similarly to JFFS2, UBIFS is tolerant of unclean reboots and power-cuts. +It does not need stuff like ckfs.ext2. UBIFS automatically replays its +journal and recovers from crashes, ensuring that the on-flash data +structures are consistent. + +UBIFS scales logarithmically (most of the data structures it uses are +trees), so the mount time and memory consumption do not linearly depend +on the flash size, like in case of JFFS2. This is because UBIFS +maintains the FS index on the flash media. However, UBIFS depends on +UBI, which scales linearly. So overall UBI/UBIFS stack scales linearly. +Nevertheless, UBI/UBIFS scales considerably better than JFFS2. + +The authors of UBIFS believe, that it is possible to develop UBI2 which +would scale logarithmically as well. UBI2 would support the same API as UBI, +but it would be binary incompatible to UBI. So UBIFS would not need to be +changed to use UBI2 + + +Mount options +============= + +(*) == default. + +norm_unmount (*)	commit on unmount; the journal is committed +			when the file-system is unmounted so that the +			next mount does not have to replay the journal +			and it becomes very fast; +fast_unmount		do not commit on unmount; this option makes +			unmount faster, but the next mount slower +			because of the need to replay the journal. + + +Quick usage instructions +======================== + +The UBI volume to mount is specified using "ubiX_Y" or "ubiX:NAME" syntax, +where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is +UBI volume name. + +Mount volume 0 on UBI device 0 to /mnt/ubifs: +$ mount -t ubifs ubi0_0 /mnt/ubifs + +Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume +name): +$ mount -t ubifs ubi0:rootfs /mnt/ubifs + +The following is an example of the kernel boot arguments to attach mtd0 +to UBI and mount volume "rootfs": +ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs + + +Module Parameters for Debugging +=============================== + +When UBIFS has been compiled with debugging enabled, there are 3 module +parameters that are available to control aspects of testing and debugging. +The parameters are unsigned integers where each bit controls an option. +The parameters are: + +debug_msgs	Selects which debug messages to display, as follows: + +		Message Type				Flag value + +		General messages			1 +		Journal messages			2 +		Mount messages				4 +		Commit messages				8 +		LEB search messages			16 +		Budgeting messages			32 +		Garbage collection messages		64 +		Tree Node Cache (TNC) messages		128 +		LEB properties (lprops) messages	256 +		Input/output messages			512 +		Log messages				1024 +		Scan messages				2048 +		Recovery messages			4096 + +debug_chks	Selects extra checks that UBIFS can do while running: + +		Check					Flag value + +		General checks				1 +		Check Tree Node Cache (TNC)		2 +		Check indexing tree size		4 +		Check orphan area			8 +		Check old indexing tree			16 +		Check LEB properties (lprops)		32 +		Check leaf nodes and inodes		64 + +debug_tsts	Selects a mode of testing, as follows: + +		Test mode				Flag value + +		Force in-the-gaps method		2 +		Failure mode for recovery testing	4 + +For example, set debug_msgs to 5 to display General messages and Mount +messages. + + +References +========== + +UBIFS documentation and FAQ/HOWTO at the MTD web site: +http://www.linux-mtd.infradead.org/doc/ubifs.html +http://www.linux-mtd.infradead.org/faq/ubifs.html |