aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs/disk-io.c
AgeCommit message (Collapse)AuthorFilesLines
2008-09-25Remove Btrfs compat code for older kernelsChris Mason1-28/+0
Btrfs had compatibility code for kernels back to 2.6.18. These have been removed, and will be maintained in a separate backport git tree from now on. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Full back reference supportZheng Yan1-2/+2
This patch makes the back reference system to explicit record the location of parent node for all types of extents. The location of parent node is placed into the offset field of backref key. Every time a tree block is balanced, the back references for the affected lower level extents are updated. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Checksum tree blocks in the backgroundChris Mason1-9/+17
Tree blocks were using async bio submission, but the sum was still being done directly during writepage. This moves the checksumming into the worker thread. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: free space accounting redoJosef Bacik1-4/+3
1) replace the per fs_info extent_io_tree that tracked free space with two rb-trees per block group to track free space areas via offset and size. The reason to do this is because most allocations come with a hint byte where to start, so we can usually find a chunk of free space at that hint byte to satisfy the allocation and get good space packing. If we cannot find free space at or after the given offset we fall back on looking for a chunk of the given size as close to that given offset as possible. When we fall back on the size search we also try to find a slot as close to the size we want as possible, to avoid breaking small chunks off of huge areas if possible. 2) remove the extent_io_tree that tracked the block group cache from fs_info and replaced it with an rb-tree thats tracks block group cache via offset. also added a per space_info list that tracks the block group cache for the particular space so we can lookup related block groups easily. 3) cleaned up the allocation code to make it a little easier to read and a little less complicated. Basically there are 3 steps, first look from our provided hint. If we couldn't find from that given hint, start back at our original search start and look for space from there. If that fails try to allocate space if we can and start looking again. If not we're screwed and need to start over again. 4) small fixes. there were some issues in volumes.c where we wouldn't allocate the rest of the disk. fixed cow_file_range to actually pass the alloc_hint, which has helped a good bit in making the fs_mark test I run have semi-normal results as we run out of space. Generally with data allocations we don't track where we last allocated from, so everytime we did a data allocation we'd search through every block group that we have looking for free space. Now searching a block group with no free space isn't terribly time consuming, it was causing a slight degradation as we got more data block groups. The alloc_hint has fixed this slight degredation and made things semi-normal. There is still one nagging problem I'm working on where we will get ENOSPC when there is definitely plenty of space. This only happens with metadata allocations, and only when we are almost full. So you generally hit the 85% mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm still tracking it down, but until then this seems to be pretty stable and make a significant performance gain. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Fix mismerge in block header checksChris Mason1-1/+1
I had incorrectly disabled the check for the block number being correct in the header block. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Record dirty pages tree-log pages in an extent_io treeChris Mason1-2/+15
This is the same way the transaction code makes sure that all the other tree blocks are safely on disk. There's an extent_io tree for each root, and any blocks allocated to the tree logs are recorded in that tree. At tree-log sync, the extent_io tree is walked to flush down the dirty pages and wait for them. The main benefit is less time spent walking the tree log and skipping clean pages, and getting sequential IO down to the drive. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Optimize tree log block allocationsChris Mason1-3/+2
Since tree log blocks get freed every transaction, they never really need to be written to disk. This skips the step where we update metadata to record they were allocated. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Optimize btree walking while logging inodesChris Mason1-1/+1
Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Fix releasepage to properly keep dirty and writeback pagesChris Mason1-1/+4
Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Tree logging fixesChris Mason1-3/+30
* Pin down data blocks to prevent them from being reallocated like so: trans 1: allocate file extent trans 2: free file extent trans 3: free file extent during old snapshot deletion trans 3: allocate file extent to new file trans 3: fsync new file Before the tree logging code, this was legal because the fsync would commit the transation that did the final data extent free and the transaction that allocated the extent to the new file at the same time. With the tree logging code, the tree log subtransaction can commit before the transaction that freed the extent. If we crash, we're left with two different files using the extent. * Don't wait in start_transaction if log replay is going on. This avoids deadlocks from iput while we're cleaning up link counts in the replay code. * Don't deadlock in replay_one_name by trying to read an inode off the disk while holding paths for the directory * Hold the buffer lock while we mark a buffer as written. This closes a race where someone is changing a buffer while we write it. They are supposed to mark it dirty again after they change it, but this violates the cow rules. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Add a write ahead tree log to optimize synchronous operationsChris Mason1-21/+117
File syncs and directory syncs are optimized by copying their items into a special (copy-on-write) log tree. There is one log tree per subvolume and the btrfs super block points to a tree of log tree roots. After a crash, items are copied out of the log tree and back into the subvolume. See tree-log.c for all the details. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Add debugging checks to track down corrupted metadataChris Mason1-1/+6
Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Throttle for async bio submits higher up the chainChris Mason1-1/+7
The current code waits for the count of async bio submits to get below a given threshold if it is too high right after adding the latest bio to the work queue. This isn't optimal because the caller may have sequential adjacent bios pending they are waiting to send down the pipe. This changeset requires the caller to wait on the async bio count, and changes the async checksumming submits to wait for async bios any time they self throttle. The end result is much higher sequential throughput. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Wait for async bio submissions to make some progress at queue timeChris Mason1-7/+9
Before, the btrfs bdi congestion function was used to test for too many async bios. This keeps that check to throttle pdflush, but also adds a check while queuing bios. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Lower contention on the csum mutexChris Mason1-1/+8
This takes the csum mutex deeper in the call chain and releases it more often. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Wait for kernel threads to make progress during async submissionChris Mason1-19/+26
Before this change, btrfs would use a bdi congestion function to make sure there weren't too many pending async checksum work items. This change makes the process creating async work items wait instead, leading to fewer congestion returns from the bdi. This improves pdflush background_writeout scanning. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Give all the worker threads descriptive namesChris Mason1-7/+15
Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Transaction commit: don't use filemap_fdatawaitChris Mason1-7/+4
After writing out all the remaining btree blocks in the transaction, the commit code would use filemap_fdatawait to make sure it was all on disk. This means it would wait for blocks written by other procs as well. The new code walks the list of blocks for this transaction again and waits only for those required by this transaction. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Count async bios separately from async checksum work itemsChris Mason1-3/+22
Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Limit the number of async bio submission kthreads to the number of ↵Chris Mason1-1/+3
devices Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Avoid calling into the FS for the final iput on fake root inodesChris Mason1-0/+1
Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Maintain a list of inodes that are delalloc and a way to wait on themChris Mason1-0/+1
Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: More throttle tuningChris Mason1-11/+2
* Make walk_down_tree wake up throttled tasks more often * Make walk_down_tree call cond_resched during long loops * As the size of the ref cache grows, wait longer in throttle * Get rid of the reada code in walk_down_tree, the leaves don't get read anymore, thanks to the ref cache. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Fix streaming read performance with checksumming onChris Mason1-0/+15
Large streaming reads make for large bios, which means each entry on the list async work queues represents a large amount of data. IO congestion throttling on the device was kicking in before the async worker threads decided a single thread was busy and needed some help. The end result was that a streaming read would result in a single CPU running at 100% instead of balancing the work off to other CPUs. This patch also changes the pre-IO checksum lookup done by reads to work on a per-bio basis instead of a per-page. This results in many extra btree lookups on large streaming reads. Doing the checksum lookup right before bio submit allows us to reuse searches while processing adjacent offsets. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: implement memory reclaim for leaf reference cacheYan1-2/+3
The memory reclaiming issue happens when snapshot exists. In that case, some cache entries may not be used during old snapshot dropping, so they will remain in the cache until umount. The patch adds a field to struct btrfs_leaf_ref to record create time. Besides, the patch makes all dead roots of a given snapshot linked together in order of create time. After a old snapshot was completely dropped, we check the dead root list and remove all cache entries created before the oldest dead root in the list. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Fix verify_parent_transidChris Mason1-1/+1
It was incorrectly clearing the up to date flag on the buffer even when the buffer properly verified. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Throttle operations if the reference cache gets too largeChris Mason1-2/+5
A large reference cache is directly related to a lot of work pending for the cleaner thread. This throttles back new operations based on the size of the reference cache so the cleaner thread will be able to keep up. Overall, this actually makes the FS faster because the cleaner thread will be more likely to find things in cache. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Leaf reference cache updateChris Mason1-3/+5
This changes the reference cache to make a single cache per root instead of one cache per transaction, and to key by the byte number of the disk block instead of the keys inside. This makes it much less likely to have cache misses if a snapshot or something has an extra reference on a higher node or a leaf while the first transaction that added the leaf into the cache is dropping. Some throttling is added to functions that free blocks heavily so they wait for old transactions to drop. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Add a leaf reference cacheYan Zheng1-0/+14
Much of the IO done while dropping snapshots is done looking up leaves in the filesystem trees to see if they point to any extents and to drop the references on any extents found. This creates a cache so that IO isn't required. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Create orphan inode records to prevent lost files after a crashJosef Bacik1-0/+2
Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Fix the defragmention code and the block relocation code for data=orderedChris Mason1-0/+3
Before setting an extent to delalloc, the code needs to wait for pending ordered extents. Also, the relocation code needs to wait for ordered IO before scanning the block group again. This is because the extents are not removed until the IO for the new extents is finished Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Search data ordered extents first for checksums on readChris Mason1-1/+8
Checksum items are not inserted into the tree until all of the io from a given extent is complete. This means one dirty page from an extent may be written, freed, and then read again before the entire extent is on disk and the checksum item is inserted. The checksums themselves are stored in the ordered extent so they can be inserted in bulk when IO is complete. On read, if a checksum item isn't found, the ordered extents were being searched for a checksum record. This all worked most of the time, but the checksum insertion code tries to reduce the number of tree operations by pre-inserting checksum items based on i_size and a few other factors. This means the read code might find a checksum item that hasn't yet really been filled in. This commit changes things to check the ordered extents first and only dive into the btree if nothing was found. This removes the need for extra locking and is more reliable. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Index extent buffers in an rbtreeChris Mason1-17/+9
Before, extent buffers were a temporary object, meant to map a number of pages at once and collect operations on them. But, a few extra fields have crept in, and they are also the best place to store a per-tree block lock field as well. This commit puts the extent buffers into an rbtree, and ensures a single extent buffer for each tree block. Signed-off-by: Chris Mason <[email protected]>
2008-09-25btrfs_start_transaction: wait for commits in progress to finishChris Mason1-0/+1
btrfs_commit_transaction has to loop waiting for any writers in the transaction to finish before it can proceed. btrfs_start_transaction should be polite and not join a transaction that is in the process of being finished off. There are a few places that can't wait, basically the ones doing IO that might be needed to finish the transaction. For them, btrfs_join_transaction is added. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Use async helpers to deal with pages that have been improperly dirtiedChris Mason1-0/+4
Higher layers sometimes call set_page_dirty without asking the filesystem to help. This causes many problems for the data=ordered and cow code. This commit detects pages that haven't been properly setup for IO and kicks off an async helper to deal with them. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: New data=ordered implementationChris Mason1-1/+12
The old data=ordered code would force commit to wait until all the data extents from the transaction were fully on disk. This introduced large latencies into the commit and stalled new writers in the transaction for a long time. The new code changes the way data allocations and extents work: * When delayed allocation is filled, data extents are reserved, and the extent bit EXTENT_ORDERED is set on the entire range of the extent. A struct btrfs_ordered_extent is allocated an inserted into a per-inode rbtree to track the pending extents. * As each page is written EXTENT_ORDERED is cleared on the bytes corresponding to that page. * When all of the bytes corresponding to a single struct btrfs_ordered_extent are written, The previously reserved extent is inserted into the FS btree and into the extent allocation trees. The checksums for the file data are also updated. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Drop some verbose printksChris Mason1-2/+0
Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Add locking around volume management (device add/remove/balance)Chris Mason1-0/+1
Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Online btree defragmentation fixesChris Mason1-58/+3
The btree defragger wasn't making forward progress because the new key wasn't being saved by the btrfs_search_forward function. This also disables the automatic btree defrag, it wasn't scaling well to huge filesystems. The auto-defrag needs to be done differently. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Replace the transaction work queue with kthreadsChris Mason1-9/+107
This creates one kthread for commits and one kthread for deleting old snapshots. All the work queues are removed. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Add btrfs_end_transaction_throttle to force writers to wait for pending commitsChris Mason1-18/+0
The existing throttle mechanism was often not sufficient to prevent new writers from coming in and making a given transaction run forever. This adds an explicit wait at the end of most operations so they will allow the current transaction to close. There is no wait inside file_write, inode updates, or cow filling, all which have different deadlock possibilities. This is a temporary measure until better asynchronous commit support is added. This code leads to stalls as it waits for data=ordered writeback, and it really needs to be fixed. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Fix snapshot deletion to release the alloc_mutex much more often.Chris Mason1-0/+2
This lowers the impact of snapshot deletion on the rest of the FS. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Drop locks in btrfs_search_slot when reading a tree block.Chris Mason1-0/+1
One lock per btree block can make for significant congestion if everyone has to wait for IO at the high levels of the btree. This drops locks held by a path when doing reads during a tree search. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Replace the big fs_mutex with a collection of other locksChris Mason1-8/+7
Extent alloctions are still protected by a large alloc_mutex. Objectid allocations are covered by a objectid mutex Other btree operations are protected by a lock on individual btree nodes Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Start btree concurrency work.Chris Mason1-1/+12
The allocation trees and the chunk trees are serialized via their own dedicated mutexes. This means allocation location is still not very fine grained. The main FS btree is protected by locks on each block in the btree. Locks are taken top / down, and as processing finishes on a given level of the tree, the lock is released after locking the lower level. The end result of a search is now a path where only the lowest level is locked. Releasing or freeing the path drops any locks held. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Add a thread pool just for submit_bioChris Mason1-0/+4
If a bio submission is after a lock holder waiting for the bio on the work queue, it is possible to deadlock. Move the bios into their own pool. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Add a mount option to control worker thread pool sizeChris Mason1-15/+15
mount -o thread_pool_size changes the default, which is min(num_cpus + 2, 8). Larger thread pools would make more sense on very large disk arrays. This mount option controls the max size of each thread pool. There are multiple thread pools, so the total worker count will be larger than the mount option. Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: Add async worker threads for pre and post IO checksummingChris Mason1-118/+82
Btrfs has been using workqueues to spread the checksumming load across other CPUs in the system. But, workqueues only schedule work on the same CPU that queued the work, giving them a limited benefit for systems with higher CPU counts. This code adds a generic facility to schedule work with pools of kthreads, and changes the bio submission code to queue bios up. The queueing is important to make sure large numbers of procs on the system don't turn streaming workloads into random workloads by sending IO down concurrently. The end result of all of this is much higher performance (and CPU usage) when doing checksumming on large machines. Two worker pools are created, one for writes and one for endio processing. The two could deadlock if we tried to service both from a single pool. Signed-off-by: Chris Mason <[email protected]>
2008-09-25btrfs: sanity mount option parsing and early mount codeChristoph Hellwig1-1/+4
Also adds lots of comments to describe what's going on here. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2008-09-25Btrfs: bdi_init and bdi_destroy come with 2.6.23Jan Engelhardt1-3/+3
Signed-off-by: Chris Mason <[email protected]>