aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2013-11-11Btrfs: stop all workers after we free block groupsJosef Bacik1-2/+2
Stefan was hitting a panic in the async worker stuff because we had outstanding read bios while we were stopping the worker threads. You could reproduce this easily if you mount -o nospace_cache and ran generic/273. This is because the caching thread stuff is still going and we were stopping all the worker threads. We need to stop the workers after this work is done, and the free block groups code will wait for all the caching threads to stop first so we don't run into this problem. With this patch we no longer panic. Thanks, Cc: [email protected] Reported-by: Stefan Behrens <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: add tests for btrfs_get_extentJosef Bacik7-2/+861
I'm going to be removing hole extents in the near future so I wanted to make a sanity test for btrfs_get_extent to make sure I don't break anything in the meantime. This patch just puts btrfs_get_extent through its paces by giving it a completely unreasonable mapping to look at and make sure it is giving us back maps that make sense. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: add tests for find_lock_delalloc_rangeJosef Bacik8-8/+388
So both Liu and I made huge messes of find_lock_delalloc_range trying to fix stuff, me first by fixing extent size, then him by fixing something I broke and then me again telling him to fix it a different way. So this is obviously a candidate for some testing. This patch adds a pseudo fs so we can allocate fake inodes for tests that need an inode or pages. Then it addes a bunch of tests to make sure find_lock_delalloc_range is acting the way it is supposed to. With this patch and all of our previous patches to find_lock_delalloc_range I am sure it is working as expected now. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: free reserved space on error in a few placesJosef Bacik2-2/+21
While trying to track down a reserved space leak I noticed a few places where we won't properly clean up reserved space if we have an error, this patch fixes those up. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: fixup reserved trace pointsJosef Bacik1-2/+6
In trying to track down where we were leaking reserved space I noticed our reserve extent tracepoints are a little off. First we were saying that the reserved space had been alloced in btrfs_reserve_extent, which isn't the case, this needs to be triggered when we actually allocate the space when we run the delayed ref. We were also missing a few places where we should have been tracing the btrfs_reserve_extent_free tracepoint. With these in place I was able to put together where we were leaking reserved space. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: free up block groups after everythingJosef Bacik1-2/+2
If we abort a transaction we will do the tree log cleanup at unmount, but this happens after we free up the block groups. This makes all the leak detection warnings go off because we think we've leaked space but in reality we just haven't cleaned it up yet. So instead do the block group cleanup stuff after free'ing the fs roots so we don't get these warnings. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: cleanup reserved space when freeing tree log on errorJosef Bacik1-22/+25
On error we will wait and free the tree log at unmount without a transaction. This means that the actual freeing of the blocks doesn't happen which means we complain about space leaks on unmount. So to fix this just skip the transaction specific cleanup part of the tree log free'ing if we don't have a transaction and that way we can free up our reserved space and our counters stay happy. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: do not free the dirty bytes from the trans block rsv on cleanupJosef Bacik1-2/+0
The transactions should be cleaning up their reservations on failure, this just causes us to have warnings on unmount because we go negative by free'ing reservations that have already been free'ed. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: fix memory leaks on transaction commit failureFilipe David Borba Manana1-4/+8
Structures of the types tree_mod_elem and qgroup_update are allocated during transaction commit but were not being released if the call to btrfs_run_delayed_items() returned an error. Stack trace reported by kmemleak: unreferenced object 0xffff880679f0b398 (size 128): comm "umount", pid 21508, jiffies 4295967793 (age 36718.112s) hex dump (first 32 bytes): 60 b5 f0 79 06 88 ff ff 00 00 00 00 00 00 00 00 `..y............ 00 00 00 00 00 00 00 00 50 1c 00 00 00 00 00 00 ........P....... backtrace: [<ffffffff81742d26>] kmemleak_alloc+0x26/0x50 [<ffffffff811889c2>] kmem_cache_alloc_trace+0x112/0x200 [<ffffffffa046f2d3>] tree_mod_log_insert_key.constprop.45+0x93/0x150 [btrfs] [<ffffffffa04720f9>] __btrfs_cow_block+0x299/0x4f0 [btrfs] [<ffffffffa0472510>] btrfs_cow_block+0x120/0x1f0 [btrfs] [<ffffffffa0476679>] btrfs_search_slot+0x449/0x930 [btrfs] [<ffffffffa048eecf>] btrfs_lookup_inode+0x2f/0xa0 [btrfs] [<ffffffffa04eb49c>] __btrfs_update_delayed_inode+0x1c/0x1d0 [btrfs] [<ffffffffa04eb9e2>] __btrfs_run_delayed_items+0x162/0x1e0 [btrfs] [<ffffffffa04eba63>] btrfs_delayed_inode_exit+0x3/0x20 [btrfs] [<ffffffffa0499c63>] btrfs_commit_transaction+0x203/0xa50 [btrfs] [<ffffffffa046b519>] btrfs_sync_fs+0x69/0x110 [btrfs] [<ffffffff811cb210>] __sync_filesystem+0x30/0x60 [<ffffffff811cb2bb>] sync_filesystem+0x4b/0x70 [<ffffffff8119ce7b>] generic_shutdown_super+0x3b/0xf0 [<ffffffff8119cfc6>] kill_anon_super+0x16/0x30 unreferenced object 0xffff880677e0dd88 (size 32): comm "umount", pid 21508, jiffies 4295967793 (age 36718.112s) hex dump (first 32 bytes): 78 75 11 a9 06 88 ff ff 00 c0 e0 77 06 88 ff ff xu.........w.... 40 c3 a2 70 06 88 ff ff 00 00 00 00 00 00 00 00 @..p............ backtrace: [<ffffffff81742d26>] kmemleak_alloc+0x26/0x50 [<ffffffff811889c2>] kmem_cache_alloc_trace+0x112/0x200 [<ffffffffa04fa54f>] btrfs_qgroup_record_ref+0xf/0x90 [btrfs] [<ffffffffa04e1914>] btrfs_add_delayed_tree_ref+0xf4/0x170 [btrfs] [<ffffffffa048518a>] btrfs_free_tree_block+0x9a/0x220 [btrfs] [<ffffffffa0472163>] __btrfs_cow_block+0x303/0x4f0 [btrfs] [<ffffffffa0472510>] btrfs_cow_block+0x120/0x1f0 [btrfs] [<ffffffffa0476679>] btrfs_search_slot+0x449/0x930 [btrfs] [<ffffffffa048eecf>] btrfs_lookup_inode+0x2f/0xa0 [btrfs] [<ffffffffa04eb49c>] __btrfs_update_delayed_inode+0x1c/0x1d0 [btrfs] [<ffffffffa04eb9e2>] __btrfs_run_delayed_items+0x162/0x1e0 [btrfs] [<ffffffffa04eba63>] btrfs_delayed_inode_exit+0x3/0x20 [btrfs] [<ffffffffa0499c63>] btrfs_commit_transaction+0x203/0xa50 [btrfs] [<ffffffffa046b519>] btrfs_sync_fs+0x69/0x110 [btrfs] [<ffffffff811cb210>] __sync_filesystem+0x30/0x60 [<ffffffff811cb2bb>] sync_filesystem+0x4b/0x70 Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: fix the dev-replace suspend sequenceIlya Dryomov1-2/+3
Replace progresses strictly from lower to higher offsets, and the progress is tracked in chunks, by storing the physical offset of the dev_extent which is being copied in the cursor_left field of btrfs_dev_replace_item. When we are done copying the chunk, left_cursor is updated to point one byte past the dev_extent, so that on resume we can skip the dev_extents that have already been copied. There is a major bug (which goes all the way back to the inception of dev-replace in 3.8) in the way left_cursor is bumped: the bump is done unconditionally, without any regard to the scrub_chunk return value. On suspend (and also on any kind of error) scrub_chunk returns early, i.e. without completing the copy. This leads to us skipping the chunk that hasn't been fully copied yet when resuming. Fix this by doing the cursor_left update only if scrub_chunk ret is 0. (On suspend scrub_chunk returns with -ECANCELED, so this fix covers both suspend and error cases.) Cc: Stefan Behrens <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: improve inode hash function/inode lookupFilipe David Borba Manana3-3/+25
Currently the hash value used for adding an inode to the VFS's inode hash table consists of the plain inode number, which is a 64 bits integer. This results in hash table buckets (hlist_head lists) with too many elements for at least 2 important scenarios: 1) When we have many subvolumes. Each subvolume has its own btree where its files and directories are added to, and each has its own objectid (inode number) namespace. This means that if we have N subvolumes, and all have inode number X associated to a file or directory, the corresponding inodes all map to the same hash table entry, resulting in a bucket (hlist_head list) with N elements; 2) On 32 bits machines. Th VFS hash values are unsigned longs, which are 32 bits wide on 32 bits machines, and the inode (objectid) numbers are 64 bits unsigned integers. We simply cast the inode numbers to hash values, which means that for all inodes with the same 32 bits lower half, the same hash bucket is used for all of them. For example, all inodes with a number (objectid) between 0x0000_0000_ffff_ffff and 0xffff_ffff_ffff_ffff will end up in the same hash table bucket. This change ensures the inode's hash value depends both on the objectid (inode number) and its subvolume's (btree root) objectid. For 32 bits machines, this change gives better entropy by making the hash value depend on both the upper and lower 32 bits of the 64 bits hash previously computed. Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: remove unnecessary tree search when logging inodeFilipe David Borba Manana1-5/+5
In tree-log.c:btrfs_log_inode(), we keep calling btrfs_search_forward() until it returns a key whose objectid is higher than our inode or until the key's type is higher than our maximum allowed type. At the end of the loop, we increment our mininum search key's objectid and type regardless of our desired target objectid and maximum desired type, which causes another loop iteration that will call again btrfs_search_forward() just to figure out we've gone beyond our maximum key and exit the loop. Therefore while incrementing our minimum key, don't do it blindly and exit the loop immiediately if the next search key's objectid or type is beyond what we seek. Also after incrementing the type, set the key's offset to 0, which was missing and could make us loose some of the inode's items. Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: remove unused max_key arg from btrfs_search_forwardFilipe David Borba Manana7-30/+7
It is not used for anything. Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: fix memory leak of chunks' extent mapLiu Bo1-0/+2
As we're hold a ref on looking up the extent map, we need to drop the ref before returning to callers. Signed-off-by: Liu Bo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: improve jitter performance of the sequential buffered writeMiao Xie1-3/+4
The performance was slowed down sometimes when we ran sysbench to measure the performance of the sequential buffered write by 2 or more threads. It was because the write order of the test threads might be confused by the task scheduler, and the coming write would be beyond the end of the file, in this case, we need insert dummy file extents and create a hole for the area we skip. But in order to avoid the ongoing ordered extents which are in the area, we need wait for them. Unfortunately, the current code doesn't check if there are ordered extents in the area or not, try to find and flush the dirty pages directly, but in fact, there is no dirty page in that area, this step of the current code is unnecessary, and just wastes time. Sometimes, it would increase the contention of some locks, and makes the performance slow down suddenly. So we remove the ordered extent flush function before the check, and flush the dirty pages and wait for the ordered extents only when we find them. According to my test, we got 1-2 times of the performance regression when we ran the test by 10 times before applying this patch. After applying this patch, the regression went away. Test Environment: CPU: 1CPU * 4Cores Memory: 6GB Partition: 20GB Test Command: # sysbench --test=fileio --file-total-size=16G --file-test-mode=seqwr \ > --num-threads=512 --file-block-size=16384 --max-time=60 --max-requests=0 run Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: fix BUG_ON() casued by the reserved space migrationMiao Xie3-3/+28
When we did space balance and snapshot creation at the same time, we might meet the following oops: kernel BUG at fs/btrfs/inode.c:3038! [SNIP] Call Trace: [<ffffffffa0411ec7>] btrfs_orphan_cleanup+0x293/0x407 [btrfs] [<ffffffffa042dc45>] btrfs_mksubvol.isra.28+0x259/0x373 [btrfs] [<ffffffffa042de85>] btrfs_ioctl_snap_create_transid+0x126/0x156 [btrfs] [<ffffffffa042dff1>] btrfs_ioctl_snap_create_v2+0xd0/0x121 [btrfs] [<ffffffffa0430b2c>] btrfs_ioctl+0x414/0x1854 [btrfs] [<ffffffff813b60b7>] ? __do_page_fault+0x305/0x379 [<ffffffff811215a9>] vfs_ioctl+0x1d/0x39 [<ffffffff81121d7c>] do_vfs_ioctl+0x32d/0x3e2 [<ffffffff81057fe7>] ? finish_task_switch+0x80/0xb8 [<ffffffff81121e88>] SyS_ioctl+0x57/0x83 [<ffffffff813b39ff>] ? do_device_not_available+0x12/0x14 [<ffffffff813b99c2>] system_call_fastpath+0x16/0x1b [SNIP] RIP [<ffffffffa040da40>] btrfs_orphan_add+0xc3/0x126 [btrfs] The reason of the problem is that the relocation root creation stole the reserved space, which was reserved for orphan item deletion. There are several ways to fix this problem, one is to increasing the reserved space size of the space balace, and then we can use that space to create the relocation tree for each fs/file trees. But it is hard to calculate the suitable size because we doesn't know how many fs/file trees we need relocate. We fixed this problem by reserving the space for relocation root creation actively since the space it need is very small (one tree block, used for root node copy), then we use that reserved space to create the relocation tree. If we don't reserve space for relocation tree creation, we will use the reserved space of the balance. Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11btrfs: remove unused parameter from btrfs_header_fsidRoss Kirk4-10/+10
Remove unused parameter, 'eb'. Unused since introduction in 5f39d397dfbe140a14edecd4e73c34ce23c4f9ee Updated to be rebased against current upstream and correct diff supplied this time! Signed-off-by: Ross Kirk <[email protected]> Reviewed-by: Eric Sandeen <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: fix two use-after-free bugs with transaction cleanupJosef Bacik3-82/+52
I was noticing the slab redzone stuff going off every once and a while during transaction aborts. This was caused by two things 1) We would walk the pending snapshots and set their error to -ECANCELED. We don't need to do this, the snapshot stuff waits for a transaction commit and if there is a problem we just free our pending snapshot object and exit. Doing this was causing us to touch the pending snapshot object after the thing had already been freed. 2) We were freeing the transaction manually with wanton disregard for it's use_count reference counter. To fix this I cleaned up the transaction freeing loop to either wait for the transaction commit to finish if it was in the middle of that (since it will be cleaned and freed up there) or to do the cleanup oursevles. I also moved the global "kill all things dirty everywhere" stuff outside of the transaction cleanup loop since that only needs to be done once. With this patch I'm no longer seeing slab corruption because of use after frees. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: remove all BUG_ON()'s from commit_cowonly_rootsJosef Bacik1-5/+8
Noticed this when forcing errors to happen during delayed ref running. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: don't delete ordered roots from list during cleanupJosef Bacik1-1/+2
During transaction cleanup after an abort we are just removing roots from the ordered roots list which is incorrect. We have a BUG_ON() to make sure that the root is still part of the ordered roots list when we put our ordered extent which we were tripping in this case. So do like we do everywhere else and just move it to the tail of the ordered roots list and allow the normal cleanup to take care of stuff. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: cleanup transaction on abortJosef Bacik2-1/+6
If we abort not during a transaction commit we won't clean up anything until we unmount. Unfortunately if we abort in the middle of writing out an ordered extent we won't clean it up and if somebody is waiting on that ordered extent they will wait forever. To fix this just make the transaction kthread call the cleanup transaction stuff if it notices theres an error, and make btrfs_end_transaction wake up the transaction kthread if there is an error. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: do not release metadata for space cache inodesJosef Bacik1-1/+7
I've been testing our error paths and I was tripping the BUG_ON() in drop_outstanding_extent because our outstanding_extents is 0 for space cache inodes. This is because we don't reserve metadata space for these inodes since we depend on the global block reserve for our space. To fix this we need to make sure the DO_ACCOUNTING stuff doesn't actually call release_metadata for space cache inodes. With this patch I'm no longer panicing. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: reset intwrite on transaction abortJosef Bacik1-0/+2
If we abort a transaction in the middle of a commit we weren't undoing the intwrite locking. This patch fixes that problem. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: relocate csums properly with prealloc extentsJosef Bacik1-3/+15
A user reported a problem where they were getting csum errors when running a balance and running systemd's journal. This is because systemd is awesome and fallocate()'s its log space and writes into it. Unfortunately we assume that when we read in all the csums for an extent that they are sequential starting at the bytenr we care about. This obviously isn't the case for prealloc extents, where we could have written to the middle of the prealloc extent only, which means the csum would be for the bytenr in the middle of our range and not the front of our range. Fix this by offsetting the new bytenr we are logging to based on the original bytenr the csum was for. With this patch I no longer see the csum errors I was seeing. Thanks, Cc: [email protected] Reported-by: Chris Murphy <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: don't leak block group on errorFilipe David Borba Manana1-2/+1
In extent-tree.c:btrfs_write_dirty_block_groups(), if the call to write_one_cache_group() failed, we would return without putting the block group first. Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: fix sync fs to actually wait for all data to be persistedFilipe David Borba Manana1-3/+9
Currently the fs sync function (super.c:btrfs_sync_fs()) doesn't wait for delayed work to finish before returning success to the caller. This change fixes this, ensuring that there's no data loss if a power failure happens right after fs sync returns success to the caller and before the next commit happens. Steps to reproduce the data loss issue: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ perl -e '$d = ("\x41" x 6001); open($f,">","/mnt/btrfs/foobar"); print $f $d; close($f);' && btrfs fi sync /mnt/btrfs Right after the btrfs fi sync command (a second or 2 for example), power off the machine and reboot it. The file will be empty, as it can be verified after mounting the filesystem and through btrfs-debug-tree: $ btrfs-debug-tree /dev/sdb3 | egrep '\(257 INODE_ITEM 0\) itemoff' -B 3 -A 8 item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36 location key (257 INODE_ITEM 0) type FILE namelen 6 datalen 0 name: foobar item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160 inode generation 7 transid 7 size 0 block group 0 mode 100644 links 1 item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16 inode ref index 2 namelen 6 name: foobar checksum tree key (CSUM_TREE ROOT_ITEM 0) leaf 29429760 items 0 free space 3995 generation 7 owner 7 fs uuid 6192815c-af2a-4b75-b3db-a959ffb6166e chunk uuid b529c44b-938c-4d3d-910a-013b4700bcae uuid tree key (UUID_TREE ROOT_ITEM 0) After this patch, the data loss no longer happens after a power failure and btrfs-debug-tree shows: $ btrfs-debug-tree /dev/sdb3 | egrep '\(257 INODE_ITEM 0\) itemoff' -B 3 -A 8 item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36 location key (257 INODE_ITEM 0) type FILE namelen 6 datalen 0 name: foobar item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160 inode generation 6 transid 6 size 6001 block group 0 mode 100644 links 1 item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16 inode ref index 2 namelen 6 name: foobar item 6 key (257 EXTENT_DATA 0) itemoff 3522 itemsize 53 extent data disk byte 12845056 nr 8192 extent data offset 0 nr 8192 ram 8192 extent compression 0 checksum tree key (CSUM_TREE ROOT_ITEM 0) Signed-off-by: Filipe David Borba Manana <[email protected]> Reviewed-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: fix tracking of orphan inode countFilipe David Borba Manana1-5/+8
In inode.c:btrfs_orphan_add() if we failed to insert the orphan item, we would return without decrementing the orphan count that we just incremented before attempting the insertion, leaving the orphan inode count wrong. In inode.c:btrfs_orphan_del(), we were decrementing the inode orphan count if the bit BTRFS_INODE_ORPHAN_META_RESERVED was set, which is logically wrong because it should be decremented if the bit BTRFS_INODE_HAS_ORPHAN_ITEM was set - after all we increment the count when we set the bit BTRFS_INODE_HAS_ORPHAN_ITEM elsewhere. Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: export btrfs space shared info to userspaceLiu Bo1-1/+33
Similar to ocfs2, btrfs also supports that extents can be shared by different inodes, and there are some userspace tools requesting for this kind of 'space shared infomation'.[1] ocfs2 uses flag FIEMAP_EXTENT_SHARED, so does btrfs. [1]: http://thr3ads.net/ocfs2-devel/2010/09/489052-PATCH-3-3-shared-du-using-fiemap-to-figure-up-the-shared-extents-per-file-and-the-footprint-in Reviewed-by: David Sterba <[email protected]> Signed-off-by: Liu Bo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: remove path arg from btrfs_truncate_free_space_cacheFilipe David Borba Manana5-15/+3
Not used for anything, and removing it avoids caller's need to allocate a path structure. Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: remove duplicated ino cache's inode lookupFilipe David Borba Manana3-9/+5
We're doing a unnecessary extra lookup of the ino cache's inode when we already have it (and holding a reference) during the process of saving the ino cache contents to disk. Therefore remove this extra lookup. Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: do a full search everytime in btrfs_search_old_slotJosef Bacik1-2/+6
While running some snashot aware defrag tests I noticed I was panicing every once and a while in key_search. This is because of the optimization that says if we find a key at slot 0 it will be at slot 0 all the way down the rest of the tree. This isn't the case for btrfs_search_old_slot since it will likely replay changes to a buffer if something has changed since we took our sequence number. So short circuit this optimization by setting prev_cmp to -1 every time we call key_search so we will do our normal binary search. With this patch I am no longer seeing the panics I was seeing before. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: add a sanity test for btrfs_split_itemJosef Bacik7-7/+283
While looking at somebodys corruption I became completely convinced that btrfs_split_item was broken, so I wrote this test to verify that it was working as it was supposed to. Thankfully it appears to be working as intended, so just add this test to make sure nobody breaks it in the future. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11btrfs: drop unused parameter from btrfs_item_nrRoss Kirk8-32/+31
Remove unused eb parameter from btrfs_item_nr Signed-off-by: Ross Kirk <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: don't store NULL byte in symlink extentsFilipe David Borba Manana1-2/+2
It is not necessary to store the NULL byte in a symlink inline file extent. There's currently no code that requires the NULL byte to be present in the extent. This change also doesn't break file format compatibility nor the send/receive feature. The VFS also doesn't need the NULL byte to be present in the extent, as it reads up to inode->i_size bytes (which already excluded the NULL byte) and sets the NULL byte for us (in fs/namei.c:page_getlink()). So with this change we save 1 byte per symlink file extent (which is always inlined in the btree leaf) without losing backward and forward compatibility. Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-11-11Btrfs: eliminate the exceptional root_tree refs=0Stefan Behrens3-15/+10
The fact that btrfs_root_refs() returned 0 for the tree_root caused bugs in the past, therefore it is set to 1 with this patch and (hopefully) all affected code is adapted to this change. I verified this change by temporarily adding WARN_ON() checks everywhere where btrfs_root_refs() is used, checking whether the logic of the code is changed by btrfs_root_refs() returning 1 instead of 0 for root->root_key.objectid == BTRFS_ROOT_TREE_OBJECTID. With these added checks, I ran the xfstests './check -g auto'. The two roots chunk_root and log_root_tree that are only referenced by the superblock and the log_roots below the log_root_tree still have btrfs_root_refs() == 0, only the tree_root is changed. Signed-off-by: Stefan Behrens <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-10-31vfs: decrapify dput(), fix cache behavior under normal loadLinus Torvalds1-2/+3
We do not want to dirty the dentry->d_flags cacheline in dput() just to set the DCACHE_REFERENCED flag when it is already set in the common case anyway. This way the first cacheline of the dentry (which contains the RCU lookup information etc) can stay shared among multiple CPU's. This finishes off some of the details of all the scalability patches merged during the merge window. Also don't mark dentry_kill() for inlining, since it's the uncommon path and inlining it just makes the common path slower due to extra function entry/exit overhead. Signed-off-by: Linus Torvalds <[email protected]>
2013-10-30Revert "select: use freezable blocking call"Rafael J. Wysocki1-2/+1
This reverts commit 9745cdb36da8 (select: use freezable blocking call) that triggers problems during resume from suspend to RAM on Paul Bolle's 32-bit x86 machines. Paul says: Ever since I tried running (release candidates of) v3.11 on the two working i686s I still have lying around I ran into issues on resuming from suspend. Reverting 9745cdb36da8 (select: use freezable blocking call) resolves those issues. Resuming from suspend on i686 on (release candidates of) v3.11 and later triggers issues like: traps: systemd[1] general protection ip:b738e490 sp:bf882fc0 error:0 in libc-2.16.so[b731c000+1b0000] and traps: rtkit-daemon[552] general protection ip:804d6e5 sp:b6cb32f0 error:0 in rtkit-daemon[8048000+d000] Once I hit the systemd error I can only get out of the mess that the system is at that point by power cycling it. Since we are reverting another freezer-related change causing similar problems to happen, this one should be reverted as well. References: https://lkml.org/lkml/2013/10/29/583 Reported-by: Paul Bolle <[email protected]> Fixes: 9745cdb36da8 (select: use freezable blocking call) Signed-off-by: Rafael J. Wysocki <[email protected]> Cc: 3.11+ <[email protected]> # 3.11+
2013-10-30Revert "epoll: use freezable blocking call"Rafael J. Wysocki1-3/+1
This reverts commit 1c441e921201 (epoll: use freezable blocking call) which is reported to cause user space memory corruption to happen after suspend to RAM. Since it appears to be extremely difficult to root cause this problem, it is best to revert the offending commit and try to address the original issue in a better way later. References: https://bugzilla.kernel.org/show_bug.cgi?id=61781 Reported-by: Natrio <[email protected]> Reported-by: Jeff Pohlmeyer <[email protected]> Bisected-by: Leo Wolf <[email protected]> Fixes: 1c441e921201 (epoll: use freezable blocking call) Signed-off-by: Rafael J. Wysocki <[email protected]> Cc: 3.11+ <[email protected]> # 3.11+
2013-10-25Merge branch 'for-linus' of ↵Linus Torvalds2-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes (try two) from Al Viro: "nfsd performance regression fix + seq_file lseek(2) fix" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: seq_file: always update file->f_pos in seq_lseek() nfsd regression since delayed fput()
2013-10-25seq_file: always update file->f_pos in seq_lseek()Gu Zheng1-0/+2
This issue was first pointed out by Jiaxing Wang several months ago, but no further comments: https://lkml.org/lkml/2013/6/29/41 As we know pread() does not change f_pos, so after pread(), file->f_pos and m->read_pos become different. And seq_lseek() does not update file->f_pos if offset equals to m->read_pos, so after pread() and seq_lseek()(lseek to m->read_pos), then a subsequent read may read from a wrong position, the following program produces the problem: char str1[32] = { 0 }; char str2[32] = { 0 }; int poffset = 10; int count = 20; /*open any seq file*/ int fd = open("/proc/modules", O_RDONLY); pread(fd, str1, count, poffset); printf("pread:%s\n", str1); /*seek to where m->read_pos is*/ lseek(fd, poffset+count, SEEK_SET); /*supposed to read from poffset+count, but this read from position 0*/ read(fd, str2, count); printf("read:%s\n", str2); out put: pread: ck_netbios_ns 12665 read: nf_conntrack_netbios /proc/modules: nf_conntrack_netbios_ns 12665 0 - Live 0xffffffffa038b000 nf_conntrack_broadcast 12589 1 nf_conntrack_netbios_ns, Live 0xffffffffa0386000 So we always update file->f_pos to offset in seq_lseek() to fix this issue. Signed-off-by: Jiaxing Wang <[email protected]> Signed-off-by: Gu Zheng <[email protected]> Signed-off-by: Al Viro <[email protected]>
2013-10-25Merge tag 'ecryptfs-3.12-rc7-fixes' of ↵Linus Torvalds2-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs Pull ecryptfs fixes from Tyler Hicks: "Two important fixes - Fix long standing memory leak in the (rarely used) public key support - Fix large file corruption on 32 bit architectures" * tag 'ecryptfs-3.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs: eCryptfs: fix 32 bit corruption issue ecryptfs: Fix memory leakage in keystore.c
2013-10-24eCryptfs: fix 32 bit corruption issueColin Ian King1-1/+1
Shifting page->index on 32 bit systems was overflowing, causing data corruption of > 4GB files. Fix this by casting it first. https://launchpad.net/bugs/1243636 Signed-off-by: Colin Ian King <[email protected]> Reported-by: Lars Duesing <[email protected]> Cc: [email protected] # v3.11+ Signed-off-by: Tyler Hicks <[email protected]>
2013-10-22vfs: fix new kernel-doc warningsRandy Dunlap1-8/+7
Move kernel-doc notation to immediately before its function to eliminate kernel-doc warnings introduced by commit db14fc3abcd5 ("vfs: add d_walk()") Warning(fs/dcache.c:1343): No description found for parameter 'data' Warning(fs/dcache.c:1343): No description found for parameter 'dentry' Warning(fs/dcache.c:1343): Excess function parameter 'parent' description in 'check_mount' Signed-off-by: Randy Dunlap <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-10-22fs/namei.c: fix new kernel-doc warningRandy Dunlap1-1/+2
Add @path parameter to fix kernel-doc warning. Also fix a spello/typo. Warning(fs/namei.c:2304): No description found for parameter 'path' Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-10-22Merge tag 'jfs-3.12' of git://github.com/kleikamp/linux-shaggyLinus Torvalds1-2/+1
Pull jfs bugfix from David Kleikamp: "Just a patch to fix an oops in an error path" * tag 'jfs-3.12' of git://github.com/kleikamp/linux-shaggy: jfs: fix error path in ialloc
2013-10-20nfsd regression since delayed fput()Al Viro1-2/+2
Background: nfsd v[23] had throughput regression since delayed fput went in; every read or write ends up doing fput() and we get a pair of extra context switches out of that (plus quite a bit of work in queue_work itselfi, apparently). Use of schedule_delayed_work() gives it a chance to accumulate a bit before we do __fput() on all of them. I'm not too happy about that solution, but... on at least one real-world setup it reverts about 10% throughput loss we got from switch to delayed fput. Signed-off-by: Al Viro <[email protected]>
2013-10-18Merge branch 'for-linus' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "Sage hit a deadlock with ceph on btrfs, and Josef tracked it down to a regression in our initial rc1 pull. When doing nocow writes we were sometimes starting a transaction with locks held" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: release path before starting transaction in can_nocow_extent
2013-10-18Btrfs: release path before starting transaction in can_nocow_extentJosef Bacik1-0/+1
We can't be holding tree locks while we try to start a transaction, we will deadlock. Thanks, Reported-by: Sage Weil <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-10-17Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds8-21/+93
Pull CIFS fixes from Steve French: "Five small cifs fixes (includes fixes for: unmount hang, 2 security related, symlink, large file writes)" * 'for-linus' of git://git.samba.org/sfrench/cifs-2.6: cifs: ntstatus_to_dos_map[] is not terminated cifs: Allow LANMAN auth method for servers supporting unencapsulated authentication methods cifs: Fix inability to write files >2GB to SMB2/3 shares cifs: Avoid umount hangs with smb2 when server is unresponsive do not treat non-symlink reparse points as valid symlinks
2013-10-16Merge branch 'akpm' (fixes from Andrew Morton)Linus Torvalds3-6/+22
Merge misc fixes from Andrew Morton. * emailed patches from Andrew Morton <[email protected]>: (21 commits) mm: revert mremap pud_free anti-fix mm: fix BUG in __split_huge_page_pmd swap: fix set_blocksize race during swapon/swapoff procfs: call default get_unmapped_area on MMU-present architectures procfs: fix unintended truncation of returned mapped address writeback: fix negative bdi max pause percpu_refcount: export symbols fs: buffer: move allocation failure loop into the allocator mm: memcg: handle non-error OOM situations more gracefully tools/testing/selftests: fix uninitialized variable block/partitions/efi.c: treat size mismatch as a warning, not an error mm: hugetlb: initialize PG_reserved for tail pages of gigantic compound pages mm/zswap: bugfix: memory leak when re-swapon mm: /proc/pid/pagemap: inspect _PAGE_SOFT_DIRTY only on present pages mm: migration: do not lose soft dirty bit if page is in migration state gcov: MAINTAINERS: Add an entry for gcov mm/hugetlb.c: correct missing private flag clearing mm/vmscan.c: don't forget to free shrinker->nr_deferred ipc/sem.c: synchronize semop and semctl with IPC_RMID ipc: update locking scheme comments ...