aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2015-02-02Btrfs: add missing cleanup on sysfs init failureFilipe Manana1-1/+9
If we failed during initialization of sysfs, we weren't unregistering the top level btrfs sysfs entry nor the debugfs stuff. Not unregistering the top level sysfs entry makes future attempts to reload the btrfs module impossible and the following is reported in dmesg: [ 2246.451296] WARNING: CPU: 3 PID: 10999 at fs/sysfs/dir.c:486 sysfs_warn_dup+0x91/0xb0() [ 2246.451298] sysfs: cannot create duplicate filename '/fs/btrfs' [ 2246.451298] Modules linked in: btrfs(+) raid6_pq xor bnep rfcomm bluetooth binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc parport_pc parport psmouse serio_raw pcspkr evbug i2c_piix4 e1000 floppy [last unloaded: btrfs] [ 2246.451310] CPU: 3 PID: 10999 Comm: modprobe Tainted: G W 3.13.0-fdm-btrfs-next-24+ #7 [ 2246.451311] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 2246.451312] 0000000000000009 ffff8800d353fa08 ffffffff816f1da6 0000000000000410 [ 2246.451314] ffff8800d353fa58 ffff8800d353fa48 ffffffff8104a32c ffff88020821a290 [ 2246.451316] ffff88020821a290 ffff88020821a290 ffff8802148f0000 ffff8800d353fb80 [ 2246.451318] Call Trace: [ 2246.451322] [<ffffffff816f1da6>] dump_stack+0x4e/0x68 [ 2246.451324] [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0 [ 2246.451325] [<ffffffff8104a416>] warn_slowpath_fmt+0x46/0x50 [ 2246.451328] [<ffffffff81367dc5>] ? strlcat+0x65/0x90 (....) This fixes the following change: btrfs: add simple debugfs interface commit 1bae30982bc86ab66d61ccb6e22792593b45d44d Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-02-02Btrfs: fix race between transaction commit and empty block group removalFilipe Manana3-1/+22
Committing a transaction can race with automatic removal of empty block groups (cleaner kthread), leading to a BUG_ON() in the transaction commit code while running btrfs_finish_extent_commit(). The following sequence diagram shows how it can happen: CPU 1 CPU 2 btrfs_commit_transaction() fs_info->running_transaction = NULL btrfs_finish_extent_commit() find_first_extent_bit() -> found range for block group X in fs_info->freed_extents[] btrfs_delete_unused_bgs() -> found block group X Removed block group X's range from fs_info->freed_extents[] btrfs_remove_chunk() btrfs_remove_block_group(bg X) unpin_extent_range(bg X range) btrfs_lookup_block_group(bg X) -> returns NULL -> BUG_ON() The trace that results from the BUG_ON() is: [48665.187808] ------------[ cut here ]------------ [48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675! [48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC [48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode [48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G W 3.19.0-rc5-btrfs-next-4+ #1 [48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000 [48665.197388] RIP: 0010:[<ffffffffa0350d05>] [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs] [48665.197388] RSP: 0018:ffff8801b56a7b88 EFLAGS: 00010246 [48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8 [48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0 [48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000 [48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000 [48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000 [48665.197388] FS: 0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000 [48665.197388] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0 [48665.197388] Stack: [48665.197388] ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff [48665.197388] ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0 [48665.197388] ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227 [48665.197388] Call Trace: [48665.197388] [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs] [48665.197388] [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs] [48665.197388] [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs] [48665.197388] [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33 [48665.197388] [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs] [48665.197388] [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab [48665.197388] [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab [48665.197388] [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf [48665.197388] [<ffffffff8105a55b>] worker_thread+0x210/0x2d0 [48665.197388] [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3 [48665.197388] [<ffffffff8105e5c0>] kthread+0xef/0xf7 [48665.197388] [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39 [48665.197388] [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad [48665.197388] [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0 [48665.197388] [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad [48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b [48665.197388] RIP [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs] [48665.197388] RSP <ffff8801b56a7b88> [48665.272246] ---[ end trace b9c6ab9957521376 ]--- Fix this by ensuring that unpining the block group's range in btrfs_finish_extent_commit() is done in a synchronized fashion with removing the block group's range from freed_extents[] in btrfs_delete_unused_bgs() This race got introduced with the change: Btrfs: remove empty block groups automatically commit 47ab2a6c689913db23ccae38349714edf8365e0a Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-02-02btrfs: add more checks to btrfs_read_sys_arrayDavid Sterba1-3/+23
Verify that the sys_array has enough bytes to read the next item. Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-02-02btrfs: cleanup, rename a few variables in btrfs_read_sys_arrayDavid Sterba1-15/+16
There's a pointer to buffer, integer offset and offset passed as pointer, try to find matching names for them. Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-02-02btrfs: add checks for sys_chunk_array sizesDavid Sterba1-0/+19
Verify that possible minimum and maximum size is set, validity of contents is checked in btrfs_read_sys_array. Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-02-02btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesizeDavid Sterba1-0/+19
I received a few crafted images from Jiri, all got through the recently added superblock checks. The lower bounds checks for num_devices and sector/node -sizes were missing and caused a crash during mount. Tools for symbolic code execution were used to prepare the images contents. Reported-by: Jiri Slaby <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-02-02Btrfs: Add code to support file creation timechandan r3-2/+36
This patch adds a new member to the 'struct btrfs_inode' structure to hold the file creation time. Signed-off-by: chandan <[email protected]> [refreshed, removed btrfs_inode_otime] Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-02-02btrfs: kill btrfs_inode_*time helpersDavid Sterba5-69/+33
They just opencode taking address of the timespec member. Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-22Btrfs: insert_new_root: Fix lock type of the extent buffer.chandan1-1/+1
btrfs_alloc_tree_block() returns an extent buffer on which a blocked lock has been taken. Hence assign the appropriate value to path->locks[level]. Signed-off-by: Chandan Rajendra <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: fix unused members in struct btrfs_rootAnand Jain2-4/+0
There isn't any real use of following members of struct btrfs_root so delete them. struct kobject root_kobj; struct completion kobj_unregister; Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21btrfs: qgroup: move WARN_ON() to the correct location.Yang Dongsheng1-2/+1
In function qgroup_excl_accounting(), we need to WARN when qg->excl is less than what we want to free, same to child and parents. But currently, for parent qgroup, the WARN_ON() is located after freeing qg->excl. It will WARN out even we free it normally. This patch move this WARN_ON() before freeing qg->excl. Signed-off-by: Dongsheng Yang <[email protected]> Reviewed-by: Satoru Takeuchi <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: cleanup unused run_mostLiu Bo1-4/+1
"run_most" is not used anymore. Signed-off-by: Liu Bo <[email protected]> Reviewed-by: Satoru Takeuchi <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Rename all ref_count to refs in structZhao Lei1-13/+13
refs is better than ref_count to record a struct's ref count. Signed-off-by: Zhao Lei <[email protected]> Suggested-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: Introduce BTRFS_BLOCK_GROUP_RAID56_MASK to check raid56 simplyZhao Lei4-28/+19
So we can check raid56 with: (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) instead of long: (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) Signed-off-by: Zhao Lei <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: Include map_type in raid_bioZhao Lei4-18/+22
Corrent code use many kinds of "clever" way to determine operation target's raid type, as: raid_map != NULL or raid_map[MAX_NR] == RAID[56]_Q_STRIPE To make code easy to maintenance, this patch put raid type into bbio, and we can always get raid type from bbio with a "stupid" way. Signed-off-by: Zhao Lei <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: Simplify scrub_setup_recheck_block()'s argumentZhao Lei1-16/+9
scrub_setup_recheck_block() have many arguments but most of them can be get from one of them, we can remove them to make code clean. Some other cleanup for that function also included in this patch. Signed-off-by: Zhao Lei <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: Combine per-page recover in dev-replace and scrubZhao Lei1-72/+48
The code are similar, combine them to make code clean and easy to maintenance. Some lost condition are also completed with benefit of this combination. Signed-off-by: Zhao Lei <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: Separate finding-right-mirror and writing-to-target's process in ↵Zhao Lei1-27/+17
scrub_handle_errored_block() In corrent code, code of finding-right-mirror and writing-to-target are mixed in logic, if we find a right mirror but failed in writing to target, it will treat as "hadn't found right block", and fill the target with sblock_bad. Actually, "failed in writing to target" does not mean "source block is wrong", this patch separate above two condition in logic, and do some cleanup to make code clean. Signed-off-by: Zhao Lei <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: Break loop when reach BTRFS_MAX_MIRRORS in scrub_setup_recheck_block()Zhao Lei1-1/+1
Use break instead of useless loop should be more suitable in this case. Signed-off-by: Zhao Lei <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: btrfs_rm_dev_replace_blocked(): Use wait_event()Zhao Lei1-11/+2
Signed-off-by: Zhao Lei <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: Cleanup btrfs_bio_counter_inc_blocked()Zhao Lei1-6/+6
1: Remove no-need DEFINE_WAIT(wait) 2: Add likely() for BTRFS_FS_STATE_DEV_REPLACING condition 3: Use while loop instead of goto Signed-off-by: Zhao Lei <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: Remove noneed force_write in scrub_write_block_to_dev_replaceZhao Lei1-12/+7
It is always 1 in this place, because !1 case was already jumped out in previous code. Signed-off-by: Zhao Lei <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: Fix a jump typo of nodatasum_case to avoid wrong WARN_ON()Zhao Lei1-1/+2
if (sctx->is_dev_replace && !is_metadata && !have_csum) { ... goto nodatasum_case; } ... nodatasum_case: WARN_ON(sctx->is_dev_replace); In above code, nodatasum_case marker should be moved after WARN_ON(). Signed-off-by: Zhao Lei <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: add ref_count and free function for btrfs_bioZhao Lei7-54/+57
1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag to keep btrfs_bio's memory in raid56 recovery implement. 2: free function for bbio will make code clean and flexible, plus forced data type checking in compile. Changelog v1->v2: Rename following by David Sterba's suggestion: put_btrfs_bio() -> btrfs_put_bio() get_btrfs_bio() -> btrfs_get_bio() bbio->ref_count -> bbio->refs Signed-off-by: Zhao Lei <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: Make raid_map array be inlined in btrfs_bio structureZhao Lei5-125/+105
It can make code more simple and clear, we need not care about free bbio and raid_map together. Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Zhao Lei <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: sort raid_map before adding tgtdev stripesZhao Lei1-14/+8
It can avoid complex calculation of real stripes in sort, moreover, we can clean up code of sorting tgtdev_map because it will be in order initially. Signed-off-by: Zhao Lei <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: fix a out-of-bound access of raid_mapZhao Lei1-2/+5
We add the number of stripes on target devices into bbio->num_stripes if we are under device replacement, and we just sort the raid_map of those stripes that not on the target devices, so if when we need real raid_map, we need skip the stripes on the target devices. Signed-off-by: Zhao Lei <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefsFilipe Manana2-5/+43
If we have an inode with a large number of hard links, some of which may be extrefs, turn a regular ref into an extref, fsync the inode and then replay the fsync log (after a crash/reboot), we can endup with an fsync log that makes the replay code always fail with -EOVERFLOW when processing the inode's references. This is easy to reproduce with the test case I made for xfstests. Its steps are the following: _scratch_mkfs "-O extref" >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create a test file with 3001 hard links. This number is large enough to # make btrfs start using extrefs at some point even if the fs has the maximum # possible leaf/node size (64Kb). echo "hello world" > $SCRATCH_MNT/foo for i in `seq 1 3000`; do ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i` done # Make sure all metadata and data are durably persisted. sync # Now remove one link, add a new one with a new name, add another new one with # the same name as the one we just removed and fsync the inode. rm -f $SCRATCH_MNT/foo_link_0001 ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001 ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_0001 rm -f $SCRATCH_MNT/foo_link_0002 ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3002 ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3003 $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Simulate a crash/power loss. This makes sure the next mount # will see an fsync log and will replay that log. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Check that the number of hard links is correct, we are able to remove all # the hard links and read the file's data. This is just to verify we don't # get stale file handle errors (due to dangling directory index entries that # point to inodes that no longer exist). echo "Link count: $(stat --format=%h $SCRATCH_MNT/foo)" [ -f $SCRATCH_MNT/foo ] || echo "Link foo is missing" for ((i = 1; i <= 3003; i++)); do name=foo_link_`printf "%04d" $i` if [ $i -eq 2 ]; then [ -f $SCRATCH_MNT/$name ] && echo "Link $name found" else [ -f $SCRATCH_MNT/$name ] || echo "Link $name is missing" fi done rm -f $SCRATCH_MNT/foo_link_* cat $SCRATCH_MNT/foo rm -f $SCRATCH_MNT/foo status=0 exit The fix is simply to correct the overflow condition when overwriting a reference item because it was wrong, trying to increase the item in the fs/subvol tree by an impossible amount. Also ensure that we don't insert one normal ref and one ext ref for the same dentry - this happened because processing a dir index entry from the parent in the log happened when the normal ref item was full, which made the logic insert an extref and later when the normal ref had enough room, it would be inserted again when processing the ref item from the child inode in the log. This issue has been present since the introduction of the extrefs feature (2012). A test case for xfstests follows soon. This test only passes if the previous patch titled "Btrfs: fix fsync when extend references are added to an inode" is applied too. Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: fix fsync when extend references are added to an inodeFilipe Manana1-11/+16
If we added an extended reference to an inode and fsync'ed it, the log replay code would make our inode have an incorrect link count, which was lower then the expected/correct count. This resulted in stale directory index entries after deleting some of the hard links, and any access to the dangling directory entries resulted in -ESTALE errors because the entries pointed to inode items that don't exist anymore. This is easy to reproduce with the test case I made for xfstests, and the bulk of that test is: _scratch_mkfs "-O extref" >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create a test file with 3001 hard links. This number is large enough to # make btrfs start using extrefs at some point even if the fs has the maximum # possible leaf/node size (64Kb). echo "hello world" > $SCRATCH_MNT/foo for i in `seq 1 3000`; do ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i` done # Make sure all metadata and data are durably persisted. sync # Add one more link to the inode that ends up being a btrfs extref and fsync # the inode. ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001 $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Simulate a crash/power loss. This makes sure the next mount # will see an fsync log and will replay that log. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Now after the fsync log replay btrfs left our inode with a wrong link count N, # which was smaller than the correct link count M (N < M). # So after removing N hard links, the remaining M - N directory entries were # still visible to user space but it was impossible to do anything with them # because they pointed to an inode that didn't exist anymore. This resulted in # stale file handle errors (-ESTALE) when accessing those dentries for example. # # So remove all hard links except the first one and then attempt to read the # file, to verify we don't get an -ESTALE error when accessing the inodel # # The btrfs fsck tool also detected the incorrect inode link count and it # reported an error message like the following: # # root 5 inode 257 errors 2001, no inode item, link count wrong # unresolved ref dir 256 index 2978 namelen 13 name foo_link_2976 filetype 1 errors 4, no inode ref # # The fstests framework automatically calls fsck after a test is run, so we # don't need to call fsck explicitly here. rm -f $SCRATCH_MNT/foo_link_* cat $SCRATCH_MNT/foo status=0 exit So make sure an fsync always flushes the delayed inode item, so that the fsync log contains it (needed in order to trigger the link count fixup code) and fix the extref counting function, which always return -ENOENT to its caller (and made it assume there were always 0 extrefs). This issue has been present since the introduction of the extrefs feature (2012). A test case for xfstests follows soon. Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: fix directory inconsistency after fsync log replayFilipe Manana1-2/+18
If we have an inode (file) with a link count greater than 1, remove one of its hard links, fsync the inode, power fail/crash and then replay the fsync log on the next mount, we end up getting the parent directory's metadata inconsistent - its i_size still reflects the deleted hard link and has dangling index entries (with no matching inode reference entries). This prevents the directory from ever being deletable, as its i_size can never decrease to BTRFS_EMPTY_DIR_SIZE even if all of its children inodes are deleted, and the dangling index entries can never be removed (as they point to an inode that does not exist anymore). This is easy to reproduce with the following excerpt from the test case for xfstests that I just made: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create a test file with 2 hard links in the same directory. mkdir -p $SCRATCH_MNT/a/b echo "hello world" > $SCRATCH_MNT/a/b/foo ln $SCRATCH_MNT/a/b/foo $SCRATCH_MNT/a/b/bar # Make sure all metadata and data are durably persisted. sync # Now remove one of the hard links and fsync the inode. rm -f $SCRATCH_MNT/a/b/bar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/b/foo # Simulate a crash/power loss. This makes sure the next mount # will see an fsync log and will replay that log. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Remove the last hard link of the file and attempt to remove its parent # directory - this failed in btrfs because the fsync log and replay code # didn't decrement the parent directory's i_size and left dangling directory # index entries - this made the btrfs rmdir implementation always fail with # the error -ENOTEMPTY. # # The dangling directory index entries were visible to user space, but it was # impossible to do anything on them (unlink, open, read, write, stat, etc) # because the inode they pointed to did not exist anymore. # # The parent directory's metadata inconsistency (stale index entries) was # also detected by btrfs' fsck tool, which is run automatically by the fstests # framework when the test finishes. The error message reported by fsck was: # # root 5 inode 259 errors 2001, no inode item, link count wrong # unresolved ref dir 258 index 3 namelen 3 name bar filetype 1 errors 4, no inode ref # rm -f $SCRATCH_MNT/a/b/* rmdir $SCRATCH_MNT/a/b rmdir $SCRATCH_MNT/a To fix this just make sure that after an unlink, if the inode is fsync'ed, he parent inode is fully logged in the fsync log. A test case for xfstests follows soon. Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: lookup for block group only if needed when freeing a tree blockFilipe Manana1-4/+6
Very often our extent buffer's header generation doesn't match the current transaction's id or it is also referenced by other trees (snapshots), so we don't need the corresponding block group cache object. Therefore only search for it if we are going to use it, so we avoid an unnecessary search in the block groups rbtree (and acquiring and releasing its spinlock). Freeing a tree block is performed when COWing or deleting a node/leaf, which implies we are holding the node/leaf's parent node lock, therefore reducing the amount of time spent when freeing a tree block helps reducing the amount of time we are holding the parent node's lock. For example, for a run of xfstests/generic/083, the block group cache object was needed only 682 times for a total of 226691 calls to free a tree block. Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21btrfs: remove a no-op unfreeze superbock callbackDavid Sterba1-6/+0
Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21btrfs: switch extent_state state to unsignedDavid Sterba4-55/+55
Currently there's a 4B hole in the structure between refs and state and there are only 16 bits used so we can make it unsigned. This will get a better packing and may save some stack space for local variables. The size of extent_state gets reduced by 8B and there are usually a lot of slab objects. struct extent_state { u64 start; /* 0 8 */ u64 end; /* 8 8 */ struct rb_node rb_node; /* 16 24 */ wait_queue_head_t wq; /* 40 24 */ /* --- cacheline 1 boundary (64 bytes) --- */ atomic_t refs; /* 64 4 */ /* XXX 4 bytes hole, try to pack */ long unsigned int state; /* 72 8 */ u64 private; /* 80 8 */ /* size: 88, cachelines: 2, members: 7 */ /* sum members: 84, holes: 1, sum holes: 4 */ /* last cacheline: 24 bytes */ }; Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21btrfs: set proper message level for skinny metadataDavid Sterba1-1/+1
This has been confusing people for too long, the message is really just informative. CC: <[email protected]> # 3.10+ Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21btrfs: update message levels after checksum errorsDavid Sterba2-2/+2
The errors are worth noting and might get missed with INFO level. Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21btrfs: update message levels during failed mountDavid Sterba1-8/+8
All error conditions from open_ctree shall be ERR. Warning would suggest that something's wrong and we can continue. Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21btrfs: update message levels for errorsDavid Sterba2-5/+6
Several messages that point to some internal problem, level INFO is wrong here. Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: fix setup_leaf_for_split() to avoid leaf corruptionFilipe Manana1-2/+4
We were incorrectly detecting when the target key didn't exist anymore after releasing the path and re-searching the tree. This could make us split or duplicate (btrfs_split_item() and btrfs_duplicate_item() are its only callers at the moment) an item when we should not. For the case of duplicating an item, we currently only duplicate checksum items (csum tree) and file extent items (fs/subvol trees). For the checksum items we end up overriding the item completely, but for file extent items we update only some of their fields in the copy (done in __btrfs_drop_extents), which means we can end up having a logical corruption for some values. Also for the case where we duplicate a file extent item it will make us produce a leaf with a wrong key order, as btrfs_duplicate_item() advances us to the next slot and then its caller sets a smaller key on the new item at that slot (like in __btrfs_drop_extents() e.g.). Alternatively if the tree search in setup_leaf_for_split() leaves with path->slots[0] == btrfs_header_nritems(path->nodes[0]), we end up accessing beyond the leaf's end (when we check if the item's size has changed) and make our caller insert an item at the invalid slot btrfs_header_nritems(path->nodes[0]) + 1, causing an invalid memory access if the leaf is full or nearly full. This issue has been present since the introduction of this function in 2009: Btrfs: Add btrfs_duplicate_item commit ad48fd754676bfae4139be1a897b1ea58f9aaf21 Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Merge branch 'cleanup/blocksize-diet-part2' of ↵Chris Mason13-73/+84
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
2015-01-21Merge branch 'fix/find-item-path-leak' of ↵Chris Mason7-49/+46
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
2015-01-21Btrfs: track dirty block groups on their own listJosef Bacik5-124/+72
Currently any time we try to update the block groups on disk we will walk _all_ block groups and check for the ->dirty flag to see if it is set. This function can get called several times during a commit. So if you have several terabytes of data you will be a very sad panda as we will loop through _all_ of the block groups several times, which makes the commit take a while which slows down the rest of the file system operations. This patch introduces a dirty list for the block groups that we get added to when we dirty the block group for the first time. Then we simply update any block groups that have been dirtied since the last time we called btrfs_write_dirty_block_groups. This allows us to clean up how we write the free space cache out so it is much cleaner. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-21Btrfs: change how we track dirty rootsJosef Bacik3-5/+21
I've been overloading root->dirty_list to keep track of dirty roots and which roots need to have their commit roots switched at transaction commit time. This could cause us to lose an update to the root which could corrupt the file system. To fix this use a state bit to know if the root is dirty, and if it isn't set we go ahead and move the root to the dirty list. This way if we re-dirty the root after adding it to the switch_commit list we make sure to update it. This also makes it so that the extent root is always the last root on the dirty list to try and keep the amount of churn down at this point in the commit. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-01-18Linux 3.19-rc5Linus Torvalds1-1/+1
2015-01-18Merge tag 'armsoc-for-linus' of ↵Linus Torvalds32-80/+272
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM SoC fixes from Olof Johansson: "We've been sitting on our fixes branch for a while, so this batch is unfortunately on the large side. A lot of these are tweaks and fixes to device trees, fixing various bugs around clocks, reg ranges, etc. There's also a few defconfig updates (which are on the late side, no more of those). All in all the diffstat is bigger than ideal at this time, but nothing in here seems particularly risky" * tag 'armsoc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (31 commits) reset: sunxi: fix spinlock initialization ARM: dts: disable CCI on exynos5420 based arndale-octa drivers: bus: check cci device tree node status ARM: rockchip: disable jtag/sdmmc autoswitching on rk3288 ARM: nomadik: fix up leftover device tree pins ARM: at91: board-dt-sama5: add phy_fixup to override NAND_Tree ARM: at91/dt: sam9263: Add missing clocks to lcdc node ARM: at91: sama5d3: dt: correct the sound route ARM: at91/dt: sama5d4: fix the timer reg length ARM: exynos_defconfig: Enable LM90 driver ARM: exynos_defconfig: Enable options for display panel support arm: dts: Use pmu_system_controller phandle for dp phy ARM: shmobile: sh73a0 legacy: Set .control_parent for all irqpin instances ARM: dts: berlin: correct BG2Q's SM GPIO location. ARM: dts: berlin: add broken-cd and set bus width for eMMC in Marvell DMP DT ARM: dts: berlin: fix io clk and add missing core clk for BG2Q sdhci2 host ARM: dts: Revert disabling of smc91x for n900 ARM: dts: imx51-babbage: Fix ULPI PHY reset modelling ARM: dts: dra7-evm: fix qspi device tree partition size ARM: omap2plus_defconfig: use CONFIG_CPUFREQ_DT ...
2015-01-18Merge tag 'clk-fixes-for-linus' of ↵Linus Torvalds7-28/+69
git://git.linaro.org/people/mike.turquette/linux Pull clock driver fixes from Mike Turquette: "Small number of fixes for clock drivers and a single null pointer dereference fix in the framework core code. The driver fixes vary from fixing section mismatch warnings to preventing machines from hanging (and preventing developers from crying)" * tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mike.turquette/linux: clk: fix possible null pointer dereference Revert "clk: ppc-corenet: Fix Section mismatch warning" clk: rockchip: fix deadlock possibility in cpuclk clk: berlin: bg2q: remove non-exist "smemc" gate clock clk: at91: keep slow clk enabled to prevent system hang clk: rockchip: fix rk3288 cpuclk core dividers clk: rockchip: fix rk3066 pll lock bit location clk: rockchip: Fix clock gate for rk3188 hclk_emem_peri clk: rockchip: add CLK_IGNORE_UNUSED flag to fix rk3066/rk3188 USB Host
2015-01-18Merge tag 'scsi-fixes' of ↵Linus Torvalds2-4/+2
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "This is one fix for a Multiqueue sleeping in invalid context problem and a MAINTAINER file update for Qlogic" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: ->queue_rq can't sleep MAINTAINERS: Update maintainer list for qla4xxx
2015-01-17clk: fix possible null pointer dereferenceStanimir Varbanov1-1/+1
The commit 646cafc6 (clk: Change clk_ops->determine_rate to return a clk_hw as the best parent) opens a possibility for null pointer dereference, fix this. Signed-off-by: Stanimir Varbanov <[email protected]> Reviewed-by: Stephen Boyd <[email protected]> Signed-off-by: Michael Turquette <[email protected]>
2015-01-17Revert "clk: ppc-corenet: Fix Section mismatch warning"Kevin Hao1-1/+1
This reverts commit da788acb28386aa896224e784954bb73c99ff26c. That commit tried to fix the section mismatch warning by moving the ppc_corenet_clk_driver struct to init section. This is definitely wrong because the kernel would free the memories occupied by this struct after boot while this driver is still registered in the driver core. The kernel would panic when accessing this driver struct. Cc: [email protected] # 3.17 Signed-off-by: Kevin Hao <[email protected]> Acked-by: Scott Wood <[email protected]> Signed-off-by: Michael Turquette <[email protected]>
2015-01-17clk: rockchip: fix deadlock possibility in cpuclkHeiko Stübner1-4/+6
Lockdep reported a possible deadlock between the cpuclk lock and for example the i2c driver. CPU0 CPU1 ---- ---- lock(clk_lock); local_irq_disable(); lock(&(&i2c->lock)->rlock); lock(clk_lock); <Interrupt> lock(&(&i2c->lock)->rlock); *** DEADLOCK *** The generic clock-types of the core ccf already use spin_lock_irqsave when touching clock registers, so do the same for the cpuclk. Signed-off-by: Heiko Stuebner <[email protected]> Reviewed-by: Doug Anderson <[email protected]> Signed-off-by: Michael Turquette <[email protected]> [[email protected]: removed initialization of "flags"]
2015-01-18Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dmaLinus Torvalds3-3/+6
Pull dmaengine fixes from Vinod Koul: "Two patches, the first by Andy to fix dw dmac runtime pm and second one by me to fix the dmaengine headers in MAINTAINERS" * 'fixes' of git://git.infradead.org/users/vkoul/slave-dma: dmaengine: dw: balance PM runtime calls MAINTAINERS: dmaengine: fix the header file for dmaengine