aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2014-01-28btrfs: fix defrag 32-bit integer overflowJustin Maggard1-3/+3
When defragging a very large file, the cluster variable can wrap its 32-bit signed int type and become negative, which eventually gets passed to btrfs_force_ra() as a very large unsigned long value. On 32-bit platforms, this eventually results in an Oops from the SLAB allocator. Change the cluster and max_cluster signed int variables to unsigned long to match the readahead functions. This also allows the min() comparison in btrfs_defrag_file() to work as intended. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28btrfs: sysfs: list the NO_HOLES featureDavid Sterba1-0/+2
Signed-off-by: David Sterba <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28btrfs: sysfs: don't show reserved incompat featureDavid Sterba1-2/+0
The COMPRESS_LZOv2 incompat featue is currently not implemented, the bit is only reserved, no point to list it in sysfs. Signed-off-by: David Sterba <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28btrfs: call permission checks earlier in ioctls and return EPERMDavid Sterba1-13/+9
The owner and capability checks in IOC_SUBVOL_SETFLAGS and SET_RECEIVED_SUBVOL should be called before any other checks are done. Also unify the error code to EPERM. Signed-off-by: David Sterba <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28btrfs: restrict snapshotting to own subvolumesDavid Sterba1-0/+6
Currently, any user can snapshot any subvolume if the path is accessible and thus indirectly create and keep files he does not own under his direcotries. This is not possible with traditional directories. In security context, a user can snapshot root filesystem and pin any potentially buggy binaries, even if the updates are applied. All the snapshots are visible to the administrator, so it's possible to verify if there are suspicious snapshots. Another more practical problem is that any user can pin the space used by eg. root and cause ENOSPC. Original report: https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/484786 CC: [email protected] Signed-off-by: David Sterba <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: fix wrong block group in trace during the free space allocationMiao Xie1-1/+2
We allocate the free space from the former block group, not the current one, so should use the former one to output the trace information. Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: cleanup the code of used_block_group in find_free_extent()Miao Xie1-20/+13
used_block_group is just used for the space cluster which doesn't belong to the current block group, the other place needn't use it. Or the logic of code seems unclear. Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: cleanup the redundant code for the block group allocation and initMiao Xie1-50/+44
Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: change the members' order of btrfs_space_info structure to reduce the ↵Miao Xie1-14/+15
cache miss It is better that the position of the lock is close to the data which is protected by it, because they may be in the same cache line, we will load less cache lines when we access them. So we rearrange the members' position of btrfs_space_info structure to make the lock be closer to the its data. Signed-off-by: Miao Xie <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: fix wrong search path initialization before searching tree rootWang Shilong1-1/+1
To search tree root without transaction protection, we should neither search commit root nor skip locking here, fix it. Signed-off-by: Wang Shilong <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: flush the dirty pages of the ordered extent aggressively during ↵Miao Xie1-1/+5
logging csum The performance of fsync dropped down suddenly sometimes, the main reason of this problem was that we might only flush part dirty pages in a ordered extent, then got that ordered extent, wait for the csum calcucation. But if no task flushed the left part, we would wait until the flusher flushed them, sometimes we need wait for several seconds, it made the performance drop down suddenly. (On my box, it drop down from 56MB/s to 4-10MB/s) This patch improves the above problem by flushing left dirty pages aggressively. Test Environment: CPU: 2CPU * 2Cores Memory: 4GB Partition: 20GB(HDD) Test Command: # sysbench --num-threads=8 --test=fileio --file-num=1 \ > --file-total-size=8G --file-block-size=32768 \ > --file-io-mode=sync --file-fsync-freq=100 \ > --file-fsync-end=no --max-requests=10000 \ > --file-test-mode=rndwr run Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: fix transaction abortion when remounting btrfs from RW to ROWang Shilong1-2/+2
Steps to reproduce: # mkfs.btrfs -f /dev/sda8 # mount /dev/sda8 /mnt -o flushoncommit # dd if=/dev/zero of=/mnt/data bs=4k count=102400 & # mount /dev/sda8 /mnt -o remount, ro When remounting RW to RO, the logic is to firstly set flag to RO and then commit transaction, however with option flushoncommit enabled,we will do RO check within committing transaction, so we get a transaction abortion here. Actually,here check is wrong, we should check if FS_STATE_ERROR is set, fix it. Reported-by: Qu Wenruo <[email protected]> Suggested-by: Miao Xie <[email protected]> Signed-off-by: Wang Shilong <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: faster file extent item search in clone ioctlFilipe David Borba Manana1-9/+14
When we are looking for file extent items that intersect the cloning range, for each one that falls completely outside the range, don't release the path and do another full tree search - just move on to the next slot and copy the file extent item into our buffer only if the item intersects the cloning range. Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: fix extent state leak on transaction abortionLiu Bo1-5/+9
When transaction is aborted, we fail to commit transaction, instead we do cleanup work. After that when we umount btrfs, we get to free fs roots' log trees respectively, but that happens after we unpin extents, so those extents pinned by freeing log trees will remain in memory and lead to the leak. Signed-off-by: Liu Bo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28btrfs: Cleanup the btrfs_parse_options for remount.Qu Wenruo1-60/+77
Since remount will pending the new mount options to the original mount options, which will make btrfs_parse_options check the old options then new options, causing some stupid output like "enabling XXX" following by "disable XXX". This patch will add extra check before every btrfs_info to skip the output from old options checking. Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28btrfs: Add noinode_cache mount optionQu Wenruo4-2/+22
Add noinode_cache mount option for btrfs. Since inode map cache involves all the btrfs_find_free_ino/return_ino things and if just trigger the mount_opt, an inode number get from inode map cache will not returned to inode map cache. To keep the find and return inode both in the same behavior, a new bit in mount_opt, CHANGE_INODE_CACHE, is introduced for this idea. CHANGE_INODE_CACHE is set/cleared in remounting, and the original INODE_MAP_CACHE is set/cleared according to CHANGE_INODE_CACHE after a success transaction. Since find/return inode is all done between btrfs_start_transaction and btrfs_commit_transaction, this will keep consistent behavior. Also noinode_cache mount option will not stop the caching_kthread. Cc: David Sterba <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: fix to search previous metadata extent item since skinny metadataWang Shilong3-2/+46
There is a bug that using btrfs_previous_item() to search metadata extent item. This is because in btrfs_previous_item(), we need type match, however, since skinny metada was introduced by josef, we may mix this two types. So just use btrfs_previous_item() is not working right. To keep btrfs_previous_item() like normal tree search, i introduce another function btrfs_previous_extent_item(). Signed-off-by: Wang Shilong <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: fix missing skinny metadata check in scrub_stripe()Wang Shilong1-1/+4
Check if we support skinny metadata firstly and fix to use right type to search. Signed-off-by: Wang Shilong <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: fix send to not send non-aligned clone operationsFilipe David Borba Manana1-1/+2
It is possible for the send feature to send clone operations that request a cloning range (offset + length) that is not aligned with the block size. This makes the btrfs receive command send issue a clone ioctl call that will fail, as the ioctl will return an -EINVAL error because of the unaligned range. Fix this by not sending clone operations for non block aligned ranges, and instead send regular write operation for these (less common) cases. The following xfstest reproduces this issue, which fails on the second btrfs receive command without this change: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=`mktemp -d` status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -fr $tmp } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch _need_to_be_root rm -f $seqres.full _scratch_mkfs >/dev/null 2>&1 _scratch_mount $XFS_IO_PROG -f -c "truncate 819200" $SCRATCH_MNT/foo | _filter_xfs_io $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch $XFS_IO_PROG -c "falloc -k 819200 667648" $SCRATCH_MNT/foo | _filter_xfs_io $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch $XFS_IO_PROG -f -c "pwrite 1482752 2978" $SCRATCH_MNT/foo | _filter_xfs_io $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch $BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1 | \ _filter_scratch $XFS_IO_PROG -f -c "truncate 883305" $SCRATCH_MNT/foo | _filter_xfs_io $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch $BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2 | \ _filter_scratch $BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap 2>&1 | _filter_scratch $BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \ -f $tmp/2.snap 2>&1 | _filter_scratch md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/mysnap1/foo | _filter_scratch md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch _scratch_unmount _check_btrfs_filesystem $SCRATCH_DEV _scratch_mkfs >/dev/null 2>&1 _scratch_mount $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap md5sum $SCRATCH_MNT/mysnap1/foo | _filter_scratch $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch _scratch_unmount _check_btrfs_filesystem $SCRATCH_DEV status=0 exit The tests expected output is: QA output created by 025 FSSync 'SCRATCH_MNT' FSSync 'SCRATCH_MNT' wrote 2978/2978 bytes at offset 1482752 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) FSSync 'SCRATCH_MNT' Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap1' FSSync 'SCRATCH_MNT' Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap2' At subvol SCRATCH_MNT/mysnap1 At subvol SCRATCH_MNT/mysnap2 129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/foo 42b6369eae2a8725c1aacc0440e597aa SCRATCH_MNT/mysnap1/foo 129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/mysnap2/foo At subvol mysnap1 42b6369eae2a8725c1aacc0440e597aa SCRATCH_MNT/mysnap1/foo At snapshot mysnap2 129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/mysnap2/foo Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: fix btrfs boot when compiled as built-inFilipe David Borba Manana6-9/+73
After the change titled "Btrfs: add support for inode properties", if btrfs was built-in the kernel (i.e. not as a module), it would cause a kernel panic, as reported recently by Fengguang: [ 2.024722] BUG: unable to handle kernel NULL pointer dereference at (null) [ 2.027814] IP: [<ffffffff81501594>] crc32c+0xc/0x6b [ 2.028684] PGD 0 [ 2.028684] Oops: 0000 [#1] SMP [ 2.028684] Modules linked in: [ 2.028684] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc7-04795-ga7b57c2 #1 [ 2.028684] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 2.028684] task: ffff88000edba100 ti: ffff88000edd6000 task.ti: ffff88000edd6000 [ 2.028684] RIP: 0010:[<ffffffff81501594>] [<ffffffff81501594>] crc32c+0xc/0x6b [ 2.028684] RSP: 0000:ffff88000edd7e58 EFLAGS: 00010246 [ 2.028684] RAX: 0000000000000000 RBX: ffffffff82295550 RCX: 0000000000000000 [ 2.028684] RDX: 0000000000000011 RSI: ffffffff81efe393 RDI: 00000000fffffffe [ 2.028684] RBP: ffff88000edd7e60 R08: 0000000000000003 R09: 0000000000015d20 [ 2.028684] R10: ffffffff81ef225e R11: ffffffff811b0222 R12: ffffffffffffffff [ 2.028684] R13: 0000000000000239 R14: 0000000000000000 R15: 0000000000000000 [ 2.028684] FS: 0000000000000000(0000) GS:ffff88000fa00000(0000) knlGS:0000000000000000 [ 2.028684] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 2.028684] CR2: 0000000000000000 CR3: 000000000220c000 CR4: 00000000000006f0 [ 2.028684] Stack: [ 2.028684] ffffffff82295550 ffff88000edd7e80 ffffffff8238af62 ffffffff8238ac05 [ 2.028684] 0000000000000000 ffff88000edd7e98 ffffffff8238ac0f ffffffff8238ac05 [ 2.028684] ffff88000edd7f08 ffffffff810002ba ffff88000edd7f00 ffffffff810e2404 [ 2.028684] Call Trace: [ 2.028684] [<ffffffff8238af62>] btrfs_props_init+0x4f/0x96 [ 2.028684] [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145 [ 2.028684] [<ffffffff8238ac0f>] init_btrfs_fs+0xa/0xf0 [ 2.028684] [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145 [ 2.028684] [<ffffffff810002ba>] do_one_initcall+0xa4/0x13a [ 2.028684] [<ffffffff810e2404>] ? parse_args+0x25f/0x33d [ 2.028684] [<ffffffff8234cf75>] kernel_init_freeable+0x1aa/0x230 [ 2.028684] [<ffffffff8234c785>] ? do_early_param+0x88/0x88 [ 2.028684] [<ffffffff819f61b5>] ? rest_init+0x89/0x89 [ 2.028684] [<ffffffff819f61c3>] kernel_init+0xe/0x109 The issue here is that the initialization function of btrfs (super.c:init_btrfs_fs) started using crc32c (from lib/libcrc32c.c). But when it needs to call crc32c (as part of the properties initialization routine), the libcrc32c is not yet initialized, so crc32c derreferenced a NULL pointer (lib/libcrc32c.c:tfm), causing the kernel panic on boot. The approach to fix this is to use crypto component directly to use its crc32c (which is basically what lib/libcrc32c.c is, a wrapper around crypto). This is what ext4 is doing as well, it uses crypto directly to get crc32c functionality. Verified this works both when btrfs is built-in and when it's loadable kernel module. Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: unlock inodes in correct order in clone ioctlFilipe David Borba Manana1-3/+11
In the clone ioctl, when the source and target inodes are different, we can acquire their mutexes in 2 possible different orders. After we're done cloning, we were releasing the mutexes always in the same order - the most correct way of doing it is to release them by the reverse order they were acquired. Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: optimize to remove unnecessary removal with ulist reallocationWang Shilong1-3/+1
Here we are not going to free memory, no need to remove every node one by one, just init root node here is ok. Cc: Liu Bo <[email protected]> Signed-off-by: Wang Shilong <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: release subvolume's block_rsv before transaction commitLiu Bo1-7/+7
We don't have to keep subvolume's block_rsv during transaction commit, and within transaction commit, we may also need the free space reclaimed from this block_rsv to process delayed refs. Signed-off-by: Liu Bo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: fix the race between write back and nocow buffered writeMiao Xie1-2/+5
When we ran the 274th case of xfstests with nodatacow mount option, We met the following warning message: WARNING: CPU: 1 PID: 14185 at fs/btrfs/extent-tree.c:3734 btrfs_free_reserved_data_space+0xa6/0xd0 It is caused by the race between the write back and nocow buffered write: Task1 Task2 __btrfs_buffered_write() skip data reservation reserve the metadata space copy the data dirty the pages unlock the pages write back the pages release the data space becasue there is no noreserve flag set the noreserve flag This patch fixes this problem by unlocking the pages after the noreserve flag is set. Reported-by: Tsutomu Itoh <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: only process as many file extents as there are refsJosef Bacik1-8/+9
The backref walking code will search down to the key it is looking for and then proceed to walk _all_ of the extents on the file until it hits the end. This is suboptimal with large files, we only need to look for as many extents as we have references for that inode. I have a testcase that creates a randomly written 4 gig file and before this patch it took 6min 30sec to do the initial send, with this patch it takes 2min 30sec to do the intial send. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: fix qgroup rescan to work with skinny metadataJosef Bacik1-5/+13
Could have sworn I fixed this before but apparently not. This makes us pass btrfs/022 with skinny metadata enabled. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: fix extent_from_logical to deal with skinny metadataJosef Bacik1-8/+33
I don't think this is an issue and I've not seen it in practice but extent_from_logical will fail to find a skinny extent because it uses btrfs_previous_item and gives it the normal extent item type. This is just not a place to use btrfs_previous_item since we care about either normal extents or skinny extents, so open code btrfs_previous_item to properly check. This would only affect metadata and the only place this is used for metadata is scrub and I'm pretty sure it's just for printing stuff out, not actually doing any work so hopefully it was never a problem other than a cosmetic one. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: throttle delayed refs betterJosef Bacik4-4/+46
On one of our gluster clusters we noticed some pretty big lag spikes. This turned out to be because our transaction commit was taking like 3 minutes to complete. This is because we have like 30 gigs of metadata, so our global reserve would end up being the max which is like 512 mb. So our throttling code would allow a ridiculous amount of delayed refs to build up and then they'd all get run at transaction commit time, and for a cold mounted file system that could take up to 3 minutes to run. So fix the throttling to be based on both the size of the global reserve and how long it takes us to run delayed refs. This patch tracks the time it takes to run delayed refs and then only allows 1 seconds worth of outstanding delayed refs at a time. This way it will auto-tune itself from cold cache up to when everything is in memory and it no longer has to go to disk. This makes our transaction commits take much less time to run. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: attach delayed ref updates to delayed ref headsJosef Bacik6-405/+267
Currently we have two rb-trees, one for delayed ref heads and one for all of the delayed refs, including the delayed ref heads. When we process the delayed refs we have to hold onto the delayed ref lock for all of the selecting and merging and such, which results in quite a bit of lock contention. This was solved by having a waitqueue and only one flusher at a time, however this hurts if we get a lot of delayed refs queued up. So instead just have an rb tree for the delayed ref heads, and then attach the delayed ref updates to an rb tree that is per delayed ref head. Then we only need to take the delayed ref lock when adding new delayed refs and when selecting a delayed ref head to process, all the rest of the time we deal with a per delayed ref head lock which will be much less contentious. The locking rules for this get a little more complicated since we have to lock up to 3 things to properly process delayed refs, but I will address that problem later. For now this passes all of xfstests and my overnight stress tests. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: make fsync latency less suckyJosef Bacik3-1/+15
Looking into some performance related issues with large amounts of metadata revealed that we can have some pretty huge swings in fsync() performance. If we have a lot of delayed refs backed up (as you will tend to do with lots of metadata) fsync() will wander off and try to run some of those delayed refs which can result in reading from disk and such. Since the actual act of fsync() doesn't create any delayed refs there is no need to make it throttle on delayed ref stuff, that will be handled by other people. With this patch we get much smoother fsync performance with large amounts of metadata. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: add support for inode propertiesFilipe David Borba Manana10-10/+545
This change adds infrastructure to allow for generic properties for inodes. Properties are name/value pairs that can be associated with inodes for different purposes. They are stored as xattrs with the prefix "btrfs." Properties can be inherited - this means when a directory inode has inheritable properties set, these are added to new inodes created under that directory. Further, subvolumes can also have properties associated with them, and they can be inherited from their parent subvolume. Naturally, directory properties have priority over subvolume properties (in practice a subvolume property is just a regular property associated with the root inode, objectid 256, of the subvolume's fs tree). This change also adds one specific property implementation, named "compression", whose values can be "lzo" or "zlib" and it's an inheritable property. The corresponding changes to btrfs-progs were also implemented. A patch with xfstests for this feature will follow once there's agreement on this change/feature. Further, the script at the bottom of this commit message was used to do some benchmarks to measure any performance penalties of this feature. Basically the tests correspond to: Test 1 - create a filesystem and mount it with compress-force=lzo, then sequentially create N files of 64Kb each, measure how long it took to create the files, unmount the filesystem, mount the filesystem and perform an 'ls -lha' against the test directory holding the N files, and report the time the command took. Test 2 - create a filesystem and don't use any compression option when mounting it - instead set the compression property of the subvolume's root to 'lzo'. Then create N files of 64Kb, and report the time it took. The unmount the filesystem, mount it again and perform an 'ls -lha' like in the former test. This means every single file ends up with a property (xattr) associated to it. Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the compression property, have no real effect other than adding more work when inheriting properties and taking more btree leaf space. Test 4 - same as test 3 but with 10 properties per file. Results (in seconds, and averages of 5 runs each), for different N numbers of files follow. * Without properties (test 1) file creation time ls -lha time 10 000 files 3.49 0.76 100 000 files 47.19 8.37 1 000 000 files 518.51 107.06 * With 1 property (compression property set to lzo - test 2) file creation time ls -lha time 10 000 files 3.63 0.93 100 000 files 48.56 9.74 1 000 000 files 537.72 125.11 * With 4 properties (test 3) file creation time ls -lha time 10 000 files 3.94 1.20 100 000 files 52.14 11.48 1 000 000 files 572.70 142.13 * With 10 properties (test 4) file creation time ls -lha time 10 000 files 4.61 1.35 100 000 files 58.86 13.83 1 000 000 files 656.01 177.61 The increased latencies with properties are essencialy because of: *) When creating an inode, we now synchronously write 1 more item (an xattr item) for each property inherited from the parent dir (or subvolume). This could be done in an asynchronous way such as we do for dir intex items (delayed-inode.c), which could help reduce the file creation latency; *) With properties, we now have larger fs trees. For this particular test each xattr item uses 75 bytes of leaf space in the fs tree. This could be less by using a new item for xattr items, instead of the current btrfs_dir_item, since we could cut the 'location' and 'type' fields (saving 18 bytes) and maybe 'transid' too (saving a total of 26 bytes per xattr item) from the btrfs_dir_item type. Also tried batching the xattr insertions (ignoring proper hash collision handling, since it didn't exist) when creating files that inherit properties from their parent inode/subvolume, but the end results were (surprisingly) essentially the same. Test script: $ cat test.pl #!/usr/bin/perl -w use strict; use Time::HiRes qw(time); use constant NUM_FILES => 10_000; use constant FILE_SIZES => (64 * 1024); use constant DEV => '/dev/sdb4'; use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev'; use constant TEST_DIR => (MNT_POINT . '/testdir'); system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!"; # following line for testing without properties #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!"; # following 2 lines for testing with properties system("mount", DEV, MNT_POINT) == 0 or die "mount failed!"; system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!"; system("mkdir", TEST_DIR) == 0 or die "mkdir failed!"; my ($t1, $t2); $t1 = time(); for (my $i = 1; $i <= NUM_FILES; $i++) { my $p = TEST_DIR . '/file_' . $i; open(my $f, '>', $p) or die "Error opening file!"; $f->autoflush(1); for (my $j = 0; $j < FILE_SIZES; $j += 4096) { print $f ('A' x 4096) or die "Error writing to file!"; } close($f); } $t2 = time(); print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n"; system("umount", DEV) == 0 or die "umount failed!"; system("mount", DEV, MNT_POINT) == 0 or die "mount failed!"; $t1 = time(); system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!"; $t2 = time(); print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n"; system("umount", DEV) == 0 or die "umount failed!"; Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: faster file extent item replace operationsFilipe David Borba Manana4-46/+114
When writing to a file we drop existing file extent items that cover the write range and then add a new file extent item that represents that write range. Before this change we were doing a tree lookup to remove the file extent items, and then after we did another tree lookup to insert the new file extent item. Most of the time all the file extent items we need to drop are located within a single leaf - this is the leaf where our new file extent item ends up at. Therefore, in this common case just combine these 2 operations into a single one. By avoiding the second btree navigation for insertion of the new file extent item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf COW operations, CPU time on btree node/leaf key binary searches, etc. Besides for file writes, this is an operation that happens for file fsync's as well. However log btrees are much less likely to big as big as regular fs btrees, therefore the impact of this change is smaller. The following benchmark was performed against an SSD drive and a HDD drive, both for random and sequential writes: sysbench --test=fileio --file-num=4096 --file-total-size=8G \ --file-test-mode=[rndwr|seqwr] --num-threads=512 \ --file-block-size=8192 \ --max-requests=1000000 \ --file-fsync-freq=0 --file-io-mode=sync [prepare|run] All results below are averages of 10 runs of the respective test. ** SSD sequential writes Before this change: 225.88 Mb/sec After this change: 277.26 Mb/sec ** SSD random writes Before this change: 49.91 Mb/sec After this change: 56.39 Mb/sec ** HDD sequential writes Before this change: 68.53 Mb/sec After this change: 69.87 Mb/sec ** HDD random writes Before this change: 13.04 Mb/sec After this change: 14.39 Mb/sec Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: handle EAGAIN case properly in btrfs_drop_snapshot()Wang Shilong1-1/+1
We may return early in btrfs_drop_snapshot(), we shouldn't call btrfs_std_err() for this case, fix it. Cc: [email protected] Signed-off-by: Wang Shilong <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: remove unnecessary transaction commit before sendWang Shilong1-29/+0
We will finish orphan cleanups during snapshot, so we don't have to commit transaction here. Signed-off-by: Wang Shilong <[email protected]> Reviewed-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: fix protection between send and root deletionWang Shilong2-0/+29
We should gurantee that parent and clone roots can not be destroyed during send, for this we have two ideas. 1.by holding @subvol_sem, this might be a nightmare, because it will block all subvolumes deletion for a long time. 2.Miao pointed out we can reuse @send_in_progress, that mean we will skip snapshot deletion if root sending is in progress. Here we adopt the second approach since it won't block other subvolumes deletion for a long time. Besides in btrfs_clean_one_deleted_snapshot(), we only check first root , if this root is involved in send, we return directly rather than continue to check.There are several reasons about it: 1.this case happen seldomly. 2.after sending,cleaner thread can continue to drop that root. 3.make code simple Cc: David Sterba <[email protected]> Signed-off-by: Wang Shilong <[email protected]> Reviewed-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: fix wrong send_in_progress accountingWang Shilong1-3/+13
Steps to reproduce: # mkfs.btrfs -f /dev/sda8 # mount /dev/sda8 /mnt # btrfs sub snapshot -r /mnt /mnt/snap1 # btrfs sub snapshot -r /mnt /mnt/snap2 # btrfs send /mnt/snap1 -p /mnt/snap2 -f /mnt/1 # dmesg The problem is that we will sort clone roots(include @send_root), it might push @send_root before thus @send_root's @send_in_progress will be decreased twice. Cc: David Sterba <[email protected]> Signed-off-by: Wang Shilong <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28btrfs: Add treelog mount option.Qu Wenruo2-2/+9
Add treelog mount option to enable tree log with remount option. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28btrfs: Add datasum mount option.Qu Wenruo2-1/+13
Add datasum mount option to enable checksum with remount option. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28btrfs: Add datacow mount option.Qu Wenruo2-3/+10
Add datacow mount option to enable copy-on-write with remount option. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28btrfs: Add acl mount option.Qu Wenruo2-2/+7
Add acl mount option to enable acl with remount option. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28btrfs: Add noflushoncommit mount option.Qu Wenruo2-1/+8
Add noflushoncommit mount option to disable flush on commit with remount option. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28btrfs: Add noenospc_debug mount option.Qu Wenruo2-1/+7
Add noenospc_debug mount option to disable ENOSPC debug with remount option. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28btrfs: Add nodiscard mount option.Qu Wenruo2-3/+10
Add nodiscard mount option to disable discard with remount option. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28btrfs: Add noautodefrag mount option.Qu Wenruo2-4/+12
Btrfs has autodefrag mount option but no pairing noautodefrag option, which makes it impossible to disable autodefrag without umount. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28btrfs: Add "barrier" option to support "-o remount,barrier"Qu Wenruo2-7/+14
Btrfs can be remounted without barrier, but there is no "barrier" option so nobody can remount btrfs back with barrier on. Only umount and mount again can re-enable barrier.(Quite awkward) Also the mount options in the document is also changed slightly for the further pairing options changes. Reported-by: Daniel Blueman <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: Mike Fleetwood <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: only fua the first superblock when writting supersWang Shilong1-1/+4
We only intent to fua the first superblock in every device from comments, fix it. Signed-off-by: Wang Shilong <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: return free space to global_rsv as much as possibleLiu Bo1-1/+1
@full is not protected within global_rsv.lock, so we may think global_rsv is already full but in fact it's not, so we miss the opportunity to return free space to global_rsv directly when we release other block_rsvs. Signed-off-by: Liu Bo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: fix an oops when we fail to relocate tree blocksWang Shilong1-0/+6
During balance test, we hit an oops: [ 2013.841551] kernel BUG at fs/btrfs/relocation.c:1174! The problem is that if we fail to relocate tree blocks, we should update backref cache, otherwise, some pending nodes are not updated while snapshot check @cache->last_trans is within one transaction and won't update it and then oops happen. Signed-off-by: Wang Shilong <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: fix the wrong nocow range checkMiao Xie1-2/+5
The following warning message was outputed when running the 274th case of xfstests with nodatacow option: BUG: Bad page state in process kswapd0 pfn:1c66f page:ffffea0000636848 count:0 mapcount:0 mapping:(null) index:0x78000 page flags: 0x1000000000100a(error|uptodate|private_2) It is because the check of nocow range was wrong, we should compare the start and end position of the extent with the write position to verify if the write position was in the extent, but the current code just used the start postion to do the check, so we got the wrong extent and told the caller that it was a nocow write. And then when we write back the dirty pages, we found we should cow the extent, but at that time, there was no space in the fs, we had to the error flag for the page. When someone reclaimed that page, the above warning outputed. Fix it. Reported-by: Tsutomu Itoh <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-01-28Btrfs: fix an oops when we fail to merge reloc rootsWang Shilong1-7/+3
Previously, we will free reloc root memory and then force filesystem to be readonly. The problem is that there may be another thread commiting transaction which will try to access freed reloc root during merging reloc roots process. To keep consistency snapshots shared space, we should allow snapshot finished if possible, so here we don't free reloc root memory. signed-off-by: Wang Shilong <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>