aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)AuthorFilesLines
2013-02-20Btrfs: fix trivial error in btrfs_ioctl_resize()Miao Xie1-6/+7
This patch fixes the following problem: - improper return value - unnecessary read-only check Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]>
2013-02-20Btrfs: use wrapper page_offsetMiao Xie4-18/+15
Use wrapper page_offset to get byte-offset into filesystem object for page. Signed-off-by: Liu Bo <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]>
2013-02-20Btrfs: flush all dirty inodes if writeback can not startMiao Xie1-9/+31
We may try to flush some dirty pages when there is no enough space to reserve. But it is possible that this operation fails, in order to get enough space to reserve successfully, we will sync all the delalloc file. This operation is safe, we needn't worry about the case that the filesystem goes from r/w to r/o. because the filesystem should guarantee all the dirty pages have been written into the disk after it becomes readonly, so the sync operation will do nothing if the filesystem is already readonly. Though it may waste lots of time, as a corner case, we needn't care. Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]>
2013-02-20Btrfs: make delayed ref lock logic more readableMiao Xie3-18/+38
Locking and unlocking delayed ref mutex are in the different functions, and the name of lock functions is not uniform, so the readability is not so good, this patch optimizes the lock logic and makes it more readable. Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]>
2013-02-20Btrfs: fix lots of orphan inodes when the space is not enoughMiao Xie3-17/+85
We're running into having 50-100 orphans left over with xfstests 83 because of ENOSPC when trying to start the transaction for the inode update. But in fact, it makes no sense in updating the inode for the new size while we're deleting the stupid thing. This patch fixes this problem. Reported-by: Josef Bacik <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]>
2013-02-20Btrfs: cleanup similar code in delayed inodeMiao Xie1-46/+37
The delayed item commit code in several functions is similar, so cleanup it. Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]>
2013-02-20Btrfs: use common work instead of delayed workMiao Xie1-4/+4
Since we do not want to delay the async transaction commit, we should use common work, not delayed work. Signed-off-by: Miao Xie <[email protected]>
2013-02-20Btrfs: cleanup unnecessary clear when freeing a transaction or a trans handleMiao Xie1-2/+0
We clear the transaction object and the trans handle when they are about to be freed, it is unnecessary, cleanup it. Signed-off-by: Miao Xie <[email protected]>
2013-02-20Btrfs: use slabs for delayed reference allocationMiao Xie5-21/+115
The delayed reference allocation is in the fast path of the IO, so use slabs to improve the speed of the allocation. And besides that, it can do check for leaked objects when the module is removed. Signed-off-by: Miao Xie <[email protected]>
2013-02-15btrfs: access superblock via pagecache in scan_one_deviceDavid Sterba1-6/+64
btrfs_scan_one_device is calling set_blocksize() which can race with a concurrent process making dirty page cache pages. It can end up dropping dirty page cache pages on the floor, which isn't very nice when someone is just running btrfs dev scan to find filesystems on the box. Now that udev is registering btrfs devices as it discovers them, we can actually end up racing with our own mkfs program too. When this happens, we drop some of the important blocks written by mkfs. This commit changes scan_one_device to read the super out of the page cache instead of trying to use bread. This way we don't have to care about the blocksize of the device. This also drops the invalidate_bdev() call. It wasn't very polite to invalidate during the scan either. mkfs is putting the super into the page cache, there's no reason to invalidate at this point. Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-02-14Btrfs: fix crash in log replay with qgroups enabledArne Jansen1-1/+1
When replaying a log tree with qgroups enabled, tree_mod_log_rewind does a sanity-check of the number of items against the maximum possible number. It calculates that number with the nodesize of fs_root. Unfortunately fs_root is not yet set at this stage. So instead use the nodesize from tree_root, which is already initialized. Signed-off-by: Arne Jansen <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-02-08Merge branch 'for-linus' of ↵Linus Torvalds8-36/+87
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We've got corner cases for updating i_size that ceph was hitting, error handling for quotas when we run out of space, a very subtle snapshot deletion race, a crash while removing devices, and one deadlock between subvolume creation and the sb_internal code (thanks lockdep)." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: move d_instantiate outside the transaction during mksubvol Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata Btrfs: fix possible stale data exposure Btrfs: fix missing i_size update Btrfs: fix race between snapshot deletion and getting inode Btrfs: fix missing release of the space/qgroup reservation in start_transaction() Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write() Btrfs: do not merge logged extents if we've removed them from the tree btrfs: don't try to notify udev about missing devices
2013-02-06Btrfs: move d_instantiate outside the transaction during mksubvolChris Mason1-1/+4
Dave Sterba triggered a lockdep complaint about lock ordering between the sb_internal lock and the cleaner semaphore. btrfs_lookup_dentry() checks for orphans if we're looking up the inode for a subvolume, and subvolume creation is triggering the lookup with a transaction running. This commit moves the d_instantiate after the transaction closes. Signed-off-by: Chris Mason <[email protected]>
2013-02-06Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadataJan Schmidt1-12/+10
When btrfs_qgroup_reserve returned a failure, we were missing a counter operation for BTRFS_I(inode)->outstanding_extents++, leading to warning messages about outstanding extents and space_info->bytes_may_use != 0. Additionally, the error handling code didn't take into account that we dropped the inode lock which might require more cleanup. Luckily, all the cleanup code we need is already there and can be shared with reserve_metadata_bytes, which is exactly what this patch does. Reported-by: Lev Vainblat <[email protected]> Signed-off-by: Jan Schmidt <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-02-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git ↵Chris Mason16-119/+370
for-chris into for-linus
2013-02-05Btrfs: fix possible stale data exposureJosef Bacik1-1/+1
We specifically do not update the disk i_size if there are ordered extents outstanding for any area between the current disk_i_size and our ordered extent so that we do not expose stale data. The problem is the check we have only checks if the ordered extent starts at or after the current disk_i_size, which doesn't take into account an ordered extent that starts before the current disk_i_size and ends past the disk_i_size. Fix this by checking if the extent ends past the disk_i_size. Thanks, Signed-off-by: Josef Bacik <[email protected]>
2013-02-05Btrfs: fix missing i_size updateJosef Bacik1-2/+9
If we have an ordered extent before the ordered extent we are currently completing that is after the current disk_i_size we will put our i_size update into that ordered extent so that we do not expose stale data. The problem is that if our disk i_size is updated past the previous ordered extent we won't update the i_size with the pending i_size update. So check the pending i_size update and if its above the current disk i_size we need to go ahead and try to update. Thanks, Signed-off-by: Josef Bacik <[email protected]>
2013-02-05Btrfs: fix race between snapshot deletion and getting inodeLiu Bo2-9/+38
While running snapshot testscript created by Mitch and David, the race between autodefrag and snapshot deletion can lead to corruption of dead_root list so that we can get crash on btrfs_clean_old_snapshots(). And besides autodefrag, scrub also does the same thing, ie. read root first and get inode. Here is the story(take autodefrag as an example): (1) when we delete a snapshot or subvolume, it will set its root's refs to zero and do a iput() on its own inode, and if this inode happens to be the only active in-meory one in root's inode rbtree, it will add itself to the global dead_roots list for later cleanup. (2) after (1), the autodefrag thread may read another inode for defrag and the inode is just in the deleted snapshot/subvolume, but all of these are without checking if the root is still valid(refs > 0). So the end up result is adding the deleted snapshot/subvolume's root to the global dead_roots list AGAIN. Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu. So all we need to do is to take the lock to protect 'read root and get inode', since we synchronize to wait for the rcu grace period before adding something to the global dead_roots list. Reported-by: Mitch Harder <[email protected]> Signed-off-by: Liu Bo <[email protected]> Signed-off-by: Josef Bacik <[email protected]>
2013-02-05Btrfs: fix missing release of the space/qgroup reservation in ↵Miao Xie1-8/+19
start_transaction() When we fail to start a transaction, we need to release the reserved free space and qgroup space, fix it. Signed-off-by: Miao Xie <[email protected]> Reviewed-by: Jan Schmidt <[email protected]> Signed-off-by: Josef Bacik <[email protected]>
2013-02-05Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write()Miao Xie1-1/+2
If the checks at the beginning of btrfs_file_aio_write() fail, we needn't decrease ->sync_writers, because we have not increased it. Fix it. Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]>
2013-02-05Btrfs: do not merge logged extents if we've removed them from the treeJosef Bacik1-1/+2
You can run into this problem where if somebody is fsyncing and writing out the existing extents you will have removed the extent map from the em tree, but it's still valid for the current fsync so we go ahead and write it. The problem is we unconditionally try to merge it back into the em tree, but if we've removed it from the em tree that will cause use after free problems. Fix this to only merge if we are still a part of the tree. Thanks, Signed-off-by: Josef Bacik <[email protected]>
2013-02-05Merge branch 'for-linus' into raid56-experimentalChris Mason14-98/+300
Conflicts: fs/btrfs/volumes.c Signed-off-by: Chris Mason <[email protected]>
2013-02-05Btrfs: remove conflicting check for minimum number of devices in raid56Chris Mason1-8/+0
The device removal code was incorrectly checking against two different limits for raid5 and raid6. Signed-off-by: Chris Mason <[email protected]>
2013-02-05Btrfs: select XOR_BLOCKS in KconfigTomasz Torcz1-0/+1
The Btrfs raid56 uses the generic xor helpers. Signed-off-by: Chris Mason <[email protected]>
2013-02-01Btrfs: reduce CPU contention while waiting for delayed extent operationsChris Mason3-5/+70
We batch up operations to the extent allocation tree, which allows us to deal with the recursive nature of using the extent allocation tree to allocate extents to the extent allocation tree. It also provides a mechanism to sort and collect extent operations, which makes it much more efficient to record extents that are close together. The delayed extent operations must all be finished before the running transaction commits, so we have code to make sure and run a few of the batched operations when closing our transaction handles. This creates a great deal of contention for the locks in the delayed extent operation tree, and also contention for the lock on the extent allocation tree itself. All the extra contention just slows down the operations and doesn't get things done any faster. This commit changes things to use a wait queue instead. As procs want to run the delayed operations, one of them races in and gets permission to hit the tree, and the others step back and wait for progress to be made. Signed-off-by: Chris Mason <[email protected]>
2013-02-01Btrfs: reduce lock contention on extent buffer locksChris Mason1-0/+16
The extent buffers have a refs_lock which we use to make coordinate freeing the extent buffer with operations on the radix tree. On tree roots and other extent buffers that very cache hot, this can be highly contended. These are also the extent buffers that are basically pinned in memory. This commit adds code to cmpxchg our way through the ref modifications, and as long as the result of the reference change is still pinned in ram, we skip the expensive spinlock. Signed-off-by: Chris Mason <[email protected]>
2013-02-01Btrfs: fix cluster alignment for mount -o ssdChris Mason1-1/+6
With the new raid56 code, we want to make sure we're properly aligning our allocation clusters with -o ssd Signed-off-by: Chris Mason <[email protected]>
2013-02-01Btrfs: add a plugging callback to raid56 writesChris Mason1-4/+124
Buffered writes and DIRECT_IO writes will often break up big contiguous changes to the file into sub-stripe writes. This adds a plugging callback to gather those smaller writes full stripe writes. Example on flash: fio job to do 64K writes in batches of 3 (which makes a full stripe): With plugging: 450MB/s Without plugging: 220MB/s Signed-off-by: Chris Mason <[email protected]>
2013-02-01Btrfs: Add a stripe cache to raid56Chris Mason2-8/+324
The stripe cache allows us to avoid extra read/modify/write cycles by caching the pages we read off the disk. Pages are cached when: * They are read in during a read/modify/write cycle * They are written during a read/modify/write cycle * They are involved in a parity rebuild Pages are not cached if we're doing a full stripe write. We're assuming that a full stripe write won't be followed by another partial stripe write any time soon. This provides a substantial boost in performance for workloads that synchronously modify adjacent offsets in the file, and for the parity rebuild use case in general. The size of the stripe cache isn't tunable (yet) and is set at 1024 entries. Example on flash: dd if=/dev/zero of=/mnt/xxx bs=4K oflag=direct Without the stripe cache -- 2.1MB/s With the stripe cache 21MB/s Signed-off-by: Chris Mason <[email protected]>
2013-02-01Btrfs: RAID5 and RAID6David Woodhouse15-102/+2283
This builds on David Woodhouse's original Btrfs raid5/6 implementation. The code has changed quite a bit, blame Chris Mason for any bugs. Read/modify/write is done after the higher levels of the filesystem have prepared a given bio. This means the higher layers are not responsible for building full stripes, and they don't need to query for the topology of the extents that may get allocated during delayed allocation runs. It also means different files can easily share the same stripe. But, it does expose us to incorrect parity if we crash or lose power while doing a read/modify/write cycle. This will be addressed in a later commit. Scrub is unable to repair crc errors on raid5/6 chunks. Discard does not work on raid5/6 (yet) The stripe size is fixed at 64KiB per disk. This will be tunable in a later commit. Signed-off-by: Chris Mason <[email protected]>
2013-02-01Btrfs: add rw argument to merge_bio_hook()David Woodhouse5-11/+11
We'll want to merge writes so they can fill a full RAID[56] stripe, but not necessarily reads. Signed-off-by: David Woodhouse <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-02-01btrfs: don't try to notify udev about missing devicesEric Sandeen1-1/+2
If we remove a missing device, bdev is null, and if we send that off to btrfs_kobject_uevent we'll panic. Signed-off-by: Eric Sandeen <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-01-29Merge branch 'master' into for-nextJiri Kosina43-1828/+5523
Conflicts: drivers/devfreq/exynos4_bus.c Sync with Linus' tree to be able to apply patches that are against newer code (mvneta).
2013-01-25Merge 3.8-rc5 into driver-core-nextGreg Kroah-Hartman14-98/+300
This resolves a gpio driver merge issue pointed out in linux-next. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2013-01-25Merge branch 'for-linus' of ↵Linus Torvalds14-98/+300
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "It turns out that we had two crc bugs when running fsx-linux in a loop. Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it all down. Miao also has a new OOM fix in this v2 pull as well. Ilya fixed a regression Liu Bo found in the balance ioctls for pausing and resuming a running balance across drives. Josef's orphan truncate patch fixes an obscure corruption we'd see during xfstests. Arne's patches address problems with subvolume quotas. If the user destroys quota groups incorrectly the FS will refuse to mount. The rest are smaller fixes and plugs for memory leaks." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits) Btrfs: fix repeated delalloc work allocation Btrfs: fix wrong max device number for single profile Btrfs: fix missed transaction->aborted check Btrfs: Add ACCESS_ONCE() to transaction->abort accesses Btrfs: put csums on the right ordered extent Btrfs: use right range to find checksum for compressed extents Btrfs: fix panic when recovering tree log Btrfs: do not allow logged extents to be merged or removed Btrfs: fix a regression in balance usage filter Btrfs: prevent qgroup destroy when there are still relations Btrfs: ignore orphan qgroup relations Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag Btrfs: fix unlock order in btrfs_ioctl_rm_dev Btrfs: fix unlock order in btrfs_ioctl_resize Btrfs: fix "mutually exclusive op is running" error code Btrfs: bring back balance pause/resume logic btrfs: update timestamps on truncate() btrfs: fix btrfs_cont_expand() freeing IS_ERR em Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents Btrfs: fix off-by-one in lseek ...
2013-01-24Btrfs: fix repeated delalloc work allocationMiao Xie1-14/+41
btrfs_start_delalloc_inodes() locks the delalloc_inodes list, fetches the first inode, unlocks the list, triggers btrfs_alloc_delalloc_work/ btrfs_queue_worker for this inode, and then it locks the list, checks the head of the list again. But because we don't delete the first inode that it deals with before, it will fetch the same inode. As a result, this function allocates a huge amount of btrfs_delalloc_work structures, and OOM happens. Fix this problem by splice this delalloc list. Reported-by: Alex Lyakas <[email protected]> Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]>
2013-01-24Btrfs: fix wrong max device number for single profileMiao Xie1-1/+1
The max device number of single profile is 1, not 0 (0 means 'as many as possible'). Fix it. Cc: Liu Bo <[email protected]> Signed-off-by: Miao Xie <[email protected]> Reviewed-by: Liu Bo <[email protected]> Signed-off-by: Josef Bacik <[email protected]>
2013-01-24Btrfs: fix missed transaction->aborted checkMiao Xie1-0/+16
First, though the current transaction->aborted check can stop the commit early and avoid unnecessary operations, it is too early, and some transaction handles don't end, those handles may set transaction->aborted after the check. Second, when we commit the transaction, we will wake up some worker threads to flush the space cache and inode cache. Those threads also allocate some transaction handles and may set transaction->aborted if some serious error happens. So we need more check for ->aborted when committing the transaction. Fix it. Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]>
2013-01-24Btrfs: Add ACCESS_ONCE() to transaction->abort accessesMiao Xie2-2/+3
We may access and update transaction->aborted on the different CPUs without lock, so we need ACCESS_ONCE() wrapper to prevent the compiler from creating unsolicited accesses and make sure we can get the right value. Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Josef Bacik <[email protected]>
2013-01-24Btrfs: put csums on the right ordered extentJosef Bacik1-2/+2
I noticed a WARN_ON going off when adding csums because we were going over the amount of csum bytes that should have been allowed for an ordered extent. This is a leftover from when we used to hold the csums privately for direct io, but now we use the normal ordered sum stuff so we need to make sure and check if we've moved on to another extent so that the csums are added to the right extent. Without this we could end up with csums for bytenrs that don't have extents to cover them yet. Thanks, Signed-off-by: Josef Bacik <[email protected]>
2013-01-24Btrfs: use right range to find checksum for compressed extentsLiu Bo1-0/+5
For compressed extents, the range of checksum is covered by disk length, and the disk length is different with ram length, so we need to use disk length instead to get us the right checksum. Signed-off-by: Liu Bo <[email protected]> Signed-off-by: Josef Bacik <[email protected]>
2013-01-24Btrfs: fix panic when recovering tree logJosef Bacik1-8/+12
A user reported a BUG_ON(ret) that occured during tree log replay. Ret was -EAGAIN, so what I think happened is that we removed an extent that covered a bitmap entry and an extent entry. We remove the part from the bitmap and return -EAGAIN and then search for the next piece we want to remove, which happens to be an entire extent entry, so we just free the sucker and return. The problem is ret is still set to -EAGAIN so we trip the BUG_ON(). The user used btrfs-zero-log so I'm not 100% sure this is what happened so I've added a WARN_ON() to catch the other possibility. Thanks, Reported-by: Jan Steffens <[email protected]> Signed-off-by: Josef Bacik <[email protected]>
2013-01-24Btrfs: do not allow logged extents to be merged or removedJosef Bacik3-3/+16
We drop the extent map tree lock while we're logging extents, so somebody could come in and merge another extent into this one and screw up our logging, or they could even remove us from the list which would keep us from logging the extent or freeing our ref on it, so we need to make sure to not clear LOGGING until after the extent is logged, and then we can merge it to adjacent extents. Thanks, Signed-off-by: Josef Bacik <[email protected]>
2013-01-21Btrfs: fix a regression in balance usage filterIlya Dryomov1-1/+8
Commit 3fed40cc ("Btrfs: cleanup duplicated division functions"), which was merged into 3.8-rc1, has introduced a regression by removing logic that was guarding us against bad user input. Bring it back. Signed-off-by: Ilya Dryomov <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-01-21Merge branch 'mutex-ops@next-for-chris' of ↵Chris Mason2-31/+86
git://github.com/idryomov/btrfs-unstable into linus
2013-01-21Merge branch 'for-chris' of ↵Chris Mason6-35/+91
git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into linus
2013-01-21Btrfs: prevent qgroup destroy when there are still relationsArne Jansen1-1/+12
Currently you can just destroy a qgroup even though it is in use by other qgroups or has qgroups assigned to it. This patch prevents destruction of qgroups unless they are completely unused. Otherwise destroy will return EBUSY. Reported-by: Eric Hopper <[email protected]> Signed-off-by: Arne Jansen <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-01-21Btrfs: ignore orphan qgroup relationsArne Jansen1-0/+7
If a qgroup that has still assignments is deleted by the user, the corresponding relations are left in the tree. This leads to an unmountable filesystem. With this patch, those relations are simple ignored. Reported-by: Eric Hopper <[email protected]> Signed-off-by: Arne Jansen <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2013-01-21fs/btrfs: remove depends on CONFIG_EXPERIMENTALKees Cook1-2/+1
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. Cc: Chris Mason <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2013-01-20Btrfs: reorder locks and sanity checks in btrfs_ioctl_defragIlya Dryomov1-8/+9
Operation-specific check (whether subvol is readonly or not) should go after the mutual exclusiveness check. Signed-off-by: Ilya Dryomov <[email protected]>