aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)AuthorFilesLines
2018-05-28btrfs: remove wrong use of volume_mutex from btrfs_dev_replace_startDavid Sterba2-10/+1
The volume mutex does not protect against anything in this case, the comment about scrub is right but not related to locking and looks confusing. The comment in btrfs_find_device_missing_or_by_path is wrong and confusing too. The device_list_mutex is not held here to protect device lookup, but in this case device replace cannot run in parallel with device removal (due to exclusive op protection), so we don't need further locking here. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: cleanup helpers that reset balance stateDavid Sterba2-20/+17
The function __cancel_balance name is confusing with the cancel operation of balance and it really resets the state of balance back to zero. The unset_balance_control helper is called only from one place and simple enough to be inlined. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: add sanity check when resuming balance after mountDavid Sterba1-1/+13
Replace a WARN_ON with a proper check and message in case something goes really wrong and resumed balance cannot set up its exclusive status. The check is a user friendly assertion, I don't expect to ever happen under normal circumstances. Also document that the paused balance starts here and owns the exclusive op status. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: add proper safety check before resuming dev-replaceDavid Sterba1-1/+11
The device replace is paused by unmount or read only remount, and resumed on next mount or write remount. The exclusive status should be checked properly as it's a global invariant and we must not allow 2 operations run. In this case, the balance can be also paused and resumed under same conditions. It's always checked first so dev-replace could see the EXCL_OP already taken, BUT, the ioctl would never let start both at the same time. Replace the WARN_ON with message and return 0, indicating no error as this is purely theoretical and the user will be informed. Resolving that manually should be possible by waiting for the other operation to finish or cancel the paused state. Reviewed-by: Anand Jain <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: move clearing of EXCL_OP out of __cancel_balanceDavid Sterba2-7/+8
Make the clearning visible in the callers so we can pair it with the test_and_set part. Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: move volume_mutex to callers of btrfs_rm_deviceDavid Sterba2-2/+4
Move locking and unlocking next to the BTRFS_FS_EXCL_OP bit manipulation so it's obvious that the two happen at the same time. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: move btrfs_init_dev_replace_tgtdev to dev-replace.c and make staticDavid Sterba3-103/+99
The function logically belongs there and there's only a single caller, no need to export it. No code changes. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: export and rename free_deviceDavid Sterba2-12/+13
The function will be used outside of volumes.c, the allocation btrfs_alloc_device is also exported. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: make success path out of btrfs_init_dev_replace_tgtdev more clearDavid Sterba2-2/+7
This is a preparatory cleanup that will make clear that the only successful way out of btrfs_init_dev_replace_tgtdev will also set the device_out to a valid pointer. With this guarantee, the callers can be simplified. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: squeeze btrfs_dev_replace_continue_on_mount to its callerDavid Sterba1-13/+3
The function is called once and is fairly small, we can merge it with the caller. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: cleanup btrfs_rm_device() promote fs_devices pointerAnand Jain1-7/+6
This function uses fs_info::fs_devices number of time, however we declare and use it only at the end, instead do it in the beginning of the function and use it. Signed-off-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: cleanup find_device() drop list_head pointerAnand Jain1-2/+1
find_device() declares struct list_head *head pointer and used only once, instead just use it directly. Signed-off-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: rename __btrfs_open_devices to open_fs_devicesAnand Jain1-4/+3
__btrfs_open_devices() is un-exported drop __ prefix and rename it to open_fs_devices(). Signed-off-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: rename __btrfs_close_devices to close_fs_devicesAnand Jain1-6/+6
__btrfs_close_devices() is un-exported, drop the __ prefix and rename it to close_fs_devices(). Signed-off-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: cleanup __btrfs_open_devices() drop head pointerAnand Jain1-2/+1
__btrfs_open_devices() declares struct list_head *head, however head is used only once, instead use btrfs_fs_devices::devices directly. Signed-off-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: rename struct btrfs_fs_devices::listAnand Jain3-10/+10
btrfs_fs_devices::list is the list of BTRFS fsid in the kernel, a generic name 'list' makes it's search very difficult, rename it to fs_list. Signed-off-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Drop fs_info parameter from btrfs_merge_delayed_refsNikolay Borisov3-4/+2
It's provided by the transaction handle. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Drop fs_info parameter from add_delayed_data_refNikolay Borisov1-7/+5
It's provided by the transaction handle. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Drop add_delayed_ref_head fs_info parameterNikolay Borisov1-11/+10
It's provided by the transaction handle. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Remove btrfs_wait_and_free_delalloc_workNikolay Borisov2-8/+2
This function is called from only 1 place and is effectively a wrapper over wait_completion/kfree. It doesn't really bring any value having those two calls in a separate function. Just open code it and remove it. No functional changes. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Remove tree argument from extent_writepagesNikolay Borisov3-9/+4
It can be directly referenced from the passed address_space so do that. No functional changes. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Use list_empty instead of list_empty_carefulNikolay Borisov1-2/+2
list_empty_careful usually is a signal of something tricky going on. Its usage in btrfs is actually not needed since both lists it's used on are local to a function and cannot be modified concurrently. So switch to plain list_empty. No functional changes. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Remove redundant tree argument from extent_readpagesNikolay Borisov3-9/+7
This function is called only from btrfs_readpage and is already passed the mapping. Simplify its signature by moving the code obtaining reference to the extent tree in the function. No functional changes. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Remove map argument from try_release_extent_stateNikolay Borisov1-3/+2
It's not used in the function so just remove it. No functional changes. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Sink extent_tree arguments in try_release_extent_mappingNikolay Borisov3-13/+5
This function already gets the page from which the two extent trees are referenced. Simplify its signature by moving the code getting the trees inside the function. No functional changes. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Allow rmdir(2) to delete an empty subvolumeMisono Tomohiro1-1/+1
Change the behavior of rmdir(2) and allow it to delete an empty subvolume by using btrfs_delete_subvolume() which is used by btrfs_ioctl_snap_destroy(). This is a change in behaviour and has been requested by users. Deleting the subvolume by ioctl requires root permissions while the rmdir way does works with standard tools and syscalls for all users that can access the subvolume. The main usecase is to allow 'rm -rf /path/with/subvols' to simply work. We were not able to find any nasty usability surprises, the intention is to do the destructive rm. Without allowing rmdir, this would have to be followed by the ioctl subvolume deletion, which is more of an annoyance. Implementation details: The required lock for @dir and inode of @dentry is already acquired in vfs layer. We need some check before deleting a subvolume. Permission check is done in vfs layer, emptiness check is in btrfs_rmdir() and additional check (i.e. neither the subvolume is a default subvolume nor send is in progress) is in btrfs_delete_subvolume(). Note that in btrfs_ioctl_snap_destroy(), d_delete() is called after btrfs_delete_subvolume(). For rmdir(2), d_delete() is called in vfs layer later. Tested-by: Goffredo Baroncelli <[email protected]> Signed-off-by: Tomohiro Misono <[email protected]> Reviewed-by: David Sterba <[email protected]> [ enhance changelog ] Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Factor out the main deletion process from btrfs_ioctl_snap_destroy()Misono Tomohiro3-136/+144
Factor out the second half of btrfs_ioctl_snap_destroy() as btrfs_delete_subvolume(), which performs some subvolume specific checks before deletion: 1. send is not in progress 2. the subvolume is not the default subvolume 3. the subvolume does not contain other subvolumes and actual deletion process. btrfs_delete_subvolume() requires inode_lock for both @dir and inode of @dentry. The remaining part of btrfs_ioctl_snap_destroy() is mainly permission checks. Note that call of d_delete() is not included in btrfs_delete_subvolume() as this function will also be used by btrfs_rmdir() to delete an empty subvolume and in that case d_delete() is called in VFS layer. As a result, btrfs_unlink_subvol() and may_destroy_subvol() become static functions. No functional changes. Signed-off-by: Tomohiro Misono <[email protected]> Reviewed-by: David Sterba <[email protected]> [ minor comment updates ] Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Move may_destroy_subvol() from ioctl.c to inode.cMisono Tomohiro3-54/+56
This is a preparation work to refactor btrfs_ioctl_snap_destroy() and to allow rmdir(2) to delete an empty subvolume. Signed-off-by: Tomohiro Misono <[email protected]> Reviewed-by: David Sterba <[email protected]> [ minor update of the function comment ] Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: remove unused le_test_bit()Howard McLauchlan1-5/+0
With commit b18253ec57c0 ("btrfs: optimize free space tree bitmap conversion"), there are no more callers to le_test_bit(). This patch removes le_test_bit(). Signed-off-by: Howard McLauchlan <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: optimize free space tree bitmap conversionHoward McLauchlan1-38/+23
Presently, convert_free_space_to_extents() does a linear scan of the bitmap. We can speed this up with find_next_{bit,zero_bit}_le(). This patch replaces the linear scan with find_next_{bit,zero_bit}_le(). Testing shows a 20-33% decrease in execution time for convert_free_space_to_extents(). Since we change bitmap to be unsigned long, we have to do some casting for the bitmap cursor. In le_bitmap_set() it makes sense to use u8, as we are doing bit operations. Everywhere else, we're just using it for pointer arithmetic and not directly accessing it, so char seems more appropriate. Suggested-by: Omar Sandoval <[email protected]> Signed-off-by: Howard McLauchlan <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: clean up le_bitmap_{set, clear}()Howard McLauchlan3-43/+20
le_bitmap_set() is only used by free-space-tree, so move it there and make it static. le_bitmap_clear() is not used, so remove it. Signed-off-by: Howard McLauchlan <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: use fs_info for btrfs_handle_em_exist tracepointDavid Sterba4-8/+11
We really want to know to which filesystem the extent map events belong, but as it cannot be reached from the extent_map pointers, we need to pass it down the callchain. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: tests: pass fs_info to extent_map testsDavid Sterba1-16/+36
Preparatory work to pass fs_info to btrfs_add_extent_mapping so we can get a better tracepoint message. Extent maps do not need fs_info for anything so we only add a dummy one without any other initialization. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Consolidate error checking for btrfs_alloc_chunkNikolay Borisov1-5/+7
The second if is really a subcase of ret being less than 0. So introduce a generic if (ret < 0) check, and inside have another if which explicitly handles the -ENOSPC and any other errors. No functional changes. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Fix lock release orderNikolay Borisov1-1/+1
Locks should generally be released in the oppposite order they are acquired. Generally lock acquisiton ordering is used to ensure deadlocks don't happen. However, as becomes more complicated it's best to also maintain proper unlock order so as to avoid possible dead locks. This was found by code inspection and doesn't necessarily lead to a deadlock scenario. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Use while loop instead of labels in __endio_write_update_orderedNikolay Borisov1-27/+25
Currently __endio_write_update_ordered uses labels to implement what is essentially a simple while loop. This makes the code more cumbersome to follow than it actually has to be. No functional changes. No xfstest regressions were found during testing. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: add comment about BTRFS_FS_EXCL_OPAnand Jain1-0/+35
Adds comments about BTRFS_FS_EXCL_OP to existing comments about the device locks. Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> [ minor updates ] Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Drop delayed_refs argument from btrfs_check_delayed_seqNikolay Borisov3-10/+5
It's used to print its pointer in a debug statement but doesn't really bring any useful information to the error message. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: rename btrfs_get_block_group_info and make it staticSu Yue2-6/+4
The function btrfs_get_block_group_info() was introduced by the commit 5af3e8cce8b7 ("Btrfs: make filesystem read-only when submitting barrier fails") which used it in disk-io.c. However, the function is only called in ioctl.c now. Its parameter type btrfs_ioctl_space_info* is only for ioctl. So, make it static and rename it to be original name get_block_group_info. No functional change. Signed-off-by: Su Yue <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: Replace owner argument in add_pinned_bytes with a booleanNikolay Borisov1-7/+13
add_pinned_bytes really cares whether the bytes being pinned are either data or metadata. To that effect it checks whether the 'owner' argument is less than BTRFS_FIRST_FREE_OBJECTID (256). This works because owner can really have 2 types of values: a) For metadata extents it holds the level at which the parent is in the btree. This amounts to owner having the values 0-7 b) In case of modifying data extents, owner is the inode number to which those extents belongs. Let's make this more explicit byt converting the owner parameter to a boolean value and either pass it directly when we know the type of extents we are working with (i.e. in btrfs_free_tree_block). In cases when the parent function can be called on both metadata/data extents perform the check in the caller. This hopefully makes the interface of add_pinned_bytes more intuitive. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-24Merge tag 'for-4.17-rc6-tag' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A one-liner that prevents leaking an internal error value 1 out of the ftruncate syscall. This has been observed in practice. The steps to reproduce make a common pattern (open/write/fync/ftruncate) but also need the application to not check only for negative values and happens only for compressed inlined files. The conditions are narrow but as this could break userspace I think it's better to merge it now and not wait for the merge window" * tag 'for-4.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix error handling in btrfs_truncate()
2018-05-24Btrfs: fix error handling in btrfs_truncate()Omar Sandoval1-1/+2
Jun Wu at Facebook reported that an internal service was seeing a return value of 1 from ftruncate() on Btrfs in some cases. This is coming from the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items(). btrfs_truncate() uses two variables for error handling, ret and err. When btrfs_truncate_inode_items() returns non-zero, we set err to the return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we only set err if ret is an error (i.e., negative). To reproduce the issue: mount a filesystem with -o compress-force=zstd and the following program will encounter return value of 1 from ftruncate: int main(void) { char buf[256] = { 0 }; int ret; int fd; fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666); if (fd == -1) { perror("open"); return EXIT_FAILURE; } if (write(fd, buf, sizeof(buf)) != sizeof(buf)) { perror("write"); close(fd); return EXIT_FAILURE; } if (fsync(fd) == -1) { perror("fsync"); close(fd); return EXIT_FAILURE; } ret = ftruncate(fd, 128); if (ret) { printf("ftruncate() returned %d\n", ret); close(fd); return EXIT_FAILURE; } close(fd); return EXIT_SUCCESS; } Fixes: ddfae63cc8e0 ("btrfs: move btrfs_truncate_block out of trans handle") CC: [email protected] # 4.15+ Reported-by: Jun Wu <[email protected]> Signed-off-by: Omar Sandoval <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-21Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-12/+4
Pull vfs fixes from Al Viro: "Assorted fixes all over the place" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: aio: fix io_destroy(2) vs. lookup_ioctx() race ext2: fix a block leak nfsd: vfs_mkdir() might succeed leaving dentry negative unhashed cachefiles: vfs_mkdir() might succeed leaving dentry negative unhashed unfuck sysfs_mount() kernfs: deal with kernfs_fill_super() failures cramfs: Fix IS_ENABLED typo befs_lookup(): use d_splice_alias() affs_lookup: switch to d_splice_alias() affs_lookup(): close a race with affs_remove_link() fix breakage caused by d_find_alias() semantics change fs: don't scan the inode cache before SB_BORN is set do d_instantiate/unlock_new_inode combinations safely iov_iter: fix memory leak in pipe_get_pages_alloc() iov_iter: fix return type of __pipe_get_pages()
2018-05-20Merge tag 'for-4.17-rc5-tag' of ↵Linus Torvalds7-48/+180
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "We've accumulated some fixes during the last week, some of them were in the works for a longer time but there are some newer ones too. Most of the fixes have a reproducer and fix user visible problems, also candidates for stable kernels. They IMHO qualify for a late rc, though I did not expect that many" * tag 'for-4.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix crash when trying to resume balance without the resume flag btrfs: Fix delalloc inodes invalidation during transaction abort btrfs: Split btrfs_del_delalloc_inode into 2 functions btrfs: fix reading stale metadata blocks after degraded raid1 mounts btrfs: property: Set incompat flag if lzo/zstd compression is set Btrfs: fix duplicate extents after fsync of file with prealloc extents Btrfs: fix xattr loss after power failure Btrfs: send, fix invalid access to commit roots due to concurrent snapshotting
2018-05-17btrfs: fix crash when trying to resume balance without the resume flagAnand Jain1-0/+9
We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance() only, which isn't called during the remount. So when resuming from the paused balance we hit the bug: kernel: kernel BUG at fs/btrfs/volumes.c:3890! :: kernel: balance_kthread+0x51/0x60 [btrfs] kernel: kthread+0x111/0x130 :: kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8 Reproducer: On a mounted filesystem: btrfs balance start --full-balance /btrfs btrfs balance pause /btrfs mount -o remount,ro /dev/sdb /btrfs mount -o remount,rw /dev/sdb /btrfs To fix this set the BTRFS_BALANCE_RESUME flag in btrfs_resume_balance_async(). CC: [email protected] # 4.4+ Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-17btrfs: Fix delalloc inodes invalidation during transaction abortNikolay Borisov1-11/+15
When a transaction is aborted btrfs_cleanup_transaction is called to cleanup all the various in-flight bits and pieces which migth be active. One of those is delalloc inodes - inodes which have dirty pages which haven't been persisted yet. Currently the process of freeing such delalloc inodes in exceptional circumstances such as transaction abort boiled down to calling btrfs_invalidate_inodes whose sole job is to invalidate the dentries for all inodes related to a root. This is in fact wrong and insufficient since such delalloc inodes will likely have pending pages or ordered-extents and will be linked to the sb->s_inode_list. This means that unmounting a btrfs instance with an aborted transaction could potentially lead inodes/their pages visible to the system long after their superblock has been freed. This in turn leads to a "use-after-free" situation once page shrink is triggered. This situation could be simulated by running generic/019 which would cause such inodes to be left hanging, followed by generic/176 which causes memory pressure and page eviction which lead to touching the freed super block instance. This situation is additionally detected by the unmount code of VFS with the following message: "VFS: Busy inodes after unmount of Self-destruct in 5 seconds. Have a nice day..." Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree)); in free_fs_root for the same reason. This patch aims to rectify the sitaution by doing the following: 1. Change btrfs_destroy_delalloc_inodes so that it calls invalidate_inode_pages2 for every inode on the delalloc list, this ensures that all the pages of the inode are released. This function boils down to calling btrfs_releasepage. During test I observed cases where inodes on the delalloc list were having an i_count of 0, so this necessitates using igrab to be sure we are working on a non-freed inode. 2. Since calling btrfs_releasepage might queue delayed iputs move the call out to btrfs_cleanup_transaction in btrfs_error_commit_super before calling run_delayed_iputs for the last time. This is necessary to ensure that delayed iputs are run. Note: this patch is tagged for 4.14 stable but the fix applies to older versions too but needs to be backported manually due to conflicts. CC: [email protected] # 4.14.x: 2b8773313494: btrfs: Split btrfs_del_delalloc_inode into 2 functions CC: [email protected] # 4.14.x Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> [ add comment to igrab ] Signed-off-by: David Sterba <[email protected]>
2018-05-17btrfs: Split btrfs_del_delalloc_inode into 2 functionsNikolay Borisov2-3/+12
This is in preparation of fixing delalloc inodes leakage on transaction abort. Also export the new function. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-17btrfs: fix reading stale metadata blocks after degraded raid1 mountsLiu Bo1-3/+3
If a btree block, aka. extent buffer, is not available in the extent buffer cache, it'll be read out from the disk instead, i.e. btrfs_search_slot() read_block_for_search() # hold parent and its lock, go to read child btrfs_release_path() read_tree_block() # read child Unfortunately, the parent lock got released before reading child, so commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had used 0 as parent transid to read the child block. It forces read_tree_block() not to check if parent transid is different with the generation id of the child that it reads out from disk. A simple PoC is included in btrfs/124, 0. A two-disk raid1 btrfs, 1. Right after mkfs.btrfs, block A is allocated to be device tree's root. 2. Mount this filesystem and put it in use, after a while, device tree's root got COW but block A hasn't been allocated/overwritten yet. 3. Umount it and reload the btrfs module to remove both disks from the global @fs_devices list. 4. mount -odegraded dev1 and write some data, so now block A is allocated to be a leaf in checksum tree. Note that only dev1 has the latest metadata of this filesystem. 5. Umount it and mount it again normally (with both disks), since raid1 can pick up one disk by the writer task's pid, if btrfs_search_slot() needs to read block A, dev2 which does NOT have the latest metadata might be read for block A, then we got a stale block A. 6. As parent transid is not checked, block A is marked as uptodate and put into the extent buffer cache, so the future search won't bother to read disk again, which means it'll make changes on this stale one and make it dirty and flush it onto disk. To avoid the problem, parent transid needs to be passed to read_tree_block(). In order to get a valid parent transid, we need to hold the parent's lock until finishing reading child. This patch needs to be slightly adapted for stable kernels, the &first_key parameter added to read_tree_block() is from 4.16+ (581c1760415c4). The fix is to replace 0 by 'gen'. Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race") CC: [email protected] # 4.4+ Signed-off-by: Liu Bo <[email protected]> Reviewed-by: Filipe Manana <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> [ update changelog ] Signed-off-by: David Sterba <[email protected]>
2018-05-17btrfs: property: Set incompat flag if lzo/zstd compression is setMisono Tomohiro1-4/+8
Incompat flag of LZO/ZSTD compression should be set at: 1. mount time (-o compress/compress-force) 2. when defrag is done 3. when property is set Currently 3. is missing and this commit adds this. This could lead to a filesystem that uses ZSTD but is not marked as such. If a kernel without a ZSTD support encounteres a ZSTD compressed extent, it will handle that but this could be confusing to the user. Typically the filesystem is mounted with the ZSTD option, but the discrepancy can arise when a filesystem is never mounted with ZSTD and then the property on some file is set (and some new extents are written). A simple mount with -o compress=zstd will fix that up on an unpatched kernel. Same goes for LZO, but this has been around for a very long time (2.6.37) so it's unlikely that a pre-LZO kernel would be used. Fixes: 5c1aab1dd544 ("btrfs: Add zstd support") CC: [email protected] # 4.14+ Signed-off-by: Tomohiro Misono <[email protected]> Reviewed-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> [ add user visible impact ] Signed-off-by: David Sterba <[email protected]>
2018-05-17Btrfs: fix duplicate extents after fsync of file with prealloc extentsFilipe Manana1-25/+112
In commit 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay"), on fsync, we started to always log all prealloc extents beyond an inode's i_size in order to avoid losing them after a power failure. However under some cases this can lead to the log replay code to create duplicate extent items, with different lengths, in the extent tree. That happens because, as of that commit, we can now log extent items based on extent maps that are not on the "modified" list of extent maps of the inode's extent map tree. Logging extent items based on extent maps is used during the fast fsync path to save time and for this to work reliably it requires that the extent maps are not merged with other adjacent extent maps - having the extent maps in the list of modified extents gives such guarantee. Consider the following example, captured during a long run of fsstress, which illustrates this problem. We have inode 271, in the filesystem tree (root 5), for which all of the following operations and discussion apply to. A buffered write starts at offset 312391 with a length of 933471 bytes (end offset at 1245862). At this point we have, for this inode, the following extent maps with the their field values: em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613, block_len 0, orig_block_len 0 em B, start 40960, orig_start 40960, len 376832, block_start 1106399232, block_len 376832, orig_block_len 376832 em C, start 417792, orig_start 417792, len 782336, block_start 18446744073709551613, block_len 0, orig_block_len 0 em D, start 1200128, orig_start 1200128, len 835584, block_start 1106776064, block_len 835584, orig_block_len 835584 em E, start 2035712, orig_start 2035712, len 245760, block_start 1107611648, block_len 245760, orig_block_len 245760 Extent map A corresponds to a hole and extent maps D and E correspond to preallocated extents. Extent map D ends where extent map E begins (1106776064 + 835584 = 1107611648), but these extent maps were not merged because they are in the inode's list of modified extent maps. An fsync against this inode is made, which triggers the fast path (BTRFS_INODE_NEEDS_FULL_SYNC is not set). This fsync triggers writeback of the data previously written using buffered IO, and when the respective ordered extent finishes, btrfs_drop_extents() is called against the (aligned) range 311296..1249279. This causes a split of extent map D at btrfs_drop_extent_cache(), replacing extent map D with a new extent map D', also added to the list of modified extents, with the following values: em D', start 1249280, orig_start of 1200128, block_start 1106825216 (= 1106776064 + 1249280 - 1200128), orig_block_len 835584, block_len 786432 (835584 - (1249280 - 1200128)) Then, during the fast fsync, btrfs_log_changed_extents() is called and extent maps D' and E are removed from the list of modified extents. The flag EXTENT_FLAG_LOGGING is also set on them. After the extents are logged clear_em_logging() is called on each of them, and that makes extent map E to be merged with extent map D' (try_merge_map()), resulting in D' being deleted and E adjusted to: em E, start 1249280, orig_start 1200128, len 1032192, block_start 1106825216, block_len 1032192, orig_block_len 245760 A direct IO write at offset 1847296 and length of 360448 bytes (end offset at 2207744) starts, and at that moment the following extent maps exist for our inode: em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613, block_len 0, orig_block_len 0 em B, start 40960, orig_start 40960, len 270336, block_start 1106399232, block_len 270336, orig_block_len 376832 em C, start 311296, orig_start 311296, len 937984, block_start 1112842240, block_len 937984, orig_block_len 937984 em E (prealloc), start 1249280, orig_start 1200128, len 1032192, block_start 1106825216, block_len 1032192, orig_block_len 245760 The dio write results in drop_extent_cache() being called twice. The first time for a range that starts at offset 1847296 and ends at offset 2035711 (length of 188416), which results in a double split of extent map E, replacing it with two new extent maps: em F, start 1249280, orig_start 1200128, block_start 1106825216, block_len 598016, orig_block_len 598016 em G, start 2035712, orig_start 1200128, block_start 1107611648, block_len 245760, orig_block_len 1032192 It also creates a new extent map that represents a part of the requested IO (through create_io_em()): em H, start 1847296, len 188416, block_start 1107423232, block_len 188416 The second call to drop_extent_cache() has a range with a start offset of 2035712 and end offset of 2207743 (length of 172032). This leads to replacing extent map G with a new extent map I with the following values: em I, start 2207744, orig_start 1200128, block_start 1107783680, block_len 73728, orig_block_len 1032192 It also creates a new extent map that represents the second part of the requested IO (through create_io_em()): em J, start 2035712, len 172032, block_start 1107611648, block_len 172032 The dio write set the inode's i_size to 2207744 bytes. After the dio write the inode has the following extent maps: em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613, block_len 0, orig_block_len 0 em B, start 40960, orig_start 40960, len 270336, block_start 1106399232, block_len 270336, orig_block_len 376832 em C, start 311296, orig_start 311296, len 937984, block_start 1112842240, block_len 937984, orig_block_len 937984 em F, start 1249280, orig_start 1200128, len 598016, block_start 1106825216, block_len 598016, orig_block_len 598016 em H, start 1847296, orig_start 1200128, len 188416, block_start 1107423232, block_len 188416, orig_block_len 835584 em J, start 2035712, orig_start 2035712, len 172032, block_start 1107611648, block_len 172032, orig_block_len 245760 em I, start 2207744, orig_start 1200128, len 73728, block_start 1107783680, block_len 73728, orig_block_len 1032192 Now do some change to the file, like adding a xattr for example and then fsync it again. This triggers a fast fsync path, and as of commit 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay"), we use the extent map I to log a file extent item because it's a prealloc extent and it starts at an offset matching the inode's i_size. However when we log it, we create a file extent item with a value for the disk byte location that is wrong, as can be seen from the following output of "btrfs inspect-internal dump-tree": item 1 key (271 EXTENT_DATA 2207744) itemoff 3782 itemsize 53 generation 22 type 2 (prealloc) prealloc data disk byte 1106776064 nr 1032192 prealloc data offset 1007616 nr 73728 Here the disk byte value corresponds to calculation based on some fields from the extent map I: 1106776064 = block_start (1107783680) - 1007616 (extent_offset) extent_offset = 2207744 (start) - 1200128 (orig_start) = 1007616 The disk byte value of 1106776064 clashes with disk byte values of the file extent items at offsets 1249280 and 1847296 in the fs tree: item 6 key (271 EXTENT_DATA 1249280) itemoff 3568 itemsize 53 generation 20 type 2 (prealloc) prealloc data disk byte 1106776064 nr 835584 prealloc data offset 49152 nr 598016 item 7 key (271 EXTENT_DATA 1847296) itemoff 3515 itemsize 53 generation 20 type 1 (regular) extent data disk byte 1106776064 nr 835584 extent data offset 647168 nr 188416 ram 835584 extent compression 0 (none) item 8 key (271 EXTENT_DATA 2035712) itemoff 3462 itemsize 53 generation 20 type 1 (regular) extent data disk byte 1107611648 nr 245760 extent data offset 0 nr 172032 ram 245760 extent compression 0 (none) item 9 key (271 EXTENT_DATA 2207744) itemoff 3409 itemsize 53 generation 20 type 2 (prealloc) prealloc data disk byte 1107611648 nr 245760 prealloc data offset 172032 nr 73728 Instead of the disk byte value of 1106776064, the value of 1107611648 should have been logged. Also the data offset value should have been 172032 and not 1007616. After a log replay we end up getting two extent items in the extent tree with different lengths, one of 835584, which is correct and existed before the log replay, and another one of 1032192 which is wrong and is based on the logged file extent item: item 12 key (1106776064 EXTENT_ITEM 835584) itemoff 3406 itemsize 53 refs 2 gen 15 flags DATA extent data backref root 5 objectid 271 offset 1200128 count 2 item 13 key (1106776064 EXTENT_ITEM 1032192) itemoff 3353 itemsize 53 refs 1 gen 22 flags DATA extent data backref root 5 objectid 271 offset 1200128 count 1 Obviously this leads to many problems and a filesystem check reports many errors: (...) checking extents Extent back ref already exists for 1106776064 parent 0 root 5 owner 271 offset 1200128 num_refs 1 extent item 1106776064 has multiple extent items ref mismatch on [1106776064 835584] extent item 2, found 3 Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 2 wanted 1 back 0x55b1d0ad7680 Backref 1106776064 root 5 owner 271 offset 1200128 num_refs 0 not found in extent tree Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 1 wanted 0 back 0x55b1d0ad4e70 Backref bytes do not match extent backref, bytenr=1106776064, ref bytes=835584, backref bytes=1032192 backpointer mismatch on [1106776064 835584] checking free space cache block group 1103101952 has wrong amount of free space failed to load free space cache for block group 1103101952 checking fs roots (...) So fix this by logging the prealloc extents beyond the inode's i_size based on searches in the subvolume tree instead of the extent maps. Fixes: 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay") CC: [email protected] # 4.14+ Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>