aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2022-09-26btrfs: log conflicting inodes without holding log mutex of the initial inodeFilipe Manana3-152/+196
When logging an inode, if we detect the inode has a reference that conflicts with some other inode that got renamed, we log that other inode while holding the log mutex of the current inode. We then find out if there are other inodes that conflict with the first conflicting inode, and log them while under the log mutex of the original inode. This is fine because the recursion can only happen once. For the upcoming work where we directly log delayed items without flushing them first to the subvolume tree, this recursion adds a lot of complexity and it's hard to keep lockdep happy about it. So collect a list of conflicting inodes and then log the inodes after unlocking the log mutex of the inode we started with. Also limit the maximum number of conflict inodes we log to 10, to avoid spending too much time logging (and maybe allocating too many list elements too), as typically we don't have more than 1 or 2 conflicting inodes - if we go over the limit, simply fallback to a transaction commit. It is possible to have a very long list of conflicting inodes to be intentionally created by a user if he/she creates a very long succession of renames like this: (...) rename E to F rename D to E rename C to D rename B to C rename A to B touch A (create a new file named A) fsync A If that happened for a sequence of hundreds or thousands of renames, it could massively slow down the logging and cause other secondary effects like for example blocking other fsync operations and transaction commits for a very long time (assuming it wouldn't run into -ENOSPC or -ENOMEM first). However such cases are very uncommon to happen in practice, nevertheless it's better to be prepared for them and avoid chaos. Such long sequence of conflicting inodes could be created before this change. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: move log_new_dir_dentries() above btrfs_log_inode()Filipe Manana1-167/+167
The static function log_new_dir_dentries() is currently defined below btrfs_log_inode(), but in an upcoming patch a new function is introduced that is called by btrfs_log_inode() and this new function needs to call log_new_dir_dentries(). So move log_new_dir_dentries() to a location between btrfs_log_inode() and need_log_inode() (the later is called by log_new_dir_dentries()). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: move need_log_inode() to above log_conflicting_inodes()Filipe Manana1-35/+35
The static function need_log_inode() is defined below btrfs_log_inode() and log_conflicting_inodes(), but in the next patches in the series we will need to call need_log_inode() in a couple new functions that will be used by btrfs_log_inode(). So move its definition to a location above log_conflicting_inodes(). Also make its arguments 'const', since they are not supposed to be modified. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: search for last logged dir index if it's not cached in the inodeFilipe Manana1-45/+74
The key offset of the last dir index item that was logged is stored in the inode's last_dir_index_offset field. However that field is not persisted in the inode item or elsewhere, so if the inode gets evicted and reloaded, it gets a value of (u64)-1, so that when we are logging dir index items we check if they were logged before, to avoid attempts to insert duplicated keys and fallback to a transaction commit. Improve on this by searching for the last dir index that was logged when we start logging a directory if the inode's last_dir_index_offset is not set (has a value of (u64)-1) and it was logged before. This avoids checking if each dir index item we find was already logged before, and simplifies the logging of dir index items (process_dir_items_leaf()). This will also be needed for an incoming change where we start logging delayed items directly, without flushing them first. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: shrink the size of struct btrfs_delayed_itemFilipe Manana2-24/+25
Currently struct btrfs_delayed_item has a base size of 96 bytes, but its size can be decreased by doing the following 2 tweaks: 1) Change data_len from u32 to u16. Our maximum possible leaf size is 64K, so the data_len can never be larger than that, and in fact it is always much smaller than that. The max length for a dentry's name is ensured at the VFS level (PATH_MAX, 4096 bytes) and in struct btrfs_inode_ref and btrfs_dir_item we use a u16 to store the name's length; 2) Change 'ins_or_del' to a 1 bit enum, which is all we need since it can only have 2 values. After this there's also no longer the need to BUG_ON() before using 'ins_or_del' in several places. Also rename the field from 'ins_or_del' to 'type', which is more clear. These two tweaks decrease the size of struct btrfs_delayed_item from 96 bytes down to 88 bytes. A previous patch already reduced the size of this structure by 16 bytes, but an upcoming change will increase its size by 16 bytes (adding a struct list_head element). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: remove unused logic when looking up delayed itemsFilipe Manana1-42/+3
All callers pass NULL to the 'prev' and 'next' arguments of the function __btrfs_lookup_delayed_item(), so remove these arguments. Also, remove the unnecessary wrapper __btrfs_lookup_delayed_insertion_item(), making btrfs_delete_delayed_insertion_item() directly call __btrfs_lookup_delayed_item(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: store index number instead of key in struct btrfs_delayed_itemFilipe Manana2-53/+56
All delayed items are for dir index keys, so there's really no point of having an embedded struct btrfs_key in struct btrfs_delayed_item, which makes the structure use more space than necessary (and adds a hole of 7 bytes). So replace the key field with an index number (u64), which reduces the size of struct btrfs_delayed_item from 112 bytes down to 96 bytes. Some upcoming work will increase the structure size by 16 bytes, so this change compensates for that future size increase. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: remove root argument from btrfs_delayed_item_reserve_metadata()Filipe Manana1-5/+3
The root argument of btrfs_delayed_item_reserve_metadata() is used only to get the fs_info object, but we already have a transaction handle, which we can use to get the fs_info. So remove the root argument. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: avoid memory allocation at log_new_dir_dentries() for common caseFilipe Manana1-17/+12
At log_new_dir_dentries() we always start by allocating a list element for the starting inode and then do a while loop with the condition being a list emptiness check. This however is not needed, we can avoid allocating this initial list element and then just check for the list emptiness at the end of the loop's body. So just do that to save one memory allocation from the kmalloc-32 slab. This allows for not doing any memory allocation when we don't have any subdirectory to log, which is a very common case. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: free list element sooner at log_new_dir_dentries()Filipe Manana1-22/+30
At log_new_dir_dentries(), there's no need to keep the current list element allocated while processing the leaves with directory items for the current directory, and while logging other inodes. Plus in case we find a subdirectory, we also end up allocating a new list element while the current one is still allocated, temporarily using more memory than necessary. So free the current list element early on, before processing leaves. Also make the removal and release of all list elements in case of an error more simple by eliminating the label and goto, adding an explicit loop to release all list elements in case an error happens. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: update stale comment for log_new_dir_dentries()Filipe Manana1-4/+4
The comment refers to the function log_dir_items() in order to check why the inodes of new directory entries need to be logged, but the relevant comments are no longer at log_dir_items(), they were moved to the function process_dir_items_leaf() in commit eb10d85ee77f09 ("btrfs: factor out the copying loop of dir items from log_dir_items()"). So update it with the current function name. Also remove references with i_mutex to "VFS lock", since the inode lock is no longer a mutex since 2016 (it's now a rw semaphore). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: remove the root argument from log_new_dir_dentries()Filipe Manana1-3/+3
There's no point in passing a root argument to log_new_dir_dentries() because it always corresponds to the root of the given inode. So remove it and extract the root from the given inode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: don't drop dir index range items when logging a directoryFilipe Manana1-5/+1
When logging a directory that was previously logged in the current transaction, we drop all the range items (BTRFS_DIR_LOG_INDEX_KEY key type). This is because we will process all leaves in the subvolume's tree that were changed in the current transaction and then add range items for covering new dir index items and deleted dir index items, which could cover now a larger range than before. We used to fail if we tried to insert a range item key that already exists, so we dropped all range items to avoid failing. However nowadays, since commit 750ee454908e90 ("btrfs: fix assertion failure when logging directory key range item"), we simply update any range item that already exists, increasing its range's last dir index if needed. Since the range covered by a range item can never decrease, due to the fact that dir index values come from a monotonically increasing counter and are never reused, we can stop dropping all range items before we start logging a directory. By not dropping the items we can avoid having occasional tree rebalance operations. This will also be needed for an incoming change where we start logging delayed items directly, without flushing them first. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: scrub: use larger block size for data extent scrubQu Wenruo1-1/+7
[PROBLEM] The existing scrub code for data extents always limit the block size to sectorsize. This causes quite some extra scrub_block being allocated: (there is a data extent at logical bytenr 298844160, length 64KiB) alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1 alloc_scrub_block: new block: logical=298848256 physical=298848256 mirror=1 alloc_scrub_block: new block: logical=298852352 physical=298852352 mirror=1 alloc_scrub_block: new block: logical=298856448 physical=298856448 mirror=1 alloc_scrub_block: new block: logical=298860544 physical=298860544 mirror=1 alloc_scrub_block: new block: logical=298864640 physical=298864640 mirror=1 alloc_scrub_block: new block: logical=298868736 physical=298868736 mirror=1 alloc_scrub_block: new block: logical=298872832 physical=298872832 mirror=1 alloc_scrub_block: new block: logical=298876928 physical=298876928 mirror=1 alloc_scrub_block: new block: logical=298881024 physical=298881024 mirror=1 alloc_scrub_block: new block: logical=298885120 physical=298885120 mirror=1 alloc_scrub_block: new block: logical=298889216 physical=298889216 mirror=1 alloc_scrub_block: new block: logical=298893312 physical=298893312 mirror=1 alloc_scrub_block: new block: logical=298897408 physical=298897408 mirror=1 alloc_scrub_block: new block: logical=298901504 physical=298901504 mirror=1 alloc_scrub_block: new block: logical=298905600 physical=298905600 mirror=1 ... scrub_block_put: free block: logical=298844160 physical=298844160 len=4096 mirror=1 scrub_block_put: free block: logical=298848256 physical=298848256 len=4096 mirror=1 scrub_block_put: free block: logical=298852352 physical=298852352 len=4096 mirror=1 scrub_block_put: free block: logical=298856448 physical=298856448 len=4096 mirror=1 scrub_block_put: free block: logical=298860544 physical=298860544 len=4096 mirror=1 scrub_block_put: free block: logical=298864640 physical=298864640 len=4096 mirror=1 scrub_block_put: free block: logical=298868736 physical=298868736 len=4096 mirror=1 scrub_block_put: free block: logical=298872832 physical=298872832 len=4096 mirror=1 scrub_block_put: free block: logical=298876928 physical=298876928 len=4096 mirror=1 scrub_block_put: free block: logical=298881024 physical=298881024 len=4096 mirror=1 scrub_block_put: free block: logical=298885120 physical=298885120 len=4096 mirror=1 scrub_block_put: free block: logical=298889216 physical=298889216 len=4096 mirror=1 scrub_block_put: free block: logical=298893312 physical=298893312 len=4096 mirror=1 scrub_block_put: free block: logical=298897408 physical=298897408 len=4096 mirror=1 scrub_block_put: free block: logical=298901504 physical=298901504 len=4096 mirror=1 scrub_block_put: free block: logical=298905600 physical=298905600 len=4096 mirror=1 This behavior will waste a lot of memory, especially after we have moved quite some members from scrub_sector to scrub_block. [FIX] To reduce the allocation of scrub_block, and to reduce memory usage, use BTRFS_STRIPE_LEN instead of sectorsize as the block size to scrub data extents. This results only one scrub_block to be allocated for above data extent: alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1 scrub_block_put: free block: logical=298844160 physical=298844160 len=65536 mirror=1 This would greatly reduce the memory usage (even it's just transient) for larger data extents scrub. For above example, the memory usage would be: Old: num_sectors * (sizeof(scrub_block) + sizeof(scrub_sector)) 16 * (408 + 96) = 8065 New: sizeof(scrub_block) + num_sectors * sizeof(scrub_sector) 408 + 16 * 96 = 1944 A good reduction of 75.9%. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: scrub: move logical/physical/dev/mirror_num from scrub_sector to ↵Qu Wenruo1-73/+92
scrub_block Currently we store the following members in scrub_sector: - logical - physical - physical_for_dev_replace - dev - mirror_num However the current scrub code has ensured that scrub_blocks never cross stripe boundary. This is caused by the entry functions (scrub_simple_mirror, scrub_simple_stripe), thus every scrub_block will not cross stripe boundary. Thus this makes it possible to move those members into scrub_block other than putting them into scrub_sector. This should save quite some memory, as a scrub_block can be as large as 64 sectors, even for metadata it's 16 sectors byte default. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: scrub: remove scrub_sector::page and use scrub_block::pages insteadQu Wenruo1-32/+67
Although scrub currently works for subpage (PAGE_SIZE > sectorsize) cases, it will allocate one page for each scrub_sector, which can cause extra unnecessary memory usage. Utilize scrub_block::pages[] instead of allocating page for each scrub_sector, this allows us to integrate larger extents while using less memory. For example, if our page size is 64K, sectorsize is 4K, and we got an 32K sized extent. We will only allocate one page for scrub_block, and all 8 scrub sectors will point to that page. To do that properly, here we introduce several small helpers: - scrub_page_get_logical() Get the logical bytenr of a page. We store the logical bytenr of the page range into page::private. But for 32bit systems, their (void *) is not large enough to contain a u64, so in that case we will need to allocate extra memory for it. For 64bit systems, we can use page::private directly. - scrub_block_get_logical() Just get the logical bytenr of the first page. - scrub_sector_get_page() Return the page which the scrub_sector points to. - scrub_sector_get_page_offset() Return the offset inside the page which the scrub_sector points to. - scrub_sector_get_kaddr() Return the address which the scrub_sector points to. Just a wrapper using scrub_sector_get_page() and scrub_sector_get_page_offset() - bio_add_scrub_sector() Please note that, even with this patch, we're still allocating one page for one sector for data extents. This is because in scrub_extent() we split the data extent using sectorsize. The memory usage reduction will need extra work to make scrub to work like data read to only use the correct sector(s). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: scrub: introduce scrub_block::pages for more efficient memory usage ↵Qu Wenruo1-22/+116
for subpage [BACKGROUND] Currently for scrub, we allocate one page for one sector, this is fine for PAGE_SIZE == sectorsize support, but can waste extra memory for subpage support. [CODE CHANGE] Make scrub_block contain all the pages, so if we're scrubbing an extent sized 64K, and our page size is also 64K, we only need to allocate one page. [LIFESPAN CHANGE] Since now scrub_sector no longer holds a page, but is using scrub_block::pages[] instead, we have to ensure scrub_block has a longer lifespan for write bio. The lifespan for read bio is already large enough. Now scrub_block will only be released after the write bio finished. [COMING NEXT] Currently we only added scrub_block::pages[] for this purpose, but scrub_sector is still utilizing the old scrub_sector::page. The switch will happen in the next patch. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: scrub: factor out allocation and initialization of scrub_sector into ↵Qu Wenruo1-31/+29
helper The allocation and initialization is shared by 3 call sites, and we're going to change the initialization of some members in the upcoming patches. So factor out the allocation and initialization of scrub_sector into a helper, alloc_scrub_sector(), which will do the following work: - Allocate the memory for scrub_sector - Allocate a page for scrub_sector::page - Initialize scrub_sector::refs to 1 - Attach the allocated scrub_sector to scrub_block The attachment is bidirectional, which means scrub_block::sectorv[] will be updated and scrub_sector::sblock will also be updated. - Update scrub_block::sector_count and do extra sanity check on it Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: scrub: factor out initialization of scrub_block into helperQu Wenruo1-23/+23
Although there are only two callers, we are going to add some members for scrub_block in the incoming patches. Factoring out the initialization code will make later expansion easier. One thing to note is, even scrub_handle_errored_block() doesn't utilize scrub_block::refs, we still use alloc_scrub_block() to initialize sblock::ref, allowing us to use scrub_block_put() to do cleanup. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: scrub: use pointer array to replace sblocks_for_recheckQu Wenruo1-46/+53
In function scrub_handle_errored_block(), we use @sblocks_for_recheck pointer to hold one scrub_block for each mirror, and uses kcalloc() to allocate an array. But this one pointer for an array is not readable due to the member offsets done by addition and not []. Change this pointer to struct scrub_block *[BTRFS_MAX_MIRRORS], this will slightly increase the stack memory usage. Since function scrub_handle_errored_block() won't get iterative calls, this extra cost would completely be acceptable. And since we're here, also set sblock->refs and use scrub_block_put() to clean them up, as later we will add extra members in scrub_block, which needs scrub_block_put() to clean them up. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: send: add support for fs-verityBoris Burkov5-7/+122
Preserve the fs-verity status of a btrfs file across send/recv. There is no facility for installing the Merkle tree contents directly on the receiving filesystem, so we package up the parameters used to enable verity found in the verity descriptor. This gives the receive side enough information to properly enable verity again. Note that this means that receive will have to re-compute the whole Merkle tree, similar to how compression worked before encoded_write. Since the file becomes read-only after verity is enabled, it is important that verity is added to the send stream after any file writes. Therefore, when we process a verity item, merely note that it happened, then actually create the command in the send stream during 'finish_inode_if_needed'. This also creates V3 of the send stream format, without any format changes besides adding the new commands and attributes. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: use atomic_try_cmpxchg in free_extent_bufferUros Bizjak1-4/+2
Use `atomic_try_cmpxchg(ptr, &old, new)` instead of `atomic_cmpxchg(ptr, old, new) == old` in free_extent_buffer. This has two benefits: - The x86 cmpxchg instruction returns success in the ZF flag, so this change saves a compare after cmpxchg, as well as a related move instruction in the front of cmpxchg. - atomic_try_cmpxchg implicitly assigns the *ptr value to &old when cmpxchg fails, enabling further code simplifications. This patch has no functional change. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: scrub: remove impossible sanity checksQu Wenruo1-25/+9
There are several sanity checks which are no longer possible to trigger inside btrfs_scrub_dev(). Since we have mount time check against super block nodesize/sectorsize, and our fixed macro is hardcoded to handle even the worst combination. Thus those sanity checks are no longer needed, can be easily removed. But this patch still uses some ASSERT()s as a safe net just in case we change some features in the future to trigger those impossible combinations. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: delete btrfs_wait_space_cache_v1_finishedJosef Bacik2-2/+2
We used to use this in a few spots, but now we only use it directly inside of block-group.c, so remove the helper and just open code where we were using it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: remove lock protection for BLOCK_GROUP_FLAG_RELOCATING_REPAIRJosef Bacik1-3/+0
Before when this was modifying the bit field we had to protect it with the bg->lock, however now we're using bit helpers so we can stop using the bg->lock. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: remove BLOCK_GROUP_FLAG_HAS_CACHING_CTLJosef Bacik2-25/+21
This is used mostly to determine if we need to look at the caching ctl list and clean up any references to this block group. However we never clear this flag, specifically because we need to know if we have to remove a caching ctl we have for this block group still. This is in the remove block group path which isn't a fast path, so the optimization doesn't really matter, simplify this logic and remove the flag. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: simplify block group traversal in btrfs_put_block_group_cacheJosef Bacik1-27/+15
We're breaking out and re-searching for the next block group while evicting any of the block group cache inodes. This is not needed, the block groups aren't disappearing here, we can simply loop through the block groups like normal and iput any inode that we find. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: remove lock protection for BLOCK_GROUP_FLAG_TO_COPYJosef Bacik3-9/+0
We use this during device replace for zoned devices, we were simply taking the lock because it was in a bit field and we needed the lock to be safe with other modifications in the bitfield. With the bit helpers we no longer require that locking. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: convert block group bit field to use bit helpersJosef Bacik9-56/+71
We use a bit field in the btrfs_block_group for different flags, however this is awkward because we have to hold the block_group->lock for any modification of any of these fields, and makes the code clunky for a few of these flags. Convert these to a properly flags setup so we can utilize the bit helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: handle space_info setting of bg in btrfs_add_bg_to_space_infoJosef Bacik3-28/+13
We previously had the pattern of btrfs_update_space_info(all, the, bg, fields, &space_info); link_block_group(bg); bg->space_info = space_info; Now that we're passing the bg into btrfs_add_bg_to_space_info we can do the linking in that function, transforming this to simply btrfs_add_bg_to_space_info(fs_info, bg); and put the link_block_group() and bg->space_info assignment directly in btrfs_add_bg_to_space_info. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: simplify arguments of btrfs_update_space_info and renameJosef Bacik3-36/+28
This function has grown a bunch of new arguments, and it just boils down to passing in all the block group fields as arguments. Simplify this by passing in the block group itself and updating the space_info fields based on the block group fields directly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: use btrfs_fs_closing for background bg workJosef Bacik1-0/+6
For both unused bg deletion and async balance work we'll happily run if the fs is closing. However I want to move these to their own worker thread, and they can be long running jobs, so add a check to see if we're closing and simply bail. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent()Omar Sandoval5-30/+20
btrfs_insert_file_extent() is only ever used to insert holes, so rename it and remove the redundant parameters. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: sysfs: use sysfs_streq for string matchingDavid Sterba1-20/+1
We have own string matching helper that duplicates what sysfs_streq does, with a slight difference that it skips initial whitespace. So far this is used for the drive allocation policy. The initial whitespace of written sysfs values should be rather discouraged and we should use a standard helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: scrub: try to fix super block errorsQu Wenruo1-0/+36
[BUG] The following script shows that, although scrub can detect super block errors, it never tries to fix it: mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2 xfs_io -c "pwrite 67108864 4k" $dev2 mount $dev1 $mnt btrfs scrub start -B $dev2 btrfs scrub start -Br $dev2 umount $mnt The first scrub reports the super error correctly: scrub done for f3289218-abd3-41ac-a630-202f766c0859 Scrub started: Tue Aug 2 14:44:11 2022 Status: finished Duration: 0:00:00 Total to scrub: 1.26GiB Rate: 0.00B/s Error summary: super=1 Corrected: 0 Uncorrectable: 0 Unverified: 0 But the second read-only scrub still reports the same super error: Scrub started: Tue Aug 2 14:44:11 2022 Status: finished Duration: 0:00:00 Total to scrub: 1.26GiB Rate: 0.00B/s Error summary: super=1 Corrected: 0 Uncorrectable: 0 Unverified: 0 [CAUSE] The comments already shows that super block can be easily fixed by committing a transaction: /* * If we find an error in a super block, we just report it. * They will get written with the next transaction commit * anyway */ But the truth is, such assumption is not always true, and since scrub should try to repair every error it found (except for read-only scrub), we should really actively commit a transaction to fix this. [FIX] Just commit a transaction if we found any super block errors, after everything else is done. We cannot do this just after scrub_supers(), as btrfs_commit_transaction() will try to pause and wait for the running scrub, thus we can not call it with scrub_lock hold. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: scrub: properly report super block errors in system logQu Wenruo1-21/+12
[PROBLEM] Unlike data/metadata corruption, if scrub detected some error in the super block, the only error message is from the updated device status: BTRFS info (device dm-1): scrub: started on devid 2 BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0 This is not helpful at all. [CAUSE] Unlike data/metadata error reporting, there is no visible report in kernel dmesg to report supper block errors. In fact, return value of scrub_checksum_super() is intentionally skipped, thus scrub_handle_errored_block() will never be called for super blocks. [FIX] Make super block errors to output an error message, now the full dmesg would looks like this: BTRFS info (device dm-1): scrub: started on devid 2 BTRFS warning (device dm-1): super block error on device /dev/mapper/test-scratch2, physical 67108864 BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0 BTRFS info (device dm-1): scrub: started on devid 2 This fix involves: - Move the super_errors reporting to scrub_handle_errored_block() This allows the device status message to show after the super block error message. But now we no longer distinguish super block corruption and generation mismatch, now all counted as corruption. - Properly check the return value from scrub_checksum_super() - Add extra super block error reporting for scrub_print_warning(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: fix alignment of VMA for memory mapped files on THPAlexander Zhu1-0/+1
With CONFIG_READ_ONLY_THP_FOR_FS, the Linux kernel supports using THPs for read-only mmapped files, such as shared libraries. However, the kernel makes no attempt to actually align those mappings on 2MB boundaries, which makes it impossible to use those THPs most of the time. This issue applies to general file mapping THP as well as existing setups using CONFIG_READ_ONLY_THP_FOR_FS. This is easily fixed by using thp_get_unmapped_area for the unmapped_area function in btrfs, which is what ext2, ext4, fuse, and xfs all use. Initially btrfs had been left out in commit 8c07fc452ac0 ("btrfs: fix alignment of VMA for memory mapped files on THP") as btrfs does not support DAX. However, commit 1854bc6e2420 ("mm/readahead: Align file mappings for non-DAX") removed the DAX requirement. We should now be able to call thp_get_unmapped_area() for btrfs. The problem can be seen in /proc/PID/smaps where THPeligible is set to 0 on mappings to eligible shared object files as shown below. Before this patch: 7fc6a7e18000-7fc6a80cc000 r-xp 00000000 00:1e 199856 /usr/lib64/libcrypto.so.1.1.1k Size: 2768 kB THPeligible: 0 VmFlags: rd ex mr mw me With this patch the library is mapped at a 2MB aligned address: fbdfe200000-7fbdfe4b4000 r-xp 00000000 00:1e 199856 /usr/lib64/libcrypto.so.1.1.1k Size: 2768 kB THPeligible: 1 VmFlags: rd ex mr mw me This fixes the alignment of VMAs for any mmap of a file that has the rd and ex permissions and size >= 2MB. The VMA alignment and THPeligible field for anonymous memory is handled separately and is thus not effected by this change. CC: stable@vger.kernel.org # 5.18+ Signed-off-by: Alexander Zhu <alexlzhu@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: add lockdep annotations for the ordered extents wait eventIoannis Angelakopoulos4-0/+33
This wait event is very similar to the pending ordered wait event in the sense that it occurs in a different context than the condition signaling for the event. The signaling occurs in btrfs_remove_ordered_extent() while the wait event is implemented in btrfs_start_ordered_extent() in fs/btrfs/ordered-data.c However, in this case a thread must not acquire the lockdep map for the ordered extents wait event when the ordered extent is related to a free space inode. That is because lockdep creates dependencies between locks acquired both in execution paths related to normal inodes and paths related to free space inodes, thus leading to false positives. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: change the lockdep class of free space inode's invalidate_lockIoannis Angelakopoulos1-0/+10
Reinitialize the class of the lockdep map for struct inode's mapping->invalidate_lock in load_free_space_cache() function in fs/btrfs/free-space-cache.c. This will prevent lockdep from producing false positives related to execution paths that make use of free space inodes and paths that make use of normal inodes. Specifically, with this change lockdep will create separate lock dependencies that include the invalidate_lock, in the case that free space inodes are used and in the case that normal inodes are used. The lockdep class for this lock was first initialized in inode_init_always() in fs/inode.c. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: add lockdep annotations for pending_ordered wait eventIoannis Angelakopoulos4-0/+6
In contrast to the num_writers and num_extwriters wait events, the condition for the pending ordered wait event is signaled in a different context from the wait event itself. The condition signaling occurs in btrfs_remove_ordered_extent() in fs/btrfs/ordered-data.c while the wait event is implemented in btrfs_commit_transaction() in fs/btrfs/transaction.c Thus the thread signaling the condition has to acquire the lockdep map as a reader at the start of btrfs_remove_ordered_extent() and release it after it has signaled the condition. In this case some dependencies might be left out due to the placement of the annotation, but it is better than no annotation at all. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: add lockdep annotations for transaction states wait eventsIoannis Angelakopoulos3-10/+83
Add lockdep annotations for the transaction states that have wait events; 1) TRANS_STATE_COMMIT_START 2) TRANS_STATE_UNBLOCKED 3) TRANS_STATE_SUPER_COMMITTED 4) TRANS_STATE_COMPLETED The new macros introduced here to annotate the transaction states wait events have the same effect as the generic lockdep annotation macros. With the exception of the lockdep annotation for TRANS_STATE_COMMIT_START the transaction thread has to acquire the lockdep maps for the transaction states as reader after the lockdep map for num_writers is released so that lockdep does not complain. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: add lockdep annotations for num_extwriters wait eventIoannis Angelakopoulos3-0/+15
Similarly to the num_writers wait event in fs/btrfs/transaction.c add a lockdep annotation for the num_extwriters wait event. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: add lockdep annotations for num_writers wait eventIoannis Angelakopoulos3-5/+41
Annotate the num_writers wait event in fs/btrfs/transaction.c with lockdep in order to catch deadlocks involving this wait event. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: add macros for annotating wait events with lockdepIoannis Angelakopoulos1-0/+45
Introduce four macros that are used to annotate wait events in btrfs code with lockdep; 1) the btrfs_lockdep_init_map 2) the btrfs_lockdep_acquire, 3) the btrfs_lockdep_release 4) the btrfs_might_wait_for_event macros. The btrfs_lockdep_init_map macro is used to initialize a lockdep map. The btrfs_lockdep_<acquire,release> macros are used by threads to take the lockdep map as readers (shared lock) and release it, respectively. The btrfs_might_wait_for_event macro is used by threads to take the lockdep map as writers (exclusive lock) and release it. In general, the lockdep annotation for wait events work as follows: The condition for a wait event can be modified and signaled at the same time by multiple threads. These threads hold the lockdep map as readers when they enter a context in which blocking would prevent signaling the condition. Frequently, this occurs when a thread violates a condition (lockdep map acquire), before restoring it and signaling it at a later point (lockdep map release). The threads that block on the wait event take the lockdep map as writers (exclusive lock). These threads have to block until all the threads that hold the lockdep map as readers signal the condition for the wait event and release the lockdep map. The lockdep annotation is used to warn about potential deadlock scenarios that involve the threads that modify and signal the wait event condition and threads that block on the wait event. A simple example is illustrated below: Without lockdep: TA TB cond = false lock(A) wait_event(w, cond) unlock(A) lock(A) cond = true signal(w) unlock(A) With lockdep: TA TB rwsem_acquire_read(lockdep_map) cond = false lock(A) rwsem_acquire(lockdep_map) rwsem_release(lockdep_map) wait_event(w, cond) unlock(A) lock(A) cond = true signal(w) unlock(A) rwsem_release(lockdep_map) In the second case, with the lockdep annotation, lockdep would warn about an ABBA deadlock, while the first case would just deadlock at some point. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: dump extra info if one free space cache has more bitmaps than it shouldQu Wenruo1-0/+6
There is an internal report on hitting the following ASSERT() in recalculate_thresholds(): ASSERT(ctl->total_bitmaps <= max_bitmaps); Above @max_bitmaps is calculated using the following variables: - bytes_per_bg 8 * 4096 * 4096 (128M) for x86_64/x86. - block_group->length The length of the block group. @max_bitmaps is the rounded up value of block_group->length / 128M. Normally one free space cache should not have more bitmaps than above value, but when it happens the ASSERT() can be triggered if CONFIG_BTRFS_ASSERT is also enabled. But the ASSERT() itself won't provide enough info to know which is going wrong. Is the bg too small thus it only allows one bitmap? Or is there something else wrong? So although I haven't found extra reports or crash dump to do further investigation, add the extra info to make it more helpful to debug. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-25Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds5-181/+154
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Regression and bug fixes: - Performance regression fix from 5.18 on a Rasberry Pi - Fix extent parsing bug which triggers a BUG_ON when a (corrupted) extent tree has has a non-root node when zero entries. - Fix a livelock where in the right (wrong) circumstances a large number of nfsd threads can try to write to a nearly full file system, and retry for hours(!)" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: limit the number of retries after discarding preallocations blocks ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0 ext4: use buckets for cr 1 block scan instead of rbtree ext4: use locality group preallocation for small closed files ext4: make directory inode spreading reflect flexbg size ext4: avoid unnecessary spreading of allocations among groups ext4: make mballoc try target group first even with mb_optimize_scan
2022-09-25Merge tag 'dax-and-nvdimm-fixes-v6.0-final' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull NVDIMM and DAX fixes from Dan Williams: "A recently discovered one-line fix for devdax that further addresses a v5.5 regression, and (a bit embarrassing) a small batch of fixes that have been sitting in my fixes tree for weeks. The older fixes have soaked in linux-next during that time and address an fsdax infinite loop and some other minor fixups. - Fix a infinite loop bug in fsdax - Fix memory-type detection for devdax (EINJ regression) - Small cleanups" * tag 'dax-and-nvdimm-fixes-v6.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: devdax: Fix soft-reservation memory description fsdax: Fix infinite loop in dax_iomap_rw() nvdimm/namespace: drop nested variable in create_namespace_pmem() ndtest: Cleanup all of blk namespace specific code pmem: fix a name collision
2022-09-24Merge branch 'for-6.0/dax' into libnvdimm-fixesDan Williams588-28298/+18925
Pick up another "Soft Reservation" fix for v6.0-final on top of some straggling nvdimm fixes that missed v5.19.
2022-09-22ext4: limit the number of retries after discarding preallocations blocksTheodore Ts'o1-1/+3
This patch avoids threads live-locking for hours when a large number threads are competing over the last few free extents as they blocks getting added and removed from preallocation pools. From our bug reporter: A reliable way for triggering this has multiple writers continuously write() to files when the filesystem is full, while small amounts of space are freed (e.g. by truncating a large file -1MiB at a time). In the local filesystem, this can be done by simply not checking the return code of write (0) and/or the error (ENOSPACE) that is set. Over NFS with an async mount, even clients with proper error checking will behave this way since the linux NFS client implementation will not propagate the server errors [the write syscalls immediately return success] until the file handle is closed. This leads to a situation where NFS clients send a continuous stream of WRITE rpcs which result in ERRNOSPACE -- but since the client isn't seeing this, the stream of writes continues at maximum network speed. When some space does appear, multiple writers will all attempt to claim it for their current write. For NFS, we may see dozens to hundreds of threads that do this. The real-world scenario of this is database backup tooling (in particular, github.com/mdkent/percona-xtrabackup) which may write large files (>1TiB) to NFS for safe keeping. Some temporary files are written, rewound, and read back -- all before closing the file handle (the temp file is actually unlinked, to trigger automatic deletion on close/crash.) An application like this operating on an async NFS mount will not see an error code until TiB have been written/read. The lockup was observed when running this database backup on large filesystems (64 TiB in this case) with a high number of block groups and no free space. Fragmentation is generally not a factor in this filesystem (~thousands of large files, mostly contiguous except for the parts written while the filesystem is at capacity.) Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2022-09-22ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0Luís Henriques1-0/+4
When walking through an inode extents, the ext4_ext_binsearch_idx() function assumes that the extent header has been previously validated. However, there are no checks that verify that the number of entries (eh->eh_entries) is non-zero when depth is > 0. And this will lead to problems because the EXT_FIRST_INDEX() and EXT_LAST_INDEX() will return garbage and result in this: [ 135.245946] ------------[ cut here ]------------ [ 135.247579] kernel BUG at fs/ext4/extents.c:2258! [ 135.249045] invalid opcode: 0000 [#1] PREEMPT SMP [ 135.250320] CPU: 2 PID: 238 Comm: tmp118 Not tainted 5.19.0-rc8+ #4 [ 135.252067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014 [ 135.255065] RIP: 0010:ext4_ext_map_blocks+0xc20/0xcb0 [ 135.256475] Code: [ 135.261433] RSP: 0018:ffffc900005939f8 EFLAGS: 00010246 [ 135.262847] RAX: 0000000000000024 RBX: ffffc90000593b70 RCX: 0000000000000023 [ 135.264765] RDX: ffff8880038e5f10 RSI: 0000000000000003 RDI: ffff8880046e922c [ 135.266670] RBP: ffff8880046e9348 R08: 0000000000000001 R09: ffff888002ca580c [ 135.268576] R10: 0000000000002602 R11: 0000000000000000 R12: 0000000000000024 [ 135.270477] R13: 0000000000000000 R14: 0000000000000024 R15: 0000000000000000 [ 135.272394] FS: 00007fdabdc56740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 [ 135.274510] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 135.276075] CR2: 00007ffc26bd4f00 CR3: 0000000006261004 CR4: 0000000000170ea0 [ 135.277952] Call Trace: [ 135.278635] <TASK> [ 135.279247] ? preempt_count_add+0x6d/0xa0 [ 135.280358] ? percpu_counter_add_batch+0x55/0xb0 [ 135.281612] ? _raw_read_unlock+0x18/0x30 [ 135.282704] ext4_map_blocks+0x294/0x5a0 [ 135.283745] ? xa_load+0x6f/0xa0 [ 135.284562] ext4_mpage_readpages+0x3d6/0x770 [ 135.285646] read_pages+0x67/0x1d0 [ 135.286492] ? folio_add_lru+0x51/0x80 [ 135.287441] page_cache_ra_unbounded+0x124/0x170 [ 135.288510] filemap_get_pages+0x23d/0x5a0 [ 135.289457] ? path_openat+0xa72/0xdd0 [ 135.290332] filemap_read+0xbf/0x300 [ 135.291158] ? _raw_spin_lock_irqsave+0x17/0x40 [ 135.292192] new_sync_read+0x103/0x170 [ 135.293014] vfs_read+0x15d/0x180 [ 135.293745] ksys_read+0xa1/0xe0 [ 135.294461] do_syscall_64+0x3c/0x80 [ 135.295284] entry_SYSCALL_64_after_hwframe+0x46/0xb0 This patch simply adds an extra check in __ext4_ext_check(), verifying that eh_entries is not 0 when eh_depth is > 0. Link: https://bugzilla.kernel.org/show_bug.cgi?id=215941 Link: https://bugzilla.kernel.org/show_bug.cgi?id=216283 Cc: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20220822094235.2690-1-lhenriques@suse.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>