aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs/extent-io-tree.h
AgeCommit message (Collapse)AuthorFilesLines
2023-08-21btrfs: make find_first_extent_bit() return a booleanFilipe Manana1-3/+3
Currently find_first_extent_bit() returns a 0 if it found a range in the given io tree and 1 if it didn't find any. There's no need to return any errors, so make the return value a boolean and invert the logic to make more sense: return true if it found a range and false if it didn't find any range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: drop gfp from parameter extent state helpersDavid Sterba1-7/+5
Now that all extent state bit helpers effectively take the GFP_NOFS mask (and GFP_NOWAIT is encoded in the bits) we can remove the parameter. This reduces stack consumption in many functions and simplifies a lot of code. Net effect on module on a release build: text data bss dec hex filename 1250432 20985 16088 1287505 13a551 pre/btrfs.ko 1247074 20985 16088 1284147 139833 post/btrfs.ko DELTA: -3358 Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: pass NOWAIT for set/clear extent bits as another bitDavid Sterba1-0/+9
The only flags we now pass to set_extent_bit/__clear_extent_bit are GFP_NOFS and GFP_NOWAIT (a few functions handling mappings). This requires an extra parameter to be passed everywhere but is almost always the same. Encode the GFP_NOWAIT as an artificial extent bit and extract the real bits and gfp mask in the lowest level helpers. Now the passed gfp mask is not actually used and can be removed. Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: open code set_extent_bitsDavid Sterba1-6/+0
This helper calls set_extent_bit with two more parameters set to default values, but otherwise it's purpose is not clear. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: open code set_extent_bits_nowaitDavid Sterba1-6/+0
The helper only passes GFP_NOWAIT as gfp flags and is used two times. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: open code set_extent_dirtyDavid Sterba1-6/+0
The helper is used a few times, that it's setting the DIRTY extent bit is still clear. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: open code set_extent_newDavid Sterba1-6/+0
The helper is used only once. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: open code set_extent_delallocDavid Sterba1-9/+0
The helper is used once in fs code and a few times in the self test code. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: open code set_extent_defragDavid Sterba1-8/+0
The helper is used only once. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: remove the io_failure_record infrastructureChristoph Hellwig1-1/+0
struct io_failure_record and the io_failure_tree tree are unused now, so remove them. This in turn makes struct btrfs_inode smaller by 16 bytes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: allow passing a cached state record to count_range_bits()Filipe Manana1-1/+2
An inode's io_tree can be quite large and there are cases where due to delalloc it can have thousands of extent state records, which makes the red black tree have a depth of 10 or more, making the operation of count_range_bits() slow if we repeatedly call it for a range that starts where, or after, the previous one we called it for. Such use cases are when searching for delalloc in a file range that corresponds to a hole or a prealloc extent, which is done during lseek SEEK_HOLE/DATA and fiemap. So introduce a cached state parameter to count_range_bits() which we use to store the last extent state record we visited, and then allow the caller to pass it again on its next call to count_range_bits(). The next patches in the series will make fiemap and lseek use the new parameter. This change is part of a patchset that has the goal to make performance better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to iterate over the extents of a file. Two examples are the cp program from coreutils 9.0+ and the tar program (when using its --sparse / -S option). A sample test and results are listed in the changelog of the last patch in the series: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_treeFilipe Manana1-7/+0
We don't need to set the EXTENT_UPDATE bit in an inode's io_tree to mark a range as uptodate, we rely on the pages themselves being uptodate - page reading is not triggered for already uptodate pages. Recently we removed most use of the EXTENT_UPTODATE for buffered IO with commit 52b029f42751 ("btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path"), but there were a few leftovers, namely when reading from holes and successfully finishing read repair. These leftovers are unnecessarily making an inode's tree larger and deeper, slowing down searches on it. So remove all the leftovers. This change is part of a patchset that has the goal to make performance better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to iterate over the extents of a file. Two examples are the cp program from coreutils 9.0+ and the tar program (when using its --sparse / -S option). A sample test and results are listed in the changelog of the last patch in the series: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: switch extent_io_tree::private_data to btrfs_inode and renameDavid Sterba1-1/+2
The extent_io_tree::private_data was meant to be a preparatory work for the metadata inode rework but that never materialized. Now it's used only for an inode so it's better to change the appropriate type and rename it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: drop private_data parameter from extent_io_tree_initDavid Sterba1-2/+1
All callers except one pass NULL, so the parameter can be dropped and the inode::io_tree initialization can be open coded. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: remove unused unlock_extent_atomicJosef Bacik1-7/+0
As of "btrfs: do not use GFP_ATOMIC in the read endio" we no longer have any users of unlock_extent_atomic, remove it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: convert EXTENT_* bits to enumsDavid Sterba1-33/+38
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: cache the failed state when locking extentsJosef Bacik1-1/+2
Currently if we fail to lock a range we'll return the start of the range that we failed to lock. We'll then search down to this range and wait on any extent states in this range. However we can avoid this search altogether if we simply cache the extent_state that had the contention. We can pass this into wait_extent_bit() and start from that extent_state without doing the search. In the most optimistic case we can avoid all searches, more likely we'll avoid the initial search and have to perform the search after we wait on the failed state, or worst case we must search both times which is what currently happens. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: add a cached_state to try_lock_extentJosef Bacik1-1/+2
With nowait becoming more pervasive throughout our codebase go ahead and add a cached_state to try_lock_extent(). This allows us to be faster about clearing the locked area if we have contention, and then gives us the same optimization for unlock if we are able to lock the range. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: stop tracking failed reads in the I/O treeChristoph Hellwig1-1/+0
There is a separate I/O failure tree to track the fail reads, so remove the extra EXTENT_DAMAGED bit in the I/O tree as it's set but never used. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: replace delete argument with EXTENT_CLEAR_ALL_BITSJosef Bacik1-10/+18
Instead of taking up a whole argument to indicate we're clearing everything in a range, simply add another EXTENT bit to control this, and then update all the callers to drop this argument from the clear_extent_bit variants. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: get rid of extent_io_tree::dirty_bytesJosef Bacik1-1/+0
This was used as an optimization for count_range_bits(EXTENT_DIRTY), which was used by the failed record code. However this was removed in this series by patch "btrfs: convert the io_failure_tree to a plain rb_tree" which was the last user of this optimization. Remove the ->dirty_bytes as nobody cares anymore. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: remove extent_io_tree::track_uptodateJosef Bacik1-1/+0
Since commit 78361f64ff42 ("btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path") we no longer check ->track_uptodate, remove it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: unify the lock/unlock extent variantsJosef Bacik1-16/+6
We have two variants of lock/unlock extent, one set that takes a cached state, another that does not. This is slightly annoying, and generally speaking there are only a few places where we don't have a cached state. Simplify this by making lock_extent/unlock_extent the only variant and make it take a cached state, then convert all the callers appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: drop extent_changeset from set_extent_bitJosef Bacik1-10/+8
The only places that set extent_changeset is set_record_extent_bits, everywhere else sets it to NULL. Drop this argument from set_extent_bit. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: remove failed_start argument from set_extent_bitJosef Bacik1-13/+9
This is only used for internal locking related helpers, everybody else just passes in NULL. I've changed set_extent_bit to __set_extent_bit and made it static, removed failed_start from set_extent_bit and have it call __set_extent_bit with a NULL failed_start, and I've moved some code down below the now static __set_extent_bit. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: remove the wake argument from clear_extent_bitsJosef Bacik1-20/+14
This is only used in the case that we are clearing EXTENT_LOCKED, so infer this value from the bits passed in instead of taking it as an argument. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: drop exclusive_bits from set_extent_bitJosef Bacik1-10/+10
This is only ever set if we have EXTENT_LOCKED set, so simply push this into the function itself and remove the function argument. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: move extent io tree unrelated prototypes to their appropriate headerJosef Bacik1-8/+0
These prototypes have nothing to do with the extent_io_tree helpers, move them to their appropriate header. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: unexport all the temporary exports for extent-io-tree.cJosef Bacik1-47/+0
Now that we've moved everything we can unexport all the temporary exports, move the random helpers, and mark everything as static again. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: unexport btrfs_debug_check_extent_io_rangeJosef Bacik1-10/+0
We no longer need to export this as all users are in extent-io-tree.c, remove it from the header and put it into extent-io-tree.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: temporarily export and then move extent state helpersJosef Bacik1-0/+26
In order to avoid moving all of the related code at once temporarily export all of the extent state related helpers. Then move these helpers into extent-io-tree.c. We will clean up the exports and make them static in followup patches. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: temporarily export and move core extent_io_tree tree functionsJosef Bacik1-0/+13
A lot of the various internals of extent_io_tree call these two functions for insert or searching the rb tree for entries, so temporarily export them and then move them to extent-io-tree.c. We can't move tree_search() without renaming it, and I don't want to introduce a bunch of churn just to do that, so move these functions first and then we can move a few big functions and then the remaining users of tree_search(). Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: move btrfs_debug_check_extent_io_range into extent-io-tree.cJosef Bacik1-0/+10
This helper is used by a lot of the core extent_io_tree helpers, so temporarily export it and move it into extent-io-tree.c in order to make it straightforward to migrate the helpers in batches. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: export wait_extent_bitJosef Bacik1-0/+1
This is used by the subpage code in addition to lock_extent_bits, so export it so we can move it out of extent_io.c Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: move simple extent bit helpers out of extent_io.cJosef Bacik1-5/+15
These are just variants and wrappers around the actual work horses of the extent state. Extract these out of extent_io.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: move extent state init and alloc functions to their own fileJosef Bacik1-0/+5
Start cleaning up extent_io.c by moving the extent state code out of it. This patch starts with the extent state allocation code and the extent_io_tree init code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: temporarily export alloc_extent_state helpersJosef Bacik1-0/+3
We're going to move this code in stages, but while we're doing that we need to export these helpers so we can more easily move the code into the new file. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: separate out the extent state and extent buffer init codeJosef Bacik1-2/+2
In order to help separate the extent buffer from the extent io tree code we need to break up the init functions. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: convert the io_failure_tree to a plain rb_treeJosef Bacik1-3/+0
We still have this oddity of stashing the io_failure_record in the extent state for the io_failure_tree, which is leftover from when we used to stuff private pointers in extent_io_trees. However this doesn't make a lot of sense for the io failure records, we can simply use a normal rb_tree for this. This will allow us to further simplify the extent_io_tree code by removing the io_failure_rec pointer from the extent state. Convert the io_failure_tree to an rb tree + spinlock in the inode, and then use our rb tree simple helpers to insert and find failed records. This greatly cleans up this code and makes it easier to separate out the extent_io_tree code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: unexport internal failrec functionsJosef Bacik1-6/+0
These are internally used functions and are not used outside of extent_io.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: rename clean_io_failure and remove extraneous argsJosef Bacik1-4/+2
This is exported, so rename it to btrfs_clean_io_failure. Additionally we are passing in the io tree's and such from the inode, so instead of doing all that simply pass in the inode itself and get all the components we need directly inside of btrfs_clean_io_failure. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: move btrfs_bio allocation to volumes.cChristoph Hellwig1-3/+0
volumes.c is the place that implements the storage layer using the btrfs_bio structure, so move the bio_set and allocation helpers there as well. To make up for the new initialization boilerplate, merge the two init/exit helpers in extent_io.c into a single one. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O pathEthan Lien1-2/+2
After we copied data to page cache in buffered I/O, we 1. Insert a EXTENT_UPTODATE state into inode's io_tree, by endio_readpage_release_extent(), set_extent_delalloc() or set_extent_defrag(). 2. Set page uptodate before we unlock the page. But the only place we check io_tree's EXTENT_UPTODATE state is in btrfs_do_readpage(). We know we enter btrfs_do_readpage() only when we have a non-uptodate page, so it is unnecessary to set EXTENT_UPTODATE. For example, when performing a buffered random read: fio --rw=randread --ioengine=libaio --direct=0 --numjobs=4 \ --filesize=32G --size=4G --bs=4k --name=job \ --filename=/mnt/file --name=job Then check how many extent_state in io_tree: cat /proc/slabinfo | grep btrfs_extent_state | awk '{print $2}' w/o this patch, we got 640567 btrfs_extent_state. w/ this patch, we got 204 btrfs_extent_state. Maintaining such a big tree brings overhead since every I/O needs to insert EXTENT_LOCKED, insert EXTENT_UPTODATE, then remove EXTENT_LOCKED. And in every insert or remove, we need to lock io_tree, do tree search, alloc or dealloc extent states. By removing unnecessary EXTENT_UPTODATE, we keep io_tree in a minimal size and reduce overhead when performing buffered I/O. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Ethan Lien <ethanlien@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-15btrfs: Convert from invalidatepage to invalidate_folioMatthew Wilcox (Oracle)1-2/+2
A lot of the underlying infrastructure in btrfs needs to be switched over to folios, but this at least documents that invalidatepage can't be passed a tail page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2020-12-08btrfs: use fixed width int type for extent_state::stateQu Wenruo1-17/+16
Currently the type is unsigned int which could change its width depending on the architecture. We need up to 32 bits so make it explicit. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: merge __set_extent_bit and set_extent_bitNikolay Borisov1-8/+13
There are only 2 direct calls to set_extent_bit outside of extent-io - in btrfs_find_new_delalloc_bytes and btrfs_truncate_block, the rest are thin wrappers around __set_extent_bit. This adds unnecessary indirection and just makes it more annoying when looking at the various extent bit manipulation functions. This patch renames __set_extent_bit to set_extent_bit effectively removing a level of indirection. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat and remove __must_check ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: update the number of bytes used by an inode atomicallyFilipe Manana1-1/+15
There are several occasions where we do not update the inode's number of used bytes atomically, resulting in a concurrent stat(2) syscall to report a value of used blocks that does not correspond to a valid value, that is, a value that does not match neither what we had before the operation nor what we get after the operation completes. In extreme cases it can result in stat(2) reporting zero used blocks, which can cause problems for some userspace tools where they can consider a file with a non-zero size and zero used blocks as completely sparse and skip reading data, as reported/discussed a long time ago in some threads like the following: https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html The cases where this can happen are the following: -> Case 1 If we do a write (buffered or direct IO) against a file region for which there is already an allocated extent (or multiple extents), then we have a short time window where we can report a number of used blocks to stat(2) that does not take into account the file region being overwritten. This short time window happens when completing the ordered extent(s). This happens because when we drop the extents in the write range we decrement the inode's number of bytes and later on when we insert the new extent(s) we increment the number of bytes in the inode, resulting in a short time window where a stat(2) syscall can get an incorrect number of used blocks. If we do writes that overwrite an entire file, then we have a short time window where we report 0 used blocks to stat(2). Example reproducer: $ cat reproducer-1.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi stat_loop() { trap "wait; exit" SIGTERM local filepath=$1 local expected=$2 local got while :; do got=$(stat -c %b $filepath) if [ $got -ne $expected ]; then echo -n "ERROR: unexpected used blocks" echo " (got: $got expected: $expected)" fi done } mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f $DEV > /dev/null # mkfs.ext4 -F $DEV > /dev/null # mkfs.f2fs -f $DEV > /dev/null # mkfs.reiserfs -f $DEV > /dev/null mount $DEV $MNT xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null expected=$(stat -c %b $MNT/foobar) # Create a process to keep calling stat(2) on the file and see if the # reported number of blocks used (disk space used) changes, it should # not because we are not increasing the file size nor punching holes. stat_loop $MNT/foobar $expected & loop_pid=$! for ((i = 0; i < 50000; i++)); do xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null done kill $loop_pid &> /dev/null wait umount $DEV $ ./reproducer-1.sh ERROR: unexpected used blocks (got: 0 expected: 128) ERROR: unexpected used blocks (got: 0 expected: 128) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. -> Case 2 If we do a buffered write against a file region that does not have any allocated extents, like a hole or beyond EOF, then during ordered extent completion we have a short time window where a concurrent stat(2) syscall can report a number of used blocks that does not correspond to the value before or after the write operation, a value that is actually larger than the value after the write completes. This happens because once we start a buffered write into an unallocated file range we increment the inode's 'new_delalloc_bytes', to make sure any stat(2) call gets a correct used blocks value before delalloc is flushed and completes. However at ordered extent completion, after we inserted the new extent, we increment the inode's number of bytes used with the size of the new extent, and only later, when clearing the range in the inode's iotree, we decrement the inode's 'new_delalloc_bytes' counter with the size of the extent. So this results in a short time window where a concurrent stat(2) syscall can report a number of used blocks that accounts for the new extent twice. Example reproducer: $ cat reproducer-2.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi stat_loop() { trap "wait; exit" SIGTERM local filepath=$1 local expected=$2 local got while :; do got=$(stat -c %b $filepath) if [ $got -ne $expected ]; then echo -n "ERROR: unexpected used blocks" echo " (got: $got expected: $expected)" fi done } mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f $DEV > /dev/null # mkfs.ext4 -F $DEV > /dev/null # mkfs.f2fs -f $DEV > /dev/null # mkfs.reiserfs -f $DEV > /dev/null mount $DEV $MNT touch $MNT/foobar write_size=$((64 * 1024)) for ((i = 0; i < 16384; i++)); do offset=$(($i * $write_size)) xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null blocks_used=$(stat -c %b $MNT/foobar) # Fsync the file to trigger writeback and keep calling stat(2) on it # to see if the number of blocks used changes. stat_loop $MNT/foobar $blocks_used & loop_pid=$! xfs_io -c "fsync" $MNT/foobar kill $loop_pid &> /dev/null wait $loop_pid done umount $DEV $ ./reproducer-2.sh ERROR: unexpected used blocks (got: 265472 expected: 265344) ERROR: unexpected used blocks (got: 284032 expected: 283904) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. -> Case 3 Another case where such problems happen is during other operations that replace extents in a file range with other extents. Those operations are extent cloning, deduplication and fallocate's zero range operation. The cause of the problem is similar to the first case. When we drop the extents from a range, we decrement the inode's number of bytes, and later on, after inserting the new extents we increment it. Since this is not done atomically, a concurrent stat(2) call can see and return a number of used blocks that is smaller than it should be, does not match the number of used blocks before or after the clone/deduplication/zero operation. Like for the first case, when doing a clone, deduplication or zero range operation against an entire file, we end up having a time window where we can report 0 used blocks to a stat(2) call. Example reproducer: $ cat reproducer-3.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f -m reflink=1 $DEV > /dev/null mount $DEV $MNT extent_size=$((64 * 1024)) num_extents=16384 file_size=$(($extent_size * $num_extents)) # File foo has many small extents. xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \ > /dev/null # File bar has much less extents and has exactly the same data as foo. xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null expected=$(stat -c %b $MNT/foo) # Now deduplicate bar into foo. While the deduplication is in progres, # the number of used blocks/file size reported by stat should not change xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null & dedupe_pid=$! while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do used=$(stat -c %b $MNT/foo) if [ $used -ne $expected ]; then echo "Unexpected blocks used: $used (expected: $expected)" fi done umount $DEV $ ./reproducer-3.sh Unexpected blocks used: 2076800 (expected: 2097152) Unexpected blocks used: 2097024 (expected: 2097152) Unexpected blocks used: 2079872 (expected: 2097152) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. So fix this by: 1) Making btrfs_drop_extents() not decrement the VFS inode's number of bytes, and instead return the number of bytes; 2) Making any code that drops extents and adds new extents update the inode's number of bytes atomically, while holding the btrfs inode's spinlock, which is also used by the stat(2) callback to get the inode's number of bytes; 3) For ranges in the inode's iotree that are marked as 'delalloc new', corresponding to previously unallocated ranges, increment the inode's number of bytes when clearing the 'delalloc new' bit from the range, in the same critical section that decrements the inode's 'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock. An alternative would be to have btrfs_getattr() wait for any IO (ordered extents in progress) and locking the whole range (0 to (u64)-1) while it it computes the number of blocks used. But that would mean blocking stat(2), which is a very used syscall and expected to be fast, waiting for writes, clone/dedupe, fallocate, page reads, fiemap, etc. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: sink the failed_start parameter to set_extent_bitQu Wenruo1-10/+7
The @failed_start parameter is only paired with @exclusive_bits, and those parameters are only used for EXTENT_LOCKED bit, which have their own wrappers lock_extent_bits(). Thus for regular set_extent_bit() calls, the failed_start makes no sense, just sink the parameter. Also, since @failed_start and @exclusive_bits are used in pairs, add an assert to make it obvious. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: remove struct extent_io_opsNikolay Borisov1-1/+0
It's no longer used just remove the function and any related code which was initialising it for inodes. No functional changes. Removing 8 bytes from extent_io_tree in turn reduces size of other structures where it is embedded, notably btrfs_inode where it reduces size by 24 bytes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: use own btree inode io_tree owner idQu Wenruo1-0/+1
Btree inode is special compared to all other inode extent io_trees, although it has a btrfs inode, it doesn't have the track_uptodate bit at all. This means a lot of things like extent locking doesn't even need to be applied to btree io tree. Since it's so special, adds a new owner value for it to make debuging a little easier. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>