blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2023-02-13	btrfs: send: initialize all the red black trees earlier	Filipe Manana	1	-6/+6
	After we allocate the send context object and before we initialize all the red black trees, we can jump to the 'out' label if some errors happen, and then under the 'out' label we use RB_EMPTY_ROOT() against some of the those trees, which we have not yet initialized. This happens to work out ok because the send context object was initialized to zeroes with kzalloc and the RB_ROOT initializer just happens to have the following definition: #define RB_ROOT (struct rb_root) { NULL, } But it's really neither clean nor a good practice as RB_ROOT is supposed to be opaque and in case it changes or we change those red black trees to some other data structure, it leaves us in a precarious situation. So initialize all the red black trees immediately after allocating the send context and before any jump into the 'out' label. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: send: iterate waiting dir move rbtree only once when processing refs	Filipe Manana	1	-5/+2
	When processing the new references for an inode, we unnecessarily iterate twice the waiting dir moves rbtree, once with is_waiting_for_move() and if we found an entry in the rbtree, we iterate it again with a call to get_waiting_dir_move(). This is pointless, we can make this simpler and more efficient by calling only get_waiting_dir_move(), so just do that. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: send: reduce searches on parent root when checking if dir can be removed	Filipe Manana	1	-6/+59
	During an incremental send, every time we remove a reference (dentry) for an inode and the parent directory does not exists anymore in the send root, we go check if we can remove the directory by making a call to can_rmdir(). This helper can only return true (value 1) if all dentries were already removed, and for that it always does a search on the parent root for dir index keys - if it finds any dentry referring to an inode with a number higher then the inode currently being processed, then the directory can not be removed and it must return false (value 0). However that means if a directory that was deleted had 1000 dentries, and each one pointed to an inode with a number higher then the number of the directory's inode, we end up doing 1000 searches on the parent root. Typically files are created in a directory after the directory was created and therefore they get an higher inode number than the directory. It's also common to have the each dentry pointing to an inode with a higher number then the inodes the previous dentries point to, for example when creating a series of files inside a directory, a very common pattern. So improve on that by having the first call to can_rmdir() for a directory to check the number of the inode that the last dentry points to and cache that inode number in the orphan dir structure. Then every subsequent call to can_rmdir() can avoid doing a search on the parent root if the number of the inode currently being processed is smaller than cached inode number at the directory's orphan dir structure. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()	Filipe Manana	1	-4/+7
	At can_rmdir() we start by searching the orphan dirs rbtree for an orphan dir object for the target directory. Later when iterating over the dir index keys, if we find that any dir entry points to inode for which there is a pending dir move or the inode was not yet processed, we exit because we can't remove the directory yet. However we end up always calling add_orphan_dir_info(), which will iterate again the rbtree and if there is already an orphan dir object (created by the first call to can_rmdir()), it returns the existing object. This is unnecessary work because in case there is already an existing orphan dir object, we got a reference to it at the start of can_rmdir(). So skip the call to add_orphan_dir_info() if we already have a reference for an orphan dir object. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: send: avoid duplicated orphan dir allocation and initialization	Filipe Manana	1	-15/+12
	At can_rmdir() we are allocating and initializing an orphan dir object twice. This can be deduplicated outside of the loop that iterates over the dir index keys. So deduplicate that code, even because other patch in the series will need to add more initialization code and another one will add one more condition. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: send: remove send_progress argument from can_rmdir()	Filipe Manana	1	-8/+5
	All callers of can_rmdir() pass sctx->cur_ino as the value for the send_progress argument, so remove the argument and directly use sctx->cur_ino. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: send: avoid extra b+tree searches when checking reference overrides	Filipe Manana	1	-32/+29
	During an incremental send, when processing the new references of an inode (either it's a new inode or an existing one renamed/moved), he will search the b+tree of the send or parent roots in order to find out the inode item of the parent directory and extract its generation. However we are doing that search twice, once with is_inode_existent() -> get_cur_inode_state() and then again at did_overwrite_ref() or will_overwrite_ref(). So avoid that and get the generation at get_cur_inode_state() and then propagate it up to did_overwrite_ref() and will_overwrite_ref(). This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: send: directly return from will_overwrite_ref() and simplify it	Filipe Manana	1	-22/+16
	There are no resources to release before will_overwrite_ref() returns, so we don't really need the 'out' label and jumping to it when conditions are met - we can directly return and get rid of the label and jumps. Also we can deal with -ENOENT and other errors in a single if-else logic, as it's more straightforward. This helps the next patch in the series to be more simple as well. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: send: avoid unnecessary generation search at did_overwrite_ref()	Filipe Manana	1	-9/+22
	At did_overwrite_ref() we always call get_inode_gen() to find out the generation of the inode 'ow_inode'. However we don't always need to use that generation, and in fact it's very common to not use it, so we end up doing a b+tree search on the send root, allocating a path, etc, for nothing. So improve on this by getting the generation only if we need to use it. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: send: directly return from did_overwrite_ref() and simplify it	Filipe Manana	1	-25/+18
	There are no resources to release before did_overwrite_ref() returns, so we don't really need the 'out' label and jumping to it when conditions are met - we can directly return and get rid of the label and jumps. Also we can deal with -ENOENT and other errors in a single if-else logic, as it's more straightforward. This helps the next patch in the series to be more simple as well. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: sysfs: update fs features directory asynchronously	Qu Wenruo	6	-23/+27
	[BUG] Since the introduction of per-fs feature sysfs interface (/sys/fs/btrfs/<UUID>/features/), the content of that directory is never updated. Thus for the following case, that directory will not show the new features like RAID56: # mkfs.btrfs -f $dev1 $dev2 $dev3 # mount $dev1 $mnt # btrfs balance start -f -mconvert=raid5 $mnt # ls /sys/fs/btrfs/$uuid/features/ extended_iref free_space_tree no_holes skinny_metadata While after unmount and mount, we got the correct features: # umount $mnt # mount $dev1 $mnt # ls /sys/fs/btrfs/$uuid/features/ extended_iref free_space_tree no_holes raid56 skinny_metadata [CAUSE] Because we never really try to update the content of per-fs features/ directory. We had an attempt to update the features directory dynamically in commit 14e46e04958d ("btrfs: synchronize incompat feature bits with sysfs files"), but unfortunately it get reverted in commit e410e34fad91 ("Revert "btrfs: synchronize incompat feature bits with sysfs files""). The problem in the original patch is, in the context of btrfs_create_chunk(), we can not afford to update the sysfs group. The exported but never utilized function, btrfs_sysfs_feature_update() is the leftover of such attempt. As even if we go sysfs_update_group(), new files will need extra memory allocation, and we have no way to specify the sysfs update to go GFP_NOFS. [FIX] This patch will address the old problem by doing asynchronous sysfs update in the cleaner thread. This involves the following changes: - Make __btrfs_(set\|clear)_fs_(incompat\|compat_ro) helpers to set BTRFS_FS_FEATURE_CHANGED flag when needed - Update btrfs_sysfs_feature_update() to use sysfs_update_group() And drop unnecessary arguments. - Call btrfs_sysfs_feature_update() in cleaner_kthread If we have the BTRFS_FS_FEATURE_CHANGED flag set. - Wake up cleaner_kthread in btrfs_commit_transaction if we have BTRFS_FS_FEATURE_CHANGED flag By this, all the previously dangerous call sites like btrfs_create_chunk() need no new changes, as above helpers would have already set the BTRFS_FS_FEATURE_CHANGED flag. The real work happens at cleaner_kthread, thus we pay the cost of delaying the update to sysfs directory, but the delayed time should be small enough that end user can not distinguish though it might get delayed if the cleaner thread is busy with removing subvolumes or defrag. CC: [email protected] # 4.14+ Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: remove duplicate include header in extent-tree.c	ye xingchen	1	-1/+0
	extent-tree.h is included more than once, added in a0231804affe ("btrfs: move extent-tree helpers into their own header file"). Signed-off-by: ye xingchen <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: scrub: improve tree block error reporting	Qu Wenruo	1	-9/+40
	[BUG] When debugging a scrub related metadata error, it turns out that our metadata error reporting is not ideal. The only 3 error messages are: - BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 0, gen 1 Showing we have metadata generation mismatch errors. - BTRFS error (device dm-2): unable to fixup (regular) error at logical 7110656 on dev /dev/mapper/test-scratch1 Showing which tree blocks are corrupted. - BTRFS warning (device dm-2): checksum/header error at logical 24772608 on dev /dev/mapper/test-scratch2, physical 3801088: metadata node (level 1) in tree 5 Showing which physical range the corrupted metadata is at. We have to combine the above 3 to know we have a corrupted metadata with generation mismatch. And this is already the better case, if we have other problems, like fsid mismatch, we can not even know the cause. [CAUSE] The problem is caused by the fact that, scrub_checksum_tree_block() never outputs any error message. It just return two bits for scrub: sblock->header_error, and sblock->generation_error. And later we report error in scrub_print_warning(), but unfortunately we only have two bits, there is not really much thing we can done to print any detailed errors. [FIX] This patch will do the following to enhance the error reporting of metadata scrub: - Add extra warning (ratelimited) for every error we hit This can help us to distinguish the different types of errors. Some errors can help us to know what's going wrong immediately, like bytenr mismatch. - Re-order the checks Currently we check bytenr first, then immediately generation. This can lead to false generation mismatch reports, while the fsid mismatches. Here is the new output for the bug I'm debugging (we forgot to writeback tree blocks for commit roots): BTRFS warning (device dm-2): tree block 24117248 mirror 1 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc BTRFS warning (device dm-2): tree block 24117248 mirror 0 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc Now we can immediately know it's some tree blocks didn't even get written back, other than the original confusing generation mismatch. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: don't use size classes for zoned file systems	Boris Burkov	3	-3/+13
	When a file system has ZNS devices which are constrained by a maximum number of active block groups, then not being able to use all the block groups for every allocation is not ideal, and could cause us to loop a ton with mixed size allocations. In general, since zoned doesn't write into gaps behind where block groups are writing, it is not susceptible to the same sort of fragmentation that size classes are designed to solve, so we can skip size classes for zoned file systems in general, even though there would probably be no harm for SMR devices. Signed-off-by: Boris Burkov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: load block group size class when caching	Boris Burkov	1	-0/+148
	Since the size class is an artifact of an arbitrary anti fragmentation strategy, it doesn't really make sense to persist it. Furthermore, most of the size class logic assumes fresh block groups. That is of course not a reasonable assumption -- we will be upgrading kernels with existing filesystems whose block groups are not classified. To work around those issues, implement logic to compute the size class of the block groups as we cache them in. To perfectly assess the state of a block group, we would have to read the entire extent tree (since the free space cache mashes together contiguous extent items) which would be prohibitively expensive for larger file systems with more extents. We can do it relatively cheaply by implementing a simple heuristic of sampling a handful of extents and picking the smallest one we see. In the happy case where the block group was classified, we will only see extents of the correct size. In the unhappy case, we will hopefully find one of the smaller extents, but there is no perfect answer anyway. Autorelocation will eventually churn up the block group if there is significant freeing anyway. There was no regression in mount performance at end state of the fsperf test suite, and the delay until the block group is marked cached is minimized by the constant number of extent samples. Signed-off-by: Boris Burkov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: introduce size class to block group allocator	Boris Burkov	4	-51/+148
	The aim of this patch is to reduce the fragmentation of block groups under certain unhappy workloads. It is particularly effective when the size of extents correlates with their lifetime, which is something we have observed causing fragmentation in the fleet at Meta. This patch categorizes extents into size classes: - x < 128KiB: "small" - 128KiB < x < 8MiB: "medium" - x > 8MiB: "large" and as much as possible reduces allocations of extents into block groups that don't match the size class. This takes advantage of any (possible) correlation between size and lifetime and also leaves behind predictable re-usable gaps when extents are freed; small writes don't gum up bigger holes. Size classes are implemented in the following way: - Mark each new block group with a size class of the first allocation that goes into it. - Add two new passes to ffe: "unset size class" and "wrong size class". First, try only matching block groups, then try unset ones, then allow allocation of new ones, and finally allow mismatched block groups. - Filtering is done just by skipping inappropriate ones, there is no special size class indexing. Other solutions I considered were: - A best fit allocator with an rb-tree. This worked well, as small writes didn't leak big holes from large freed extents, but led to regressions in ffe and write performance due to lock contention on the rb-tree with every allocation possibly updating it in parallel. Perhaps something clever could be done to do the updates in the background while being "right enough". - A fixed size "working set". This prevents freeing an extent drastically changing where writes currently land, and seems like a good option too. Doesn't take advantage of size in any way. - The same size class idea, but implemented with xarray marks. This turned out to be slower than looping the linked list and skipping wrong block groups, and is also less flexible since we must have only 3 size classes (max #marks). With the current approach we can have as many as we like. Performance testing was done via: https://github.com/josefbacik/fsperf Of particular relevance are the new fragmentation specific tests. A brief summary of the testing results: - Neutral results on existing tests. There are some minor regressions and improvements here and there, but nothing that truly stands out as notable. - Improvement on new tests where size class and extent lifetime are correlated. Fragmentation in these cases is completely eliminated and write performance is generally a little better. There is also significant improvement where extent sizes are just a bit larger than the size class boundaries. - Regression on one new tests: where the allocations are sized intentionally a hair under the borders of the size classes. Results are neutral on the test that intentionally attacks this new scheme by mixing extent size and lifetime. The full dump of the performance results can be found here: https://bur.io/fsperf/size-class-2022-11-15.txt (there are ANSI escape codes, so best to curl and view in terminal) Here is a snippet from the full results for a new test which mixes buffered writes appending to a long lived set of files and large short lived fallocates: bufferedappendvsfallocate results metric baseline current stdev diff ====================================================================================== avg_commit_ms 31.13 29.20 2.67 -6.22% bg_count 14 15.60 0 11.43% commits 11.10 12.20 0.32 9.91% elapsed 27.30 26.40 2.98 -3.30% end_state_mount_ns 11122551.90 10635118.90 851143.04 -4.38% end_state_umount_ns 1.36e+09 1.35e+09 12248056.65 -1.07% find_free_extent_calls 116244.30 114354.30 964.56 -1.63% find_free_extent_ns_max 599507.20 1047168.20 103337.08 74.67% find_free_extent_ns_mean 3607.19 3672.11 101.20 1.80% find_free_extent_ns_min 500 512 6.67 2.40% find_free_extent_ns_p50 2848 2876 37.65 0.98% find_free_extent_ns_p95 4916 5000 75.45 1.71% find_free_extent_ns_p99 20734.49 20920.48 1670.93 0.90% frag_pct_max 61.67 0 8.05 -100.00% frag_pct_mean 43.59 0 6.10 -100.00% frag_pct_min 25.91 0 16.60 -100.00% frag_pct_p50 42.53 0 7.25 -100.00% frag_pct_p95 61.67 0 8.05 -100.00% frag_pct_p99 61.67 0 8.05 -100.00% fragmented_bg_count 6.10 0 1.45 -100.00% max_commit_ms 49.80 46 5.37 -7.63% sys_cpu 2.59 2.62 0.29 1.39% write_bw_bytes 1.62e+08 1.68e+08 17975843.50 3.23% write_clat_ns_mean 57426.39 54475.95 2292.72 -5.14% write_clat_ns_p50 46950.40 42905.60 2101.35 -8.62% write_clat_ns_p99 148070.40 143769.60 2115.17 -2.90% write_io_kbytes 4194304 4194304 0 0.00% write_iops 2476.15 2556.10 274.29 3.23% write_lat_ns_max 2101667.60 2251129.50 370556.59 7.11% write_lat_ns_mean 59374.91 55682.00 2523.09 -6.22% write_lat_ns_min 17353.10 16250 1646.08 -6.36% There are some mixed improvements/regressions in most metrics along with an elimination of fragmentation in this workload. On the balance, the drastic 1->0 improvement in the happy cases seems worth the mix of regressions and improvements we do observe. Some considerations for future work: - Experimenting with more size classes - More hinting/search ordering work to approximate a best-fit allocator Signed-off-by: Boris Burkov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: add more find_free_extent tracepoints	Boris Burkov	2	-0/+7
	find_free_extent is a complicated function. It consists (at least) of: - a hint that jumps into the middle of a for loop macro - a middle loop trying every raid level - an outer loop ascending through ffe loop levels - complicated logic for skipping some of those ffe loop levels - multiple underlying in-bg allocators (zoned, cluster, no cluster) Which is all to say that more tracing is helpful for debugging its behavior. Add two new tracepoints: at the entrance to the block_groups loop (hit for every raid level and every ffe_ctl loop) and at the point we seriously consider a block_group for allocation. This way we can see the whole path through the algorithm, including hints, multiple loops, etc. Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Boris Burkov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: pass find_free_extent_ctl to allocator tracepoints	Boris Burkov	3	-86/+82
	The allocator tracepoints currently have a pile of values from ffe_ctl. In modifying the allocator and adding more tracepoints, I found myself adding to the already long argument list of the tracepoints. It makes it a lot simpler to just send in the ffe_ctl itself. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Boris Burkov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: remove the wait argument to btrfs_start_ordered_extent	Christoph Hellwig	5	-21/+18
	Given that wait is always set to 1, so remove the argument. Last use of wait with 0 was in 0c304304feab ("Btrfs: remove csum_bytes_left"). Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: use a single variable to track return value for log_dir_items()	Filipe Manana	1	-19/+18
	We currently use 'ret' and 'err' to track the return value for log_dir_items(), which is confusing and likely the cause for previous bugs where log_dir_items() did not return an error when it should, fixed in previous patches. So change this and use only a single variable, 'ret', to track the return value. This is simpler and makes it similar to most of the existing code. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: use a negative value for BTRFS_LOG_FORCE_COMMIT	Filipe Manana	2	-9/+10
	Currently we use the value 1 for BTRFS_LOG_FORCE_COMMIT, but that value has a few inconveniences: 1) If it's ever used by btrfs_log_inode(), or any function down the call chain, we have to remember to btrfs_set_log_full_commit(), which is repetitive and has a chance to be forgotten in future use cases. btrfs_log_inode_parent() only calls btrfs_set_log_full_commit() when it gets a negative value from btrfs_log_inode(); 2) Down the call chain of btrfs_log_inode(), we may have functions that need to force a log commit, but can return either an error (negative value), false (0) or true (1). So they are forced to return some random negative to force a log commit - using BTRFS_LOG_FORCE_COMMIT would make the intention more clear. Currently the only example is flush_dir_items_batch(). So turn BTRFS_LOG_FORCE_COMMIT into a negative value. The chosen value is -(MAX_ERRNO + 1), so that it does not overlap any errno value and makes it easier to debug. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: use PAGE_{ALIGN, ALIGNED, ALIGN_DOWN} macro	Yushan Zhou	6	-9/+8
	The header file linux/mm.h provides PAGE_ALIGN, PAGE_ALIGNED, PAGE_ALIGN_DOWN macros. Use these macros to make code more concise. Signed-off-by: Yushan Zhou <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: go to matching label when cleaning em in btrfs_submit_direct	Peng Hao	1	-1/+1
	When btrfs_get_chunk_map fails to allocate a new em the cleanup does not need to be done so the goto target is out_err, which is consistent with current coding style. Signed-off-by: Peng Hao <[email protected]> Reviewed-by: David Sterba <[email protected]> [ update changelog ] Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: turn on -Wmaybe-uninitialized	Josef Bacik	1	-1/+2
	We had a recent bug that would have been caught by a newer compiler with -Wmaybe-uninitialized and would have saved us a month of failing tests that I didn't have time to investigate. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: fix uninitialized variable warning in run_one_async_start	Josef Bacik	1	-0/+4
	With -Wmaybe-uninitialized compiler complains about ret being possibly uninitialized, which isn't possible as the WQ_ constants are set only from our code, however we can handle the default case and get rid of the warning. The value is set to BLK_STS_IOERR so it does not issue any IO and could be potentially detected, but this is basically a "cannot happen" error. To catch any problems during development use the assert. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> [ set the error in default: ] Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: zoned: fix uninitialized variable warning in btrfs_get_dev_zones	Naohiro Aota	1	-2/+5
	Fix an uninitialized warning we get with -Wmaybe-uninitialized where it thought zno may have been uninitialized, in both cases it depends on zinfo->zone_cache but we know the value won't change between checks. Reported-by: Josef Bacik <[email protected]> Link: https://lore.kernel.org/linux-btrfs/af6c527cbd8bdc782e50bd33996ee83acc3a16fb.1671221596.git.josef@toxicpanda.com/ Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: fix uninitialized variable warning in btrfs_sb_log_location	Josef Bacik	1	-1/+1
	We only have 3 possible mirrors, and we have ASSERT()'s to make sure we're not passing in an invalid super mirror into this function, so technically this value isn't uninitialized. However -Wmaybe-uninitialized will complain, so set it to U64_MAX so if we don't have ASSERT()'s turned on it'll error out later on when it see's the zone is beyond our maximum zones. Reviewed-by: Naohiro Aota <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: fix uninitialized variable warnings in __set_extent_bit and ↵	Josef Bacik	1	-4/+4
	convert_extent_bit We will pass in the parent and p pointer into our tree_search function to avoid doing a second search when inserting a new extent state into the tree. However because this is conditional upon passing in these pointers the compiler seems to think these values can be uninitialized if we're using -Wmaybe-uninitialized. Fix this by initializing these values. Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: fix uninitialized variable warning in btrfs_update_block_group	Josef Bacik	1	-1/+1
	reclaim isn't set in the alloc case, however we only care about reclaim in the !alloc case. This isn't an actual problem, however -Wmaybe-uninitialized will complain, so initialize reclaim to quiet the compiler. Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: fix uninitialized variable warning in get_inode_gen	Josef Bacik	1	-5/+3
	Anybody that calls get_inode_gen() can have an uninitialized gen if there's an error. This isn't a big deal because all the users just exit if they get an error, however it makes -Wmaybe-uninitialized complain, so fix this up to always initialize the passed in gen, this quiets all of the uninitialized warnings in send.c. Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: fix uninitialized variable warning in btrfs_cleanup_ordered_extents	Josef Bacik	1	-1/+1
	We can conditionally pass in a locked page, and then we'll use that page range to skip marking errors as that will happen in another layer. However this causes the compiler to complain because it doesn't understand we only use these values when we have the page. Make the compiler stop complaining by setting these values to 0. Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: move btrfs_abort_transaction to transaction.c	Josef Bacik	4	-64/+60
	While trying to sync messages.[ch] I ended up with this dependency on messages.h in the rest of btrfs-progs code base because it's where btrfs_abort_transaction() was now held. We want to keep messages.[ch] limited to the kernel code, and the btrfs_abort_transaction() code better fits in the transaction code and not in messages. Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> [ move the __cold attributes ] Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: directly pass in fs_info to btrfs_merge_delayed_refs	Johannes Thumshirn	3	-5/+4
	Now that none of the functions called by btrfs_merge_delayed_refs() needs a btrfs_trans_handle, directly pass in a btrfs_fs_info to btrfs_merge_delayed_refs(). Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: drop trans parameter of insert_delayed_ref	Johannes Thumshirn	1	-4/+3
	Now that drop_delayed_ref() doesn't need a btrfs_trans_handle, drop it from insert_delayed_ref() as well. Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: remove trans parameter of merge_ref	Johannes Thumshirn	1	-3/+2
	Now that drop_delayed_ref() doesn't get the btrfs_trans_handle passed in anymore, we can get rid of it in merge_ref() as well. Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	btrfs: drop unused trans parameter of drop_delayed_ref	Johannes Thumshirn	1	-5/+4
	drop_delayed_ref() doesn't use the btrfs_trans_handle it gets passed in, so remove it. Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-02-13	fs: ecryptfs: Use crypto_wait_req	Herbert Xu	1	-27/+3
	This patch replaces the custom crypto completion function with crypto_req_done. Signed-off-by: Herbert Xu <[email protected]> Acked-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Herbert Xu <[email protected]>
2023-02-13	zonefs: make kobj_type structure constant	Thomas Weißschuh	1	-1/+1
	Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definition to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Damien Le Moal <[email protected]>
2023-02-13	xfs: refactor the filestreams allocator pick functions	Dave Chinner	2	-132/+145
	Now that the filestreams allocator is largely rewritten, restructure the main entry point and pick function to seperate out the different operations cleanly. The MRU lookup function should not handle the start AG selection on MRU lookup failure, and nor should the pick function handle building the association that is inserted into the MRU. This leaves the filestreams allocator fairly clean and easy to understand, returning to the caller with an active perag reference and a target block to allocate at. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>
2023-02-13	xfs: return a referenced perag from filestreams allocator	Dave Chinner	2	-47/+85
	Now that the filestreams AG selection tracks active perags, we need to return an active perag to the core allocator code. This is because the file allocation the filestreams code will run are AG specific allocations and so need to pin the AG until the allocations complete. We cannot rely on the filestreams item reference to do this - the filestreams association can be torn down at any time, hence we need to have a separate reference for the allocation process to pin the AG after it has been selected. This means there is some perag juggling in allocation failure fallback paths as they will do all AG scans in the case the AG specific allocation fails. Hence we need to track the perag reference that the filestream allocator returned to make sure we don't leak it on repeated allocation failure. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>
2023-02-13	xfs: pass perag to filestreams tracing	Dave Chinner	3	-42/+25
	Pass perags instead of raw ag numbers, avoiding the need for the special peek function for the tracing code. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>
2023-02-13	xfs: use for_each_perag_wrap in xfs_filestream_pick_ag	Dave Chinner	1	-60/+41
	xfs_filestream_pick_ag() is now ready to rework to use for_each_perag_wrap() for iterating the perags during the AG selection scan. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>
2023-02-13	xfs: track an active perag reference in filestreams	Dave Chinner	1	-57/+43
	Rather than just track the agno of the reference, track a referenced perag pointer instead. This will allow active filestreams to prevent AGs from going away until the filestreams have been torn down. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>
2023-02-13	xfs: factor out MRU hit case in xfs_filestream_select_ag	Dave Chinner	1	-50/+83
	Because it now stands out like a sore thumb. Factoring out this case starts the process of simplifying xfs_filestream_select_ag() again. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>
2023-02-13	xfs: remove xfs_filestream_select_ag() longest extent check	Dave Chinner	1	-17/+1
	Picking a new AG checks the longest free extent in the AG is valid, so there's no need to repeat the check in xfs_filestream_select_ag(). Remove it. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>
2023-02-13	xfs: merge new filestream AG selection into xfs_filestream_select_ag()	Dave Chinner	1	-72/+40
	This is largely a wrapper around xfs_filestream_pick_ag() that repeats a lot of the lookups that we just merged back into xfs_filestream_select_ag() from the lookup code. Merge the xfs_filestream_new_ag() code back into _select_ag() to get rid of all the unnecessary logic. Indeed, this makes it obvious that if we have no parent inode, the filestreams allocator always selects AG 0 regardless of whether it is fit for purpose or not. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>
2023-02-13	xfs: merge filestream AG lookup into xfs_filestream_select_ag()	Dave Chinner	1	-114/+70
	The lookup currently either returns the cached filestream AG or it calls xfs_filestreams_select_lengths() to looks up a new AG. This has verify the AG that is selected, so we end up doing "select a new AG loop in a couple of places when only one really is needed. Merge the initial lookup functionality with the length selection so that we only need to do a single pick loop on lookup or verification failure. This undoes a lot of the factoring that enabled the selection to be moved over to the filestreams code. It makes xfs_filestream_select_ag() an awful messier, but it has to be made worse before it can get better in future patches... Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>
2023-02-13	xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c	Dave Chinner	4	-87/+115
	xfs_bmap_btalloc_filestreams() calls two filestreams functions to select the AG to allocate from. Both those functions end up in the same selection function that iterates all AGs multiple times. Worst case, xfs_bmap_btalloc_filestreams() can iterate all AGs 4 times just to select the initial AG to allocate in. Move the AG selection to fs/xfs/xfs_filestreams.c as a single interface so that the inefficient AG interation is contained entirely within the filestreams code. This will allow the implementation to be simplified and made more efficient in future patches. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>
2023-02-13	xfs: use xfs_bmap_longest_free_extent() in filestreams	Dave Chinner	3	-15/+11
	The code in xfs_bmap_longest_free_extent() is open coded in xfs_filestream_pick_ag(). Export xfs_bmap_longest_free_extent and call it from the filestreams code instead. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>
2023-02-13	xfs: get rid of notinit from xfs_bmap_longest_free_extent	Dave Chinner	1	-48/+37
	It is only set if reading the AGF gets a EAGAIN error. Just return the EAGAIN error and handle that error in the callers. This means we can remove the not_init parameter from xfs_bmap_select_minlen(), too, because the use of not_init there is pessimistic. If we can't read the agf, it won't increase blen. The only time we actually care whether we checked all the AGFs for contiguous free space is when the best length is less than the minimum allocation length. If not_init is set, then we ignore blen and set the minimum alloc length to the absolute minimum, not the best length we know already is present. However, if blen is less than the minimum we're going to ignore it anyway, regardless of whether we scanned all the AGFs or not. Hence not_init can go away, because we only use if blen is good from the scanned AGs otherwise we ignore it altogether and use minlen. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>