aboutsummaryrefslogtreecommitdiff
path: root/fs/xfs/libxfs/xfs_btree.c
AgeCommit message (Collapse)AuthorFilesLines
2024-07-04xfs: Avoid races with cnt_btree lastrec updatesZizhi Wo1-51/+0
A concurrent file creation and little writing could unexpectedly return -ENOSPC error since there is a race window that the allocator could get the wrong agf->agf_longest. Write file process steps: 1) Find the entry that best meets the conditions, then calculate the start address and length of the remaining part of the entry after allocation. 2) Delete this entry and update the -current- agf->agf_longest. 3) Insert the remaining unused parts of this entry based on the calculations in 1), and update the agf->agf_longest again if necessary. Create file process steps: 1) Check whether there are free inodes in the inode chunk. 2) If there is no free inode, check whether there has space for creating inode chunks, perform the no-lock judgment first. 3) If the judgment succeeds, the judgment is performed again with agf lock held. Otherwire, an error is returned directly. If the write process is in step 2) but not go to 3) yet, the create file process goes to 2) at this time, it may be mistaken for no space, resulting in the file system still has space but the file creation fails. We have sent two different commits to the community in order to fix this problem[1][2]. Unfortunately, both solutions have flaws. In [2], I discussed with Dave and Darrick, realized that a better solution to this problem requires the "last cnt record tracking" to be ripped out of the generic btree code. And surprisingly, Dave directly provided his fix code. This patch includes appropriate modifications based on his tmp-code to address this issue. The entire fix can be roughly divided into two parts: 1) Delete the code related to lastrec-update in the generic btree code. 2) Place the process of updating longest freespace with cntbt separately to the end of the cntbt modifications. Move the cursor to the rightmost firstly, and update the longest free extent based on the record. Note that we can not update the longest with xfs_alloc_get_rec() after find the longest record, as xfs_verify_agbno() may not pass because pag->block_count is updated on the outside. Therefore, use xfs_btree_get_rec() as a replacement. [1] https://lore.kernel.org/all/20240419061848.1032366-2-yebin10@huawei.com [2] https://lore.kernel.org/all/20240604071121.3981686-1-wozizhi@huawei.com Reported by: Ye Bin <yebin10@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-22xfs: support in-memory btreesDarrick J. Wong1-37/+219
Adapt the generic btree cursor code to be able to create a btree whose buffers come from a (presumably in-memory) buftarg with a header block that's specific to in-memory btrees. We'll connect this to other parts of online scrub in the next patches. Note that in-memory btrees always have a block size matching the system memory page size for efficiency reasons. There are also a few things we need to do to finalize a btree update; that's covered in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: add a xfs_btree_ptrs_equal helperChristoph Hellwig1-13/+17
This only has a single caller and thus might be a bit questionable, but I think it really improves the readability of xfs_btree_visit_block. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: move and rename xfs_btree_read_buflChristoph Hellwig1-30/+0
Despite its name, xfs_btree_read_bufl doesn't contain any btree-related functionaliy and isn't used by the btree code. Move it to xfs_bmap.c, hard code the refval and ops arguments and rename it to xfs_bmap_read_buf. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: remove xfs_btree_reada_bufsChristoph Hellwig1-28/+10
xfs_btree_reada_bufl just wraps xfs_btree_readahead and a agblock to daddr conversion. Just open code it's three callsites in the two callers (One of which isn't even btree related). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: remove xfs_btree_reada_buflChristoph Hellwig1-24/+6
xfs_btree_reada_bufl just wraps xfs_btree_readahead and a fsblock to daddr conversion. Just open code it's two callsites in the only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: factor out a __xfs_btree_check_lblock_hdr helperChristoph Hellwig1-7/+23
This will allow sharing code with the in-memory block checking helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: rename btree helpers that depends on the block number representationChristoph Hellwig1-31/+33
All these helpers hardcode fsblocks or agblocks and not just the pointer size. Rename them so that the names are still fitting when we add the long format in-memory blocks and adjust the checks when calling them to check the btree types and not just pointer length. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: consolidate btree block verificationChristoph Hellwig1-42/+30
Add a __xfs_btree_check_block helper that can be called by the scrub code to validate a btree block of any form, and move the duplicate error handling code from xfs_btree_check_sblock and xfs_btree_check_lblock into xfs_btree_check_block and thus remove these two helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: tighten up validation of root block in inode forksChristoph Hellwig1-3/+13
Check that root blocks that sit in the inode fork and thus have a NULL bp don't have siblings. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: remove the crc variable in __xfs_btree_check_lblockChristoph Hellwig1-2/+1
crc is only used once, just use the xfs_has_crc check directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: misc cleanups for __xfs_btree_check_sblockChristoph Hellwig1-8/+4
Remove the local crc variable that is only used once and remove the bp NULL checking as it can't ever be NULL for short form blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: consolidate btree ptr checkingChristoph Hellwig1-31/+29
Merge xfs_btree_check_sptr and xfs_btree_check_lptr into a single __xfs_btree_check_ptr that can be shared between xfs_btree_check_ptr and the scrub code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: simplify xfs_btree_check_lblock_siblingsChristoph Hellwig1-16/+6
Stop using xfs_btree_check_lptr in xfs_btree_check_lblock_siblings, as it only duplicates the xfs_verify_fsbno call in the other leg of if / else besides adding a tautological level check. With this the cur and level arguments can be removed as they are now unused. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: simplify xfs_btree_check_sblock_siblingsChristoph Hellwig1-13/+6
Stop using xfs_btree_check_sptr in xfs_btree_check_sblock_siblings, as it only duplicates the xfs_verify_agbno call in the other leg of if / else besides adding a tautological level check. With this the cur and level arguments can be removed as they are now unused. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: remove xfs_btnum_tChristoph Hellwig1-2/+2
The last checks for bc_btnum can be replaced with helpers that check the btree ops. This allows adding new btrees to XFS without having to update a global enum. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: complete the ops predicates] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: add a name field to struct xfs_btree_opsChristoph Hellwig1-4/+4
The btnum in struct xfs_btree_ops is often used for printing a symbolic name for the btree. Add a name field to the ops structure and use that directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: don't override bc_ops for staging btreesChristoph Hellwig1-16/+59
Add a few conditionals for staging btrees to the core btree code instead of overloading the bc_ops vector. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: add a xfs_btree_init_ptr_from_curChristoph Hellwig1-4/+23
Inode-rooted btrees don't need to initialize the root pointer in the ->init_ptr_from_cur method as the root is found by the xfs_btree_get_iroot method later. Make ->init_ptr_from_cur option for inode rooted btrees by providing a helper that does the right thing for the given btree type and also documents the semantics. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: create predicate to determine if cursor is at inode root levelDarrick J. Wong1-30/+22
Create a predicate to decide if the given cursor and level point to the root block in the inode immediate area instead of a disk block, and get rid of the open-coded logic everywhere. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: split the per-btree union in struct xfs_btree_curChristoph Hellwig1-1/+1
Split up the union that encodes btree-specific fields in struct xfs_btree_cur. Most fields in there are specific to the btree type encoded in xfs_btree_ops.type, and we can use the obviously named union for that. But one field is specific to the bmapbt and two are shared by the refcount and rtrefcountbt. Move those to a separate union to make the usage clear and not need a separate struct for the refcount-related fields. This will also make unnecessary some very awkward btree cursor refc/rtrefc switching logic in the rtrefcount patchset. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: split out a btree type from the btree ops geometry flagsChristoph Hellwig1-30/+36
Two of the btree cursor flags are always used together and encode the fundamental btree type. There currently are two such types: 1) an on-disk AG-rooted btree with 32-bit pointers 2) an on-disk inode-rooted btree with 64-bit pointers and we're about to add: 3) an in-memory btree with 64-bit pointers Introduce a new enum and a new type field in struct xfs_btree_geom to encode this type directly instead of using flags and change most code to switch on this enum. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: make the pointer lengths explicit] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: store the btree pointer length in struct xfs_btree_opsDarrick J. Wong1-33/+24
Make the pointer length an explicit field in the btree operations structure so that the next patch (which introduces an explicit btree type enum) doesn't have to play a bunch of awkward games with inferring the pointer length from the enumeration. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: factor out a btree block owner checkDarrick J. Wong1-5/+28
Hoist the btree block owner check into a separate helper so that we don't have an ugly multiline if statement. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: factor out a xfs_btree_owner helperDarrick J. Wong1-14/+11
Split out a helper to calculate the owner for a given btree instead of duplicating the logic in two places. While we're at it, make the bc_ag/bc_ino switch logic depend on the correct geometry flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: break this up into two patches for the owner check] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: move lru refs to the btree ops structureDarrick J. Wong1-22/+2
Move the btree buffer LRU refcount to the btree ops structure so that we can eliminate the last bc_btnum switch in the generic btree code. We're about to create repair-specific btree types, and we don't want that stuff cluttering up libxfs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: set btree block buffer ops in _init_bufDarrick J. Wong1-0/+1
Set the btree block buffer ops in xfs_btree_init_buf since we already have access to that information through the btree ops. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: remove the unnecessary daddr paramter to _init_blockDarrick J. Wong1-3/+16
Now that all of the callers pass XFS_BUF_DADDR_NULL as the daddr parameter, we can elide that too. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: btree convert xfs_btree_init_block to xfs_btree_init_buf callsDarrick J. Wong1-2/+1
Convert any place we call xfs_btree_init_block with a buffer to use the _init_buf function. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: rename btree block/buffer init functionsDarrick J. Wong1-4/+4
Rename xfs_btree_init_block_int to xfs_btree_init_block, and xfs_btree_init_block to xfs_btree_init_buf so that the name suggests the type that caller are supposed to pass in. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: initialize btree blocks using btree_ops structureDarrick J. Wong1-34/+23
Notice now that the btree ops structure encodes btree geometry flags and the magic number through the buffer ops. Refactor the btree block initialization functions to use the btree ops so that we no longer have to open code all that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: remove bc_ino.flagsChristoph Hellwig1-1/+1
Just move the two flags into bc_flags where there is plenty of space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22xfs: encode the btree geometry flags in the btree ops structureDarrick J. Wong1-55/+55
Certain btree flags never change for the life of a btree cursor because they describe the geometry of the btree itself. Encode these in the btree ops structure and reduce the amount of code required in each btree type's init_cursor functions. This also frees up most of the bits in bc_flags. A previous version of this patch also converted the open-coded flags logic to helpers. This was removed due to the pending refactoring (that follows this patch) to eliminate most of the state flags. Conversion script: sed \ -e 's/XFS_BTREE_LONG_PTRS/XFS_BTGEO_LONG_PTRS/g' \ -e 's/XFS_BTREE_ROOT_IN_INODE/XFS_BTGEO_ROOT_IN_INODE/g' \ -e 's/XFS_BTREE_LASTREC_UPDATE/XFS_BTGEO_LASTREC_UPDATE/g' \ -e 's/XFS_BTREE_OVERLAPPING/XFS_BTGEO_OVERLAPPING/g' \ -e 's/cur->bc_flags & XFS_BTGEO_/cur->bc_ops->geom_flags \& XFS_BTGEO_/g' \ -i $(git ls-files fs/xfs/*.[ch] fs/xfs/libxfs/*.[ch] fs/xfs/scrub/*.[ch]) Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: drop XFS_BTREE_CRC_BLOCKSDarrick J. Wong1-4/+4
All existing btree types set XFS_BTREE_CRC_BLOCKS when running against a V5 filesystem. All currently proposed btree types are V5 only and use the richer XFS_BTREE_CRC_BLOCKS format. Therefore, we can drop this flag and change the conditional to xfs_has_crc. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: consolidate btree block allocation tracepointsDarrick J. Wong1-3/+17
Don't waste tracepoint segment memory on per-btree block allocation tracepoints when we can do it from the generic btree code. With this patch applied, two tracepoints are collapsed into one tracepoint, with the following effects on objdump -hx xfs.ko output: Before: 10 __tracepoints_ptrs 00000b38 0000000000000000 0000000000000000 001412f0 2**2 14 __tracepoints_strings 00005433 0000000000000000 0000000000000000 001689a0 2**5 29 __tracepoints 00010d30 0000000000000000 0000000000000000 0023fe00 2**5 After: 10 __tracepoints_ptrs 00000b34 0000000000000000 0000000000000000 001417b0 2**2 14 __tracepoints_strings 00005413 0000000000000000 0000000000000000 00168e80 2**5 29 __tracepoints 00010cd0 0000000000000000 0000000000000000 00240760 2**5 Column 3 is the section size in bytes; removing these two tracepoints reduces the size of the ELF segments by 132 bytes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: consolidate btree block freeing tracepointsDarrick J. Wong1-0/+2
Don't waste memory on extra per-btree block freeing tracepoints when we can do it from the generic btree code. With this patch applied, two tracepoints are collapsed into one tracepoint, with the following effects on objdump -hx xfs.ko output: Before: 10 __tracepoints_ptrs 00000b3c 0000000000000000 0000000000000000 00140eb0 2**2 14 __tracepoints_strings 00005453 0000000000000000 0000000000000000 00168540 2**5 29 __tracepoints 00010d90 0000000000000000 0000000000000000 0023f5e0 2**5 After: 10 __tracepoints_ptrs 00000b38 0000000000000000 0000000000000000 001412f0 2**2 14 __tracepoints_strings 00005433 0000000000000000 0000000000000000 001689a0 2**5 29 __tracepoints 00010d30 0000000000000000 0000000000000000 0023fe00 2**5 Column 3 is the section size in bytes; removing these two tracepoints reduces the size of the ELF segments by 132 bytes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: report XFS_IS_CORRUPT errors to the health systemDarrick J. Wong1-1/+13
Whenever we encounter XFS_IS_CORRUPT failures, we should report that to the health monitoring system for later reporting. I started with this semantic patch and massaged everything until it built: @@ expression mp, test; @@ - if (XFS_IS_CORRUPT(mp, test)) return -EFSCORRUPTED; + if (XFS_IS_CORRUPT(mp, test)) { xfs_btree_mark_sick(cur); return -EFSCORRUPTED; } @@ expression mp, test; identifier label, error; @@ - if (XFS_IS_CORRUPT(mp, test)) { error = -EFSCORRUPTED; goto label; } + if (XFS_IS_CORRUPT(mp, test)) { xfs_btree_mark_sick(cur); error = -EFSCORRUPTED; goto label; } Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: report btree block corruption errors to the health systemDarrick J. Wong1-3/+22
Whenever we encounter corrupt btree blocks, we should report that to the health monitoring system for later reporting. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-13xfs: convert remaining kmem_free() to kfree()Dave Chinner1-1/+1
The remaining callers of kmem_free() are freeing heap memory, so we can convert them directly to kfree() and get rid of kmem_free() altogether. This conversion was done with: $ for f in `git grep -l kmem_free fs/xfs`; do > sed -i s/kmem_free/kfree/ $f > done $ Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-12-15xfs: repair refcount btreesDarrick J. Wong1-0/+26
Reconstruct the refcount data from the rmap btree. Link: https://docs.kernel.org/filesystems/xfs-online-fsck-design.html#case-study-rebuilding-the-space-reference-counts Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-15xfs: read leaf blocks when computing keys for bulkloading into node blocksDarrick J. Wong1-1/+1
When constructing a new btree, xfs_btree_bload_node needs to read the btree blocks for level N to compute the keyptrs for the blocks that will be loaded into level N+1. The level N blocks must be formatted at that point. A subsequent patch will change the btree bulkloader to write new btree blocks in 256K chunks to moderate memory consumption if the new btree is very large. As a consequence of that, it's possible that the buffers for lower level blocks might have been reclaimed by the time the node builder comes back to the block. Therefore, change xfs_btree_bload_node to read the lower level blocks to handle the reclaimed buffer case. As a side effect, the read will increase the LRU refs, which will bias towards keeping new btree buffers in memory after the new btree commits. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-04-11xfs: implement masked btree key comparisons for _has_records scansDarrick J. Wong1-4/+20
For keyspace fullness scans, we want to be able to mask off the parts of the key that we don't care about. For most btree types we /do/ want the full keyspace, but for checking that a given space usage also has a full complement of rmapbt records (even if different/multiple owners) we need this masking so that we only track sparseness of rm_startblock, not the whole keyspace (which is extremely sparse). Augment the ->diff_two_keys and ->keys_contiguous helpers to take a third union xfs_btree_key argument, and wire up xfs_rmap_has_records to pass this through. This third "mask" argument should contain a nonzero value in each structure field that should be used in the key comparisons done during the scan. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11xfs: replace xfs_btree_has_record with a general keyspace scannerDarrick J. Wong1-13/+95
The current implementation of xfs_btree_has_record returns true if it finds /any/ record within the given range. Unfortunately, that's not sufficient for scrub. We want to be able to tell if a range of keyspace for a btree is devoid of records, is totally mapped to records, or is somewhere in between. By forcing this to be a boolean, we conflated sparseness and fullness, which caused scrub to return incorrect results. Fix the API so that we can tell the caller which of those three is the current state. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11xfs: refactor ->diff_two_keys callsitesDarrick J. Wong1-33/+24
Create wrapper functions around ->diff_two_keys so that we don't have to remember what the return values mean, and adjust some of the code comments to reflect the longtime code behavior. We're going to introduce more uses of ->diff_two_keys in the next patch, so reduce the cognitive load for readers by doing this refactoring now. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11xfs: refactor converting btree irec to btree keyDarrick J. Wong1-8/+15
We keep doing these conversions to support btree queries, so refactor this into a helper. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-02-11xfs: t_firstblock is tracking AGs not blocksDave Chinner1-1/+1
The tp->t_firstblock field is now raelly tracking the highest AG we have locked, not the block number of the highest allocation we've made. It's purpose is to prevent AGF locking deadlocks, so rename it to "highest AG" and simplify the implementation to just track the agno rather than a fsbno. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05xfs: don't use BMBT btree split workers for IO completionDave Chinner1-2/+16
When we split a BMBT due to record insertion, we offload it to a worker thread because we can be deep in the stack when we try to allocate a new block for the BMBT. Allocation can use several kilobytes of stack (full memory reclaim, swap and/or IO path can end up on the stack during allocation) and we can already be several kilobytes deep in the stack when we need to split the BMBT. A recent workload demonstrated a deadlock in this BMBT split offload. It requires several things to happen at once: 1. two inodes need a BMBT split at the same time, one must be unwritten extent conversion from IO completion, the other must be from extent allocation. 2. there must be a no available xfs_alloc_wq worker threads available in the worker pool. 3. There must be sustained severe memory shortages such that new kworker threads cannot be allocated to the xfs_alloc_wq pool for both threads that need split work to be run 4. The split work from the unwritten extent conversion must run first. 5. when the BMBT block allocation runs from the split work, it must loop over all AGs and not be able to either trylock an AGF successfully, or each AGF is is able to lock has no space available for a single block allocation. 6. The BMBT allocation must then attempt to lock the AGF that the second task queued to the rescuer thread already has locked before it finds an AGF it can allocate from. At this point, we have an ABBA deadlock between tasks queued on the xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task holding the AGF lock can't be run by the rescuer thread until the task the rescuer thread is runing gets the AGF lock.... This is a highly improbably series of events, but there it is. There's a couple of ways to fix this, but the easiest way to ensure that we only punt tasks with a locked AGF that holds enough space for the BMBT block allocations to the worker thread. This works for unwritten extent conversion in IO completion (which doesn't have a locked AGF and space reservations) because we have tight control over the IO completion stack. It is typically only 6 functions deep when xfs_btree_split() is called because we've already offloaded the IO completion work to a worker thread and hence we don't need to worry about stack overruns here. The other place we can be called for a BMBT split without a preceeding allocation is __xfs_bunmapi() when punching out the center of an existing extent. We don't remove extents in the IO path, so these operations don't tend to be called with a lot of stack consumed. Hence we don't really need to ship the split off to a worker thread in these cases, either. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-01-03xfs: fix off-by-one error in xfs_btree_space_to_heightDarrick J. Wong1-1/+6
Lately I've been stress-testing extreme-sized rmap btrees by using the (new) xfs_db bmap_inflate command to clone bmbt mappings billions of times and then using xfs_repair to build new rmap and refcount btrees. This of course is /much/ faster than actually FICLONEing a file billions of times. Unfortunately, xfs_repair fails in xfs_btree_bload_compute_geometry with EOVERFLOW, which indicates that xfs_mount.m_rmap_maxlevels is not sufficiently large for the test scenario. For a 1TB filesystem (~67 million AG blocks, 4 AGs) the btheight command reports: $ xfs_db -c 'btheight -n 4400801200 -w min rmapbt' /dev/sda rmapbt: worst case per 4096-byte block: 84 records (leaf) / 45 keyptrs (node) level 0: 4400801200 records, 52390491 blocks level 1: 52390491 records, 1164234 blocks level 2: 1164234 records, 25872 blocks level 3: 25872 records, 575 blocks level 4: 575 records, 13 blocks level 5: 13 records, 1 block 6 levels, 53581186 blocks total The AG is sufficiently large to build this rmap btree. Unfortunately, m_rmap_maxlevels is 5. Augmenting the loop in the space->height function to report height, node blocks, and blocks remaining produces this: ht 1 node_blocks 45 blockleft 67108863 ht 2 node_blocks 2025 blockleft 67108818 ht 3 node_blocks 91125 blockleft 67106793 ht 4 node_blocks 4100625 blockleft 67015668 final height: 5 The goal of this function is to compute the maximum height btree that can be stored in the given number of ondisk fsblocks. Starting with the top level of the tree, each iteration through the loop adds the fanout factor of the next level down until we run out of blocks. IOWs, maximum height is achieved by using the smallest fanout factor that can apply to that level. However, the loop setup is not correct. Top level btree blocks are allowed to contain fewer than minrecs items, so the computation is incorrect because the first time through the loop it should be using a fanout factor of 2. With this corrected, the above becomes: ht 1 node_blocks 2 blockleft 67108863 ht 2 node_blocks 90 blockleft 67108861 ht 3 node_blocks 4050 blockleft 67108771 ht 4 node_blocks 182250 blockleft 67104721 ht 5 node_blocks 8201250 blockleft 66922471 final height: 6 Fixes: 9ec691205e7d ("xfs: compute the maximum height of the rmap btree when reflink enabled") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-09xfs: convert XFS_IFORK_PTR to a static inline helperDarrick J. Wong1-2/+2
We're about to make this logic do a bit more, so convert the macro to a static inline function for better typechecking and fewer shouty macros. No functional changes here. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-07xfs: Pre-calculate per-AG agbno geometryDave Chinner1-15/+10
There is a lot of overhead in functions like xfs_verify_agbno() that repeatedly calculate the geometry limits of an AG. These can be pre-calculated as they are static and the verification context has a per-ag context it can quickly reference. In the case of xfs_verify_agbno(), we now always have a perag context handy, so we can store the AG length and the minimum valid block in the AG in the perag. This means we don't have to calculate it on every call and it can be inlined in callers if we move it to xfs_ag.h. Move xfs_ag_block_count() to xfs_ag.c because it's really a per-ag function and not an XFS type function. We need a little bit of rework that is specific to xfs_initialise_perag() to allow growfs to calculate the new perag sizes before we've updated the primary superblock during the grow (chicken/egg situation). Note that we leave the original xfs_verify_agbno in place in xfs_types.c as a static function as other callers in that file do not have per-ag contexts so still need to go the long way. It's been renamed to xfs_verify_agno_agbno() to indicate it takes both an agno and an agbno to differentiate it from new function. Future commits will make similar changes for other per-ag geometry validation functions. Further: $ size --totals fs/xfs/built-in.a text data bss dec hex filename before 1483006 329588 572 1813166 1baaae (TOTALS) after 1482185 329588 572 1812345 1ba779 (TOTALS) This rework reduces the binary size by ~820 bytes, indicating that much less work is being done to bounds check the agbno values against on per-ag geometry information. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>