aboutsummaryrefslogtreecommitdiff
path: root/fs/xfs/libxfs
AgeCommit message (Collapse)AuthorFilesLines
2020-07-28xfs: improve ondisk dquot flags checkingDarrick J. Wong2-3/+10
Create an XFS_DQTYPE_ANY mask for ondisk dquots flags, and use that to ensure that we never accept any garbage flags when we're loading dquots. While we're at it, restructure the quota type flag checking to use the proper masking. Note that I plan to add y2038 support soon, which will require a new xfs_dqtype_t flag for extended timestamp support, hence all the work to make the type masking work correctly. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-07-28xfs: create xfs_dqtype_t to represent quota typesDarrick J. Wong3-14/+20
Create a new type (xfs_dqtype_t) to represent the type of an incore dquot (user, group, project, or none). Rename the incore dquot's dq_flags field to q_type. This allows us to replace all the "uint type" arguments to the quota functions with "xfs_dqtype_t type", to make it obvious when we're passing a quota type argument into a function. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-07-28xfs: rename XFS_DQ_{USER,GROUP,PROJ} to XFS_DQTYPE_*Darrick J. Wong3-11/+13
We're going to split up the incore dquot state flags from the ondisk dquot flags (eventually renaming this "type") so start by renaming the three flags and the bitmask that are going to participate in this. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-07-28xfs: drop the type parameter from xfs_dquot_verifyDarrick J. Wong2-10/+6
xfs_qm_reset_dqcounts (aka quotacheck) is the only xfs_dqblk_verify caller that actually knows the specific quota type that it's looking for. Since everything else just pass in type==0 (including the buffer verifier), drop the parameter and open-code the check like xfs_dquot_from_disk already does. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-07-28xfs: remove qcore from incore dquotsDarrick J. Wong1-4/+3
Now that we've stopped using qcore entirely, drop it from the incore dquot. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-07-28xfs: make XFS_DQUOT_CLUSTER_SIZE_FSB part of the ondisk formatDarrick J. Wong1-0/+16
Move the dquot cluster size #define to xfs_format.h. It is an important part of the ondisk format because the ondisk dquot record size is not an even power of two, which means that the buffer size we use is significant here because the kernel leaves slack space at the end of the buffer to avoid having to deal with a dquot record crossing a block boundary. This is also an excuse to fix one of the longstanding discrepancies between kernel and userspace libxfs headers. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Chandan Babu R <[email protected]>
2020-07-28xfs: rename dquot incore state flagsDarrick J. Wong1-5/+5
Rename the existing incore dquot "dq_flags" field to "q_flags" to match everything else in the structure, then move the two actual dquot state flags to the XFS_DQFLAG_ namespace from XFS_DQ_. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Chandan Babu R <[email protected]>
2020-07-28xfs: fix inode allocation block res calculation precedenceBrian Foster1-1/+1
The block reservation calculation for inode allocation is supposed to consist of the blocks required for the inode chunk plus (maxlevels-1) of the inode btree multiplied by the number of inode btrees in the fs (2 when finobt is enabled, 1 otherwise). Instead, the macro returns (ialloc_blocks + 2) due to a precedence error in the calculation logic. This leads to block reservation overruns via generic/531 on small block filesystems with finobt enabled. Add braces to fix the calculation and reserve the appropriate number of blocks. Fixes: 9d43b180af67 ("xfs: update inode allocation/free transaction reservations for finobt") Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-07-14xfs: get rid of unnecessary xfs_perag_{get,put} pairsGao Xiang7-64/+23
In the course of some operations, we look up the perag from the mount multiple times to get or change perag information. These are often very short pieces of code, so while the lookup cost is generally low, the cost of the lookup is far higher than the cost of the operation we are doing on the perag. Since we changed buffers to hold references to the perag they are cached in, many modification contexts already hold active references to the perag that are held across these operations. This is especially true for any operation that is serialised by an allocation group header buffer. In these cases, we can just use the buffer's reference to the perag to avoid needing to do lookups to access the perag. This means that many operations don't need to do perag lookups at all to access the perag because they've already looked up objects that own persistent references and hence can use that reference instead. Cc: Dave Chinner <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Signed-off-by: Gao Xiang <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-07-07xfs: remove xfs_inobp_check()Dave Chinner2-30/+0
This debug code is called on every xfs_iflush() call, which then checks every inode in the buffer for non-zero unlinked list field. Hence it checks every inode in the cluster buffer every time a single inode on that cluster it flushed. This is resulting in: - 38.91% 5.33% [kernel] [k] xfs_iflush - 17.70% xfs_iflush - 9.93% xfs_inobp_check 4.36% xfs_buf_offset 10% of the CPU time spent flushing inodes is repeatedly checking unlinked fields in the buffer. We don't need to do this. The other place we call xfs_inobp_check() is xfs_iunlink_update_dinode(), and this is after we've done this assert for the agino we are about to write into that inode: ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino)); which means we've already checked that the agino we are about to write is not 0 on debug kernels. The inode buffer verifiers do everything else we need, so let's just remove this debug code. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-07-07xfs: attach inodes to the cluster buffer when dirtiedDave Chinner1-3/+6
Rather than attach inodes to the cluster buffer just when we are doing IO, attach the inodes to the cluster buffer when they are dirtied. The means the buffer always carries a list of dirty inodes that reference it, and we can use that list to make more fundamental changes to inode writeback that aren't otherwise possible. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-07-07xfs: pin inode backing buffer to the inode log itemDave Chinner2-7/+49
When we dirty an inode, we are going to have to write it disk at some point in the near future. This requires the inode cluster backing buffer to be present in memory. Unfortunately, under severe memory pressure we can reclaim the inode backing buffer while the inode is dirty in memory, resulting in stalling the AIL pushing because it has to do a read-modify-write cycle on the cluster buffer. When we have no memory available, the read of the cluster buffer blocks the AIL pushing process, and this causes all sorts of issues for memory reclaim as it requires inode writeback to make forwards progress. Allocating a cluster buffer causes more memory pressure, and results in more cluster buffers to be reclaimed, resulting in more RMW cycles to be done in the AIL context and everything then backs up on AIL progress. Only the synchronous inode cluster writeback in the the inode reclaim code provides some level of forwards progress guarantees that prevent OOM-killer rampages in this situation. Fix this by pinning the inode backing buffer to the inode log item when the inode is first dirtied (i.e. in xfs_trans_log_inode()). This may mean the first modification of an inode that has been held in cache for a long time may block on a cluster buffer read, but we can do that in transaction context and block safely until the buffer has been allocated and read. Once we have the cluster buffer, the inode log item takes a reference to it, pinning it in memory, and attaches it to the log item for future reference. This means we can always grab the cluster buffer from the inode log item when we need it. When the inode is finally cleaned and removed from the AIL, we can drop the reference the inode log item holds on the cluster buffer. Once all inodes on the cluster buffer are clean, the cluster buffer will be unpinned and it will be available for memory reclaim to reclaim again. This avoids the issues with needing to do RMW cycles in the AIL pushing context, and hence allows complete non-blocking inode flushing to be performed by the AIL pushing context. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-07-06xfs: add an inode item lockDave Chinner1-26/+26
The inode log item is kind of special in that it can be aggregating new changes in memory at the same time time existing changes are being written back to disk. This means there are fields in the log item that are accessed concurrently from contexts that don't share any locking at all. e.g. updating ili_last_fields occurs at flush time under the ILOCK_EXCL and flush lock at flush time, under the flush lock at IO completion time, and is read under the ILOCK_EXCL when the inode is logged. Hence there is no actual serialisation between reading the field during logging of the inode in transactions vs clearing the field in IO completion. We currently get away with this by the fact that we are only clearing fields in IO completion, and nothing bad happens if we accidentally log more of the inode than we actually modify. Worst case is we consume a tiny bit more memory and log bandwidth. However, if we want to do more complex state manipulations on the log item that requires updates at all three of these potential locations, we need to have some mechanism of serialising those operations. To do this, introduce a spinlock into the log item to serialise internal state. This could be done via the xfs_inode i_flags_lock, but this then leads to potential lock inversion issues where inode flag updates need to occur inside locks that best nest inside the inode log item locks (e.g. marking inodes stale during inode cluster freeing). Using a separate spinlock avoids these sorts of problems and simplifies future code. This does not touch the use of ili_fields in the item formatting code - that is entirely protected by the ILOCK_EXCL at this point in time, so it remains untouched. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-07-06xfs: Don't allow logging of XFS_ISTALE inodesDave Chinner1-0/+2
In tracking down a problem in this patchset, I discovered we are reclaiming dirty stale inodes. This wasn't discovered until inodes were always attached to the cluster buffer and then the rcu callback that freed inodes was assert failing because the inode still had an active pointer to the cluster buffer after it had been reclaimed. Debugging the issue indicated that this was a pre-existing issue resulting from the way the inodes are handled in xfs_inactive_ifree. When we free a cluster buffer from xfs_ifree_cluster, all the inodes in cache are marked XFS_ISTALE. Those that are clean have nothing else done to them and so eventually get cleaned up by background reclaim. i.e. it is assumed we'll never dirty/relog an inode marked XFS_ISTALE. On journal commit dirty stale inodes as are handled by both buffer and inode log items to run though xfs_istale_done() and removed from the AIL (buffer log item commit) or the log item will simply unpin it because the buffer log item will clean it. What happens to any specific inode is entirely dependent on which log item wins the commit race, but the result is the same - stale inodes are clean, not attached to the cluster buffer, and not in the AIL. Hence inode reclaim can just free these inodes without further care. However, if the stale inode is relogged, it gets dirtied again and relogged into the CIL. Most of the time this isn't an issue, because relogging simply changes the inode's location in the current checkpoint. Problems arise, however, when the CIL checkpoints between two transactions in the xfs_inactive_ifree() deferops processing. This results in the XFS_ISTALE inode being redirtied and inserted into the CIL without any of the other stale cluster buffer infrastructure being in place. Hence on journal commit, it simply gets unpinned, so it remains dirty in memory. Everything in inode writeback avoids XFS_ISTALE inodes so it can't be written back, and it is not tracked in the AIL so there's not even a trigger to attempt to clean the inode. Hence the inode just sits dirty in memory until inode reclaim comes along, sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming of a dirty inode caused use after free, list corruptions and other nasty issues later in this patchset. Hence this patch addresses a violation of the "never log XFS_ISTALE inodes" caused by the deferops processing rolling a transaction and relogging a stale inode in xfs_inactive_free. It also adds a bunch of asserts to catch this problem in debug kernels so that we don't reintroduce this problem in future. Reproducer for this issue was generic/558 on a v4 filesystem. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-07-06xfs: redesign the reflink remap loop to fix blkres depletion crashDarrick J. Wong1-4/+9
The existing reflink remapping loop has some structural problems that need addressing: The biggest problem is that we create one transaction for each extent in the source file without accounting for the number of mappings there are for the same range in the destination file. In other words, we don't know the number of remap operations that will be necessary and we therefore cannot guess the block reservation required. On highly fragmented filesystems (e.g. ones with active dedupe) we guess wrong, run out of block reservation, and fail. The second problem is that we don't actually use the bmap intents to their full potential -- instead of calling bunmapi directly and having to deal with its backwards operation, we could call the deferred ops xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead. This makes the frontend loop much simpler. Solve all of these problems by refactoring the remapping loops so that we only perform one remapping operation per transaction, and each operation only tries to remap a single extent from source to dest. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reported-by: Edwin Török <[email protected]> Tested-by: Edwin Török <[email protected]>
2020-07-06xfs: rename xfs_bmap_is_real_extent to is_written_extentDarrick J. Wong2-2/+2
The name of this predicate is a little misleading -- it decides if the extent mapping is allocated and written. Change the name to be more direct, as we're going to add a new predicate in the next patch. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2020-07-06xfs: preserve rmapbt swapext block reservation from freed blocksBrian Foster1-0/+1
The rmapbt extent swap algorithm remaps individual extents between the source inode and the target to trigger reverse mapping metadata updates. If either inode straddles a format or other bmap allocation boundary, the individual unmap and map cycles can trigger repeated bmap block allocations and frees as the extent count bounces back and forth across the boundary. While net block usage is bound across the swap operation, this behavior can prematurely exhaust the transaction block reservation because it continuously drains as the transaction rolls. Each allocation accounts against the reservation and each free returns to global free space on transaction roll. The previous workaround to this problem attempted to detect this boundary condition and provide surplus block reservation to acommodate it. This is insufficient because more remaps can occur than implied by the extent counts; if start offset boundaries are not aligned between the two inodes, for example. To address this problem more generically and dynamically, add a transaction accounting mode that returns freed blocks to the transaction reservation instead of the superblock counters on transaction roll and use it when the rmapbt based algorithm is active. This allows the chain of remap transactions to preserve the block reservation based own its own frees and prevent premature exhaustion regardless of the remap pattern. Note that this is only safe for superblocks with lazy sb accounting, but the latter is required for v5 supers and the rmap feature depends on v5. Fixes: b3fed434822d0 ("xfs: account format bouncing into rmapbt swapext tx reservation") Root-caused-by: Darrick J. Wong <[email protected]> Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-07-06xfs: Couple of typo fixes in commentsKeyur Patel1-3/+3
./xfs/libxfs/xfs_inode_buf.c:56: unnecssary ==> unnecessary ./xfs/libxfs/xfs_inode_buf.c:59: behavour ==> behaviour ./xfs/libxfs/xfs_inode_buf.c:206: unitialized ==> uninitialized Signed-off-by: Keyur Patel <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-06-15writeback: Drop I_DIRTY_TIME_EXPIREJan Kara1-2/+2
The only use of I_DIRTY_TIME_EXPIRE is to detect in __writeback_single_inode() that inode got there because flush worker decided it's time to writeback the dirty inode time stamps (either because we are syncing or because of age). However we can detect this directly in __writeback_single_inode() and there's no need for the strange propagation with I_DIRTY_TIME_EXPIRE flag. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jan Kara <[email protected]>
2020-05-27xfs: more lockdep whackamole with kmem_alloc*Darrick J. Wong1-1/+1
Dave Airlie reported the following lockdep complaint: > ====================================================== > WARNING: possible circular locking dependency detected > 5.7.0-0.rc5.20200515git1ae7efb38854.1.fc33.x86_64 #1 Not tainted > ------------------------------------------------------ > kswapd0/159 is trying to acquire lock: > ffff9b38d01a4470 (&xfs_nondir_ilock_class){++++}-{3:3}, > at: xfs_ilock+0xde/0x2c0 [xfs] > > but task is already holding lock: > ffffffffbbb8bd00 (fs_reclaim){+.+.}-{0:0}, at: > __fs_reclaim_acquire+0x5/0x30 > > which lock already depends on the new lock. > > > the existing dependency chain (in reverse order) is: > > -> #1 (fs_reclaim){+.+.}-{0:0}: > fs_reclaim_acquire+0x34/0x40 > __kmalloc+0x4f/0x270 > kmem_alloc+0x93/0x1d0 [xfs] > kmem_alloc_large+0x4c/0x130 [xfs] > xfs_attr_copy_value+0x74/0xa0 [xfs] > xfs_attr_get+0x9d/0xc0 [xfs] > xfs_get_acl+0xb6/0x200 [xfs] > get_acl+0x81/0x160 > posix_acl_xattr_get+0x3f/0xd0 > vfs_getxattr+0x148/0x170 > getxattr+0xa7/0x240 > path_getxattr+0x52/0x80 > do_syscall_64+0x5c/0xa0 > entry_SYSCALL_64_after_hwframe+0x49/0xb3 > > -> #0 (&xfs_nondir_ilock_class){++++}-{3:3}: > __lock_acquire+0x1257/0x20d0 > lock_acquire+0xb0/0x310 > down_write_nested+0x49/0x120 > xfs_ilock+0xde/0x2c0 [xfs] > xfs_reclaim_inode+0x3f/0x400 [xfs] > xfs_reclaim_inodes_ag+0x20b/0x410 [xfs] > xfs_reclaim_inodes_nr+0x31/0x40 [xfs] > super_cache_scan+0x190/0x1e0 > do_shrink_slab+0x184/0x420 > shrink_slab+0x182/0x290 > shrink_node+0x174/0x680 > balance_pgdat+0x2d0/0x5f0 > kswapd+0x21f/0x510 > kthread+0x131/0x150 > ret_from_fork+0x3a/0x50 > > other info that might help us debug this: > > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(fs_reclaim); > lock(&xfs_nondir_ilock_class); > lock(fs_reclaim); > lock(&xfs_nondir_ilock_class); > > *** DEADLOCK *** > > 4 locks held by kswapd0/159: > #0: ffffffffbbb8bd00 (fs_reclaim){+.+.}-{0:0}, at: > __fs_reclaim_acquire+0x5/0x30 > #1: ffffffffbbb7cef8 (shrinker_rwsem){++++}-{3:3}, at: > shrink_slab+0x115/0x290 > #2: ffff9b39f07a50e8 > (&type->s_umount_key#56){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0 > #3: ffff9b39f077f258 > (&pag->pag_ici_reclaim_lock){+.+.}-{3:3}, at: > xfs_reclaim_inodes_ag+0x82/0x410 [xfs] This is a known false positive because inodes cannot simultaneously be getting reclaimed and the target of a getxattr operation, but lockdep doesn't know that. We can (selectively) shut up lockdep until either it gets smarter or we change inode reclaim not to require the ILOCK by applying a stupid GFP_NOLOCKDEP bandaid. Reported-by: Dave Airlie <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Tested-by: Dave Airlie <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2020-05-27xfs: force writes to delalloc regions to unwrittenDarrick J. Wong1-12/+17
When writing to a delalloc region in the data fork, commit the new allocations (of the da reservation) as unwritten so that the mappings are only marked written once writeback completes successfully. This fixes the problem of stale data exposure if the system goes down during targeted writeback of a specific region of a file, as tested by generic/042. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2020-05-27xfs: always return -ENOSPC on project quota reservation failureEric Sandeen1-1/+0
XFS project quota treats project hierarchies as "mini filesysems" and so rather than -EDQUOT, the intent is to return -ENOSPC when a quota reservation fails, but this behavior is not consistent. The only place we make a decision between -EDQUOT and -ENOSPC returns based on quota type is in xfs_trans_dqresv(). This behavior is currently controlled by whether or not the XFS_QMOPT_ENOSPC flag gets passed into the quota reservation. However, its use is not consistent; paths such as xfs_create() and xfs_symlink() don't set the flag, so a reservation failure will return -EDQUOT for project quota reservation failures rather than -ENOSPC for these sorts of operations, even for project quota: # mkdir mnt/project # xfs_quota -x -c "project -s -p mnt/project 42" mnt # xfs_quota -x -c 'limit -p isoft=2 ihard=3 42' mnt # touch mnt/project/file{1,2,3} touch: cannot touch ‘mnt/project/file3’: Disk quota exceeded We can make this consistent by not requiring the flag to be set at the top of the callchain; instead we can simply test whether we are reserving a project quota with XFS_QM_ISPDQ in xfs_trans_dqresv and if so, return -ENOSPC for that failure. This removes the need for the XFS_QMOPT_ENOSPC altogether and simplifies the code a fair bit. Signed-off-by: Eric Sandeen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: cleanup xfs_idestroy_forkChristoph Hellwig4-29/+14
Move freeing the dynamically allocated attr and COW fork, as well as zeroing the pointers where actually needed into the callers, and just pass the xfs_ifork structure to xfs_idestroy_fork. Also simplify the kmem_free calls by not checking for NULL first. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: move the fork format fields into struct xfs_iforkChristoph Hellwig11-141/+115
Both the data and attr fork have a format that is stored in the legacy idinode. Move it into the xfs_ifork structure instead, where it uses up padding. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: move the per-fork nextents fields into struct xfs_iforkChristoph Hellwig8-98/+85
There are there are three extents counters per inode, one for each of the forks. Two are in the legacy icdinode and one is directly in struct xfs_inode. Switch to a single counter in the xfs_ifork structure where it uses up padding at the end of the structure. This simplifies various bits of code that just wants the number of extents counter and can now directly dereference it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: remove the XFS_DFORK_Q macroChristoph Hellwig2-6/+5
Just checking di_forkoff directly is a little easier to follow. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: remove the NULL fork handling in xfs_bmapi_readChristoph Hellwig1-17/+5
Now that we fully verify the inode forks before they are added to the inode cache, the crash reported in https://bugzilla.kernel.org/show_bug.cgi?id=204031 can't happen anymore, as we'll never let an inode that has inconsistent nextents counts vs the presence of an in-core attr fork leak into the inactivate code path. So remove the work around to try to handle the case, and just return an error and warn if the fork is not present. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: remove the special COW fork handling in xfs_bmapi_readChristoph Hellwig1-12/+1
We don't call xfs_bmapi_read for the COW fork anymore, so remove the special casing. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: improve local fork verificationChristoph Hellwig1-1/+7
Call the data/attr local fork verifiers as soon as we are ready for them. This keeps them close to the code setting up the forks, and avoids a few branches later on. Also open code xfs_inode_verify_forks in the only remaining caller. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: refactor xfs_inode_verify_forksChristoph Hellwig2-19/+36
The split between xfs_inode_verify_forks and the two helpers implementing the actual functionality is a little strange. Reshuffle it so that xfs_inode_verify_forks verifies if the data and attr forks are actually in local format and only call the low-level helpers if that is the case. Handle the actual error reporting in the low-level handlers to streamline the caller. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: remove xfs_ifork_opsChristoph Hellwig2-27/+7
xfs_ifork_ops add up to two indirect calls per inode read and flush, despite just having a single instance in the kernel. In xfsprogs phase6 in xfs_repair overrides the verify_dir method to deal with inodes that do not have a valid parent, but that can be fixed pretty easily by ensuring they always have a valid looking parent. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: remove xfs_ireadChristoph Hellwig2-75/+0
There is not much point in the xfs_iread function, as it has a single caller and not a whole lot of code. Move it into the only caller, and trim down the overdocumentation to just documenting the important "why" instead of a lot of redundant "what". Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: don't reset i_delayed_blks in xfs_ireadChristoph Hellwig1-2/+0
i_delayed_blks is set to 0 in xfs_inode_alloc and can't have anything assigned to it until the inode is visible to the VFS. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: call xfs_dinode_verify from xfs_inode_from_diskChristoph Hellwig1-10/+8
Keep the code dealing with the dinode together, and also ensure we verify the dinode in the owner change log recovery case as well. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: handle unallocated inodes in xfs_inode_from_diskChristoph Hellwig1-36/+14
Handle inodes with a 0 di_mode in xfs_inode_from_disk, instead of partially duplicating inode reading in xfs_iread. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: split xfs_iformat_forkChristoph Hellwig3-106/+103
xfs_iformat_fork is a weird catchall. Split it into one helper for the data fork and one for the attr fork, and then call both helper as well as the COW fork initialization from xfs_inode_from_disk. Order the COW fork initialization after the attr fork initialization given that it can't fail to simplify the error handling. Note that the newly split helpers are moved down the file in xfs_inode_fork.c to avoid the need for forward declarations. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: call xfs_iformat_fork from xfs_inode_from_diskChristoph Hellwig2-4/+5
We always need to fill out the fork structures when reading the inode, so call xfs_iformat_fork from the tail of xfs_inode_from_disk. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: xfs_bmapi_read doesn't take a fork id as the last argumentChristoph Hellwig1-1/+1
The last argument to xfs_bmapi_raad contains XFS_BMAPI_* flags, not the fork. Given that XFS_DATA_FORK evaluates to 0 no real harm is done, but let's fix this anyway. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: fix the warning message in xfs_validate_sb_common()Kaixu Xia1-1/+1
Fix this error message to complain about project and group quota flag bits instead of "PUOTA" and "QUOTA". Signed-off-by: Kaixu Xia <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-19xfs: use ordered buffers to initialize dquot buffers during quotacheckDarrick J. Wong1-1/+9
While QAing the new xfs_repair quotacheck code, I uncovered a quota corruption bug resulting from a bad interaction between dquot buffer initialization and quotacheck. The bug can be reproduced with the following sequence: # mkfs.xfs -f /dev/sdf # mount /dev/sdf /opt -o usrquota # su nobody -s /bin/bash -c 'touch /opt/barf' # sync # xfs_quota -x -c 'report -ahi' /opt User quota on /opt (/dev/sdf) Inodes User ID Used Soft Hard Warn/Grace ---------- --------------------------------- root 3 0 0 00 [------] nobody 1 0 0 00 [------] # xfs_io -x -c 'shutdown' /opt # umount /opt # mount /dev/sdf /opt -o usrquota # touch /opt/man2 # xfs_quota -x -c 'report -ahi' /opt User quota on /opt (/dev/sdf) Inodes User ID Used Soft Hard Warn/Grace ---------- --------------------------------- root 1 0 0 00 [------] nobody 1 0 0 00 [------] # umount /opt Notice how the initial quotacheck set the root dquot icount to 3 (rootino, rbmino, rsumino), but after shutdown -> remount -> recovery, xfs_quota reports that the root dquot has only 1 icount. We haven't deleted anything from the filesystem, which means that quota is now under-counting. This behavior is not limited to icount or the root dquot, but this is the shortest reproducer. I traced the cause of this discrepancy to the way that we handle ondisk dquot updates during quotacheck vs. regular fs activity. Normally, when we allocate a disk block for a dquot, we log the buffer as a regular (dquot) buffer. Subsequent updates to the dquots backed by that block are done via separate dquot log item updates, which means that they depend on the logged buffer update being written to disk before the dquot items. Because individual dquots have their own LSN fields, that initial dquot buffer must always be recovered. However, the story changes for quotacheck, which can cause dquot block allocations but persists the final dquot counter values via a delwri list. Because recovery doesn't gate dquot buffer replay on an LSN, this means that the initial dquot buffer can be replayed over the (newer) contents that were delwritten at the end of quotacheck. In effect, this re-initializes the dquot counters after they've been updated. If the log does not contain any other dquot items to recover, the obsolete dquot contents will not be corrected by log recovery. Because quotacheck uses a transaction to log the setting of the CHKD flags in the superblock, we skip quotacheck during the second mount call, which allows the incorrect icount to remain. Fix this by changing the ondisk dquot initialization function to use ordered buffers to write out fresh dquot blocks if it detects that we're running quotacheck. If the system goes down before quotacheck can complete, the CHKD flags will not be set in the superblock and the next mount will run quotacheck again, which can fix uninitialized dquot buffers. This requires amending the defer code to maintaine ordered buffer state across defer rolls for the sake of the dquot allocation code. For regular operations we preserve the current behavior since the dquot items require properly initialized ondisk dquot records. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-05-19xfs: don't fail verifier on empty attr3 leaf blockBrian Foster1-8/+7
The attr fork can transition from shortform to leaf format while empty if the first xattr doesn't fit in shortform. While this empty leaf block state is intended to be transient, it is technically not due to the transactional implementation of the xattr set operation. We historically have a couple of bandaids to work around this problem. The first is to hold the buffer after the format conversion to prevent premature writeback of the empty leaf buffer and the second is to bypass the xattr count check in the verifier during recovery. The latter assumes that the xattr set is also in the log and will be recovered into the buffer soon after the empty leaf buffer is reconstructed. This is not guaranteed, however. If the filesystem crashes after the format conversion but before the xattr set that induced it, only the format conversion may exist in the log. When recovered, this creates a latent corrupted state on the inode as any subsequent attempts to read the buffer fail due to verifier failure. This includes further attempts to set xattrs on the inode or attempts to destroy the attr fork, which prevents the inode from ever being removed from the unlinked list. To avoid this condition, accept that an empty attr leaf block is a valid state and remove the count check from the verifier. This means that on rare occasions an attr fork might exist in an unexpected state, but is otherwise consistent and functional. Note that we retain the logic to avoid racing with metadata writeback to reduce the window where this can occur. Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-05-13xfs: Use the correct style for SPDX License IdentifierNishad Kamdar20-20/+20
This patch corrects the SPDX License Identifier style in header files related to XFS File System support. For C header files Documentation/process/license-rules.rst mandates C-like comments. (opposed to C source files where C++ style should be used). Changes made by using a script provided by Joe Perches here: https://lkml.org/lkml/2019/2/7/46. Suggested-by: Joe Perches <[email protected]> Signed-off-by: Nishad Kamdar <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-13xfs: Replace zero-length array with flexible-arrayGustavo A. R. Silva1-1/+1
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-05-08xfs: move log recovery buffer cancellation code to xfs_buf_item_recover.cDarrick J. Wong1-2/+0
Move the helpers that handle incore buffer cancellation records to xfs_buf_item_recover.c since they're not directly related to the main log recovery machinery. No functional changes. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-05-08xfs: refactor releasing finished intents during log recoveryDarrick J. Wong1-0/+3
Replace the open-coded AIL item walking with a proper helper when we're trying to release an intent item that has been finished. We add a new ->iop_match method to decide if an intent item matches a supplied ID. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-05-08xfs: refactor log recovery buffer item dispatch for pass2 commit functionsDarrick J. Wong1-0/+23
Move the log buffer item pass2 commit code into the per-item source code files and use the dispatch function to call it. We do these one at a time because there's a lot of code to move. No functional changes. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-05-08xfs: refactor log recovery item dispatch for pass1 commit functionsDarrick J. Wong1-0/+4
Move the pass1 commit code into the per-item source code files and use the dispatch function to call them. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-05-08xfs: refactor log recovery item dispatch for pass2 readhead functionsDarrick J. Wong1-0/+6
Move the pass2 readhead code into the per-item source code files and use the dispatch function to call them. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-05-08xfs: refactor log recovery item sorting into a generic dispatch structureDarrick J. Wong1-2/+43
Create a generic dispatch structure to delegate recovery of different log item types into various code modules. This will enable us to move code specific to a particular log item type out of xfs_log_recover.c and into the log item source. The first operation we virtualize is the log item sorting. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-05-08xfs: convert xfs_log_recover_item_t to struct xfs_log_recover_itemDarrick J. Wong1-2/+2
Remove the old typedefs. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>