aboutsummaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)AuthorFilesLines
2019-06-28xfs: remove the never used _XBF_COMPOUND flagChristoph Hellwig1-3/+1
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2019-06-28xfs: remove the no-op spinlock_destroy stubChristoph Hellwig2-4/+0
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2019-06-28xfs: move xfs_ino_geometry to xfs_shared.hDarrick J. Wong31-41/+71
The inode geometry structure isn't related to ondisk format; it's support for the mount structure. Move it to xfs_shared.h. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2019-06-17block: return from __bio_try_merge_page if merging occured in the same pageChristoph Hellwig1-3/+8
We currently have an input same_page parameter to __bio_try_merge_page to prohibit merging in the same page. The rationale for that is that some callers need to account for every page added to a bio. Instead of letting these callers call twice into the merge code to account for the new vs existing page cases, just turn the paramter into an output one that returns if a merge in the same page occured and let them act accordingly. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-06-12xfs: remove unused flag argumentsEric Sandeen10-48/+35
There are several functions which take a flag argument that is only ever passed as "0," so remove these arguments. Signed-off-by: Eric Sandeen <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Bill O'Donnell <[email protected]> Reviewed-by: Allison Collins <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2019-06-12xfs: remove the debug-only q_transp field from struct xfs_dquotChristoph Hellwig3-16/+0
The field is only used for a few assertations. Shrink the dqout structure instead, similarly to what commit f3ca87389dbf ("xfs: remove i_transp") did for the xfs_inode. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2019-06-12xfs: merge xfs_buf_zero and xfs_buf_iomoveChristoph Hellwig2-30/+6
xfs_buf_zero is the only caller of xfs_buf_iomove. Remove support for copying from or to the buffer in xfs_buf_iomove and merge the two functions. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2019-06-12xfs: remove unused flags arg from getsb interfacesEric Sandeen7-22/+11
The flags value is always passed as 0 so remove the argument. Signed-off-by: Eric Sandeen <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2019-06-12xfs: include WARN, REPAIR build options in XFS_BUILD_OPTIONSEric Sandeen1-0/+14
The XFS_BUILD_OPTIONS string, shown at module init time and in modinfo output, does not currently include all available build options. So, add in CONFIG_XFS_WARN and CONFIG_XFS_REPAIR. It has been suggested in some quarters That this is not enough. Well ... Anybody who would like to see this in a sysfs file can send a patch. :) Signed-off-by: Eric Sandeen <[email protected]> Reviewed-by: Bill O'Donnell <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2019-06-12xfs: finish converting to inodes_per_clusterDarrick J. Wong2-9/+4
Finish converting all the old inode_cluster_size >> inopblog users to inodes_per_cluster. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2019-06-12xfs: fix inode_cluster_size rounding mayhemDarrick J. Wong5-15/+28
inode_cluster_size is supposed to represent the size (in bytes) of an inode cluster buffer. We avoid having to handle multiple clusters per filesystem block on filesystems with large blocks by openly rounding this value up to 1 FSB when necessary. However, we never reset inode_cluster_size to reflect this new rounded value, which adds to the potential for mistakes in calculating geometries. Fix this by setting inode_cluster_size to reflect the rounded-up size if needed, and special-case the few places in the sparse inodes code where we actually need the smaller value to validate on-disk metadata. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2019-06-12xfs: refactor inode geometry setup routinesDarrick J. Wong4-144/+101
Migrate all of the inode geometry setup code from xfs_mount.c into a single libxfs function that we can share with xfsprogs. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2019-06-12xfs: separate inode geometryDarrick J. Wong18-161/+208
Separate the inode geometry information into a distinct structure. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2019-06-09xfs: use file_modified() helperAmir Goldstein1-14/+1
Note that by using the helper, the order of calling file_remove_privs() after file_update_mtime() in xfs_file_aio_write_checks() has changed. Signed-off-by: Amir Goldstein <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2019-06-06Merge tag 'xfs-5.2-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2-3/+11
Pull xfs fixes from Darrick Wong: "Here are a couple more bug fixes for 5.2. Changes since last update: - Fix some forgotten strings in a log debugging function - Fix incorrect unit conversion in online fsck code" * tag 'xfs-5.2-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: inode btree scrubber should calculate im_boffset correctly xfs: fix broken log reservation debugging
2019-06-03xfs: inode btree scrubber should calculate im_boffset correctlyDarrick J. Wong1-1/+2
The im_boffset field is in units of bytes, whereas XFS_INO_OFFSET returns a value in units of inodes. Convert the units so that scrub on a 64k-block filesystem works correctly. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2019-05-24xfs: fix broken log reservation debuggingDarrick J. Wong1-2/+9
xlog_print_tic_res() is supposed to print a human readable string for each element of the log ticket reservation array. Unfortunately, I forgot to update the string array when we added rmap & reflink support, so the debug message prints "region[3]: (null) - 352 bytes" which isn't useful at all. Add the missing elements and add a build check so that we don't forget again to add a string when adding a new XLOG_REG_TYPE. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2019-05-23Merge tag 'xfs-5.2-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds3-0/+27
Pull xfs fix from Darrick Wong: "Fix an accounting mistake where we included the log space when calculating the reserve space for metadata expansion" * tag 'xfs-5.2-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: don't reserve per-AG space for an internal log
2019-05-21treewide: Add SPDX license identifier - Makefile/KconfigThomas Gleixner1-0/+1
Add SPDX license identifiers to all Make/Kconfig files which: - Have no license information of any form These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2019-05-20xfs: don't reserve per-AG space for an internal logDarrick J. Wong3-0/+27
It turns out that the log can consume nearly all the space in an AG, and when this happens this it's possible that there will be less free space in the AG than the reservation would try to hide. On a debug kernel this can trigger an ASSERT in xfs/250: XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319 The log is permanently allocated, so we know we're never going to have to expand the btrees to hold any records associated with the log space. We therefore can treat the space as if it doesn't exist. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Eric Sandeen <[email protected]>
2019-05-18treewide: prefix header search paths with $(srctree)/Masahiro Yamada1-2/+2
Currently, the Kbuild core manipulates header search paths in a crazy way [1]. To fix this mess, I want all Makefiles to add explicit $(srctree)/ to the search paths in the srctree. Some Makefiles are already written in that way, but not all. The goal of this work is to make the notation consistent, and finally get rid of the gross hacks. Having whitespaces after -I does not matter since commit 48f6e3cf5bc6 ("kbuild: do not drop -I without parameter"). [1]: https://patchwork.kernel.org/patch/9632347/ Signed-off-by: Masahiro Yamada <[email protected]>
2019-05-07Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-blockLinus Torvalds3-12/+2
Pull block updates from Jens Axboe: "Nothing major in this series, just fixes and improvements all over the map. This contains: - Series of fixes for sed-opal (David, Jonas) - Fixes and performance tweaks for BFQ (via Paolo) - Set of fixes for bcache (via Coly) - Set of fixes for md (via Song) - Enabling multi-page for passthrough requests (Ming) - Queue release fix series (Ming) - Device notification improvements (Martin) - Propagate underlying device rotational status in loop (Holger) - Removal of mtip32xx trim support, which has been disabled for years (Christoph) - Improvement and cleanup of nvme command handling (Christoph) - Add block SPDX tags (Christoph) - Cleanup/hardening of bio/bvec iteration (Christoph) - A few NVMe pull requests (Christoph) - Removal of CONFIG_LBDAF (Christoph) - Various little fixes here and there" * tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits) block: fix mismerge in bvec_advance block: don't drain in-progress dispatch in blk_cleanup_queue() blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release blk-mq: always free hctx after request queue is freed blk-mq: split blk_mq_alloc_and_init_hctx into two parts blk-mq: free hw queue's resource in hctx's release handler blk-mq: move cancel of requeue_work into blk_mq_release blk-mq: grab .q_usage_counter when queuing request from plug code path block: fix function name in comment nvmet: protect discovery change log event list iteration nvme: mark nvme_core_init and nvme_core_exit static nvme: move command size checks to the core nvme-fabrics: check more command sizes nvme-pci: check more command sizes nvme-pci: remove an unneeded variable initialization nvme-pci: unquiesce admin queue on shutdown nvme-pci: shutdown on timeout during deletion nvme-pci: fix psdt field for single segment sgls nvme-multipath: don't print ANA group state by default nvme-multipath: split bios with the ns_head bio_set before submitting ...
2019-05-01xfs: change some error-less functions to void typesEric Sandeen11-58/+26
There are several functions which have no opportunity to return an error, and don't contain any ASSERTs which could be argued to be better constructed as error cases. So, make them voids to simplify the callers. Signed-off-by: Eric Sandeen <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2019-04-30block: remove the i argument to bio_for_each_segment_allChristoph Hellwig1-2/+1
We only have two callers that need the integer loop iterator, and they can easily maintain it themselves. Suggested-by: Matthew Wilcox <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Acked-by: David Sterba <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Acked-by: Coly Li <[email protected]> Reviewed-by: Matthew Wilcox <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-04-30xfs: add online scrub for superblock countersDarrick J. Wong11-3/+461
Teach online scrub how to check the filesystem summary counters. We use the incore delalloc block counter along with the incore AG headers to compute expected values for fdblocks, icount, and ifree, and then check that the percpu counter is within a certain threshold of the expected value. This is done to avoid having to freeze or otherwise lock the filesystem, which means that we're only checking that the counters are fairly close, not that they're exactly correct. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2019-04-30xfs: don't parse the mtpt mount optionChristoph Hellwig1-5/+1
The text isn't really any more useful than the default unknown option handling. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2019-04-30xfs: always rejoin held resources during defer rollDarrick J. Wong4-37/+31
During testing of xfs/141 on a V4 filesystem, I observed some inconsistent behavior with regards to resources that are held (i.e. remain locked) across a defer roll. The transaction roll always gives the defer roll function a new transaction, even if committing the old transaction fails. However, the defer roll function only rejoins the held resources if the transaction commit succeedied. This means that callers of defer roll have to figure out whether the held resources are attached to the transaction being passed back. Worse yet, if the defer roll was part of a defer finish call, we have a third possibility: the defer finish could pass back a dirty transaction with dirty held resources and an error code. The only sane way to handle all of these scenarios is to require that the code that held the resource either cancel the transaction before unlocking and releasing the resources, or use functions that detach resources from a transaction properly (e.g. xfs_trans_brelse) if they need to drop the reference before committing or cancelling the transaction. In order to make this so, change the defer roll code to join held resources to the new transaction unconditionally and fix all the bhold callers to release the held buffers correctly. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2019-04-26xfs: add missing error check in xfs_prepare_shift()Brian Foster1-0/+2
xfs_prepare_shift() fails to check the error return from xfs_flush_unmap_range(). If the latter fails, that could lead to an insert/collapse range operation over a delalloc range, which is not supported. Add an error check and return appropriately. This is reproduced rarely by generic/475. Fixes: 7f9f71be84bc ("xfs: extent shifting doesn't fully invalidate page cache") Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Allison Collins <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2019-04-26xfs: scrub should check incore counters against ondisk headersDarrick J. Wong1-0/+20
In theory, the incore per-AG structure counters should match the ones on disk, so check that. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2019-04-26xfs: allow scrubbers to pause background reclaimDarrick J. Wong4-0/+23
The forthcoming summary counter patch races with regular filesystem activity to compute rough expected values for the counters. This design was chosen to avoid having to freeze the entire filesystem to check the counters, but while that's running we'd prefer to minimize background reclamation activity to reduce the perturbations to the incore free block count. Therefore, provide a way for scrubbers to disable background posteof and cowblock reclamation. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2019-04-26xfs: rename the speculative block allocation reclaim toggle functionsDarrick J. Wong4-9/+9
"reclaim" is used throughout the icache code to mean reclamation of incore inode structures. It's also used for two helper functions that toggle background deletion of speculative preallocations. Separate the second of the two uses to make things less confusing. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2019-04-26xfs: track delayed allocation reservations across the filesystemDarrick J. Wong4-3/+51
Add a percpu counter to track the number of blocks directly reserved for delayed allocations on the data device. This counter (in contrast to i_delayed_blks) does not track allocated CoW staging extents or anything going on with the realtime device. It will be used in the upcoming summary counter scrub function to check the free block counts without having to freeze the filesystem or walk all the inodes to find the delayed allocations. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2019-04-26xfs: fix broken bhold behavior in xrep_roll_ag_transDarrick J. Wong1-17/+8
In xrep_roll_ag_trans, the transaction roll will always set sc->tp to the new transaction, even if committing the old one fails. A bare transaction roll leaves the buffer(s) locked but not joined to the new transaction, so it's not necessary to release the hold if the roll fails. Remove the incorrect xfs_trans_bhold_release calls. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2019-04-23xfs: unlock inode when xfs_ioctl_setattr_get_trans can't get transactionDarrick J. Wong1-1/+1
We passed an inode into xfs_ioctl_setattr_get_trans with join_flags indicating which locks are held on that inode. If we can't allocate a transaction then we need to unlock the inode before we bail out, like all the other error paths do. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2019-04-23xfs: kill the xfs_dqtrx_t typedefDarrick J. Wong2-16/+16
There's only a few uses left, so just kill the typedef while we're at it. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2019-04-23xfs: widen inode delalloc block counter to 64-bitsDarrick J. Wong2-2/+3
Widen the incore inode's i_delayed_blks counter to be a 64-bit integer. This is necessary to fix an integer overflow problem that can be reproduced easily now that we use the counter to track blocks that are assigned to the inode in memory but not on disk. This includes actual delalloc reservations as well as real extents in the COW fork that are waiting to be remapped into the data fork. These 'delayed mapping' blocks can easily exceed 2^32 blocks if one creates a very large sparse file of size approximately 2^33 bytes with one byte written every 2^23 bytes, sets a very large COW extent size hint of 2^23 blocks, reflinks the first file into a second file, and then writes a single byte every 2^23 blocks in the original file. When this happens, we'll try to create approximately 1024 2^23 extent reservations in the COW fork, which will overflow the counter and cause problems. Note that on x64 we end up filling a 4-byte gap in the structure so this doesn't increase the incore size. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Allison Collins <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2019-04-23xfs: widen quota block counters to 64-bit integersDarrick J. Wong3-35/+34
Widen the incore quota transaction delta structure to treat block counters as 64-bit integers. This is a necessary addition so that we can widen the i_delayed_blks counter to be a 64-bit integer. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Allison Collins <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2019-04-23xfs: abort unaligned nowait directio earlyDarrick J. Wong1-3/+3
Dave Chinner noticed that xfs_file_dio_aio_write returns EAGAIN without dropping the IOLOCK when its deciding not to wait, which means that we leak the IOLOCK there. Since we now make unaligned directio always wait, we have the opportunity to bail out before trying to take the lock, which should reduce the overhead of this never-gonna-work case considerably while also solving the dropped lock problem. Reported-by: Dave Chinner <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2019-04-23xfs: assert that we don't enter agfl freeing with a non-permanent transactionBrian Foster1-0/+3
Block allocation requires a permanent transaction for deferred AGFL frees. Add an assert in the block allocation path to make explicit and obvious to future callers the requirement of a transaction with a permanent reservation. Reported-by: Darrick J. Wong <[email protected]> Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> [darrick: split this out from the previous patch per hch request] Signed-off-by: Darrick J. Wong <[email protected]>
2019-04-22xfs: make tr_growdata a permanent transactionBrian Foster1-1/+5
The growdata transaction is used by growfs operations to increase the data size of the filesystem. Part of this sequence involves extending the size of the last preexisting AG in the fs, if necessary. This is implemented by freeing the newly available physical range to the AG. tr_growdata is not a permanent transaction, however, and block allocation transactions must be permanent to handle deferred frees of AGFL blocks. If the grow operation extends an existing AG that requires AGFL fixing, assert failures occur due to a populated dfops list on a non-permanent transaction and the AGFL free does not occur. This is reproduced (rarely) by xfs/104. Change tr_growdata to a permanent transaction with a default log count. This increases initial transaction reservation size, but growfs is an infrequent and non-performance critical operation and so should have minimal impact. Reported-by: Darrick J. Wong <[email protected]> Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> [darrick: add a comment to the assert] Signed-off-by: Darrick J. Wong <[email protected]>
2019-04-16xfs: merge adjacent io completions of the same typeDarrick J. Wong1-0/+86
It's possible for pagecache writeback to split up a large amount of work into smaller pieces for throttling purposes or to reduce the amount of time a writeback operation is pending. Whatever the reason, XFS can end up with a bunch of IO completions that call for the same operation to be performed on a contiguous extent mapping. Since mappings are extent based in XFS, we'd prefer to run fewer transactions when we can. When we're processing an ioend on the list of io completions, check to see if the next items on the list are both adjacent and of the same type. If so, we can merge the completions to reduce transaction overhead. On fast storage this doesn't seem to make much of a difference in performance, though the number of transactions for an overnight xfstests run seems to drop by ~5%. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2019-04-16xfs: remove unused m_data_workqueueDarrick J. Wong2-10/+1
Now that we're no longer using m_data_workqueue, remove it. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2019-04-16xfs: implement per-inode writeback completion queuesDarrick J. Wong4-12/+48
When scheduling writeback of dirty file data in the page cache, XFS uses IO completion workqueue items to ensure that filesystem metadata only updates after the write completes successfully. This is essential for converting unwritten extents to real extents at the right time and performing COW remappings. Unfortunately, XFS queues each IO completion work item to an unbounded workqueue, which means that the kernel can spawn dozens of threads to try to handle the items quickly. These threads need to take the ILOCK to update file metadata, which results in heavy ILOCK contention if a large number of the work items target a single file, which is inefficient. Worse yet, the writeback completion threads get stuck waiting for the ILOCK while holding transaction reservations, which can use up all available log reservation space. When that happens, metadata updates to other parts of the filesystem grind to a halt, even if the filesystem could otherwise have handled it. Even worse, if one of the things grinding to a halt happens to be a thread in the middle of a defer-ops finish holding the same ILOCK and trying to obtain more log reservation having exhausted the permanent reservation, we now have an ABBA deadlock - writeback completion has a transaction reserved and wants the ILOCK, and someone else has the ILOCK and wants a transaction reservation. Therefore, we create a per-inode writeback io completion queue + work item. When writeback finishes, it can add the ioend to the per-inode queue and let the single worker item process that queue. This dramatically cuts down on the number of kworkers and ILOCK contention in the system, and seems to have eliminated an occasional deadlock I was seeing while running generic/476. Testing with a program that simulates a heavy random-write workload to a single file demonstrates that the number of kworkers drops from approximately 120 threads per file to 1, without dramatically changing write bandwidth or pagecache access latency. Note that we leave the xfs-conv workqueue's max_active alone because we still want to be able to run ioend processing for as many inodes as the system can handle. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2019-04-16xfs: scrub should only cross-reference with healthy btreesDarrick J. Wong3-5/+77
Skip cross-referencing with a btree if the health report tells us that it's known to be bad. This should reduce the dmesg spew considerably. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2019-04-16xfs: scrub/repair should update filesystem metadata healthDarrick J. Wong5-0/+200
Now that we have the ability to track sick metadata in-core, make scrub and repair update those health assessments after doing work. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2019-04-16xfs: hoist the already_fixed variable to the scrub contextDarrick J. Wong4-11/+10
Now that we no longer memset the scrub context, we can move the already_fixed variable into the scrub context's state flags instead of passing around pointers to separate stack variables. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2019-04-16xfs: collapse scrub bool state flags into a single unsigned intDarrick J. Wong6-12/+17
Combine all the boolean state flags in struct xfs_scrub into a single unsigned int, because we're going to be adding more state flags soon. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2019-04-16xfs: refactor scrub context initializationDarrick J. Wong1-13/+18
It's a little silly how the memset in scrub context initialization forces us to declare stack variables to preserve context variables across a retry. Since the teardown functions already null out most of the ephemeral state (buffer pointers, btree cursors, etc.), just skip the memset and move the initialization as needed. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2019-04-14xfs: report inode health via bulkstatDarrick J. Wong4-1/+50
Use space in the bulkstat ioctl structure to report any problems observed with the inode. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2019-04-14xfs: report AG health via AG geometry ioctlDarrick J. Wong4-1/+52
Use the AG geometry info ioctl to report health status too. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]>