aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2021-06-22virtiofs: propagate sync() to file serverGreg Kurz3-0/+44
Even if POSIX doesn't mandate it, linux users legitimately expect sync() to flush all data and metadata to physical storage when it is located on the same system. This isn't happening with virtiofs though: sync() inside the guest returns right away even though data still needs to be flushed from the host page cache. This is easily demonstrated by doing the following in the guest: $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync 5120+0 records in 5120+0 records out 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s sync() = 0 <0.024068> and start the following in the host when the 'dd' command completes in the guest: $ strace -T -e fsync /usr/bin/sync virtiofs/foo fsync(3) = 0 <10.371640> There are no good reasons not to honor the expected behavior of sync() actually: it gives an unrealistic impression that virtiofs is super fast and that data has safely landed on HW, which isn't the case obviously. Implement a ->sync_fs() superblock operation that sends a new FUSE_SYNCFS request type for this purpose. Provision a 64-bit placeholder for possible future extensions. Since the file server cannot handle the wait == 0 case, we skip it to avoid a gratuitous roundtrip. Note that this is per-superblock: a FUSE_SYNCFS is send for the root mount and for each submount. Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for FUSE_SYNCFS in the file server is treated as permanent success. This ensures compatibility with older file servers: the client will get the current behavior of sync() not being propagated to the file server. Note that such an operation allows the file server to DoS sync(). Since a typical FUSE file server is an untrusted piece of software running in userspace, this is disabled by default. Only enable it with virtiofs for now since virtiofsd is supposedly trusted by the guest kernel. Reported-by: Robert Krawitz <[email protected]> Signed-off-by: Greg Kurz <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2021-06-22fuse: reject internal errnoMiklos Szeredi1-1/+1
Don't allow userspace to report errors that could be kernel-internal. Reported-by: Anatoly Trosinenko <[email protected]> Fixes: 334f485df85a ("[PATCH] FUSE - device functions") Cc: <[email protected]> # v2.6.14 Signed-off-by: Miklos Szeredi <[email protected]>
2021-06-22fuse: check connected before queueing on fpq->ioMiklos Szeredi1-0/+9
A request could end up on the fpq->io list after fuse_abort_conn() has reset fpq->connected and aborted requests on that list: Thread-1 Thread-2 ======== ======== ->fuse_simple_request() ->shutdown ->__fuse_request_send() ->queue_request() ->fuse_abort_conn() ->fuse_dev_do_read() ->acquire(fpq->lock) ->wait_for(fpq->lock) ->set err to all req's in fpq->io ->release(fpq->lock) ->acquire(fpq->lock) ->add req to fpq->io After the userspace copy is done the request will be ended, but req->out.h.error will remain uninitialized. Also the copy might block despite being already aborted. Fix both issues by not allowing the request to be queued on the fpq->io list after fuse_abort_conn() has processed this list. Reported-by: Pradeep P V K <[email protected]> Fixes: fd22d62ed0c3 ("fuse: no fc->lock for iqueue parts") Cc: <[email protected]> # v4.2 Signed-off-by: Miklos Szeredi <[email protected]>
2021-06-21cifs: Avoid field over-reading memcpy()Kees Cook1-1/+4
In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memcpy(), memmove(), and memset(), avoid intentionally reading across neighboring fields. Instead of using memcpy to read across multiple struct members, just perform per-member assignments as already done for other members. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Steve French <[email protected]>
2021-06-21xfs: fix endianness issue in xfs_ag_shrink_spaceDarrick J. Wong1-3/+4
The AGI buffer is in big-endian format, so we must convert the endianness to CPU format to do any comparisons. Fixes: 46141dc891f7 ("xfs: introduce xfs_ag_shrink_space()") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Gao Xiang <[email protected]>
2021-06-21netfs: fix test for whether we can skip read when writing beyond EOFJeff Layton1-13/+36
It's not sufficient to skip reading when the pos is beyond the EOF. There may be data at the head of the page that we need to fill in before the write. Add a new helper function that corrects and clarifies the logic of when we can skip reads, and have it only zero out the part of the page that won't have data copied in for the write. Finally, don't set the page Uptodate after zeroing. It's not up to date since the write data won't have been copied in yet. [DH made the following changes: - Prefixed the new function with "netfs_". - Don't call zero_user_segments() for a full-page write. - Altered the beyond-last-page check to avoid a DIV instruction and got rid of then-redundant zero-length file check. ] Fixes: e1b1240c1ff5f ("netfs: Add write_begin helper") Reported-by: Andrew W Elble <[email protected]> Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: David Howells <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> cc: [email protected] Link: https://lore.kernel.org/r/[email protected]/ Link: https://lore.kernel.org/r/162367683365.460125.4467036947364047314.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/162391826758.1173366.11794946719301590013.stgit@warthog.procyon.org.uk/ # v2
2021-06-21afs: Fix afs_write_end() to handle short writesDavid Howells1-2/+9
Fix afs_write_end() to correctly handle a short copy into the intended write region of the page. Two things are necessary: (1) If the page is not up to date, then we should just return 0 (ie. indicating a zero-length copy). The loop in generic_perform_write() will go around again, possibly breaking up the iterator into discrete chunks[1]. This is analogous to commit b9de313cf05fe08fa59efaf19756ec5283af672a for ceph. (2) The page should not have been set uptodate if it wasn't completely set up by netfs_write_begin() (this will be fixed in the next patch), so we need to set uptodate here in such a case. Also remove the assertion that was checking that the page was set uptodate since it's now set uptodate if it wasn't already a few lines above. The assertion was from when uptodate was set elsewhere. Changes: v3: Remove the handling of len exceeding the end of the page. Fixes: 3003bbd0697b ("afs: Use the netfs_write_begin() helper") Reported-by: Jeff Layton <[email protected]> Signed-off-by: David Howells <[email protected]> Acked-by: Jeff Layton <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> cc: Al Viro <[email protected]> cc: [email protected] Link: https://lore.kernel.org/r/[email protected]/ [1] Link: https://lore.kernel.org/r/162367682522.460125.5652091227576721609.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/162391825688.1173366.3437507255136307904.stgit@warthog.procyon.org.uk/ # v2
2021-06-21xfs: remove dead stale buf unpin handling codeBrian Foster1-19/+2
This code goes back to a time when transaction commits wrote directly to iclogs. The associated log items were pinned, written to the log, and then "uncommitted" if some part of the log write had failed. This uncommit sequence called an ->iop_unpin_remove() handler that was eventually folded into ->iop_unpin() via the remove parameter. The log subsystem has since changed significantly in that transactions commit to the CIL instead of direct to iclogs, though log items must still be aborted in the event of an eventual log I/O error. However, the context for a log item abort is now asynchronous from transaction commit, which means the committing transaction has been freed by this point in time and the transaction uncommit sequence of events is no longer relevant. Further, since stale buffers remain locked at transaction commit through unpin, we can be certain that the buffer is not associated with any transaction when the unpin callback executes. Remove this unused hunk of code and replace it with an assertion that the buffer is disassociated from transaction context. Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-06-21xfs: hold buffer across unpin and potential shutdown processingBrian Foster1-16/+21
The special processing used to simulate a buffer I/O failure on fs shutdown has a difficult to reproduce race that can result in a use after free of the associated buffer. Consider a buffer that has been committed to the on-disk log and thus is AIL resident. The buffer lands on the writeback delwri queue, but is subsequently locked, committed and pinned by another transaction before submitted for I/O. At this point, the buffer is stuck on the delwri queue as it cannot be submitted for I/O until it is unpinned. A log checkpoint I/O failure occurs sometime later, which aborts the bli. The unpin handler is called with the aborted log item, drops the bli reference count, the pin count, and falls into the I/O failure simulation path. The potential problem here is that once the pin count falls to zero in ->iop_unpin(), xfsaild is free to retry delwri submission of the buffer at any time, before the unpin handler even completes. If delwri queue submission wins the race to the buffer lock, it observes the shutdown state and simulates the I/O failure itself. This releases both the bli and delwri queue holds and frees the buffer while xfs_buf_item_unpin() sits on xfs_buf_lock() waiting to run through the same failure sequence. This problem is rare and requires many iterations of fstest generic/019 (which simulates disk I/O failures) to reproduce. To avoid this problem, grab a hold on the buffer before the log item is unpinned if the associated item has been aborted and will require a simulated I/O failure. The hold is already required for the simulated I/O failure, so the ordering simply guarantees the unpin handler access to the buffer before it is unpinned and thus processed by the AIL. This particular ordering is required so long as the AIL does not acquire a reference on the bli, which is the long term solution to this problem. Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-06-21xfs: force the log offline when log intent item recovery failsDarrick J. Wong2-1/+7
If any part of log intent item recovery fails, we should shut down the log immediately to stop the log from writing a clean unmount record to disk, because the metadata is not consistent. The inability to cancel a dirty transaction catches most of these cases, but there are a few things that have slipped through the cracks, such as ENOSPC from a transaction allocation, or runtime errors that result in cancellation of a non-dirty transaction. This solves some weird behaviors reported by customers where a system goes down, the first mount fails, the second succeeds, but then the fs goes down later because of inconsistent metadata. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2021-06-21xfs: fix log intent recovery ENOSPC shutdowns when inactivating inodesDarrick J. Wong1-1/+9
During regular operation, the xfs_inactive operations create transactions with zero block reservation because in general we're freeing space, not asking for more. The per-AG space reservations created at mount time enable us to handle expansions of the refcount btree without needing to reserve blocks to the transaction. Unfortunately, log recovery doesn't create the per-AG space reservations when intent items are being recovered. This isn't an issue for intent item recovery itself because they explicitly request blocks, but any inode inactivation that can happen during log recovery uses the same xfs_inactive paths as regular runtime. If a refcount btree expansion happens, the transaction will fail due to blk_res_used > blk_res, and we shut down the filesystem unnecessarily. Fix this problem by making per-AG reservations temporarily so that we can handle the inactivations, and releasing them at the end. This brings the recovery environment closer to the runtime environment. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2021-06-21xfs: shorten the shutdown messages to a single lineDarrick J. Wong1-8/+8
Consolidate the shutdown messages to a single line containing the reason, the passed-in flags, the source of the shutdown, and the end result. This means we now only have one line to look for when debugging, which is useful when the fs goes down while something else is flooding dmesg. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Chandan Babu R <[email protected]>
2021-06-21xfs: print name of function causing fs shutdown instead of hex pointerDarrick J. Wong1-1/+1
In xfs_do_force_shutdown, print the symbolic name of the function that called us to shut down the filesystem instead of a raw hex pointer. This makes debugging a lot easier: XFS (sda): xfs_do_force_shutdown(0x2) called from line 2440 of file fs/xfs/xfs_log.c. Return address = ffffffffa038bc38 becomes: XFS (sda): xfs_do_force_shutdown(0x2) called from line 2440 of file fs/xfs/xfs_log.c. Return address = xfs_trans_mod_sb+0x25 Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Chandan Babu R <[email protected]>
2021-06-21xfs: fix type mismatches in the inode reclaim functionsDarrick J. Wong3-9/+9
It's currently unlikely that we will ever end up with more than 4 billion inodes waiting for reclamation, but the fs object code uses long int for object counts and we're certainly capable of generating that many. Instead of truncating the internal counters, widen them and report the object counts correctly. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2021-06-21xfs: separate primary inode selection criteria in xfs_iget_cache_hitDarrick J. Wong1-23/+16
During review of the v6 deferred inode inactivation patchset[1], Dave commented that _cache_hit should have a clear separation between inode selection criteria and actions performed on a selected inode. Move a hunk to make this true, and compact the shrink cases in the function. [1] https://lore.kernel.org/linux-xfs/162310469340.3465262.504398465311182657.stgit@locust/T/#mca6d958521cb88bbc1bfe1a30767203328d410b5 Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2021-06-21xfs: refactor the inode recycling codeDarrick J. Wong2-64/+83
Hoist the code in xfs_iget_cache_hit that restores the VFS inode state to an xfs_inode that was previously vfs-destroyed. The next patch will add a new set of state flags, so we need the helper to avoid duplication. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2021-06-21xfs: add iclog state trace eventsDave Chinner3-0/+88
For the DEBUGS! Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-06-21xfs: xfs_log_force_lsn isn't passed a LSNDave Chinner13-65/+56
In doing an investigation into AIL push stalls, I was looking at the log force code to see if an async CIL push could be done instead. This lead me to xfs_log_force_lsn() and looking at how it works. xfs_log_force_lsn() is only called from inode synchronisation contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn value as the LSN to sync the log to. This gets passed to xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the journal, and then used by xfs_log_force_lsn() to flush the iclogs to the journal. The problem is that ip->i_itemp->ili_last_lsn does not store a log sequence number. What it stores is passed to it from the ->iop_committing method, which is called by xfs_log_commit_cil(). The value this passes to the iop_committing method is the CIL context sequence number that the item was committed to. As it turns out, xlog_cil_force_lsn() converts the sequence to an actual commit LSN for the related context and returns that to xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn" variable that contained a sequence with an actual LSN and then uses that to sync the iclogs. This caused me some confusion for a while, even though I originally wrote all this code a decade ago. ->iop_committing is only used by a couple of log item types, and only inode items use the sequence number it is passed. Let's clean up the API, CIL structures and inode log item to call it a sequence number, and make it clear that the high level code is using CIL sequence numbers and not on-disk LSNs for integrity synchronisation purposes. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Allison Henderson <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-06-21xfs: Fix CIL throttle hang when CIL space used going backwardsDave Chinner3-24/+49
A hang with tasks stuck on the CIL hard throttle was reported and largely diagnosed by Donald Buczek, who discovered that it was a result of the CIL context space usage decrementing in committed transactions once the hard throttle limit had been hit and processes were already blocked. This resulted in the CIL push not waking up those waiters because the CIL context was no longer over the hard throttle limit. The surprising aspect of this was the CIL space usage going backwards regularly enough to trigger this situation. Assumptions had been made in design that the relogging process would only increase the size of the objects in the CIL, and so that space would only increase. This change and commit message fixes the issue and documents the result of an audit of the triggers that can cause the CIL space to go backwards, how large the backwards steps tend to be, the frequency in which they occur, and what the impact on the CIL accounting code is. Even though the CIL ctx->space_used can go backwards, it will only do so if the log item is already logged to the CIL and contains a space reservation for it's entire logged state. This is tracked by the shadow buffer state on the log item. If the item is not previously logged in the CIL it has no shadow buffer nor log vector, and hence the entire size of the logged item copied to the log vector is accounted to the CIL space usage. i.e. it will always go up in this case. If the item has a log vector (i.e. already in the CIL) and the size decreases, then the existing log vector will be overwritten and the space usage will go down. This is the only condition where the space usage reduces, and it can only occur when an item is already tracked in the CIL. Hence we are safe from CIL space usage underruns as a result of log items decreasing in size when they are relogged. Typically this reduction in CIL usage occurs from metadata blocks being free, such as when a btree block merge occurs or a directory enter/xattr entry is removed and the da-tree is reduced in size. This generally results in a reduction in size of around a single block in the CIL, but also tends to increase the number of log vectors because the parent and sibling nodes in the tree needs to be updated when a btree block is removed. If a multi-level merge occurs, then we see reduction in size of 2+ blocks, but again the log vector count goes up. The other vector is inode fork size changes, which only log the current size of the fork and ignore the previously logged size when the fork is relogged. Hence if we are removing items from the inode fork (dir/xattr removal in shortform, extent record removal in extent form, etc) the relogged size of the inode for can decrease. No other log items can decrease in size either because they are a fixed size (e.g. dquots) or they cannot be relogged (e.g. relogging an intent actually creates a new intent log item and doesn't relog the old item at all.) Hence the only two vectors for CIL context size reduction are relogging inode forks and marking buffers active in the CIL as stale. Long story short: the majority of the code does the right thing and handles the reduction in log item size correctly, and only the CIL hard throttle implementation is problematic and needs fixing. This patch makes that fix, as well as adds comments in the log item code that result in items shrinking in size when they are relogged as a clear reminder that this can and does happen frequently. The throttle fix is based upon the change Donald proposed, though it goes further to ensure that once the throttle is activated, it captures all tasks until the CIL push issues a wakeup, regardless of whether the CIL space used has gone back under the throttle threshold. This ensures that we prevent tasks reducing the CIL slightly under the throttle threshold and then making more changes that push it well over the throttle limit. This is acheived by checking if the throttle wait queue is already active as a condition of throttling. Hence once we start throttling, we continue to apply the throttle until the CIL context push wakes everything on the wait queue. We can use waitqueue_active() for the waitqueue manipulations and checks as they are all done under the ctx->xc_push_lock. Hence the waitqueue has external serialisation and we can safely peek inside the wait queue without holding the internal waitqueue locks. Many thanks to Donald for his diagnostic and analysis work to isolate the cause of this hang. Reported-and-tested-by: Donald Buczek <[email protected]> Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Allison Henderson <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-06-21xfs: journal IO cache flush reductionsDave Chinner4-48/+43
Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to guarantee the ordering requirements the journal has w.r.t. metadata writeback. THe two ordering constraints are: 1. we cannot overwrite metadata in the journal until we guarantee that the dirty metadata has been written back in place and is stable. 2. we cannot write back dirty metadata until it has been written to the journal and guaranteed to be stable (and hence recoverable) in the journal. The ordering guarantees of #1 are provided by REQ_PREFLUSH. This causes the journal IO to issue a cache flush and wait for it to complete before issuing the write IO to the journal. Hence all completed metadata IO is guaranteed to be stable before the journal overwrites the old metadata. The ordering guarantees of #2 are provided by the REQ_FUA, which ensures the journal writes do not complete until they are on stable storage. Hence by the time the last journal IO in a checkpoint completes, we know that the entire checkpoint is on stable storage and we can unpin the dirty metadata and allow it to be written back. This is the mechanism by which ordering was first implemented in XFS way back in 2002 by commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96 ("Add support for drive write cache flushing") in the xfs-archive tree. A lot has changed since then, most notably we now use delayed logging to checkpoint the filesystem to the journal rather than write each individual transaction to the journal. Cache flushes on journal IO are necessary when individual transactions are wholly contained within a single iclog. However, CIL checkpoints are single transactions that typically span hundreds to thousands of individual journal writes, and so the requirements for device cache flushing have changed. That is, the ordering rules I state above apply to ordering of atomic transactions recorded in the journal, not to the journal IO itself. Hence we need to ensure metadata is stable before we start writing a new transaction to the journal (guarantee #1), and we need to ensure the entire transaction is stable in the journal before we start metadata writeback (guarantee #2). Hence we only need a REQ_PREFLUSH on the journal IO that starts a new journal transaction to provide #1, and it is not on any other journal IO done within the context of that journal transaction. The CIL checkpoint already issues a cache flush before it starts writing to the log, so we no longer need the iclog IO to issue a REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed to xlog_write(), we no longer need to mark the first iclog in the log write with REQ_PREFLUSH for this case. As an added bonus, this ordering mechanism works for both internal and external logs, meaning we can remove the explicit data device cache flushes from the iclog write code when using external logs. Given the new ordering semantics of commit records for the CIL, we need iclogs containing commit records to issue a REQ_PREFLUSH. We also require unmount records to do this. Hence for both XLOG_COMMIT_TRANS and XLOG_UNMOUNT_TRANS xlog_write() calls we need to mark the first iclog being written with REQ_PREFLUSH. For both commit records and unmount records, we also want them immediately on stable storage, so we want to also mark the iclogs that contain these records to be marked REQ_FUA. That means if a record is split across multiple iclogs, they are all marked REQ_FUA and not just the last one so that when the transaction is completed all the parts of the record are on stable storage. And for external logs, unmount records need a pre-write data device cache flush similar to the CIL checkpoint cache pre-flush as the internal iclog write code does not do this implicitly anymore. As an optimisation, when the commit record lands in the same iclog as the journal transaction starts, we don't need to wait for anything and can simply use REQ_FUA to provide guarantee #2. This means that for fsync() heavy workloads, the cache flush behaviour is completely unchanged and there is no degradation in performance as a result of optimise the multi-IO transaction case. The most notable sign that there is less IO latency on my test machine (nvme SSDs) is that the "noiclogs" rate has dropped substantially. This metric indicates that the CIL push is blocking in xlog_get_iclog_space() waiting for iclog IO completion to occur. With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space() is blocking waiting for log IO. With the changes in this patch, this drops to 1 noiclog event for every 100 iclog writes. Hence it is clear that log IO is completing much faster than it was previously, but it is also clear that for large iclog sizes, this isn't the performance limiting factor on this hardware. With smaller iclogs (32kB), however, there is a substantial difference. With the cache flush modifications, the journal is now running at over 4000 write IOPS, and the journal throughput is largely identical to the 256kB iclogs and the noiclog event rate stays low at about 1:50 iclog writes. The existing code tops out at about 2500 IOPS as the number of cache flushes dominate performance and latency. The noiclog event rate is about 1:4, and the performance variance is quite large as the journal throughput can fall to less than half the peak sustained rate when the cache flush rate prevents metadata writeback from keeping up and the log runs out of space and throttles reservations. As a result: logbsize fsmark create rate rm -rf before 32kb 152851+/-5.3e+04 5m28s patched 32kb 221533+/-1.1e+04 5m24s before 256kb 220239+/-6.2e+03 4m58s patched 256kb 228286+/-9.2e+03 5m06s The rm -rf times are included because I ran them, but the differences are largely noise. This workload is largely metadata read IO latency bound and the changes to the journal cache flushing doesn't really make any noticable difference to behaviour apart from a reduction in noiclog events from background CIL pushing. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Allison Henderson <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-06-21xfs: remove need_start_rec parameter from xlog_write()Dave Chinner3-25/+25
The CIL push is the only call to xlog_write that sets this variable to true. The other callers don't need a start rec, and they tell xlog_write what to do by passing the type of ophdr they need written in the flags field. The need_start_rec parameter essentially tells xlog_write to to write an extra ophdr with a XLOG_START_TRANS type, so get rid of the variable to do this and pass XLOG_START_TRANS as the flag value into xlog_write() from the CIL push. $ size fs/xfs/xfs_log.o* text data bss dec hex filename 27595 560 8 28163 6e03 fs/xfs/xfs_log.o.orig 27454 560 8 28022 6d76 fs/xfs/xfs_log.o.patched Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Allison Henderson <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-06-21xfs: CIL checkpoint flushes caches unconditionallyDave Chinner1-4/+21
Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to guarantee the ordering requirements the journal has w.r.t. metadata writeback. THe two ordering constraints are: 1. we cannot overwrite metadata in the journal until we guarantee that the dirty metadata has been written back in place and is stable. 2. we cannot write back dirty metadata until it has been written to the journal and guaranteed to be stable (and hence recoverable) in the journal. These rules apply to the atomic transactions recorded in the journal, not to the journal IO itself. Hence we need to ensure metadata is stable before we start writing a new transaction to the journal (guarantee #1), and we need to ensure the entire transaction is stable in the journal before we start metadata writeback (guarantee #2). The ordering guarantees of #1 are currently provided by REQ_PREFLUSH being added to every iclog IO. This causes the journal IO to issue a cache flush and wait for it to complete before issuing the write IO to the journal. Hence all completed metadata IO is guaranteed to be stable before the journal overwrites the old metadata. However, for long running CIL checkpoints that might do a thousand journal IOs, we don't need every single one of these iclog IOs to issue a cache flush - the cache flush done before the first iclog is submitted is sufficient to cover the entire range in the log that the checkpoint will overwrite because the CIL space reservation guarantees the tail of the log (completed metadata) is already beyond the range of the checkpoint write. Hence we only need a full cache flush between closing off the CIL checkpoint context (i.e. when the push switches it out) and issuing the first journal IO. Rather than plumbing this through to the journal IO, we can start this cache flush the moment the CIL context is owned exclusively by the push worker. The cache flush can be in progress while we process the CIL ready for writing, hence reducing the latency of the initial iclog write. This is especially true for large checkpoints, where we might have to process hundreds of thousands of log vectors before we issue the first iclog write. In these cases, it is likely the cache flush has already been completed by the time we have built the CIL log vector chain. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Allison Henderson <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-06-21xfs: async blkdev cache flushDave Chinner2-0/+37
The new checkpoint cache flush mechanism requires us to issue an unconditional cache flush before we start a new checkpoint. We don't want to block for this if we can help it, and we have a fair chunk of CPU work to do between starting the checkpoint and issuing the first journal IO. Hence it makes sense to amortise the latency cost of the cache flush by issuing it asynchronously and then waiting for it only when we need to issue the first IO in the transaction. To do this, we need async cache flush primitives to submit the cache flush bio and to wait on it. The block layer has no such primitives for filesystems, so roll our own for the moment. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Allison Henderson <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-06-21xfs: remove xfs_blkdev_issue_flushDave Chinner5-13/+5
It's a one line wrapper around blkdev_issue_flush(). Just replace it with direct calls to blkdev_issue_flush(). Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Chandan Babu R <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Allison Henderson <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-06-21nfs: fix acl memory leak of posix_acl_create()Gao Xiang1-2/+2
When looking into another nfs xfstests report, I found acl and default_acl in nfs3_proc_create() and nfs3_proc_mknod() error paths are possibly leaked. Fix them in advance. Fixes: 013cdf1088d7 ("nfs: use generic posix ACL infrastructure for v3 Posix ACLs") Cc: Trond Myklebust <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Joseph Qi <[email protected]> Signed-off-by: Gao Xiang <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2021-06-21btrfs: inline wait_current_trans_commit_start in its callerDavid Sterba1-13/+7
Function wait_current_trans_commit_start is now fairly trivial so it can be inlined in its only caller. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: sink wait_for_unblock parameter to async commitDavid Sterba3-25/+4
There's only one caller left btrfs_ioctl_start_sync that passes 0, so we can remove the switch in btrfs_commit_transaction_async. A cleanup 9babda9f33fd ("btrfs: Remove async_transid from btrfs_mksubvol/create_subvol/create_snapshot") removed calls that passed 1, so this is a followup. As this removes last call of wait_current_trans_commit_start_and_unblock, remove the function as well. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: remove total_data_size variable in btrfs_batch_insert_items()Nathan Chancellor1-2/+1
clang warns: fs/btrfs/delayed-inode.c:684:6: warning: variable 'total_data_size' set but not used [-Wunused-but-set-variable] int total_data_size = 0, total_size = 0; ^ 1 warning generated. This variable's value has been unused since commit fc0d82e103c7 ("btrfs: sink total_data parameter in setup_items_for_insert"). Eliminate it. Link: https://github.com/ClangBuiltLinux/linux/issues/1391 Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Nathan Chancellor <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: eliminate insert label in add_falloc_rangeNikolay Borisov1-12/+11
By way of inverting the list_empty conditional the insert label can be eliminated, making the function's flow entirely linear. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: subpage: fix a rare race between metadata endio and eb freeingQu Wenruo3-26/+32
[BUG] There is a very rare ASSERT() triggering during full fstests run for subpage rw support. No other reproducer so far. The ASSERT() gets triggered for metadata read in btrfs_page_set_uptodate() inside end_page_read(). [CAUSE] There is still a small race window for metadata only, the race could happen like this: T1 | T2 ------------------------------------+----------------------------- end_bio_extent_readpage() | |- btrfs_validate_metadata_buffer() | | |- free_extent_buffer() | | Still have 2 refs | |- end_page_read() | |- if (unlikely(PagePrivate()) | | The page still has Private | | | free_extent_buffer() | | | Only one ref 1, will be | | | released | | |- detach_extent_buffer_page() | | |- btrfs_detach_subpage() |- btrfs_set_page_uptodate() | The page no longer has Private| >>> ASSERT() triggered <<< | This race window is super small, thus pretty hard to hit, even with so many runs of fstests. But the race window is still there, we have to go another way to solve it other than relying on random PagePrivate() check. Data path is not affected, as it will lock the page before reading, while unlocking the page after the last read has finished, thus no race window. [FIX] This patch will fix the bug by repurposing btrfs_subpage::readers. Now btrfs_subpage::readers will be a member shared by both metadata and data. For metadata path, we don't do the page unlock as metadata only relies on extent locking. At the same time, teach page_range_has_eb() to take btrfs_subpage::readers into consideration. So that even if the last eb of a page gets freed, page::private won't be detached as long as there still are pending end_page_read() calls. By this we eliminate the race window, this will slight increase the metadata memory usage, as the page may not be released as frequently as usual. But it should not be a big deal. The code got introduced in ("btrfs: submit read time repair only for each corrupted sector"), but the fix is in a separate patch to keep the problem description and the crash is rare so it should not hurt bisectability. Signed-off-by: Qu Wegruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: don't clear page extent mapped if we're not invalidating the full pageQu Wenruo1-1/+13
[BUG] With current btrfs subpage rw support, the following script can lead to fs hang: $ mkfs.btrfs -f -s 4k $dev $ mount $dev -o nospace_cache $mnt $ fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt The fs will hang at btrfs_start_ordered_extent(). [CAUSE] In above test case, btrfs_invalidate() will be called with the following parameters: offset = 0 length = 53248 page dirty = 1 subpage dirty bitmap = 0x2000 Since @offset is 0, btrfs_invalidate() will try to invalidate the full page, and finally call clear_page_extent_mapped() which will detach subpage structure from the page. And since the page no longer has subpage structure, the subpage dirty bitmap will be cleared, preventing the dirty range from being written back, thus no way to wake up the ordered extent. [FIX] Just follow other filesystems, only to invalidate the page if the range covers the full page. There are cases like truncate_setsize() which can call btrfs_invalidatepage() with offset == 0 and length != 0 for the last page of an inode. Although the old code will still try to invalidate the full page, we are still safe to just wait for ordered extent to finish. So it shouldn't cause extra problems. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()Qu Wenruo1-1/+12
[BUG] With current subpage RW support, the following script can hang the fs with 64K page size. # mkfs.btrfs -f -s 4k $dev # mount $dev -o nospace_cache $mnt # fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt The kernel will do an infinite loop in btrfs_punch_hole_lock_range(). [CAUSE] In btrfs_punch_hole_lock_range() we: - Truncate page cache range - Lock extent io tree - Wait any ordered extents in the range. We exit the loop until we meet all the following conditions: - No ordered extent in the lock range - No page is in the lock range The latter condition has a pitfall, it only works for sector size == PAGE_SIZE case. While can't handle the following subpage case: 0 32K 64K 96K 128K | |///////||//////| || lockstart=32K lockend=96K - 1 In this case, although the range crosses 2 pages, truncate_pagecache_range() will invalidate no page at all, but only zero the [32K, 96K) range of the two pages. Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we will never meet the loop exit condition. [FIX] Fix the problem by doing page alignment for the lock range. Function filemap_range_has_page() has already handled lend < lstart case, we only need to round up @lockstart, and round_down @lockend for truncate_pagecache_range(). This modification should not change any thing for sector size == PAGE_SIZE case, as in that case our range is already page aligned. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: reflink: make copy_inline_to_page() to be subpage compatibleQu Wenruo1-5/+9
The modifications are: - Page copy destination For subpage case, one page can contain multiple sectors, thus we can no longer expect the memcpy_to_page()/btrfs_decompress() to copy data into page offset 0. The correct offset is offset_in_page(file_offset) now, which should handle both regular sectorsize and subpage cases well. - Page status update Now we need to use subpage helper to handle the page status update. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: make btrfs_page_mkwrite() to be subpage compatibleQu Wenruo1-2/+2
Only set_page_dirty() and SetPageUptodate() is not subpage compatible. Convert them to subpage helpers, so that __extent_writepage_io() can submit page content correctly. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: make btrfs_truncate_block() to be subpage compatibleQu Wenruo1-1/+1
btrfs_truncate_block() itself is already mostly subpage compatible, the only missing part is the page dirtying code. Currently if we have a sector that needs to be truncated, we set the sector aligned range delalloc, then set the full page dirty. The problem is, current subpage code requires subpage dirty bit to be set, or __extent_writepage_io() won't submit bio, thus leads to ordered extent never to finish. So this patch will make btrfs_truncate_block() to call btrfs_page_set_dirty() helper to replace set_page_dirty() to fix the problem. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: make __extent_writepage_io() only submit dirty range for subpageQu Wenruo1-5/+75
__extent_writepage_io() function originally just iterates through all the extent maps of a page, and submits any regular extents. This is fine for sectorsize == PAGE_SIZE case, as if a page is dirty, we need to submit the only sector contained in the page. But for subpage case, one dirty page can contain several clean sectors with at least one dirty sector. If __extent_writepage_io() still submit all regular extent maps, it can submit data which is already written to disk. And since such already written data won't have corresponding ordered extents, it will trigger a BUG_ON() in btrfs_csum_one_bio(). Change the behavior of __extent_writepage_io() by finding the first dirty byte in the page, and only submit the dirty range other than the full extent. Since we're also here, also modify the following calls to be subpage compatible: - SetPageError() - end_page_writeback() Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: make btrfs_set_range_writeback() subpage compatibleQu Wenruo3-7/+10
Function btrfs_set_range_writeback() currently just sets the page writeback unconditionally. Change it to call the subpage helper so that we can handle both cases well. Since the subpage helpers needs btrfs_fs_info, also change the parameter to accept btrfs_inode. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by ↵Qu Wenruo1-1/+15
__process_pages_contig() In cow_file_range(), after we have succeeded creating an inline extent, we unlock the page with extent_clear_unlock_delalloc() by passing locked_page == NULL. For sectorsize == PAGE_SIZE case, this is just making the page lock and unlock harder to grab. But for incoming subpage case, it can be a big problem. For incoming subpage case, page locking have two entry points: - __process_pages_contig() In that case, we know exactly the range we want to lock (which only requires sector alignment). To handle the subpage requirement, we introduce btrfs_subpage::writers to page::private, and will update it in __process_pages_contig(). - Other directly lock/unlock_page() call sites Those won't touch btrfs_subpage::writers at all. This means, page locked by __process_pages_contig() can only be unlocked by __process_pages_contig(). Thankfully we already have the existing infrastructure in the form of @locked_page in various call sites. Unfortunately, extent_clear_unlock_delalloc() in cow_file_range() after creating an inline extent is the exception. It intentionally call extent_clear_unlock_delalloc() with locked_page == NULL, to also unlock current page (and clear its dirty/writeback bits). To co-operate with incoming subpage modifications, and make the page lock/unlock pair easier to understand, this patch will still call extent_clear_unlock_delalloc() with locked_page, and only unlock the page in __extent_writepage(). Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: update locked page dirty/writeback/error bits in __process_pages_contigQu Wenruo1-4/+4
When __process_pages_contig() gets called for extent_clear_unlock_delalloc(), if we hit the locked page, only Private2 bit is updated, but dirty/writeback/error bits are all skipped. There are several call sites that call extent_clear_unlock_delalloc() with locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK - cow_file_range() - run_delalloc_nocow() - cow_file_range_async() All for their error handling branches. For those call sites, since we skip the locked page for dirty/error/writeback bit update, the locked page will still have its subpage dirty bit remaining. Normally it's the call sites which locked the page to handle the locked page, but it won't hurt if we also do the update. Especially there are already other call sites doing the same thing by manually passing NULL as locked_page. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: make page Ordered bit to be subpage compatibleQu Wenruo3-8/+18
This involves the following modification: - Ordered extent creation This is done in process_one_page(), now PAGE_SET_ORDERED will call subpage helper to do the work. - endio functions This is done in btrfs_mark_ordered_io_finished(). - btrfs_invalidatepage() - btrfs_cleanup_ordered_extents() Use the subpage page helper, and add an extra branch to exit if the locked page have covered the full range. Now the usage of page Ordered flag for ordered extent accounting is fully subpage compatible. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: introduce helpers for subpage ordered statusQu Wenruo2-0/+33
This patch introduces the following functions to handle btrfs subpage ordered (Private2) status: - btrfs_subpage_set_ordered() - btrfs_subpage_clear_ordered() - btrfs_subpage_test_ordered() These helpers can only be called when the range is ensured to be inside the page. - btrfs_page_set_ordered() - btrfs_page_clear_ordered() - btrfs_page_test_ordered() These helpers can handle both regular sector size and subpage without problem. These functions are here to coordinate btrfs_invalidatepage() with btrfs_writepage_endio_finish_ordered(), to make sure only one of those functions can finish the ordered extent. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: make process_one_page() to handle subpage lockingQu Wenruo3-15/+94
Introduce a new data inodes specific subpage member, writers, to record how many sectors are under page lock for delalloc writing. This member acts pretty much the same as readers, except it's only for delalloc writes. This is important for delalloc code to trace which page can really be freed, as we have cases like run_delalloc_nocow() where we may exit processing nocow range inside a page, but need to exit to do cow half way. In that case, we need a way to determine if we can really unlock a full page. With the new btrfs_subpage::writers, there is a new requirement: - Page locked by process_one_page() must be unlocked by process_one_page() There are still tons of call sites manually lock and unlock a page, without updating btrfs_subpage::writers. So if we lock a page through process_one_page() then it must be unlocked by process_one_page() to keep btrfs_subpage::writers consistent. This will be handled in next patch. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: make end_bio_extent_writepage() to be subpage compatibleQu Wenruo1-1/+2
Now in end_bio_extent_writepage(), the only subpage incompatible code is the end_page_writeback(). Just call the subpage helpers. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback ↵Qu Wenruo1-8/+16
status For __process_pages_contig() and process_one_page(), to handle subpage we only need to pass bytenr in and call subpage helpers to handle dirty/error/writeback status. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: make btrfs_dirty_pages() to be subpage compatibleQu Wenruo1-2/+5
Since the extent io tree operations in btrfs_dirty_pages() are already subpage compatible, we only need to make the page status update to use subpage helpers. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: only require sector size alignment for end_bio_extent_writepage()Qu Wenruo1-17/+12
Just like read page, for subpage support we only require sector size alignment. So change the error message condition to only require sector alignment. This should not affect existing code, as for regular sectorsize == PAGE_SIZE case, we are still requiring page alignment. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: provide btrfs_page_clamp_*() helpersQu Wenruo2-0/+48
In the coming subpage RW supports, there are a lot of page status update calls which need to be converted to subpage compatible version, which needs @start and @len. Some call sites already have such @start/@len and are already in page range, like various endio functions. But there are also call sites which need to clamp the range for subpage case, like btrfs_dirty_pagse() and __process_contig_pages(). Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the clamp for subpage version. Although in theory all existing btrfs_page_*() calls can be converted to use btrfs_page_clamp_*() directly, but that would make us to do unnecessary clamp operations. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: refactor page status update into process_one_page()Qu Wenruo1-97/+109
In __process_pages_contig() we update page status according to page_ops. That update process is a bunch of 'if' branches, which lie inside two loops, this makes it pretty hard to expand for later subpage operations. So this patch will extract these operations into its own function, process_one_pages(). Also since we're refactoring __process_pages_contig(), also move the new helper and __process_pages_contig() before the first caller of them, to remove the forward declaration. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: pass bytenr directly to __process_pages_contig()Qu Wenruo1-20/+37
As a preparation for incoming subpage support, we need bytenr passed to __process_pages_contig() directly, not the current page index. So change the parameter and all callers to pass bytenr in. With the modification, here we need to replace the old @index_ret with @processed_end for __process_pages_contig(), but this brings a small problem. Normally we follow the inclusive return value, meaning @processed_end should be the last byte we processed. If parameter @start is 0, and we failed to lock any page, then we would return @processed_end as -1, causing more problems for __unlock_for_delalloc(). So here for @processed_end, we use two different return value patterns. If we have locked any page, @processed_end will be the last byte of locked page. Or it will be @start otherwise. This change will impact lock_delalloc_pages(), so it needs to check @processed_end to only unlock the range if we have locked any. Tested-by: Ritesh Harjani <[email protected]> # [ppc64] Tested-by: Anand Jain <[email protected]> # [aarch64] Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: fix hang when run_delalloc_range() failedQu Wenruo1-0/+21
[BUG] When running subpage preparation patches on x86, btrfs/125 will hang forever with one ordered extent never finished. [CAUSE] The test case btrfs/125 itself will always fail as the fix is never merged. When the test fails at balance, btrfs needs to cleanup the ordered extent in btrfs_cleanup_ordered_extents() for data reloc inode. The problem is in the sequence how we cleanup the page Order bit. Currently it works like: btrfs_cleanup_ordered_extents() |- find_get_page(); |- btrfs_page_clear_ordered(page); | Now the page doesn't have Ordered bit anymore. | !!! This also includes the first (locked) page !!! | |- offset += PAGE_SIZE | This is to skip the first page |- __endio_write_update_ordered() |- btrfs_mark_ordered_io_finished(NULL) Except the first page, all ordered extents are finished. Then the locked page is cleaned up in __extent_writepage(): __extent_writepage() |- If (PageError(page)) |- end_extent_writepage() |- btrfs_mark_ordered_io_finished(page) |- if (btrfs_test_page_ordered(page)) |- !!! The page gets skipped !!! The ordered extent is not decreased as the page doesn't have ordered bit anymore. This leaves the ordered extent with bytes_left == sectorsize, thus never finish. [FIX] The fix is to ensure we never clear page Ordered bit without running the ordered extent accounting. Here we choose to skip the locked page in btrfs_cleanup_ordered_extents() so that later end_extent_writepage() can properly finish the ordered extent. Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>