aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2021-08-16iomap: switch iomap_fiemap to use iomap_iterChristoph Hellwig1-41/+29
Rewrite the ->fiemap implementation based on iomap_iter. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16iomap: switch __iomap_dio_rw to use iomap_iterChristoph Hellwig2-85/+84
Switch __iomap_dio_rw to use iomap_iter. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16iomap: switch iomap_page_mkwrite to use iomap_iterChristoph Hellwig1-22/+17
Switch iomap_page_mkwrite to use iomap_iter. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16iomap: switch iomap_zero_range to use iomap_iterChristoph Hellwig1-18/+18
Switch iomap_zero_range to use iomap_iter. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16iomap: switch iomap_file_unshare to use iomap_iterChristoph Hellwig1-17/+18
Switch iomap_file_unshare to use iomap_iter. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16iomap: switch iomap_file_buffered_write to use iomap_iterChristoph Hellwig1-24/+25
Switch iomap_file_buffered_write to use iomap_iter. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16iomap: switch readahead and readpage to use iomap_iterChristoph Hellwig1-43/+37
Switch the page cache read functions to use iomap_iter instead of iomap_apply. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16iomap: add the new iomap_iter modelChristoph Hellwig2-2/+109
The iomap_iter struct provides a convenient way to package up and maintain all the arguments to the various mapping and operation functions. It is operated on using the iomap_iter() function that is called in loop until the whole range has been processed. Compared to the existing iomap_apply() function this avoid an indirect call for each iteration. For now iomap_iter() calls back into the existing ->iomap_begin and ->iomap_end methods, but in the future this could be further optimized to avoid indirect calls entirely. Based on an earlier patch from Matthew Wilcox <[email protected]>. Signed-off-by: Christoph Hellwig <[email protected]> [djwong: add to apply.c to preserve git history of iomap loop control] Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2021-08-16iomap: fix the iomap_readpage_actor return value for inline dataChristoph Hellwig1-2/+2
The actor should never return a larger value than the length that was passed in. The current code handles this gracefully, but the opcoming iter model will be more picky. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16iomap: mark the iomap argument to iomap_read_page_sync constChristoph Hellwig1-1/+1
iomap_read_page_sync never modifies the passed in iomap, so mark it const. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16iomap: mark the iomap argument to iomap_read_inline_data constChristoph Hellwig1-1/+1
iomap_read_inline_data never modifies the passed in iomap, so mark it const. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16fsdax: mark the iomap argument to dax_iomap_sector as constChristoph Hellwig1-1/+1
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16fs: mark the iomap argument to __block_write_begin_int constChristoph Hellwig2-4/+4
__block_write_begin_int never modifies the passed in iomap, so mark it const. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16iomap: remove the iomap arguments to ->page_{prepare,done}Christoph Hellwig2-6/+5
These aren't actually used by the only instance implementing the methods. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16iomap: fix a trivial comment typo in trace.hChristoph Hellwig1-1/+1
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16iomap: pass writeback errors to the mappingDarrick J. Wong1-1/+1
Modern-day mapping_set_error has the ability to squash the usual negative error code into something appropriate for long-term storage in a struct address_space -- ENOSPC becomes AS_ENOSPC, and everything else becomes EIO. iomap squashes /everything/ to EIO, just as XFS did before that, but this doesn't make sense. Fix this by making it so that we can pass ENOSPC to userspace when writeback fails due to space problems. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
2021-08-16xfs: move the CIL workqueue to the CILDave Chinner4-18/+19
We only use the CIL workqueue in the CIL, so it makes no sense to hang it off the xfs_mount and have to walk multiple pointers back up to the mount when we have the CIL structures right there. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16xfs: CIL work is serialised, not pipelinedDave Chinner3-40/+48
Because we use a single work structure attached to the CIL rather than the CIL context, we can only queue a single work item at a time. This results in the CIL being single threaded and limits performance when it becomes CPU bound. The design of the CIL is that it is pipelined and multiple commits can be running concurrently, but the way the work is currently implemented means that it is not pipelining as it was intended. The critical work to switch the CIL context can take a few milliseconds to run, but the rest of the CIL context flush can take hundreds of milliseconds to complete. The context switching is the serialisation point of the CIL, once the context has been switched the rest of the context push can run asynchrnously with all other context pushes. Hence we can move the work to the CIL context so that we can run multiple CIL pushes at the same time and spread the majority of the work out over multiple CPUs. We can keep the per-cpu CIL commit state on the CIL rather than the context, because the context is pinned to the CIL until the switch is done and we aggregate and drain the per-cpu state held on the CIL during the context switch. However, because we no longer serialise the CIL work, we can have effectively unlimited CIL pushes in progress. We don't want to do this - not only does it create contention on the iclogs and the state machine locks, we can run the log right out of space with outstanding pushes. Instead, limit the work concurrency to 4 concurrent works being processed at a time. This is enough concurrency to remove the CIL from being a CPU bound bottleneck but not enough to create new contention points or unbound concurrency issues. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16xfs: AIL needs asynchronous CIL forcingDave Chinner8-29/+91
The AIL pushing is stalling on log forces when it comes across pinned items. This is happening on removal workloads where the AIL is dominated by stale items that are removed from AIL when the checkpoint that marks the items stale is committed to the journal. This results is relatively few items in the AIL, but those that are are often pinned as directories items are being removed from are still being logged. As a result, many push cycles through the CIL will first issue a blocking log force to unpin the items. This can take some time to complete, with tracing regularly showing push delays of half a second and sometimes up into the range of several seconds. Sequences like this aren't uncommon: .... 399.829437: xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20 <wanted 20ms, got 270ms delay> 400.099622: xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0 400.099623: xfsaild: first lsn 0x11002f3600 400.099679: xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50 <wanted 50ms, got 500ms delay> 400.589348: xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0 400.589349: xfsaild: first lsn 0x1100305000 400.589595: xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50 <wanted 50ms, got 460ms delay> 400.950341: xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0 400.950343: xfsaild: first lsn 0x1100317c00 400.950436: xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20 <wanted 20ms, got 200ms delay> 401.142333: xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0 401.142334: xfsaild: first lsn 0x110032e600 401.142535: xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10 <wanted 10ms, got 10ms delay> 401.154323: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000 401.154328: xfsaild: first lsn 0x1100353000 401.154389: xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20 <wanted 20ms, got 300ms delay> 401.451525: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 401.451526: xfsaild: first lsn 0x1100353000 401.451804: xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50 <wanted 50ms, got 500ms delay> 401.933581: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 .... In each of these cases, every AIL pass saw 101 log items stuck on the AIL (pinned) with very few other items being found. Each pass, a log force was issued, and delay between last/first is the sleep time + the sync log force time. Some of these 101 items pinned the tail of the log. The tail of the log does slowly creep forward (first lsn), but the problem is that the log is actually out of reservation space because it's been running so many transactions that stale items that never reach the AIL but consume log space. Hence we have a largely empty AIL, with long term pins on items that pin the tail of the log that don't get pushed frequently enough to keep log space available. The problem is the hundreds of milliseconds that we block in the log force pushing the CIL out to disk. The AIL should not be stalled like this - it needs to run and flush items that are at the tail of the log with minimal latency. What we really need to do is trigger a log flush, but then not wait for it at all - we've already done our waiting for stuff to complete when we backed off prior to the log force being issued. Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we still do a blocking flush of the CIL and that is what is causing the issue. Hence we need a new interface for the CIL to trigger an immediate background push of the CIL to get it moving faster but not to wait on that to occur. While the CIL is pushing, the AIL can also be pushing. We already have an internal interface to do this - xlog_cil_push_now() - but we need a wrapper for it to be used externally. xlog_cil_force_seq() can easily be extended to do what we need as it already implements the synchronous CIL push via xlog_cil_push_now(). Add the necessary flags and "push current sequence" semantics to xlog_cil_force_seq() and convert the AIL pushing to use it. One of the complexities here is that the CIL push does not guarantee that the commit record for the CIL checkpoint is written to disk. The current log force ensures this by submitting the current ACTIVE iclog that the commit record was written to. We need the CIL to actually write this commit record to disk for an async push to ensure that the checkpoint actually makes it to disk and unpins the pinned items in the checkpoint on completion. Hence we need to pass down to the CIL push that we are doing an async flush so that it can switch out the commit_iclog if necessary to get written to disk when the commit iclog is finally released. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Allison Henderson <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16xfs: order CIL checkpoint start recordsDave Chinner3-13/+58
Because log recovery depends on strictly ordered start records as well as strictly ordered commit records. This is a zero day bug in the way XFS writes pipelined transactions to the journal which is exposed by fixing the zero day bug that prevents the CIL from pipelining checkpoints. This re-introduces explicit concurrent commits back into the on-disk journal and hence out of order start records. The XFS journal commit code has never ordered start records and we have relied on strict commit record ordering for correct recovery ordering of concurrently written transactions. Unfortunately, root cause analysis uncovered the fact that log recovery uses the LSN of the start record for transaction commit processing. Hence, whilst the commits are processed in strict order by recovery, the LSNs associated with the commits can be out of order and so recovery may stamp incorrect LSNs into objects and/or misorder intents in the AIL for later processing. This can result in log recovery failures and/or on disk corruption, sometimes silent. Because this is a long standing log recovery issue, we can't just fix log recovery and call it good. This still leaves older kernels susceptible to recovery failures and corruption when replaying a log from a kernel that pipelines checkpoints. There is also the issue that in-memory ordering for AIL pushing and data integrity operations are based on checkpoint start LSNs, and if the start LSN is incorrect in the journal, it is also incorrect in memory. Hence there's really only one choice for fixing this zero-day bug: we need to strictly order checkpoint start records in ascending sequence order in the log, the same way we already strictly order commit records. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16xfs: attach iclog callbacks in xlog_cil_set_ctx_write_state()Dave Chinner3-74/+70
Now that we have a mechanism to guarantee that the callbacks attached to an iclog are owned by the context that attaches them until they drop their reference to the iclog via xlog_state_release_iclog(), we can attach callbacks to the iclog at any time we have an active reference to the iclog. xlog_state_get_iclog_space() always guarantees that the commit record will fit in the iclog it returns, so we can move this IO callback setting to xlog_cil_set_ctx_write_state(), record the commit iclog in the context and remove the need for the commit iclog to be returned by xlog_write() altogether. This, in turn, allows us to move the wakeup for ordered commit record writes up into xlog_cil_set_ctx_write_state(), too, because we have been guaranteed that this commit record will be physically located in the iclog before any waiting commit record at a higher sequence number will be granted iclog space. This further cleans up the post commit record write processing in the CIL push code, especially as xlog_state_release_iclog() will now clean up the context when shutdown errors occur. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16xfs: factor out log write ordering from xlog_cil_push_work()Dave Chinner1-36/+51
So we can use it for start record ordering as well as commit record ordering in future. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16xfs: pass a CIL context to xlog_write()Dave Chinner3-25/+52
Pass the CIL context to xlog_write() rather than a pointer to a LSN variable. Only the CIL checkpoint calls to xlog_write() need to know about the start LSN of the writes, so rework xlog_write to directly write the LSNs into the CIL context structure. This removes the commit_lsn variable from xlog_cil_push_work(), so now we only have to issue the commit record ordering wakeup from there. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16xfs: move xlog_commit_record to xfs_log_cil.cDave Chinner3-34/+34
It is only used by the CIL checkpoints, and is the counterpart to start record formatting and writing that is already local to xfs_log_cil.c. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16xfs: log head and tail aren't reliable during shutdownDave Chinner1-24/+27
I'm seeing assert failures from xlog_space_left() after a shutdown has begun that look like: XFS (dm-0): log I/O error -5 XFS (dm-0): xfs_do_force_shutdown(0x2) called from line 1338 of file fs/xfs/xfs_log.c. Return address = xlog_ioend_work+0x64/0xc0 XFS (dm-0): Log I/O Error Detected. XFS (dm-0): Shutting down filesystem. Please unmount the filesystem and rectify the problem(s) XFS (dm-0): xlog_space_left: head behind tail XFS (dm-0): tail_cycle = 6, tail_bytes = 2706944 XFS (dm-0): GH cycle = 6, GH bytes = 1633867 XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 1310 ------------[ cut here ]------------ Call Trace: xlog_space_left+0xc3/0x110 xlog_grant_push_threshold+0x3f/0xf0 xlog_grant_push_ail+0x12/0x40 xfs_log_reserve+0xd2/0x270 ? __might_sleep+0x4b/0x80 xfs_trans_reserve+0x18b/0x260 ..... There are two things here. Firstly, after a shutdown, the log head and tail can be out of whack as things abort and release (or don't release) resources, so checking them for sanity doesn't make much sense. Secondly, xfs_log_reserve() can race with shutdown and so it can still fail like this even though it has already checked for a log shutdown before calling xlog_grant_push_ail(). So, before ASSERT failing in xlog_space_left(), make sure we haven't already shut down.... Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16xfs: don't run shutdown callbacks on active iclogsDave Chinner2-12/+38
When the log is shutdown, it currently walks all the iclogs and runs callbacks that are attached to the iclogs, regardless of whether the iclog is queued for IO completion or not. This creates a problem for contexts attaching callbacks to iclogs in that a racing shutdown can run the callbacks even before the attaching context has finished processing the iclog and releasing it for IO submission. If the callback processing of the iclog frees the structure that is attached to the iclog, then this leads to an UAF scenario that can only be protected against by holding the icloglock from the point callbacks are attached through to the release of the iclog. While we currently do this, it is not practical or sustainable. Hence we need to make shutdown processing the responsibility of the context that holds active references to the iclog. We know that the contexts attaching callbacks to the iclog must have active references to the iclog, and that means they must be in either ACTIVE or WANT_SYNC states. xlog_state_do_callback() will skip over iclogs in these states -except- when the log is shut down. xlog_state_do_callback() checks the state of the iclogs while holding the icloglock, therefore the reference count/state change that occurs in xlog_state_release_iclog() after the callbacks are atomic w.r.t. shutdown processing. We can't push the responsibility of callback cleanup onto the CIL context because we can have ACTIVE iclogs that have callbacks attached that have already been released. Hence we really need to internalise the cleanup of callbacks into xlog_state_release_iclog() processing. Indeed, we already have that internalisation via: xlog_state_release_iclog drop last reference ->SYNCING xlog_sync xlog_write_iclog if (log_is_shutdown) xlog_state_done_syncing() xlog_state_do_callback() <process shutdown on iclog that is now in SYNCING state> The problem is that xlog_state_release_iclog() aborts before doing anything if the log is already shut down. It assumes that the callbacks have already been cleaned up, and it doesn't need to do any cleanup. Hence the fix is to remove the xlog_is_shutdown() check from xlog_state_release_iclog() so that reference counts are correctly released from the iclogs, and when the reference count is zero we always transition to SYNCING if the log is shut down. Hence we'll always enter the xlog_sync() path in a shutdown and eventually end up erroring out the iclog IO and running xlog_state_do_callback() to process the callbacks attached to the iclog. This allows us to stop processing referenced ACTIVE/WANT_SYNC iclogs directly in the shutdown code, and in doing so gets rid of the UAF vector that currently exists. This then decouples the adding of callbacks to the iclogs from xlog_state_release_iclog() as we guarantee that xlog_state_release_iclog() will process the callbacks if the log has been shut down before xlog_state_release_iclog() has been called. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16xfs: separate out log shutdown callback processingDave Chinner1-15/+38
The iclog callback processing done during a forced log shutdown has different logic to normal runtime IO completion callback processing. Separate out the shutdown callbacks into their own function and call that from the shutdown code instead. We don't need this shutdown specific logic in the normal runtime completion code - we'll always run the shutdown version on shutdown, and it will do what shutdown needs regardless of whether there are racing IO completion callbacks scheduled or in progress. Hence we can also simplify the normal IO completion callpath and only abort if shutdown occurred while we actively were processing callbacks. Further, separating out the IO completion logic from the shutdown logic avoids callback race conditions from being triggered by log IO completion after a shutdown. IO completion will now only run callbacks on iclogs that are in the correct state for a callback to be run, avoiding the possibility of running callbacks on a referenced iclog that hasn't yet been submitted for IO. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16xfs: rework xlog_state_do_callback()Dave Chinner1-43/+53
Clean it up a bit by factoring and rearranging some of the code. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16xfs: make forced shutdown processing atomicDave Chinner3-89/+76
The running of a forced shutdown is a bit of a mess. It does racy checks for XFS_MOUNT_SHUTDOWN in xfs_do_force_shutdown(), then does more racy checks in xfs_log_force_unmount() before finally setting XFS_MOUNT_SHUTDOWN and XLOG_IO_ERROR under the log->icloglock. Move the checking and setting of XFS_MOUNT_SHUTDOWN into xfs_do_force_shutdown() so we only process a shutdown once and once only. Serialise this with the mp->m_sb_lock spinlock so that the state change is atomic and won't race. Move all the mount specific shutdown state changes from xfs_log_force_unmount() to xfs_do_force_shutdown() so they are done atomically with setting XFS_MOUNT_SHUTDOWN. Then get rid of the racy xlog_is_shutdown() check from xlog_force_shutdown(), and gate the log shutdown on the test_and_set_bit(XLOG_IO_ERROR) test under the icloglock. This means that the log is shutdown once and once only, and code that needs to prevent races with shutdown can do so by holding the icloglock and checking the return value of xlog_is_shutdown(). This results in a predictable shutdown execution process - we set the shutdown flags once and process the shutdown once rather than the current "as many concurrent shutdowns as can race to the flag setting" situation we have now. Also, now that shutdown is atomic, alway emit a stack trace when the error level for the filesystem is high enough. This means that we always get a stack trace when trying to diagnose the cause of shutdowns in the field, rather than just for SHUTDOWN_CORRUPT_INCORE cases. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16xfs: convert log flags to an operational state fieldDave Chinner5-51/+50
log->l_flags doesn't actually contain "flags" as such, it contains operational state information that can change at runtime. For the shutdown state, this at least should be an atomic bit because it is read without holding locks in many places and so using atomic bitops for the state field modifications makes sense. This allows us to use things like test_and_set_bit() on state changes (e.g. setting XLOG_TAIL_WARN) to avoid races in setting the state when we aren't holding locks. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16xfs: move recovery needed state updates to xfs_log_mount_finishDave Chinner2-68/+56
xfs_log_mount_finish() needs to know if recovery is needed or not to make decisions on whether to flush the log and AIL. Move the handling of the NEED_RECOVERY state out to this function rather than needing a temporary variable to store this state over the call to xlog_recover_finish(). Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16xfs: XLOG_STATE_IOERROR must dieDave Chinner4-83/+32
We don't need an iclog state field to tell us the log has been shut down. We can just check the xlog_is_shutdown() instead. The avoids the need to have shutdown overwrite the current iclog state while being active used by the log code and so having to ensure that every iclog state check handles XLOG_STATE_IOERROR appropriately. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16xfs: convert XLOG_FORCED_SHUTDOWN() to xlog_is_shutdown()Dave Chinner5-30/+30
Make it less shouty and a static inline before adding more calls through the log code. Also convert internal log code that uses XFS_FORCED_SHUTDOWN(mount) to use xlog_is_shutdown(log) as well. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-16squashfs: use bvec_virtChristoph Hellwig6-9/+8
Use bvec_virt instead of open coding it. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-16block: ensure the bdi is freed after inode_detach_wbChristoph Hellwig1-3/+4
inode_detach_wb references the "main" bdi of the inode. With the recent change to move the bdi from the request_queue to the gendisk this causes a guaranteed use after free when using certain cgroup configurations. The big itself is older through as any non-default inode reference (e.g. an open file descriptor) could have injected this use after free even before that. Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks") Reported-by: Qian Cai <[email protected]> Reported-by: syzbot <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-16block: free the extended dev_t minor laterChristoph Hellwig1-0/+5
The dev_t is used as the inode hash, so we should only released it once then block device inode is gone from the inode cache. Move it to bdev_free_inode to ensure that. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-16btrfs: prevent rename2 from exchanging a subvol with a directory from ↵NeilBrown1-2/+8
different parents Cross-rename lacks a check when that would prevent exchanging a directory and subvolume from different parent subvolume. This causes data inconsistencies and is caught before commit by tree-checker, turning the filesystem to read-only. Calling the renameat2 with RENAME_EXCHANGE flags like renameat2(AT_FDCWD, namesrc, AT_FDCWD, namedest, (1 << 1)) on two paths: namesrc = dir1/subvol1/dir2 namedest = subvol2/subvol3 will cause key order problem with following write time tree-checker report: [1194842.307890] BTRFS critical (device loop1): corrupt leaf: root=5 block=27574272 slot=10 ino=258, invalid previous key objectid, have 257 expect 258 [1194842.322221] BTRFS info (device loop1): leaf 27574272 gen 8 total ptrs 11 free space 15444 owner 5 [1194842.331562] BTRFS info (device loop1): refs 2 lock_owner 0 current 26561 [1194842.338772] item 0 key (256 1 0) itemoff 16123 itemsize 160 [1194842.338793] inode generation 3 size 16 mode 40755 [1194842.338801] item 1 key (256 12 256) itemoff 16111 itemsize 12 [1194842.338809] item 2 key (256 84 2248503653) itemoff 16077 itemsize 34 [1194842.338817] dir oid 258 type 2 [1194842.338823] item 3 key (256 84 2363071922) itemoff 16043 itemsize 34 [1194842.338830] dir oid 257 type 2 [1194842.338836] item 4 key (256 96 2) itemoff 16009 itemsize 34 [1194842.338843] item 5 key (256 96 3) itemoff 15975 itemsize 34 [1194842.338852] item 6 key (257 1 0) itemoff 15815 itemsize 160 [1194842.338863] inode generation 6 size 8 mode 40755 [1194842.338869] item 7 key (257 12 256) itemoff 15801 itemsize 14 [1194842.338876] item 8 key (257 84 2505409169) itemoff 15767 itemsize 34 [1194842.338883] dir oid 256 type 2 [1194842.338888] item 9 key (257 96 2) itemoff 15733 itemsize 34 [1194842.338895] item 10 key (258 12 256) itemoff 15719 itemsize 14 [1194842.339163] BTRFS error (device loop1): block=27574272 write time tree block corruption detected [1194842.339245] ------------[ cut here ]------------ [1194842.443422] WARNING: CPU: 6 PID: 26561 at fs/btrfs/disk-io.c:449 csum_one_extent_buffer+0xed/0x100 [btrfs] [1194842.511863] CPU: 6 PID: 26561 Comm: kworker/u17:2 Not tainted 5.14.0-rc3-git+ #793 [1194842.511870] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008 [1194842.511876] Workqueue: btrfs-worker-high btrfs_work_helper [btrfs] [1194842.511976] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs] [1194842.512068] RSP: 0018:ffffa2c284d77da0 EFLAGS: 00010282 [1194842.512074] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff928867bd9978 [1194842.512078] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff928867bd9970 [1194842.512081] RBP: ffff92876b958000 R08: 0000000000000001 R09: 00000000000c0003 [1194842.512085] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [1194842.512088] R13: ffff92875f989f98 R14: 0000000000000000 R15: 0000000000000000 [1194842.512092] FS: 0000000000000000(0000) GS:ffff928867a00000(0000) knlGS:0000000000000000 [1194842.512095] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1194842.512099] CR2: 000055f5384da1f0 CR3: 0000000102fe4000 CR4: 00000000000006e0 [1194842.512103] Call Trace: [1194842.512128] ? run_one_async_free+0x10/0x10 [btrfs] [1194842.631729] btree_csum_one_bio+0x1ac/0x1d0 [btrfs] [1194842.631837] run_one_async_start+0x18/0x30 [btrfs] [1194842.631938] btrfs_work_helper+0xd5/0x1d0 [btrfs] [1194842.647482] process_one_work+0x262/0x5e0 [1194842.647520] worker_thread+0x4c/0x320 [1194842.655935] ? process_one_work+0x5e0/0x5e0 [1194842.655946] kthread+0x135/0x160 [1194842.655953] ? set_kthread_struct+0x40/0x40 [1194842.655965] ret_from_fork+0x1f/0x30 [1194842.672465] irq event stamp: 1729 [1194842.672469] hardirqs last enabled at (1735): [<ffffffffbd1104f5>] console_trylock_spinning+0x185/0x1a0 [1194842.672477] hardirqs last disabled at (1740): [<ffffffffbd1104cc>] console_trylock_spinning+0x15c/0x1a0 [1194842.672482] softirqs last enabled at (1666): [<ffffffffbdc002e1>] __do_softirq+0x2e1/0x50a [1194842.672491] softirqs last disabled at (1651): [<ffffffffbd08aab7>] __irq_exit_rcu+0xa7/0xd0 The corrupted data will not be written, and filesystem can be unmounted and mounted again (all changes since the last commit will be lost). Add the missing check for new_ino so that all non-subvolumes must reside under the same parent subvolume. There's an exception allowing to exchange two subvolumes from any parents as the directory representing a subvolume is only a logical link and does not have any other structures related to the parent subvolume, unlike files, directories etc, that are always in the inode namespace of the parent subvolume. Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: [email protected] # 4.7+ Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: NeilBrown <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-14Merge tag 'libnvdimm-fixes-5.14-rc6' of ↵Linus Torvalds2-5/+3
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm fixes from Dan Williams: "A couple of fixes for long standing bugs, a warning fixup, and some miscellaneous dax cleanups. The bugs were recently found due to new platforms looking to use the ACPI NFIT "virtual" device definition, and new error injection capabilities to trigger error responses to label area requests. Ira's cleanups have been long pending, I neglected to send them earlier, and see no harm in including them now. This has all appeared in -next with no reported issues. Summary: - Fix support for NFIT "virtual" ranges (BIOS-defined memory disks) - Fix recovery from failed label storage areas on NVDIMM devices - Miscellaneous cleanups from Ira's investigation of dax_direct_access paths preparing for stray-write protection" * tag 'libnvdimm-fixes-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: tools/testing/nvdimm: Fix missing 'fallthrough' warning libnvdimm/region: Fix label activation vs errors ACPI: NFIT: Fix support for virtual SPA ranges dax: Ensure errno is returned from dax_direct_access fs/dax: Clarify nr_pages to dax_direct_access() fs/fuse: Remove unneeded kaddr parameter
2021-08-14io_uring: only assign io_uring_enter() SQPOLL error in actual error caseJens Axboe1-2/+3
If an SQPOLL based ring is newly created and an application issues an io_uring_enter(2) system call on it, then we can return a spurious -EOWNERDEAD error. This happens because there's nothing to submit, and if the caller doesn't specify any other action, the initial error assignment of -EOWNERDEAD never gets overwritten. This causes us to return it directly, even if it isn't valid. Move the error assignment into the actual failure case instead. Cc: [email protected] Fixes: d9d05217cb69 ("io_uring: stop SQPOLL submit on creator's death") Reported-by: Sherlock Holo [email protected] Link: https://github.com/axboe/liburing/issues/413 Signed-off-by: Jens Axboe <[email protected]>
2021-08-14Merge tag 'configfs-5.14' of git://git.infradead.org/users/hch/configfsLinus Torvalds1-12/+6
Pull configfs fix from Christoph Hellwig: - fix to revert to the historic write behavior (Bart Van Assche) * tag 'configfs-5.14' of git://git.infradead.org/users/hch/configfs: configfs: restore the kernel v5.13 text attribute write behavior
2021-08-13Merge tag '5.14-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds6-33/+80
Pull cifs fixes from Steve French: "Four CIFS/SMB3 Fixes, all for stable, two relating to deferred close, and one for the 'modefromsid' mount option (when 'idsfromsid' not specified)" * tag '5.14-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: Call close synchronously during unlink/rename/lease break. cifs: Handle race conditions during rename cifs: use the correct max-length for dentry_path_raw() cifs: create sd context must be a multiple of 8
2021-08-13Merge tag 'io_uring-5.14-2021-08-13' of git://git.kernel.dk/linux-blockLinus Torvalds2-36/+48
Pull io_uring fixes from Jens Axboe: "A bit bigger than the previous weeks, but mostly just a few stable bound fixes. In detail: - Followup fixes to patches from last week for io-wq, turns out they weren't complete (Hao) - Two lockdep reported fixes out of the RT camp (me) - Sync the io_uring-cp example with liburing, as a few bug fixes never made it to the kernel carried version (me) - SQPOLL related TIF_NOTIFY_SIGNAL fix (Nadav) - Use WRITE_ONCE() when writing sq flags (Nadav) - io_rsrc_put_work() deadlock fix (Pavel)" * tag 'io_uring-5.14-2021-08-13' of git://git.kernel.dk/linux-block: tools/io_uring/io_uring-cp: sync with liburing example io_uring: fix ctx-exit io_rsrc_put_work() deadlock io_uring: drop ctx->uring_lock before flushing work item io-wq: fix IO_WORKER_F_FIXED issue in create_io_worker() io-wq: fix bug of creating io-wokers unconditionally io_uring: rsrc ref lock needs to be IRQ safe io_uring: Use WRITE_ONCE() when writing to sq_flags io_uring: clear TIF_NOTIFY_SIGNAL when running task work
2021-08-13fat: Add KUnit tests for checksums and timestampsDavid Gow5-1/+219
Add some basic sanity-check tests for the fat_checksum() function and the fat_time_unix2fat() and fat_time_fat2unix() functions. These unit tests verify these functions return correct output for a number of test inputs. These tests were inspired by -- and serve a similar purpose to -- the timestamp parsing KUnit tests in ext4[1]. Note that, unlike fat_time_unix2fat, fat_time_fat2unix wasn't previously exported, so this patch exports it as well. This is required for the case where we're building the fat and fat_test as modules. Fixed minor checkpatch coding style errors and typos in commit log: Shuah Khan <[email protected]> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/ext4/inode-test.c Signed-off-by: David Gow <[email protected]> Acked-by: OGAWA Hirofumi <[email protected]> Reviewed-by: Brendan Higgins <[email protected]> Tested-by: Daniel Latypov <[email protected]> Signed-off-by: Shuah Khan <[email protected]>
2021-08-13f2fs: correct comment in segment.hYangtao Li1-2/+2
s/two/three Signed-off-by: Yangtao Li <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2021-08-13f2fs: improve sbi status info in debugfs/f2fs/statusYangtao Li1-5/+26
Do not use numbers but strings to improve readability when flag is set. Signed-off-by: Yangtao Li <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2021-08-13fs/ntfs3: Add NTFS3 in fs/Kconfig and fs/MakefileKonstantin Komarov2-0/+2
This adds NTFS3 in fs/Kconfig and fs/Makefile Signed-off-by: Konstantin Komarov <[email protected]>
2021-08-13fs/ntfs3: Add Kconfig, Makefile and docKonstantin Komarov2-0/+82
This adds Kconfig, Makefile and doc Signed-off-by: Konstantin Komarov <[email protected]>
2021-08-13fs/ntfs3: Add NTFS journalKonstantin Komarov1-0/+5182
This adds NTFS journal Signed-off-by: Konstantin Komarov <[email protected]>
2021-08-13fs/ntfs3: Add compressionKonstantin Komarov6-0/+2000
This patch adds different types of NTFS-applicable compressions: - lznt - lzx - xpress Latter two (lzx, xpress) implement Windows Compact OS feature and were taken from ntfs-3g system comression plugin authored by Eric Biggers (https://github.com/ebiggers/ntfs-3g-system-compression) which were ported to ntfs3 and adapted to Linux Kernel environment. Signed-off-by: Konstantin Komarov <[email protected]>
2021-08-13fs/ntfs3: Add attrib operationsKonstantin Komarov3-0/+3680
This adds attrib operations Signed-off-by: Konstantin Komarov <[email protected]>