aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2022-11-29mm: Do not enable PG_arch_2 for all 64-bit architecturesCatalin Marinas1-1/+1
Commit 4beba9486abd ("mm: Add PG_arch_2 page flag") introduced a new page flag for all 64-bit architectures. However, even if an architecture is 64-bit, it may still have limited spare bits in the 'flags' member of 'struct page'. This may happen if an architecture enables SPARSEMEM without SPARSEMEM_VMEMMAP as is the case with the newly added loongarch. This architecture port needs 19 more bits for the sparsemem section information and, while it is currently fine with PG_arch_2, adding any more PG_arch_* flags will trigger build-time warnings. Add a new CONFIG_ARCH_USES_PG_ARCH_X option which can be selected by architectures that need more PG_arch_* flags beyond PG_arch_1. Select it on arm64. Signed-off-by: Catalin Marinas <[email protected]> [[email protected]: fix build with CONFIG_ARM64_MTE disabled] Signed-off-by: Peter Collingbourne <[email protected]> Reported-by: kernel test robot <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Steven Price <[email protected]> Reviewed-by: Steven Price <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-28fsverity: stop using PG_error to track error statusEric Biggers4-65/+72
As a step towards freeing the PG_error flag for other uses, change ext4 and f2fs to stop using PG_error to track verity errors. Instead, if a verity error occurs, just mark the whole bio as failed. The coarser granularity isn't really a problem since it isn't any worse than what the block layer provides, and errors from a multi-page readahead aren't reported to applications unless a single-page read fails too. f2fs supports compression, which makes the f2fs changes a bit more complicated than desired, but the basic premise still works. Note: there are still a few uses of PageError in f2fs, but they are on the write path, so they are unrelated and this patch doesn't touch them. Reviewed-by: Chao Yu <[email protected]> Acked-by: Jaegeuk Kim <[email protected]> Signed-off-by: Eric Biggers <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-28afs: Fix fileserver probe RTT handlingDavid Howells1-2/+2
The fileserver probing code attempts to work out the best fileserver to use for a volume by retrieving the RTT calculated by AF_RXRPC for the probe call sent to each server and comparing them. Sometimes, however, no RTT estimate is available and rxrpc_kernel_get_srtt() returns false, leading good fileservers to be given an RTT of UINT_MAX and thus causing the rotation algorithm to ignore them. Fix afs_select_fileserver() to ignore rxrpc_kernel_get_srtt()'s return value and just take the estimated RTT it provides - which will be capped at 1 second. Fixes: 1d4adfaf6574 ("rxrpc: Make rxrpc_kernel_get_srtt() indicate validity") Signed-off-by: David Howells <[email protected]> Reviewed-by: Marc Dionne <[email protected]> Tested-by: Marc Dionne <[email protected]> cc: [email protected] Link: https://lore.kernel.org/r/166965503999.3392585.13954054113218099395.stgit@warthog.procyon.org.uk/ Signed-off-by: Linus Torvalds <[email protected]>
2022-11-28xfs: add debug knob to slow down write for funDarrick J. Wong4-3/+60
Add a new error injection knob so that we can arbitrarily slow down pagecache writes to test for race conditions and aberrant reclaim behavior if the writeback mechanisms are slow to issue writeback. This will enable functional testing for the ifork sequence counters introduced in commit 304a68b9c63b ("xfs: use iomap_valid method to detect stale cached iomaps") that fixes write racing with reclaim writeback. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2022-11-28xfs: add debug knob to slow down writeback for funDarrick J. Wong6-3/+90
Add a new error injection knob so that we can arbitrarily slow down writeback to test for race conditions and aberrant reclaim behavior if the writeback mechanisms are slow to issue writeback. This will enable functional testing for the ifork sequence counters introduced in commit 745b3f76d1c8 ("xfs: maintain a sequence count for inode fork manipulations"). Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2022-11-28Merge tag 'xfs-iomap-stale-fixes' of ↵Darrick J. Wong12-106/+425
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-6.2-mergeB xfs, iomap: fix data corruption due to stale cached iomaps This patch series fixes a data corruption that occurs in a specific multi-threaded write workload. The workload combined racing unaligned adjacent buffered writes with low memory conditions that caused both writeback and memory reclaim to race with the writes. The result of this was random partial blocks containing zeroes instead of the correct data. The underlying problem is that iomap caches the write iomap for the duration of the write() operation, but it fails to take into account that the extent underlying the iomap can change whilst the write is in progress. The short story is that an iomap can span mutliple folios, and so under low memory writeback can be cleaning folios the write() overlaps. Whilst the overlapping data is cached in memory, this isn't a problem, but because the folios are now clean they can be reclaimed. Once reclaimed, the write() does the wrong thing when re-instantiating partial folios because the iomap no longer reflects the underlying state of the extent. e.g. it thinks the extent is unwritten, so it zeroes the partial range, when in fact the underlying extent is now written and so it should have read the data from disk. This is how we get random zero ranges in the file instead of the correct data. The gory details of the race condition can be found here: https://lore.kernel.org/linux-xfs/[email protected]/ Fixing the problem has two aspects. The first aspect of the problem is ensuring that iomap can detect a stale cached iomap during a write in a race-free manner. We already do this stale iomap detection in the writeback path, so we have a mechanism for detecting that the iomap backing the data range may have changed and needs to be remapped. In the case of the write() path, we have to ensure that the iomap is validated at a point in time when the page cache is stable and cannot be reclaimed from under us. We also need to validate the extent before we start performing any modifications to the folio state or contents. Combine these two requirements together, and the only "safe" place to validate the iomap is after we have looked up and locked the folio we are going to copy the data into, but before we've performed any initialisation operations on that folio. If the iomap fails validation, we then mark it stale, unlock the folio and end the write. This effectively means a stale iomap results in a short write. Filesystems should already be able to handle this, as write operations can end short for many reasons and need to iterate through another mapping cycle to be completed. Hence the iomap changes needed to detect and handle stale iomaps during write() operations is relatively simple... However, the assumption is that filesystems should already be able to handle write failures safely, and that's where the second (first?) part of the problem exists. That is, handling a partial write is harder than just "punching out the unused delayed allocation extent". This is because mmap() based faults can race with writes, and if they land in the delalloc region that the write allocated, then punching out the delalloc region can cause data corruption. This data corruption problem is exposed by generic/346 when iomap is converted to detect stale iomaps during write() operations. Hence write failure in the filesytems needs to handle the fact that the write() in progress doesn't necessarily own the data in the page cache over the range of the delalloc extent it just allocated. As a result, we can't just truncate the page cache over the range the write() didn't reach and punch all the delalloc extent. We have to walk the page cache over the untouched range and skip over any dirty data region in the cache in that range. Which is .... non-trivial. That is, iterating the page cache has to handle partially populated folios (i.e. block size < page size) that contain data. The data might be discontiguous within a folio. Indeed, there might be *multiple* discontiguous data regions within a single folio. And to make matters more complex, multi-page folios mean we just don't know how many sub-folio regions we might have to iterate to find all these regions. All the corner cases between the conversions and rounding between filesystem block size, folio size and multi-page folio size combined with unaligned write offsets kept breaking my brain. However, if we convert the code to track the processed write regions by byte ranges instead of fileystem block or page cache index, we could simply use mapping_seek_hole_data() to find the start and end of each discrete data region within the range we needed to scan. SEEK_DATA finds the start of the cached data region, SEEK_HOLE finds the end of the region. These are byte based interfaces that understand partially uptodate folio regions, and so can iterate discrete sub-folio data regions directly. This largely solved the problem of discovering the dirty regions we need to keep the delalloc extent over. However, to use mapping_seek_hole_data() without needing to export it, we have to move all the delalloc extent cleanup to the iomap core and so now the iomap core can clean up delayed allocation extents in a safe, sane and filesystem neutral manner. With all this done, the original data corruption never occurs anymore, and we now have a generic mechanism for ensuring that page cache writes do not do the wrong thing when writeback and reclaim change the state of the physical extent and/or page cache contents whilst the write is in progress. Signed-off-by: Dave Chinner <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> * tag 'xfs-iomap-stale-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: drop write error injection is unfixable, remove it xfs: use iomap_valid method to detect stale cached iomaps iomap: write iomap validity checks xfs: xfs_bmap_punch_delalloc_range() should take a byte range iomap: buffered write failure should not truncate the page cache xfs,iomap: move delalloc punching to iomap xfs: use byte ranges for write cleanup ranges xfs: punching delalloc extents on write failure is racy xfs: write page faults in iomap are not buffered writes
2022-11-29xfs: drop write error injection is unfixable, remove itDave Chinner3-23/+25
With the changes to scan the page cache for dirty data to avoid data corruptions from partial write cleanup racing with other page cache operations, the drop writes error injection no longer works the same way it used to and causes xfs/196 to fail. This is because xfs/196 writes to the file and populates the page cache before it turns on the error injection and starts failing -overwrites-. The result is that the original drop-writes code failed writes only -after- overwriting the data in the cache, followed by invalidates the cached data, then punching out the delalloc extent from under that data. On the surface, this looks fine. The problem is that page cache invalidation *doesn't guarantee that it removes anything from the page cache* and it doesn't change the dirty state of the folio. When block size == page size and we do page aligned IO (as xfs/196 does) everything happens to align perfectly and page cache invalidation removes the single page folios that span the written data. Hence the followup delalloc punch pass does not find cached data over that range and it can punch the extent out. IOWs, xfs/196 "works" for block size == page size with the new code. I say "works", because it actually only works for the case where IO is page aligned, and no data was read from disk before writes occur. Because the moment we actually read data first, the readahead code allocates multipage folios and suddenly the invalidate code goes back to zeroing subfolio ranges without changing dirty state. Hence, with multipage folios in play, block size == page size is functionally identical to block size < page size behaviour, and drop-writes is manifestly broken w.r.t to this case. Invalidation of a subfolio range doesn't result in the folio being removed from the cache, just the range gets zeroed. Hence after we've sequentially walked over a folio that we've dirtied (via write data) and then invalidated, we end up with a dirty folio full of zeroed data. And because the new code skips punching ranges that have dirty folios covering them, we end up leaving the delalloc range intact after failing all the writes. Hence failed writes now end up writing zeroes to disk in the cases where invalidation zeroes folios rather than removing them from cache. This is a fundamental change of behaviour that is needed to avoid the data corruption vectors that exist in the old write fail path, and it renders the drop-writes injection non-functional and unworkable as it stands. As it is, I think the error injection is also now unnecessary, as partial writes that need delalloc extent are going to be a lot more common with stale iomap detection in place. Hence this patch removes the drop-writes error injection completely. xfs/196 can remain for testing kernels that don't have this data corruption fix, but those that do will report: xfs/196 3s ... [not run] XFS error injection drop_writes unknown on this kernel. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>
2022-11-29xfs: use iomap_valid method to detect stale cached iomapsDave Chinner5-27/+87
Now that iomap supports a mechanism to validate cached iomaps for buffered write operations, hook it up to the XFS buffered write ops so that we can avoid data corruptions that result from stale cached iomaps. See: https://lore.kernel.org/linux-xfs/[email protected]/ or the ->iomap_valid() introduction commit for exact details of the corruption vector. The validity cookie we store in the iomap is based on the type of iomap we return. It is expected that the iomap->flags we set in xfs_bmbt_to_iomap() is not perturbed by the iomap core and are returned to us in the iomap passed via the .iomap_valid() callback. This ensures that the validity cookie is always checking the correct inode fork sequence numbers to detect potential changes that affect the extent cached by the iomap. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>
2022-11-29iomap: write iomap validity checksDave Chinner2-2/+46
A recent multithreaded write data corruption has been uncovered in the iomap write code. The core of the problem is partial folio writes can be flushed to disk while a new racing write can map it and fill the rest of the page: writeback new write allocate blocks blocks are unwritten submit IO ..... map blocks iomap indicates UNWRITTEN range loop { lock folio copyin data ..... IO completes runs unwritten extent conv blocks are marked written <iomap now stale> get next folio } Now add memory pressure such that memory reclaim evicts the partially written folio that has already been written to disk. When the new write finally gets to the last partial page of the new write, it does not find it in cache, so it instantiates a new page, sees the iomap is unwritten, and zeros the part of the page that it does not have data from. This overwrites the data on disk that was originally written. The full description of the corruption mechanism can be found here: https://lore.kernel.org/linux-xfs/[email protected]/ To solve this problem, we need to check whether the iomap is still valid after we lock each folio during the write. We have to do it after we lock the page so that we don't end up with state changes occurring while we wait for the folio to be locked. Hence we need a mechanism to be able to check that the cached iomap is still valid (similar to what we already do in buffered writeback), and we need a way for ->begin_write to back out and tell the high level iomap iterator that we need to remap the remaining write range. The iomap needs to grow some storage for the validity cookie that the filesystem provides to travel with the iomap. XFS, in particular, also needs to know some more information about what the iomap maps (attribute extents rather than file data extents) to for the validity cookie to cover all the types of iomaps we might need to validate. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>
2022-11-29xfs: xfs_bmap_punch_delalloc_range() should take a byte rangeDave Chinner4-21/+15
All the callers of xfs_bmap_punch_delalloc_range() jump through hoops to convert a byte range to filesystem blocks before calling xfs_bmap_punch_delalloc_range(). Instead, pass the byte range to xfs_bmap_punch_delalloc_range() and have it do the conversion to filesystem blocks internally. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>
2022-11-29iomap: buffered write failure should not truncate the page cacheDave Chinner1-15/+180
iomap_file_buffered_write_punch_delalloc() currently invalidates the page cache over the unused range of the delalloc extent that was allocated. While the write allocated the delalloc extent, it does not own it exclusively as the write does not hold any locks that prevent either writeback or mmap page faults from changing the state of either the page cache or the extent state backing this range. Whilst xfs_bmap_punch_delalloc_range() already handles races in extent conversion - it will only punch out delalloc extents and it ignores any other type of extent - the page cache truncate does not discriminate between data written by this write or some other task. As a result, truncating the page cache can result in data corruption if the write races with mmap modifications to the file over the same range. generic/346 exercises this workload, and if we randomly fail writes (as will happen when iomap gets stale iomap detection later in the patchset), it will randomly corrupt the file data because it removes data written by mmap() in the same page as the write() that failed. Hence we do not want to punch out the page cache over the range of the extent we failed to write to - what we actually need to do is detect the ranges that have dirty data in cache over them and *not punch them out*. To do this, we have to walk the page cache over the range of the delalloc extent we want to remove. This is made complex by the fact we have to handle partially up-to-date folios correctly and this can happen even when the FSB size == PAGE_SIZE because we now support multi-page folios in the page cache. Because we are only interested in discovering the edges of data ranges in the page cache (i.e. hole-data boundaries) we can make use of mapping_seek_hole_data() to find those transitions in the page cache. As we hold the invalidate_lock, we know that the boundaries are not going to change while we walk the range. This interface is also byte-based and is sub-page block aware, so we can find the data ranges in the cache based on byte offsets rather than page, folio or fs block sized chunks. This greatly simplifies the logic of finding dirty cached ranges in the page cache. Once we've identified a range that contains cached data, we can then iterate the range folio by folio. This allows us to determine if the data is dirty and hence perform the correct delalloc extent punching operations. The seek interface we use to iterate data ranges will give us sub-folio start/end granularity, so we may end up looking up the same folio multiple times as the seek interface iterates across each discontiguous data region in the folio. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>
2022-11-28Merge tag 'fuse-fixes-6.1-rc8' of ↵Linus Torvalds1-21/+16
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fix from Miklos Szeredi: "Fix a regression introduced in -rc4" * tag 'fuse-fixes-6.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: lock inode unconditionally in fuse_fallocate()
2022-11-28f2fs: introduce discard_urgent_util sysfs nodeYangtao Li3-1/+12
Through this node, you can control the background discard to run more aggressively or not aggressively when reach the utilization rate of the space. Signed-off-by: Yangtao Li <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-11-28f2fs: define MIN_DISCARD_GRANULARITY macroYangtao Li3-3/+6
Do cleanup in f2fs_tuning_parameters() and __init_discard_policy(), let's use macro instead of number. Suggested-by: Chao Yu <[email protected]> Signed-off-by: Yangtao Li <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-11-28f2fs: init discard policy after thread wakeupYangtao Li1-11/+9
Under the current logic, after the discard thread wakes up, it will not run according to the expected policy, but will use the expected policy before sleep. Move the strategy selection to after the thread wakes up, so that the running state of the thread meets expectations. Signed-off-by: Yangtao Li <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-11-28f2fs: avoid victim selection from previous victim sectionYonggil Song1-2/+3
When f2fs chooses GC victim in large section & LFS mode, next_victim_seg[gc_type] is referenced first. After segment is freed, next_victim_seg[gc_type] has the next segment number. However, next_victim_seg[gc_type] still has the last segment number even after the last segment of section is freed. In this case, when f2fs chooses a victim for the next GC round, the last segment of previous victim section is chosen as a victim. Initialize next_victim_seg[gc_type] to NULL_SEGNO for the last segment in large section. Fixes: e3080b0120a1 ("f2fs: support subsectional garbage collection") Signed-off-by: Yonggil Song <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-11-28f2fs: truncate blocks in batch in __complete_revoke_list()Chao Yu1-7/+2
Use f2fs_do_truncate_blocks() to truncate all blocks in-batch in __complete_revoke_list(). Signed-off-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-11-28f2fs: make __queue_discard_cmd() return voidYangtao Li1-5/+6
Since __queue_discard_cmd() never returns an error, let's make it return void. Signed-off-by: Yangtao Li <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-11-28f2fs: move set_file_temperature into f2fs_new_inodeSheng Yong1-33/+29
Since the file name has already passed to f2fs_new_inode(), let's move set_file_temperature() into f2fs_new_inode(). Signed-off-by: Sheng Yong <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-11-28f2fs: fix to enable compress for newly created file if extension matchesSheng Yong2-167/+164
If compress_extension is set, and a newly created file matches the extension, the file could be marked as compression file. However, if inline_data is also enabled, there is no chance to check its extension since f2fs_should_compress() always returns false. This patch moves set_compress_inode(), which do extension check, in f2fs_should_compress() to check extensions before setting inline data flag. Fixes: 7165841d578e ("f2fs: fix to check inline_data during compressed inode conversion") Signed-off-by: Sheng Yong <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-11-28ext4: silence the warning when evicting inode with dioread_nolockZhang Yi1-5/+5
When evicting an inode with default dioread_nolock, it could be raced by the unwritten extents converting kworker after writeback some new allocated dirty blocks. It convert unwritten extents to written, the extents could be merged to upper level and free extent blocks, so it could mark the inode dirty again even this inode has been marked I_FREEING. But the inode->i_io_list check and warning in ext4_evict_inode() missing this corner case. Fortunately, ext4_evict_inode() will wait all extents converting finished before this check, so it will not lead to inode use-after-free problem, every thing is OK besides this warning. The WARN_ON_ONCE was originally designed for finding inode use-after-free issues in advance, but if we add current dioread_nolock case in, it will become not quite useful, so fix this warning by just remove this check. ====== WARNING: CPU: 7 PID: 1092 at fs/ext4/inode.c:227 ext4_evict_inode+0x875/0xc60 ... RIP: 0010:ext4_evict_inode+0x875/0xc60 ... Call Trace: <TASK> evict+0x11c/0x2b0 iput+0x236/0x3a0 do_unlinkat+0x1b4/0x490 __x64_sys_unlinkat+0x4c/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7fa933c1115b ====== rm kworker ext4_end_io_end() vfs_unlink() ext4_unlink() ext4_convert_unwritten_io_end_vec() ext4_convert_unwritten_extents() ext4_map_blocks() ext4_ext_map_blocks() ext4_ext_try_to_merge_up() __mark_inode_dirty() check !I_FREEING locked_inode_to_wb_and_lock_list() iput() iput_final() evict() ext4_evict_inode() truncate_inode_pages_final() //wait release io_end inode_io_list_move_locked() ext4_release_io_end() trigger WARN_ON_ONCE() Cc: [email protected] Fixes: ceff86fddae8 ("ext4: Avoid freeing inodes on dirty list") Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2022-11-28f2fs: set zstd compress level correctlySheng Yong1-1/+1
Fixes: cf30f6a5f0c6 ("lib: zstd: Add kernel-specific API") Signed-off-by: Sheng Yong <[email protected]> Reviewed-by: Chao Yu <[email protected]> Reviewed-by: Nick Terrell <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-11-28f2fs: change type for 'sbi->readdir_ra'Yuwei Guan4-3/+8
Before this patch, the varibale 'readdir_ra' takes effect if it's equal to '1' or not, so we can change type for it from 'int' to 'bool'. Signed-off-by: Yuwei Guan <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-11-28f2fs: cleanup for 'f2fs_tuning_parameters' functionYuwei Guan1-5/+3
A cleanup patch for 'f2fs_tuning_parameters' function. Signed-off-by: Yuwei Guan <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-11-28f2fs: fix to alloc_mode changed after remount on a small volume deviceYuwei Guan1-2/+5
The commit 84b89e5d943d8 ("f2fs: add auto tuning for small devices") add tuning for small volume device, now support to tune alloce_mode to 'reuse' if it's small size. But the alloc_mode will change to 'default' when do remount on this small size dievce. This patch fo fix alloc_mode changed when do remount for a small volume device. Signed-off-by: Yuwei Guan <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-11-28f2fs: remove submit label in __submit_discard_cmd()Yangtao Li1-4/+3
Complaint from Matthew Wilcox in another similar place: "submit? You don't submit anything at the 'submit' label. it should be called 'skip' or something. But I think this is just badly written and you don't need a goto at all." Let's remove submit label for readability. Signed-off-by: Yangtao Li <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-11-28f2fs: fix to do sanity check on i_extra_isize in is_alive()Chao Yu1-6/+12
syzbot found a f2fs bug: BUG: KASAN: slab-out-of-bounds in data_blkaddr fs/f2fs/f2fs.h:2891 [inline] BUG: KASAN: slab-out-of-bounds in is_alive fs/f2fs/gc.c:1117 [inline] BUG: KASAN: slab-out-of-bounds in gc_data_segment fs/f2fs/gc.c:1520 [inline] BUG: KASAN: slab-out-of-bounds in do_garbage_collect+0x386a/0x3df0 fs/f2fs/gc.c:1734 Read of size 4 at addr ffff888076557568 by task kworker/u4:3/52 CPU: 1 PID: 52 Comm: kworker/u4:3 Not tainted 6.1.0-rc4-syzkaller-00362-gfef7fd48922d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:284 [inline] print_report+0x15e/0x45d mm/kasan/report.c:395 kasan_report+0xbb/0x1f0 mm/kasan/report.c:495 data_blkaddr fs/f2fs/f2fs.h:2891 [inline] is_alive fs/f2fs/gc.c:1117 [inline] gc_data_segment fs/f2fs/gc.c:1520 [inline] do_garbage_collect+0x386a/0x3df0 fs/f2fs/gc.c:1734 f2fs_gc+0x88c/0x20a0 fs/f2fs/gc.c:1831 f2fs_balance_fs+0x544/0x6b0 fs/f2fs/segment.c:410 f2fs_write_inode+0x57e/0xe20 fs/f2fs/inode.c:753 write_inode fs/fs-writeback.c:1440 [inline] __writeback_single_inode+0xcfc/0x1440 fs/fs-writeback.c:1652 writeback_sb_inodes+0x54d/0xf90 fs/fs-writeback.c:1870 wb_writeback+0x2c5/0xd70 fs/fs-writeback.c:2044 wb_do_writeback fs/fs-writeback.c:2187 [inline] wb_workfn+0x2dc/0x12f0 fs/fs-writeback.c:2227 process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289 worker_thread+0x665/0x1080 kernel/workqueue.c:2436 kthread+0x2e4/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 The root cause is that we forgot to do sanity check on .i_extra_isize in below path, result in accessing invalid address later, fix it. - gc_data_segment - is_alive - data_blkaddr - offset_in_addr Reported-by: [email protected] Link: https://lore.kernel.org/linux-f2fs-devel/[email protected]/T/#u Signed-off-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-11-28f2fs: introduce F2FS_IOC_START_ATOMIC_REPLACEDaeho Jeong4-7/+31
introduce a new ioctl to replace the whole content of a file atomically, which means it induces truncate and content update at the same time. We can start it with F2FS_IOC_START_ATOMIC_REPLACE and complete it with F2FS_IOC_COMMIT_ATOMIC_WRITE. Or abort it with F2FS_IOC_ABORT_ATOMIC_WRITE. Signed-off-by: Daeho Jeong <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-11-28nfsd: reorganize filecache.cJeff Layton2-57/+58
In a coming patch, we're going to rework how the filecache refcounting works. Move some code around in the function to reduce the churn in the later patches, and rename some of the functions with (hopefully) clearer names: nfsd_file_flush becomes nfsd_file_fsync, and nfsd_file_unhash_and_dispose is renamed to nfsd_file_unhash_and_queue. Also, the nfsd_file_put_final tracepoint is renamed to nfsd_file_free, to better match the name of the function from which it's called. Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: NeilBrown <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2022-11-28nfsd: remove the pages_flushed statistic from filecacheJeff Layton1-6/+1
We're counting mapping->nrpages, but not all of those are necessarily dirty. We don't really have a simple way to count just the dirty pages, so just remove this stat since it's not accurate. Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2022-11-28NFSD: Fix licensing header in filecache.cChuck Lever1-1/+2
Add a missing SPDX header. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Jeff Layton <[email protected]>
2022-11-28NFSD: Use rhashtable for managing nfs4_file objectsChuck Lever2-39/+63
fh_match() is costly, especially when filehandles are large (as is the case for NFSv4). It needs to be used sparingly when searching data structures. Unfortunately, with common workloads, I see multiple thousands of objects stored in file_hashtbl[], which has just 256 buckets, making its bucket hash chains quite lengthy. Walking long hash chains with the state_lock held blocks other activity that needs that lock. Sizable hash chains are a common occurrance once the server has handed out some delegations, for example -- IIUC, each delegated file is held open on the server by an nfs4_file object. To help mitigate the cost of searching with fh_match(), replace the nfs4_file hash table with an rhashtable, which can dynamically resize its bucket array to minimize hash chain length. The result of this modification is an improvement in the latency of NFSv4 operations, and the reduction of nfsd CPU utilization due to eliminating the cost of multiple calls to fh_match() and reducing the CPU cache misses incurred while walking long hash chains in the nfs4_file hash table. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: NeilBrown <[email protected]> Reviewed-by: Jeff Layton <[email protected]>
2022-11-28NFSD: Refactor find_file()Chuck Lever1-21/+15
find_file() is now the only caller of find_file_locked(), so just fold these two together. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: NeilBrown <[email protected]> Reviewed-by: Jeff Layton <[email protected]>
2022-11-28NFSD: Clean up find_or_add_file()Chuck Lever1-36/+28
Remove the call to find_file_locked() in insert_nfs4_file(). Tracing shows that over 99% of these calls return NULL. Thus it is not worth the expense of the extra bucket list traversal. insert_file() already deals correctly with the case where the item is already in the hash bucket. Since nfsd4_file_hash_insert() is now just a wrapper around insert_file(), move the meat of insert_file() into nfsd4_file_hash_insert() and get rid of it. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: NeilBrown <[email protected]>
2022-11-28NFSD: Add a nfsd4_file_hash_remove() helperChuck Lever1-1/+7
Refactor to relocate hash deletion operation to a helper function that is close to most other nfs4_file data structure operations. The "noinline" annotation will become useful in a moment when the hlist_del_rcu() is replaced with a more complex rhash remove operation. It also guarantees that hash remove operations can be traced with "-p function -l remove_nfs4_file_locked". This also simplifies the organization of forward declarations: the to-be-added rhashtable and its param structure will be defined /after/ put_nfs4_file(). Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: NeilBrown <[email protected]> Reviewed-by: Jeff Layton <[email protected]>
2022-11-28NFSD: Clean up nfsd4_init_file()Chuck Lever1-6/+4
Name this function more consistently. I'm going to use nfsd4_file_ and nfsd4_file_hash_ for these helpers. Change the @fh parameter to be const pointer for better type safety. Finally, move the hash insertion operation to the caller. This is typical for most other "init_object" type helpers, and it is where most of the other nfs4_file hash table operations are located. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: NeilBrown <[email protected]> Reviewed-by: Jeff Layton <[email protected]>
2022-11-28NFSD: Update file_hashtbl() helpersChuck Lever1-2/+2
Enable callers to use const pointers for type safety. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: NeilBrown <[email protected]> Reviewed-by: Jeff Layton <[email protected]>
2022-11-28NFSD: Use const pointers as parameters to fh_ helpersChuck Lever1-4/+6
Enable callers to use const pointers where they are able to. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Jeff Layton <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: NeilBrown <[email protected]>
2022-11-28NFSD: Trace delegation revocationsChuck Lever2-0/+57
Delegation revocation is an exceptional event that is not otherwise visible externally (eg, no network traffic is emitted). Generate a trace record when it occurs so that revocation can be observed or other activity can be triggered. Example: nfsd-1104 [005] 1912.002544: nfsd_stid_revoke: client 633c9343:4e82788d stateid 00000003:00000001 ref=2 type=DELEG Trace infrastructure is provided for subsequent additional tracing related to nfs4_stid activity. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Jeff Layton <[email protected]> Reviewed-by: Jeff Layton <[email protected]>
2022-11-28NFSD: Trace stateids returned via DELEGRETURNChuck Lever2-0/+2
Handing out a delegation stateid is recorded with the nfsd_deleg_read tracepoint, but there isn't a matching tracepoint for recording when the stateid is returned. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Jeff Layton <[email protected]>
2022-11-28NFSD: Clean up nfs4_preprocess_stateid_op() call sitesChuck Lever1-24/+7
Remove the lame-duck dprintk()s around nfs4_preprocess_stateid_op() call sites. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Jeff Layton <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: NeilBrown <[email protected]>
2022-11-28NFSD: Flesh out a documenting comment for filecache.cChuck Lever1-0/+24
Record what we've learned recently about the NFSD filecache in a documenting comment so our future selves don't forget what all this is for. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Jeff Layton <[email protected]>
2022-11-28NFSD: Add an NFSD_FILE_GC flag to enable nfsd_file garbage collectionChuck Lever5-13/+64
NFSv4 operations manage the lifetime of nfsd_file items they use by means of NFSv4 OPEN and CLOSE. Hence there's no need for them to be garbage collected. Introduce a mechanism to enable garbage collection for nfsd_file items used only by NFSv2/3 callers. Note that the change in nfsd_file_put() ensures that both CLOSE and DELEGRETURN will actually close out and free an nfsd_file on last reference of a non-garbage-collected file. Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=394 Suggested-by: Trond Myklebust <[email protected]> Signed-off-by: Chuck Lever <[email protected]> Tested-by: Jeff Layton <[email protected]> Reviewed-by: NeilBrown <[email protected]> Reviewed-by: Jeff Layton <[email protected]>
2022-11-28NFSD: Revert "NFSD: NFSv4 CLOSE should release an nfsd_file immediately"Chuck Lever3-21/+2
This reverts commit 5e138c4a750dc140d881dab4a8804b094bbc08d2. That commit attempted to make files available to other users as soon as all NFSv4 clients were done with them, rather than waiting until the filecache LRU had garbage collected them. It gets the reference counting wrong, for one thing. But it also misses that DELEGRETURN should release a file in the same fashion. In fact, any nfsd_file_put() on an file held open by an NFSv4 client needs potentially to release the file immediately... Clear the way for implementing that idea. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: NeilBrown <[email protected]>
2022-11-28NFSD: Pass the target nfsd_file to nfsd_commit()Chuck Lever4-14/+25
In a moment I'm going to introduce separate nfsd_file types, one of which is garbage-collected; the other, not. The garbage-collected variety is to be used by NFSv2 and v3, and the non-garbage-collected variety is to be used by NFSv4. nfsd_commit() is invoked by both NFSv3 and NFSv4 consumers. We want nfsd_commit() to find and use the correct variety of cached nfsd_file object for the NFS version that is in use. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Jeff Layton <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: NeilBrown <[email protected]>
2022-11-28nfsd: don't call nfsd_file_put from client states seqfile displayJeff Layton1-18/+33
We had a report of this: BUG: sleeping function called from invalid context at fs/nfsd/filecache.c:440 ...with a stack trace showing nfsd_file_put being called from nfs4_show_open. This code has always tried to call fput while holding a spinlock, but we recently changed this to use the filecache, and that started triggering the might_sleep() in nfsd_file_put. states_start takes and holds the cl_lock while iterating over the client's states, and we can't sleep with that held. Have the various nfs4_show_* functions instead hold the fi_lock instead of taking a nfsd_file reference. Fixes: 78599c42ae3c ("nfsd4: add file to display list of client's opens") Link: https://bugzilla.redhat.com/show_bug.cgi?id=2138357 Reported-by: Zhi Li <[email protected]> Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2022-11-28exportfs: use pr_debug for unreachable debug statementsDavid Disseldorp1-4/+4
expfs.c has a bunch of dprintk statements which are unusable due to: #define dprintk(fmt, args...) do{}while(0) Use pr_debug so that they can be enabled dynamically. Also make some minor changes to the debug statements to fix some incorrect types, and remove __func__ which can be handled by dynamic debug separately. Signed-off-by: David Disseldorp <[email protected]> Reviewed-by: NeilBrown <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2022-11-28nfsd: allow disabling NFSv2 at compile timeJeff Layton5-8/+27
rpc.nfsd stopped supporting NFSv2 a year ago. Take the next logical step toward deprecating it and allow NFSv2 support to be compiled out. Add a new CONFIG_NFSD_V2 option that can be turned off and rework the CONFIG_NFSD_V?_ACL option dependencies. Add a description that discourages enabling it. Also, change the description of CONFIG_NFSD to state that the always-on version is now 3 instead of 2. Finally, add an #ifdef around "case 2:" in __write_versions. When NFSv2 is disabled at compile time, this should make the kernel ignore attempts to disable it at runtime, but still error out when trying to enable it. Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Tom Talpey <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2022-11-28nfsd: move nfserrno() to vfs.cJeff Layton8-63/+68
nfserrno() is common to all nfs versions, but nfsproc.c is specifically for NFSv2. Move it to vfs.c, and the prototype to vfs.h. While we're in here, remove the #ifdef EDQUOT check in this function. It's apparently a holdover from the initial merge of the nfsd code in 1997. No other place in the kernel checks that that symbol is defined before using it, so I think we can dispense with it here. Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2022-11-28nfsd: ignore requests to disable unsupported versionsJeff Layton1-1/+3
The kernel currently errors out if you attempt to enable or disable a version that it doesn't recognize. Change it to ignore attempts to disable an unrecognized version. If we don't support it, then there is no harm in doing so. Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Tom Talpey <[email protected]> Signed-off-by: Chuck Lever <[email protected]>