aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2020-12-09btrfs: move btrfs_find_highest_objectid/btrfs_find_free_objectid to disk-io.cNikolay Borisov4-58/+57
Those functions are going to be used even after inode cache is removed so moved them to a more appropriate place. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-09btrfs: drop casts of bio bi_sectorDavid Sterba7-18/+16
Since commit 72deb455b5ec ("block: remove CONFIG_LBDAF") (5.2) the sector_t type is u64 on all arches and configs so we don't need to typecast it. It used to be unsigned long and the result of sector size shifts were not guaranteed to fit in the type. Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-09btrfs: implement log-structured superblock for ZONED modeNaohiro Aota6-12/+429
Superblock (and its copies) is the only data structure in btrfs which has a fixed location on a device. Since we cannot overwrite in a sequential write required zone, we cannot place superblock in the zone. One easy solution is limiting superblock and copies to be placed only in conventional zones. However, this method has two downsides: one is reduced number of superblock copies. The location of the second copy of superblock is 256GB, which is in a sequential write required zone on typical devices in the market today. So, the number of superblock and copies is limited to be two. Second downside is that we cannot support devices which have no conventional zones at all. To solve these two problems, we employ superblock log writing. It uses two adjacent zones as a circular buffer to write updated superblocks. Once the first zone is filled up, start writing into the second one. Then, when both zones are filled up and before starting to write to the first zone again, it reset the first zone. We can determine the position of the latest superblock by reading write pointer information from a device. One corner case is when both zones are full. For this situation, we read out the last superblock of each zone, and compare them to determine which zone is older. The following zones are reserved as the circular buffer on ZONED btrfs. - The primary superblock: zones 0 and 1 - The first copy: zones 16 and 17 - The second copy: zones 1024 or zone at 256GB which is minimum, and next to it If these reserved zones are conventional, superblock is written fixed at the start of the zone without logging. Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-09btrfs: disallow mixed-bg in ZONED modeNaohiro Aota1-0/+6
Placing both data and metadata in a block group is impossible in ZONED mode. For data, we can allocate a space for it and write it immediately after the allocation. For metadata, however, we cannot do that, because the logical addresses are recorded in other metadata buffers to build up the trees. As a result, a data buffer can be placed after a metadata buffer, which is not written yet. Writing out the data buffer will break the sequential write rule. Check and disallow MIXED_BG with ZONED mode. Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-09btrfs: disable fallocate in ZONED modeNaohiro Aota1-0/+4
fallocate() is implemented by reserving actual extent instead of reservations. This can result in exposing the sequential write constraint of host-managed zoned block devices to the application, which would break the POSIX semantic for the fallocated file. To avoid this, report fallocate() as not supported when in ZONED mode for now. In the future, we may be able to implement "in-memory" fallocate() in ZONED mode by utilizing space_info->bytes_may_use or similar, so this returns EOPNOTSUPP. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-09btrfs: disallow NODATACOW in ZONED modeNaohiro Aota2-0/+18
NODATACOW implies overwriting the file data on a device, which is impossible in sequential required zones. Disable NODATACOW globally with mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-09btrfs: disallow space_cache in ZONED modeNaohiro Aota3-2/+34
As updates to the space cache v1 are in-place, the space cache cannot be located over sequential zones and there is no guarantees that the device will have enough conventional zones to store this cache. Resolve this problem by disabling completely the space cache v1. This does not introduce any problems with sequential block groups: all the free space is located after the allocation pointer and no free space before the pointer. There is no need to have such cache. Note: we can technically use free-space-tree (space cache v2) on ZONED mode. But, since ZONED mode now always allocates extents in a block group sequentially regardless of underlying device zone type, it's no use to enable and maintain the tree. For the same reason, NODATACOW is also disabled. In summary, ZONED will disable: | Disabled features | Reason | |-------------------+-----------------------------------------------------| | RAID/DUP | Cannot handle two zone append writes to different | | | zones | |-------------------+-----------------------------------------------------| | space_cache (v1) | In-place updating | | NODATACOW | In-place updating | |-------------------+-----------------------------------------------------| | fallocate | Reserved extent will be a write hole | |-------------------+-----------------------------------------------------| | MIXED_BG | Allocated metadata region will be write holes for | | | data writes | Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-09btrfs: introduce max_zone_append_sizeNaohiro Aota3-2/+19
The zone append write command has a maximum IO size restriction it accepts. This is because a zone append write command cannot be split, as we ask the device to place the data into a specific target zone and the device responds with the actual written location of the data. Introduce max_zone_append_size to zone_info and fs_info to track the value, so we can limit all I/O to a zoned block device that we want to write using the zone append command to the device's limits. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-09btrfs: check and enable ZONED modeNaohiro Aota7-0/+143
Introduce function btrfs_check_zoned_mode() to check if ZONED flag is enabled on the file system and if the file system consists of zoned devices with equal zone size. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Signed-off-by: Damien Le Moal <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-09btrfs: get zone information of zoned block devicesNaohiro Aota7-3/+287
If a zoned block device is found, get its zone information (number of zones and zone size). To avoid costly run-time zone report commands to test the device zones type during block allocation, attach the seq_zones bitmap to the device structure to indicate if a zone is sequential or accept random writes. Also it attaches the empty_zones bitmap to indicate if a zone is empty or not. This patch also introduces the helper function btrfs_dev_is_sequential() to test if the zone storing a block is a sequential write required zone and btrfs_dev_is_empty_zone() to test if the zone is a empty zone. Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Damien Le Moal <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-09fs/kernfs: remove the double check of dentry->inodeHui Su1-2/+1
In both kernfs_node_from_dentry() and in kernfs_dentry_node(), we will check the dentry->inode is NULL or not, which is superfluous. So remove the check in kernfs_node_from_dentry(). Acked-by: Tejun Heo <[email protected]> Signed-off-by: Hui Su <[email protected]> Link: https://lore.kernel.org/r/20201113132143.GA119541@rlk Signed-off-by: Greg Kroah-Hartman <[email protected]>
2020-12-09xfs: don't catch dax+reflink inodes as corruption in verifierEric Sandeen2-8/+0
We don't yet support dax on reflinked files, but that is in the works. Further, having the flag set does not automatically mean that the inode is actually "in the CPU direct access state," which depends on several other conditions in addition to the flag being set. As such, we should not catch this as corruption in the verifier - simply not actually enabling S_DAX on reflinked files is enough for now. Fixes: 4f435ebe7d04 ("xfs: don't mix reflink and DAX mode for now") Signed-off-by: Eric Sandeen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> [darrick: fix the scrubber too] Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-12-09xfs: fix the forward progress assertion in xfs_iwalk_run_callbacksDarrick J. Wong1-1/+1
In commit 27c14b5daa82 we started tracking the last inode seen during an inode walk to avoid infinite loops if a corrupt inobt record happens to have a lower ir_startino than the record preceeding it. Unfortunately, the assertion trips over the case where there are completely empty inobt records (which can happen quite easily on 64k page filesystems) because we advance the tracking cursor without actually putting the empty record into the processing buffer. Fix the assert to allow for this case. Reported-by: [email protected] Fixes: 27c14b5daa82 ("xfs: ensure inobt record walks always make forward progress") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Zorro Lang <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2020-12-09xfs: remove unneeded return value check for *init_cursor()Joseph Qi7-46/+0
Since *init_cursor() can always return a valid cursor, the NULL check in caller is unneeded. So clean them up. This also keeps the behavior consistent with other callers. Signed-off-by: Joseph Qi <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-12-09xfs: introduce xfs_validate_stripe_geometry()Gao Xiang2-11/+69
Introduce a common helper to consolidate stripe validation process. Also make kernel code xfs_validate_sb_common() use it first. Signed-off-by: Gao Xiang <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-12-09xfs: show the proper user quota optionsKaixu Xia1-4/+6
The quota option 'usrquota' should be shown if both the XFS_UQUOTA_ACCT and XFS_UQUOTA_ENFD flags are set. The option 'uqnoenforce' should be shown when only the XFS_UQUOTA_ACCT flag is set. The current code logic seems wrong, Fix it and show proper options. Signed-off-by: Kaixu Xia <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-12-09xfs: remove the unused XFS_B_FSB_OFFSET macroKaixu Xia1-1/+0
There are no callers of the XFS_B_FSB_OFFSET macro, so remove it. Signed-off-by: Kaixu Xia <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-12-09xfs: remove unnecessary null check in xfs_generic_createKaixu Xia1-4/+2
The function posix_acl_release() test the passed-in argument and move on only when it is non-null, so maybe the null check in xfs_generic_create is unnecessary. Signed-off-by: Kaixu Xia <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-12-09xfs: directly return if the delta equal to zeroKaixu Xia1-14/+9
The xfs_trans_mod_dquot() function will allocate new tp->t_dqinfo if it is NULL and make the changes in the tp->t_dqinfo->dqs[XFS_QM_TRANS _{USR,GRP,PRJ}]. Nowadays seems none of the callers want to join the dquots to the transaction and push them to device when the delta is zero. Actually, most of time the caller would check the delta and go on only when the delta value is not zero, so we should bail out when it is zero. Signed-off-by: Kaixu Xia <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-12-09xfs: check tp->t_dqinfo value instead of the XFS_TRANS_DQ_DIRTY flagKaixu Xia3-19/+3
Nowadays the only things that the XFS_TRANS_DQ_DIRTY flag seems to do are indicates the tp->t_dqinfo->dqs[XFS_QM_TRANS_{USR,GRP,PRJ}] values changed and check in xfs_trans_apply_dquot_deltas() and the unreserve variant xfs_trans_unreserve_and_mod_dquots(). Actually, we also can use the tp->t_dqinfo value instead of the XFS_TRANS_DQ_DIRTY flag, that is to say, we allocate the new tp->t_dqinfo only when the qtrx values changed, so the tp->t_dqinfo value isn't NULL equals the XFS_TRANS_DQ_DIRTY flag is set, we only need to check if tp->t_dqinfo == NULL in xfs_trans_apply_dquot_deltas() and its unreserve variant to determine whether lock all of the dquots and join them to the transaction. Signed-off-by: Kaixu Xia <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-12-09xfs: delete duplicated tp->t_dqinfo null check and allocationKaixu Xia1-7/+0
The function xfs_trans_mod_dquot_byino() wraps around xfs_trans_mod_dquot() to account for quotas, and also there is the function call chain xfs_trans_reserve_quota_bydquots -> xfs_trans_dqresv -> xfs_trans_mod_dquot, both of them do the duplicated null check and allocation. Thus we can delete the duplicated operation from them. Signed-off-by: Kaixu Xia <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2020-12-09xfs: rename xfs_fc_* back to xfs_fs_*Darrick J. Wong1-13/+13
Get rid of this one-off namespace since we're done converting things to fscontext now. Suggested-by: Dave Chinner <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2020-12-09xfs: refactor file range validationDarrick J. Wong8-5/+37
Refactor all the open-coded validation of file block ranges into a single helper, and teach the bmap scrubber to check the ranges. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-12-09xfs: refactor realtime volume extent validationDarrick J. Wong5-20/+23
Refactor all the open-coded validation of realtime device extents into a single helper. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2020-12-09xfs: refactor data device extent validationDarrick J. Wong8-50/+32
Refactor all the open-coded validation of non-static data device extents into a single helper. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2020-12-09xfs: scrub should mark a directory corrupt if any entries cannot be iget'dDarrick J. Wong1-3/+18
It's possible that xfs_iget can return EINVAL for inodes that the inobt thinks are free, or ENOENT for inodes that look free. If this is the case, mark the directory corrupt immediately when we check ftype. Note that we already check the ftype of the '.' and '..' entries, so we can skip the iget part since we already know the inode type for '.' and we have a separate parent pointer scrubber for '..'. Fixes: a5c46e5e8912 ("xfs: scrub directory metadata") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-12-09xfs: fix parent pointer scrubber bailing out on unallocated inodesDarrick J. Wong1-5/+5
xfs_iget can return -ENOENT for a file that the inobt thinks is allocated but has zeroed mode. This currently causes scrub to exit with an operational error instead of flagging this as a corruption. The end result is that scrub mistakenly reports the ENOENT to the user instead of "directory parent pointer corrupt" like we do for EINVAL. Fixes: 5927268f5a04 ("xfs: flag inode corruption if parent ptr doesn't get us a real inode") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-12-09xfs: detect overflows in bmbt recordsDarrick J. Wong1-0/+5
Detect file block mappings with a blockcount that's either so large that integer overflows occur or are zero, because neither are valid in the filesystem. Worse yet, attempting directory modifications causes the iext code to trip over the bmbt key handling and takes the filesystem down. We can fix most of this by preventing the bad metadata from entering the incore structures in the first place. Found by setting blockcount=0 in a directory data fork mapping and watching the fireworks. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-12-09xfs: trace log intent item recovery failuresDarrick J. Wong2-1/+22
Add a trace point so that we can capture when a recovered log intent item fails to recover. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2020-12-09xfs: validate feature support when recovering rmap/refcount intentsDarrick J. Wong2-0/+6
The rmap, and refcount log intent items were added to support the rmap and reflink features. Because these features come with changes to the ondisk format, the log items aren't tied to a log incompat flag. However, the log recovery routines don't actually check for those feature flags. The kernel has no business replayng an intent item for a feature that isn't enabled, so check that as part of recovered log item validation. (Note that kernels pre-dating rmap and reflink already fail log recovery on the unknown log item type code.) Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2020-12-09xfs: improve the code that checks recovered extent-free intent itemsDarrick J. Wong1-8/+7
The code that validates recovered extent-free intent items is kind of a mess -- it doesn't use the standard xfs type validators, and it doesn't check for things that it should. Fix the validator function to use the standard validation helpers and look for more types of obvious errors. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2020-12-09xfs: hoist recovered extent-free intent checks out of xfs_efi_item_recoverDarrick J. Wong1-8/+25
When we recover a extent-free intent from the log, we need to validate its contents before we try to replay them. Hoist the checking code into a separate function in preparation to refactor this code to use validation helpers. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2020-12-09xfs: improve the code that checks recovered refcount intent itemsDarrick J. Wong1-12/+11
The code that validates recovered refcount intent items is kind of a mess -- it doesn't use the standard xfs type validators, and it doesn't check for things that it should. Fix the validator function to use the standard validation helpers and look for more types of obvious errors. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2020-12-09xfs: hoist recovered refcount intent checks out of xfs_cui_item_recoverDarrick J. Wong1-21/+38
When we recover a refcount intent from the log, we need to validate its contents before we try to replay them. Hoist the checking code into a separate function in preparation to refactor this code to use validation helpers. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2020-12-09xfs: improve the code that checks recovered rmap intent itemsDarrick J. Wong1-12/+18
The code that validates recovered rmap intent items is kind of a mess -- it doesn't use the standard xfs type validators, and it doesn't check for things that it should. Fix the validator function to use the standard validation helpers and look for more types of obvious errors. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2020-12-09xfs: hoist recovered rmap intent checks out of xfs_rui_item_recoverDarrick J. Wong1-25/+42
When we recover a rmap intent from the log, we need to validate its contents before we try to replay them. Hoist the checking code into a separate function in preparation to refactor this code to use validation helpers. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2020-12-09xfs: improve the code that checks recovered bmap intent itemsDarrick J. Wong1-13/+13
The code that validates recovered bmap intent items is kind of a mess -- it doesn't use the standard xfs type validators, and it doesn't check for things that it should. Fix the validator function to use the standard validation helpers and look for more types of obvious errors. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2020-12-09xfs: hoist recovered bmap intent checks out of xfs_bui_item_recoverDarrick J. Wong1-27/+47
When we recover a bmap intent from the log, we need to validate its contents before we try to replay them. Hoist the checking code into a separate function in preparation to refactor this code to use validation helpers. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]>
2020-12-09xfs: enable the needsrepair featureDarrick J. Wong1-1/+2
Make it so that libxfs recognizes the needsrepair feature. Note that the kernel will still refuse to mount these. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Eric Sandeen <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
2020-12-09xfs: define a new "needrepair" featureDarrick J. Wong2-0/+14
Define an incompat feature flag to indicate that the filesystem needs to be repaired. While libxfs will recognize this feature, the kernel will refuse to mount if the feature flag is set, and only xfs_repair will be able to clear the flag. The goal here is to force the admin to run xfs_repair to completion after upgrading the filesystem, or if we otherwise detect anomalies. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Eric Sandeen <[email protected]>
2020-12-09nfsd: Record NFSv4 pre/post-op attributes as non-atomicTrond Myklebust4-2/+12
For the case of NFSv4, specify to the client that the pre/post-op attributes were not recorded atomically with the main operation. Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2020-12-09nfsd: Set PF_LOCAL_THROTTLE on local filesystems onlyTrond Myklebust2-3/+13
Don't set PF_LOCAL_THROTTLE on remote filesystems like NFS, since they aren't expected to ever be subject to double buffering. Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2020-12-09nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALETrond Myklebust1-4/+12
If the underlying filesystem times out, then we want knfsd to return NFSERR_JUKEBOX/DELAY rather than NFSERR_STALE. Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2020-12-09exportfs: Add a function to return the raw output from fh_to_dentry()Trond Myklebust1-8/+24
In order to allow nfsd to accept return values that are not acceptable to overlayfs and others, add a new function. Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2020-12-09nfsd: close cached files prior to a REMOVE or RENAME that would replace targetJeff Layton2-8/+10
It's not uncommon for some workloads to do a bunch of I/O to a file and delete it just afterward. If knfsd has a cached open file however, then the file may still be open when the dentry is unlinked. If the underlying filesystem is nfs, then that could trigger it to do a sillyrename. On a REMOVE or RENAME scan the nfsd_file cache for open files that correspond to the inode, and proactively unhash and put their references. This should prevent any delete-on-last-close activity from occurring, solely due to knfsd's open file cache. This must be done synchronously though so we use the variants that call flush_delayed_fput. There are deadlock possibilities if you call flush_delayed_fput while holding locks, however. In the case of nfsd_rename, we don't even do the lookups of the dentries to be renamed until we've locked for rename. Once we've figured out what the target dentry is for a rename, check to see whether there are cached open files associated with it. If there are, then unwind all of the locking, close them all, and then reattempt the rename. None of this is really necessary for "typical" filesystems though. It's mostly of use for NFS, so declare a new export op flag and use that to determine whether to close the files beforehand. Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Lance Shelton <[email protected]> Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2020-12-09nfsd: allow filesystems to opt out of subtree checkingJeff Layton2-1/+7
When we start allowing NFS to be reexported, then we have some problems when it comes to subtree checking. In principle, we could allow it, but it would mean encoding parent info in the filehandles and there may not be enough space for that in a NFSv3 filehandle. To enforce this at export upcall time, we add a new export_ops flag that declares the filesystem ineligible for subtree checking. Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Lance Shelton <[email protected]> Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2020-12-09nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operationsJeff Layton4-3/+21
With NFSv3 nfsd will always attempt to send along WCC data to the client. This generally involves saving off the in-core inode information prior to doing the operation on the given filehandle, and then issuing a vfs_getattr to it after the op. Some filesystems (particularly clustered or networked ones) have an expensive ->getattr inode operation. Atomicity is also often difficult or impossible to guarantee on such filesystems. For those, we're best off not trying to provide WCC information to the client at all, and to simply allow it to poll for that information as needed with a GETATTR RPC. This patch adds a new flags field to struct export_operations, and defines a new EXPORT_OP_NOWCC flag that filesystems can use to indicate that nfsd should not attempt to provide WCC info in NFSv3 replies. It also adds a blurb about the new flags field and flag to the exporting documentation. The server will also now skip collecting this information for NFSv2 as well, since that info is never used there anyway. Note that this patch does not add this flag to any filesystem export_operations structures. This was originally developed to allow reexporting nfs via nfsd. Other filesystems may want to consider enabling this flag too. It's hard to tell however which ones have export operations to enable export via knfsd and which ones mostly rely on them for open-by-filehandle support, so I'm leaving that up to the individual maintainers to decide. I am cc'ing the relevant lists for those filesystems that I think may want to consider adding this though. Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Lance Shelton <[email protected]> Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2020-12-09Revert "nfsd4: support change_attr_type attribute"J. Bruce Fields2-11/+0
This reverts commit a85857633b04d57f4524cca0a2bfaf87b2543f9f. We're still factoring ctime into our change attribute even in the IS_I_VERSION case. If someone sets the system time backwards, a client could see the change attribute go backwards. Maybe we can just say "well, don't do that", but there's some question whether that's good enough, or whether we need a better guarantee. Also, the client still isn't actually using the attribute. While we're still figuring this out, let's just stop returning this attribute. Signed-off-by: J. Bruce Fields <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2020-12-09nfsd4: don't query change attribute in v2/v3 caseJ. Bruce Fields1-5/+9
inode_query_iversion() has side effects, and there's no point calling it when we're not even going to use it. We check whether we're currently processing a v4 request by checking fh_maxsize, which is arguably a little hacky; we could add a flag to svc_fh instead. Signed-off-by: J. Bruce Fields <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2020-12-09nfsd: minor nfsd4_change_attribute cleanupJ. Bruce Fields1-8/+5
Minor cleanup, no change in behavior. Also pull out a common helper that'll be useful elsewhere. Signed-off-by: J. Bruce Fields <[email protected]> Signed-off-by: Chuck Lever <[email protected]>