Age | Commit message (Collapse) | Author | Files | Lines |
|
Coverity also complains about the way we calculate the offset
(starting from the address of a 4 byte array within the
header structure rather than from the beginning of the struct
plus 4 bytes) for SMB1 SetFileDisposition (which is used to
unlink a file by setting the delete on close flag). This
changeset doesn't change the address but makes it slightly
clearer.
Addresses-Coverity: 711524 ("Out of bounds write")
Reviewed-by: Paulo Alcantara (SUSE) <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
Coverity also complains about the way we calculate the offset
(starting from the address of a 4 byte array within the header
structure rather than from the beginning of the struct plus
4 bytes) for setting the file size using SMB1. This changeset
doesn't change the address but makes it slightly clearer.
Addresses-Coverity: 711525 ("Out of bounds write")
Reviewed-by: Paulo Alcantara (SUSE) <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
When syncing the log, if we fail to allocate the root node for the log
root tree:
1) We are unlocking fs_info->tree_log_mutex, but at this point we have
not yet locked this mutex;
2) We have locked fs_info->tree_root->log_mutex, but we end up not
unlocking it;
So fix this by unlocking fs_info->tree_root->log_mutex instead of
fs_info->tree_log_mutex.
Fixes: e75f9fd194090e ("btrfs: zoned: move log tree node allocation out of log_root_tree->log_mutex")
CC: [email protected] # 5.13+
Reviewed-by: Nikolay Borisov <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
If we can't acquire the reclaim_bgs_lock on block group reclaim, we
block until it is free. This can potentially stall for a long time.
While reclaim of block groups is necessary for a good user experience on
a zoned file system, there still is no need to block as it is best
effort only, just like when we're deleting unused block groups.
CC: [email protected] # 5.13
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Damien reported a test failure with btrfs/209. The test itself ran fine,
but the fsck ran afterwards reported a corrupted filesystem.
The filesystem corruption happens because we're splitting an extent and
then writing the extent twice. We have to split the extent though, because
we're creating too large extents for a REQ_OP_ZONE_APPEND operation.
When dumping the extent tree, we can see two EXTENT_ITEMs at the same
start address but different lengths.
$ btrfs inspect dump-tree /dev/nullb1 -t extent
...
item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
refs 1 gen 7 flags DATA
extent data backref root FS_TREE objectid 257 offset 786432 count 1
item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
refs 1 gen 7 flags DATA
extent data backref root FS_TREE objectid 257 offset 786432 count 1
The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in
extract_ordered_extent(). Since extract_ordered_extent() uses
create_io_em() to split an existing extent_map, we will have
split->orig_start != split->start. Then, it will be logged with non-zero
"extent data offset". Finally, the logged entries are replayed into
a duplicated EXTENT_ITEM.
Introduce and use proper splitting function for extent_map. The function is
intended to be simple and specific usage for extract_ordered_extent() e.g.
not supporting compression case (we do not allow splitting compressed
extent_map anyway).
There was a question raised by Qu, in summary why we want to split the
extent map (and not the bio):
The problem is not the limit on the zone end, which as you mention is
the same as the block group end. The problem is that data write use zone
append (ZA) operations. ZA BIOs cannot be split so a large extent may
need to be processed with multiple ZA BIOs, While that is also true for
regular writes, the major difference is that ZA are "nameless" write
operation giving back the written sectors on completion. And ZA
operations may be reordered by the block layer (not intentionally
though). Combine both of these characteristics and you can see that the
data for a large extent may end up being shuffled when written resulting
in data corruption and the impossibility to map the extent to some start
sector.
To avoid this problem, zoned btrfs uses the principle "one data extent
== one ZA BIO". So large extents need to be split. This is unfortunate,
but we can revisit this later and optimize, e.g. merge back together the
fragments of an extent once written if they actually were written
sequentially in the zone.
Reported-by: Damien Le Moal <[email protected]>
Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent")
CC: [email protected] # 5.12+
CC: Johannes Thumshirn <[email protected]>
Signed-off-by: Naohiro Aota <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: [email protected] # 5.12+
Tested-by: Shin'ichiro Kawasaki <[email protected]>
Tested-by: Naohiro Aota <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Tested-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
When a task attempting to allocate a new chunk verifies that there is not
currently enough free space in the system space_info and there is another
task that allocated a new system chunk but it did not finish yet the
creation of the respective block group, it waits for that other task to
finish creating the block group. This is to avoid exhaustion of the system
chunk array in the superblock, which is limited, when we have a thundering
herd of tasks allocating new chunks. This problem was described and fixed
by commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations").
However there are two very similar scenarios where this can lead to a
deadlock:
1) Task B allocated a new system chunk and task A is waiting on task B
to finish creation of the respective system block group. However before
task B ends its transaction handle and finishes the creation of the
system block group, it attempts to allocate another chunk (like a data
chunk for an fallocate operation for a very large range). Task B will
be unable to progress and allocate the new chunk, because task A set
space_info->chunk_alloc to 1 and therefore it loops at
btrfs_chunk_alloc() waiting for task A to finish its chunk allocation
and set space_info->chunk_alloc to 0, but task A is waiting on task B
to finish creation of the new system block group, therefore resulting
in a deadlock;
2) Task B allocated a new system chunk and task A is waiting on task B to
finish creation of the respective system block group. By the time that
task B enter the final phase of block group allocation, which happens
at btrfs_create_pending_block_groups(), when it modifies the extent
tree, the device tree or the chunk tree to insert the items for some
new block group, it needs to allocate a new chunk, so it ends up at
btrfs_chunk_alloc() and keeps looping there because task A has set
space_info->chunk_alloc to 1, but task A is waiting for task B to
finish creation of the new system block group and release the reserved
system space, therefore resulting in a deadlock.
In short, the problem is if a task B needs to allocate a new chunk after
it previously allocated a new system chunk and if another task A is
currently waiting for task B to complete the allocation of the new system
chunk.
Unfortunately this deadlock scenario introduced by the previous fix for
the system chunk array exhaustion problem does not have a simple and short
fix, and requires a big change to rework the chunk allocation code so that
chunk btree updates are all made in the first phase of chunk allocation.
And since this deadlock regression is being frequently hit on zoned
filesystems and the system chunk array exhaustion problem is triggered
in more extreme cases (originally observed on PowerPC with a node size
of 64K when running the fallocate tests from stress-ng), revert the
changes from that commit. The next patch in the series, with a subject
of "btrfs: rework chunk allocation to avoid exhaustion of the system
chunk array" does the necessary changes to fix the system chunk array
exhaustion problem.
Reported-by: Naohiro Aota <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/
Fixes: eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations")
CC: [email protected] # 5.12+
Tested-by: Shin'ichiro Kawasaki <[email protected]>
Tested-by: Naohiro Aota <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Tested-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
When we're automatically reclaiming a zone, because its zone_unusable
value is above the reclaim threshold, we're only logging how much
percent of the zone's capacity are used, but not how much of the
capacity is unusable.
Also print the percentage of the unusable space in the block group
before we're reclaiming it.
Example:
BTRFS info (device sdg): reclaiming chunk 230686720 with 13% used 86% unusable
CC: [email protected] # 5.13
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
The types in calculation of the used percentage in the reclaiming
messages are both u64, though bg->length is either 1GiB (non-zoned) or
the zone size in the zoned mode. The upper limit on zone size is 8GiB so
this could theoretically overflow in the future, right now the values
fit.
Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones")
CC: [email protected] # 5.13
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Otherwise, writeback is going to fall in a loop to flush dirty inode forever
before getting SBI_CLOSING.
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
In error cases the dentry may be NULL.
Before 20798dfe249a, the encoder also checked dentry and
d_really_is_positive(dentry), but that looks like overkill to me--zero
status should be enough to guarantee a positive dentry.
This isn't the first time we've seen an error-case NULL dereference
hidden in the initialization of a local variable in an xdr encoder. But
I went back through the other recent rewrites and didn't spot any
similar bugs.
Reported-by: JianHong Yin <[email protected]>
Reviewed-by: Chuck Lever III <[email protected]>
Fixes: 20798dfe249a ("NFSD: Update the NFSv3 GETACL result encoder...")
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
The double copy of the string is a mistake, plus __assign_str()
uses strlen(), which is wrong to do on a string that isn't
guaranteed to be NUL-terminated.
Fixes: 6019ce0742ca ("NFSD: Add a tracepoint to record directory entry encoding")
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
The pointer 'this' is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
When flushing out the unstable file writes as part of a COMMIT call, try
to perform most of of the data writes and waits outside the semaphore.
This means that if the client is sending the COMMIT as part of a memory
reclaim operation, then it can continue performing I/O, with contention
for the lock occurring only once the data sync is finished.
Fixes: 5011af4c698a ("nfsd: Fix stable writes")
Signed-off-by: Trond Myklebust <[email protected]>
Tested-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Add a .h file containing xdr_stream-based XDR helpers common to both
NLMv3 and NLMv4.
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
To enable xdr_stream-based encoding and decoding, create a bespoke
RPC dispatch function for the lockd service.
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
I'm not even sure cl_xprt can change here, but we're getting "suspicious
RCU usage" warnings, and other rpc_peeraddr2str callers are taking the
rcu lock.
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Fix gcc W=1 warning:
fs/nfs_common/grace.c:91: warning: Function parameter or member 'net' not described in 'locks_in_grace'
Signed-off-by: ChenXiaoSong <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
'status' has been overwritten to 0 after nfsd4_ssc_setup_dul(), this
cause 0 will be return in vfs_kern_mount() error case. Fix to return
nfserr_nodev in this error.
Fixes: f4e44b393389 ("NFSD: delay unmount source's export after inter-server copy completed.")
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Wei Yongjun <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Fix by initializing pointer nfsd4_ssc_umount_item with NULL instead of 0.
Replace return value of nfsd4_ssc_setup_dul with __be32 instead of int.
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Dai Ngo <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
In addition to the client's address, display the callback channel
state and address in the 'info' file.
Signed-off-by: Dave Wysochanski <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
This was causing a "sleeping function called from invalid context"
warning.
I don't think we need the set_and_test_bit() here; clients move from
unconfirmed to confirmed only once, under the client_lock.
The (conf == unconf) is a way to check whether we're in that confirming
case, hopefully that's not too obscure.
Fixes: 472d155a0631 "nfsd: report client confirmation status in "info" file"
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Fixes for virtiofs submounts
- Misc fixes and cleanups
* tag 'fuse-update-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
virtiofs: Fix spelling mistakes
fuse: use DIV_ROUND_UP helper macro for calculations
fuse: fix illegal access to inode with reused nodeid
fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)
fuse: Make fuse_fill_super_submount() static
fuse: Switch to fc_mount() for submounts
fuse: Call vfs_get_tree() for submounts
fuse: add dedicated filesystem context ops for submounts
virtiofs: propagate sync() to file server
fuse: reject internal errno
fuse: check connected before queueing on fpq->io
fuse: ignore PG_workingset after stealing
fuse: Fix infinite loop in sget_fc()
fuse: Fix crash if superblock of submount gets killed early
fuse: Fix crash in fuse_dentry_automount() error path
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"A read-ahead adjustment and a fix.
The readahead adjustment was suggested by Matthew Wilcox and looks
like how I should have written it in the first place... the "df fix"
was suggested by Walt Ligon, some Orangefs users have been complaining
about whacky df output..."
* tag 'for-linus-5.14-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: fix orangefs df output.
orangefs: readahead adjustment
|