Age | Commit message (Collapse) | Author | Files | Lines |
|
perusal of /sys/kernel/debug/bcachefs/*/btree_transaction_stats shows
that the read path has been acculumalating unneeded paths on the reflink
btree, which we don't want.
The solution is to call bch2_trans_begin(), which drops paths not used
on previous loop iteration.
bch2_readahead:
Max mem used: 0
Transaction duration:
count: 194235
since mount recent
duration of events
min: 150 ns
max: 9 ms
total: 838 ms
mean: 4 us 6 us
stddev: 34 us 7 us
time between events
min: 10 ns
max: 15 h
mean: 2 s 12 s
stddev: 2 s 3 ms
Maximum allocated btree paths (193):
path: idx 2 ref 0:0 P btree=extents l=0 pos 270943112:392:U32_MAX locks 0
path: idx 3 ref 1:0 S btree=extents l=0 pos 270943112:24578:U32_MAX locks 1
path: idx 4 ref 0:0 P btree=reflink l=0 pos 0:24773509:0 locks 0
path: idx 5 ref 0:0 P S btree=reflink l=0 pos 0:24773631:0 locks 1
path: idx 6 ref 0:0 P S btree=reflink l=0 pos 0:24773759:0 locks 1
path: idx 7 ref 0:0 P S btree=reflink l=0 pos 0:24773887:0 locks 1
path: idx 8 ref 0:0 P S btree=reflink l=0 pos 0:24774015:0 locks 1
path: idx 9 ref 0:0 P S btree=reflink l=0 pos 0:24774143:0 locks 1
path: idx 10 ref 0:0 P S btree=reflink l=0 pos 0:24774271:0 locks 1
<many more reflink paths>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
If label is not set, the Label tag in superblock info show '(none)'.
```
[Before]
Device index: 0
Label:
Version: 1.4: member_seq
[After]
Device index: 0
Label: (none)
Version: 1.4: member_seq
```
Signed-off-by: Hongbo Li <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Unnecessary here, and this broke the rust bindings:
error[E0588]: packed type cannot transitively contain a `#[repr(align)]` type
--> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:29025:1
|
29025 | pub struct bkey_i_inode_v3 {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
|
note: `bch_inode_v3` has a `#[repr(align)]` attribute
--> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:8949:1
|
8949 | pub struct bch_inode_v3 {
| ^^^^^^^^^^^^^^^^^^^^^^^
error[E0588]: packed type cannot transitively contain a `#[repr(align)]` type
--> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:32826:1
|
32826 | pub struct bkey_inode_buf {
| ^^^^^^^^^^^^^^^^^^^^^^^^^
|
note: `bch_inode_v3` has a `#[repr(align)]` attribute
--> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:8949:1
|
8949 | pub struct bch_inode_v3 {
| ^^^^^^^^^^^^^^^^^^^^^^^
note: `bkey_inode_buf` contains a field of type `bkey_i_inode_v3`
--> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:32827:9
|
32827 | pub inode: bkey_i_inode_v3,
| ^^^^^
note: ...which contains a field of type `bch_inode_v3`
--> /build/source/target/release/build/bch_bindgen-9445b24c90aca2a3/out/bcachefs.rs:29027:9
|
29027 | pub v: bch_inode_v3,
| ^
Signed-off-by: Kent Overstreet <[email protected]>
|
|
highly damaged filesystems, or filesystems that have been damaged and
repair and damaged again, may have sequence numbers we can't fully trust
- which in itself is something we need to debug.
Add a journal_seq fallback so that repair doesn't get stuck.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
This adds lockdep tracking for held btree locks with a single dep_map in
btree_trans, i.e. tracking all held btree locks as one object.
This is more practical and more useful than having lockdep track held
btree locks individually, because
- we can take more locks than lockdep can track (unbounded, now that we
have dynamically resizable btree paths)
- there's no lock ordering between btree locks for lockdep to track (we
do cycle detection)
- and this makes it easy to teach lockdep that btree locks are not safe
to hold while invoking memory reclaim.
The last rule is one that lockdep would never learn, because we only do
trylock() from within shrinkers - but we very much do not want to be
invoking memory reclaim while holding btree node locks.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Add a new helper to disable lockdep tracking entirely for a given class.
This is needed for bcachefs, which takes too many btree node locks for
lockdep to track. Instead, we have a single lockdep_map for "btree_trans
has any btree nodes locked", which makes more since given that we have
centralized lock management and a cycle detector.
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
printing the raw values can occasionally be very useful
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Eliminate possible integer truncation bugs on 32 bit
Signed-off-by: Kent Overstreet <[email protected]>
|
|
We're not always mounting when we start the filesystem
Signed-off-by: Kent Overstreet <[email protected]>
|
|
This repurposes the promote path, which already knows how to call
data_update() after a read: we now automatically rewrite bad data when
we get a read error and then successfully retry from a different
replica.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
fsck passes read_only as a mount option, and it's required for
nochanges, which it also uses.
Usually read_only is handled by the VFS, but we need to be able to
handle it too; we just don't want to print it out twice, so mark it as a
hidden option.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Don't allocate the new bkey_cached until after we've done the btree
lookup; this means we can kill bkey_cached.valid.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Signed-off-by: Kent Overstreet <[email protected]>
|
|
This fixes an accounting mismatch for cached data.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
This adds a new helper, bch2_folio_reservation_get_partial(), which
reserves as many blocks as possible and may return partial success.
__bch2_buffered_write() is switched to the new helper - this fixes
fstests generic/275, the write until -ENOSPC test.
generic/230 now fails: this appears to be a test bug, where xfs_io isn't
looping after a partial write to get the error code.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Add support for STATX_DIOALIGN to bcachefs, so that direct I/O alignment
restrictions are exposed to userspace in a generic way.
[Before]
```
./statx_test /mnt/bcachefs/test
statx(/mnt/bcachefs/test) = 0
dio mem align:0
dio offset align:0
```
[After]
```
./statx_test /mnt/bcachefs/test
statx(/mnt/bcachefs/test) = 0
dio mem align:1
dio offset align:512
```
Signed-off-by: Hongbo Li <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Signed-off-by: Kent Overstreet <[email protected]>
|
|
As part of improving btree key cache coherency, the bkey_cached.valid
flag is going away.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Set the preferred folio order in the fgp_flags by calling
fgf_set_order(). Page cache will try to allocate large folio of the
preferred order whenever possible instead of allocating multiple 0 order
folios.
This improves the buffered write performance up to 1.25x with default
mount options and up to 1.57x when mounted with no_data_io option with
the following fio workload:
fio --name=bcachefs --filename=/mnt/test --size=100G \
--ioengine=io_uring --iodepth=16 --rw=write --bs=128k
Signed-off-by: Pankaj Raghav <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Use FGP_WRITEBEGIN to avoid repeating the individual FGP flags before
starting a buffered write.
Signed-off-by: Pankaj Raghav <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
gc_lock is now only for synchronization between check_alloc_info and
interior btree updates - nothing else
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Signed-off-by: Brian Foster <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
this is an internal implementation detail - and we're improving key
cache coherency
Signed-off-by: Kent Overstreet <[email protected]>
|
|
new helper - small refactoring
Signed-off-by: Kent Overstreet <[email protected]>
|
|
we have a separate helper for releasing write locks
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Signed-off-by: Kent Overstreet <[email protected]>
|
|
gc_pos is now based on keys, not nodes, for invariantness w.r.t. splits
and merges
Signed-off-by: Kent Overstreet <[email protected]>
|
|
The transaction commit path takes mark_lock, so we shouldn't be holding
it; use a bpos as an iterator so that we can drop and retake.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Add a new helper to free zeroed out accounting entries, and use it in
bch2_replicas_gc2(); bch2_replicas_gc2() was killing superblock replicas
entries if their corresponding accounting counters were nonzero, but
that's incorrect - the superblock replicas entry needs to exist if the
accounting entry exists, not if it's nonzero, because we check and
create the replicas entry when creating the new accounting entry - we
don't know when it's becoming nonzero.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Break up the percpu counter allocations into individual allocations for
each disk accounting counter; this fixes an issue on large systems where
we have too many replica entries to for the percpu allocator's max
practical size.
Also, use just one eytzinger tree for the normal set of counters and the
gc counters; this simplifies accounting_gc_done() where we need the same
set of counters to be present in both tables.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
smatch warns that the copy of arg to userspace is a potential data
leak by virtue of arg.pad not being checked or zeroed. This was
introduced by the commit referenced below that switched arg from
being a zeroed runtime allocation to living on the stack. Fix by
simply zero initializing the structure.
Fixes: cde738a61e65 ("bcachefs: Convert bch2_ioctl_fs_usage() to new accounting")
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Brian Foster <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
bch2_accounting_mem_insert() drops and retakes mark_lock; thus, we need
to check if the entry in question has already been inserted.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
The commit 65bd44239727 ("bcachefs: bch2_btree_insert_trans() no longer
specifies BTREE_ITER_cached") removes BTREE_ITER_cached from
bch2_btree_insert_trans, which causes the update_inode function from
bcachefs-tools to take a long time (~20s). Add an iter_flags parameter
to bch2_btree_insert, so the users can specify iter update trigger
flags, such as BTREE_ITER_cached.
Signed-off-by: Ariel Miculas <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Add a new ioctl that can return the new accounting counter types; it
takes as input a bitmask of accounting types to return.
This will be used for returning e.g. compression accounting and
rebalance_work accounting.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
By removing the early-exit when REMAP_FILE_DEDUP is set, we should be
able to support the fideduperange ioctl, albeit less efficiently than if
we handled some of the extent locking and comparison logic inside
bcachefs. Extent comparison logic already exists inside of
`__generic_remap_file_range_prep`.
Signed-off-by: Reed Riley <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Implement support for FS_IOC_SETFSLABEL ioctl to set filesystem
label.
Signed-off-by: Hongbo Li <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Implement support for FS_IOC_GETFSLABEL ioctl to read filesystem
label.
Signed-off-by: Hongbo Li <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
In this patch we add the FS_IOC_GETVERSION ioctl for getting
i_generation from inode, after that, users can list file's
generation number by using "lsattr".
Signed-off-by: Hongbo Li <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
We can't hold locks while waiting for user input, that's a deadlock.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Add trace_bch2_sync_fs() and trace_bch2_fsync() implementations.
The output in trace is as follows:
sync-29779 [000] ..... 193.700935: bch2_sync_fs: dev 254,16 wait 1
<...>-40027 [002] ..... 342.535227: bch2_fsync: dev 254,32 ino 4099 parent 4096 datasync 1
Signed-off-by: Youling Tang <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
We already using mapping_set_error() in bch2_writepage_io_done(), so all
we need to do is to use file_check_and_advance_wb_err() when handling
fsync() requests in bch2_fsync().
Signed-off-by: Youling Tang <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Commit 0c0cbfdb84725e9933a24ecf47c61bdeeda06ba2 dropped the ctx->pos
update before the call to dir_emit. This breaks the userspace
implementation, causing the directory reads to be stuck in an infinite
loop. This doesn't happen in the kernel because the vfs handles the
updates to ctx->pos, but in the fuse implementation nobody updates
it.
Signed-off-by: Ariel Miculas <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Signed-off-by: Kent Overstreet <[email protected]>
|
|
We now read the line from the buffer atomically, which means we have to
allow the buffer to grow past STDIO_REDIRECT_BUFSIZE if we're waiting
for a full line - this behaviour is necessary for
stdio_redirect_readline_timeout() in the next patch.
Signed-off-by: Kent Overstreet <[email protected]>
|