aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-07-14bcachefs: bch2_fs_get_tree() cleanupKent Overstreet1-30/+29
- improve error paths - call bch2_fs_start() separately, after applying late-parsed options Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Kill bch2_mount()Kent Overstreet1-31/+15
Fold into bch2_fs_get_tree() Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Eytzinger accumulation for accounting keysKent Overstreet4-8/+107
The btree write buffer takes as input keys from the journal, sorts them, deduplicates them, and flushes them back to the btree in sorted order. The disk space accounting rewrite is moving accounting to normal btree keys, with update (in this case deltas) accumulated in the write buffer and then flushed to the btree; but this is going to increase the number of keys handled by the write buffer by perhaps as much as a factor of 3x-5x. The overhead from copying around and sorting this many keys would cause a significant performance regression, but: there is huge locality in updates to accounting keys that we can take advantage of. Instead of appending accounting keys to the list of keys to be sorted, this patch adds an eytzinger search tree of recently seen accounting keys. We look up the accounting key in the eytzinger search tree and apply the delta directly, adding it if it doesn't exist, and periodically prune the eytzinger tree of unused entries. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: bch_acct_rebalance_workKent Overstreet2-1/+11
Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: bch_acct_btreeKent Overstreet3-1/+21
Add counters for how much disk space we're using per btree. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: bch_acct_snapshotKent Overstreet3-1/+20
Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: bch2_fs_usage_base_to_text()Kent Overstreet1-0/+19
Helper to show raw accounting in sysfs, mainly for debugging. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: bch2_fs_accounting_to_text()Kent Overstreet3-0/+32
Helper to show raw accounting in sysfs, mainly for debugging. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Convert bch2_compression_stats_to_text() to new accountingKent Overstreet1-67/+19
We no longer have to walk the whole btree to calculate compression stats. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: bch_acct_compressionKent Overstreet3-11/+58
This adds per-compression-type accounting of compressed and uncompressed size as well as number of extents - meaning we can now see compression ratio (without walking the whole filesystem). Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: bch2_verify_accounting_clean()Kent Overstreet3-1/+91
Verify that the in-memory accounting verifies the on-disk accounting after a clean shutdown. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Convert bch2_replicas_gc2() to new accountingKent Overstreet1-11/+15
bch2_replicas_gc2() is used for garbage collection superblock replicas entries that are empty - this converts it to the new accounting scheme. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Convert gc to new accountingKent Overstreet17-576/+343
Rewrite fsck/gc for the new accounting scheme. This adds a second set of in-memory accounting counters for gc to use; like with other parts of gc we run all trigger in TRIGGER_GC mode, then compare what we calculated to existing in-memory accounting at the end. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Kill replicas_journal_resKent Overstreet3-57/+0
More dead code deletion Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Kill fs_usage_onlineKent Overstreet3-27/+0
More dead code deletion. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Kill bch2_fs_usage_to_text()Kent Overstreet2-42/+0
Dead code. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Delete journal-buf-sharded old style accountingKent Overstreet9-147/+21
More deletion of dead code. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Kill writing old accounting to journalKent Overstreet1-45/+0
More ripping out of the old disk space accounting. Note that the new disk space accounting is incompatible with the old, and writing out old style disk space accounting with the new code is infeasible. This means upgrading and downgrading past this version requires regenerating accounting. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: kill bch2_fs_usage_read()Kent Overstreet7-63/+13
With bch2_ioctl_fs_usage(), this is now dead code. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Convert bch2_ioctl_fs_usage() to new accountingKent Overstreet1-49/+19
This converts bch2_ioctl_fs_usage() to read from the new disk accounting, via bch2_fs_replicas_usage_read(). Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Kill bch2_fs_usage_initialize()Kent Overstreet3-33/+0
Deleting code for the old disk accounting scheme. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: dev_usage updated by new accountingKent Overstreet8-70/+36
Reading disk accounting now requires an eytzinger lookup (see: bch2_accounting_mem_read()), but the per-device counters are used frequently enough that we'd like to still be able to read them with just a percpu sum, as in the old code. This patch special cases the device counters; when we update in-memory accounting we also update the old style percpu counters if it's a deice counter update. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Coalesce accounting keys before journal replayKent Overstreet2-0/+20
This fixes a performance regression in journal replay; without colaescing accounting keys we have multiple keys at the same position, which means journal_keys_peek_upto() has to skip past many overwritten keys - turning journal replay into an O(n^2) algorithm. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Disk space accounting rewriteKent Overstreet25-423/+795
Main part of the disk accounting rewrite. This is a wholesale rewrite of the existing disk space accounting, which relies on percepu counters that are sharded by journal buffer, and rolled up and added to each journal write. With the new scheme, every set of counters is a distinct key in the accounting btree; this fixes scaling limitations of the old scheme, where counters took up space in each journal entry and required multiple percpu counters. Now, in memory accounting requires a single set of percpu counters - not multiple for each in flight journal buffer - and in the future we'll probably also have counters that don't use in memory percpu counters, they're not strictly required. An accounting update is now a normal btree update, using the btree write buffer path. At transaction commit time, we apply accounting updates to the in memory counters, which are percpu counters indexed in an eytzinger tree by the accounting key. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: btree write buffer knows how to accumulate bch_accounting keysKent Overstreet3-9/+79
Teach the btree write buffer how to accumulate accounting keys - instead of having the newer key overwrite the older key as we do with other updates, we need to add them together. Also, add a flag so that write buffer flush knows when journal replay is finished flushing accounting, and teach it to hold accounting keys until that flag is set. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Accumulate accounting keys in journal replayKent Overstreet5-24/+126
Until accounting keys hit the btree, they are deltas, not new versions of the existing key; this means we have to teach journal replay to accumulate them. Additionally, the journal doesn't track precisely which entries have been flushed to the btree; it only tracks a range of entries that may possibly still need to be flushed. That means we need to compare accounting keys against the version in the btree and only flush updates that are newer. There's another wrinkle with the write buffer: if the write buffer starts flushing accounting keys before journal replay has finished flushing accounting keys, journal replay will see the version number from the new updates and updates from the journal will be lost. To avoid this, journal replay has to flush accounting keys first, and we'll be adding a flag so that write buffer flush knows to hold accounting keys until then. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: KEY_TYPE_accountingKent Overstreet8-49/+303
New key type for the disk space accounting rewrite. - Holds a variable sized array of u64s (may be more than one for accounting e.g. compressed and uncompressed size, or buckets and sectors for a given data type) - Updates are deltas, not new versions of the key: this means updates to accounting can happen via the btree write buffer, which we'll be teaching to accumulate deltas. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: use new mount APIThomas Bertschinger2-22/+100
This updates bcachefs to use the new mount API: - Update the file_system_type to use the new init_fs_context() function. - Define the new fs_context_operations functions. - No longer register bch2_mount() and bch2_remount(); these are now called via the new fs_context functions. - Define a new helper type, bch2_opts_parse that includes a struct bch_opts and additionally a printbuf used to save options that can't be parsed until after the FS is opened. This enables us to parse as many options as possible prior to opening the filesystem while saving those options that need the open FS for later parsing. Signed-off-by: Thomas Bertschinger <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Add error code to defer option parsingThomas Bertschinger3-2/+18
This introduces a new error code, option_needs_open_fs, which is used to indicate that an attempt was made to parse a mount option prior to opening a filesystem, when that mount option requires an open filesystem in order to be validated. Returning this error results in bch2_parse_one_mount_opt() saving that option for later parsing, after the filesystem is opened. Signed-off-by: Thomas Bertschinger <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: add printbuf arg to bch2_parse_mount_opts()Thomas Bertschinger4-49/+71
Mount options that take the name of a device that may be part of a filesystem, for example "metadata_target", cannot be validated until after the filesystem has been opened. However, an attempt to parse those options may be made prior to the filesystem being opened. This change adds a printbuf parameter to bch2_parse_mount_opts() which will be used to save those mount options, when they are supplied prior to the FS being opened, so that they can be parsed later. This functionality is not currently needed, but will be used after bcachefs starts using the new mount API to parse mount options. This is because using the new mount API, we will process mount options prior to opening the FS, but the new API doesn't provide a convenient way to "replay" mount option parsing. So we save these options ourselves to accomplish this. This change also splits out the code to parse a single option into bch2_parse_one_mount_opt(), which will be useful when using the new mount API which deals with a single mount option at a time. Signed-off-by: Thomas Bertschinger <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: metadata version bucket_stripe_sectorsKent Overstreet7-18/+96
New on disk format version for bch_alloc->stripe_sectors and BCH_DATA_unstriped - accounting for unstriped data in stripe buckets. Upgrade/downgrade requires regenerating alloc info - but only if erasure coding is in use. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: BCH_DATA_unstripedKent Overstreet4-5/+19
Add a new pseudo data type, to track buckets that are members of a stripe, but have unstriped data in them. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: bch_alloc->stripe_sectorsKent Overstreet6-15/+41
Add a separate counter to bch_alloc_v4 for amount of striped data; this lets us separately track striped and unstriped data in a bucket, which lets us see when erasure coding has failed to update extents with stripe pointers, and also find buckets to continue updating if we crash mid way through creating a new stripe. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: check_key_has_inode()Kent Overstreet2-94/+113
Consolidate duplicated checks for extents/dirents/xattrs - these keys should all have a corresponding inode of the correct type. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: allow passing full device path for target optionsThomas Bertschinger1-0/+3
The output of mount options such as "metadata_target" in `/proc/mounts` uses the full path to the device. mount(8) from util-linux uses the output from `/proc/mounts` to pass existing mount options when performing a remount, so bcachefs should accept as input the same form that it prints as output. Without this change: $ mount -t bcachefs -o metadata_target=vdb /dev/vdb /mnt $ strace mount -o remount /mnt ... fsconfig(4, FSCONFIG_SET_STRING, "metadata_target", "/dev/vdb", 0) = -1 EINVAL (Invalid argument) ... Signed-off-by: Thomas Bertschinger <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: bch2_printbuf_strip_trailing_newline()Kent Overstreet3-0/+17
Add a new helper to fix inode_to_text() Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: don't expose "read_only" as a mount optionThomas Bertschinger1-1/+1
When "read_only" is exposed as a mount option, it is redundant with the standard option "ro" and gives users multiple ways to specify that a bcachefs filesystem should be mounted read-only. This presents the risk of having inconsistent options specified. This can be seen when remounting a read-only filesystem in read-write mode, using mount(8) from util-linux. Because mount(8) parses the existing mount options from `/proc/mounts` and applies them when remounting, it can end up applying both "read_only" and "rw": $ mount img -o ro /mnt $ strace mount -o remount,rw /mnt ... fsconfig(4, FSCONFIG_SET_FLAG, "read_only", NULL, 0) = 0 fsconfig(4, FSCONFIG_SET_FLAG, "rw", NULL, 0) = 0 ... Making "read_only" no longer a mount option means this edge case cannot occur. Fixes: 62719cf33c3a ("bcachefs: Fix nochanges/read_only interaction") Signed-off-by: Thomas Bertschinger <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: make offline fsck set read_only fs flagThomas Bertschinger1-0/+13
A subsequent change will remove "read_only" as a mount option in favor of the standard option "ro", meaning the userspace fsck command cannot pass it to the fsck ioctl. Instead, in offline fsck, set "read_only" kernel-side without trying to parse it as a mount option. For compatibility with versions of the "bcachefs fsck" command that try to pass the "read_only" mount opt, remove it from the mount options string prior to parsing when it is present. Signed-off-by: Thomas Bertschinger <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: btree_ptr_sectors_written() now takes bkey_s_cKent Overstreet4-9/+9
this is for the userspace metadata dump tool Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Check for bsets past bch_btree_ptr_v2.sectors_writtenKent Overstreet1-2/+5
Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Use try_cmpxchg() family of functions instead of cmpxchg()Uros Bizjak11-59/+58
Use try_cmpxchg() family of functions instead of cmpxchg (*ptr, old, new) == old. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, try_cmpxchg() implicitly assigns old *ptr value to "old" when cmpxchg fails. There is no need to re-read the value in the loop. No functional change intended. Signed-off-by: Uros Bizjak <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Brian Foster <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: add might_sleep() annotations for fsck_err()Kent Overstreet2-1/+6
Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: fix missing includeKent Overstreet1-0/+2
fs-common.h needs dirent.h for enum bch_rename_mode Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Use filemap_read() to simplify the execution flowYouling Tang1-2/+2
Using filemap_read() can reduce unnecessary code execution for non IOCB_DIRECT paths. Signed-off-by: Youling Tang <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Align the display format of `btrees/inodes/keys`Youling Tang1-1/+2
Before patch: ``` #cat btrees/inodes/keys u64s 17 type inode_v3 0:4096:U32_MAX len 0 ver 0: mode=40755 flags= (16300000) bi_size=0 ``` After patch: ``` #cat btrees/inodes/keys u64s 17 type inode_v3 0:4096:U32_MAX len 0 ver 0: mode=40755 flags=(16300000) bi_size=0 ``` Signed-off-by: Youling Tang <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: Fix missing spaces in journal_entry_dev_usage_to_textYouling Tang1-0/+3
Fixed missing spaces displayed in journal_entry_dev_usage_to_text while adjusting the display format to improve readability. before: ``` # bcachefs list_journal -a -t alloc:1:0 /dev/sdb ... dev_usage: dev=0free: buckets=233180 sectors=0 fragmented=0sb: buckets=13 sectors=6152 fragmented=504journal: buckets=1847 sectors=945664 fragmented=0btree: buckets=20 sectors=10240 fragmented=0user: buckets=1419 sectors=726513 fragmented=15cached: buckets=0 sectors=0 fragmented=0parity: buckets=0 sectors=0 fragmented=0stripe: buckets=0 sectors=0 fragmented=0need_gc_gens: buckets=0 sectors=0 fragmented=0need_discard: buckets=1 sectors=0 fragmented=0 ``` after: ``` # bcachefs list_journal -a -t alloc:1:0 /dev/sdb ... dev_usage: dev=0 free: buckets=233180 sectors=0 fragmented=0 sb: buckets=13 sectors=6152 fragmented=504 journal: buckets=1847 sectors=945664 fragmented=0 btree: buckets=20 sectors=10240 fragmented=0 user: buckets=1419 sectors=726513 fragmented=15 cached: buckets=0 sectors=0 fragmented=0 parity: buckets=0 sectors=0 fragmented=0 stripe: buckets=0 sectors=0 fragmented=0 need_gc_gens: buckets=0 sectors=0 fragmented=0 need_discard: buckets=1 sectors=0 fragmented=0 ``` Signed-off-by: Youling Tang <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: fix ei_update_lock lock orderingKent Overstreet2-7/+8
ei_update_lock is largely vestigal and will probably be removed, but we're not ready for that just yet. this fixes some lockdep splats with the new lockdep support for btree node locks; they're harmless, since we were taking ei_update_lock before actually locking any btree nodes, but "any btree nodes locked" are now tracked at the btree_trans level. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: bch2_btree_reserve_cache_to_text()Kent Overstreet5-1/+32
Add a pretty printer so the btree reserve cache can be seen in sysfs; as it pins open_buckets we need it for tracking down open_buckets issues. Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: sysfs trigger_freelist_wakeupKent Overstreet1-0/+5
another debugging knob Signed-off-by: Kent Overstreet <[email protected]>
2024-07-14bcachefs: sysfs internal/trigger_journal_writesKent Overstreet1-0/+5
another debugging knob - trigger the journal to do ready journal writes Signed-off-by: Kent Overstreet <[email protected]>