aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-10-22bcachefs: Add an option for metadata_targetKent Overstreet4-3/+23
Also, make journal writes obey foreground_target and metadata_target. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Repair bad data pointersKent Overstreet1-36/+102
Now that we can repair metadata during GC, we can handle bad pointers that would trigger errors being marked, when they need to just be dropped. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Add (partial) support for fixing btree topologyKent Overstreet5-44/+156
When we walk the btrees during recovery, part of that is checking that btree topology is correct: for every interior btree node, its child nodes should exactly span the range the parent node covers. Previously, we had checks for this, but not repair code. Now that we have the ability to do btree updates during initial GC, this patch adds that repair code. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Add support for doing btree updates prior to journal replayKent Overstreet7-81/+176
Some errors may need to be fixed in order for GC to successfully run - walk and mark all metadata. But we can't start the allocators and do normal btree updates until after GC has completed, and allocation information is known to be consistent, so we need a different method of doing btree updates. Fortunately, we already have code for walking the btree while overlaying keys from the journal to be replayed. This patch adds an update path that adds keys to the list of keys to be replayed by journal replay, and also fixes up iterators. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Add BTREE_PTR_RANGE_UPDATEDKent Overstreet4-8/+11
This is so that when we discover btree topology issues, we can just update the pointer to a btree node and signal btree read path that the min/max keys in the node header should be updated from the node pointer. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Refactor checking of btree topologyKent Overstreet1-35/+48
Still a lot of work to be done here: we can't yet repair btree topology issues, but this patch refactors things so that we have better access to what we need in the topology checks. Next up will be figuring out a way to do btree updates during gc, before journal replay is done. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Improve diagnostics when journal entries are missingKent Overstreet3-28/+96
There's an outstanding bug with journal entries being missing in journal replay. This patch adds code to print out where the journal entries were physically located that were around the entry(ies) being missing, which should make debugging easier. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix BCH_REPLICAS_MAX checkKent Overstreet1-4/+4
Ideally, this limit will be going away in the future. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix build in userspaceKent Overstreet1-3/+2
The userspace bch_err() macro doesn't use the filesystem argument. Could also be fixed with a better macro. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix an assertionKent Overstreet1-1/+2
If we're invalidating a bucket that has cached data in it, data_type won't be 0 - oops. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Mark superblocks transactionallyKent Overstreet6-47/+211
More work towards getting rid of the in memory struct bucket: this path adds code for marking superblock and journal buckets via the btree, and uses it in the device add and journal resize paths. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Kill bch2_invalidate_bucket()Kent Overstreet3-58/+14
This patch is working towards eventually getting rid of the in memory struct bucket, and relying only on the btree representation. Since bch2_invalidate_bucket() was only used for incrementing gens, not invalidating cached data, no other counters were being changed as a side effect - meaning it's safe for the allocator code to increment the bucket gen directly. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Refactor dev usageKent Overstreet9-129/+91
This is to make it more amenable for serialization. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Kill metadata only gcKent Overstreet3-61/+27
This was useful before we had transactional updates to interior btree nodes - but now, it's just extra unneeded complexity. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Ensure __bch2_trans_commit() always calls bch2_trans_reset()Kent Overstreet1-3/+3
This was leading to a very strange bug in bch2_bucket_io_time_reset(), where we'd retry without clearing out the list of updates. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix a faulty assertionKent Overstreet1-0/+1
If journal replay hasn't finished, the journal can't be empty - oops. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Switch replicas.c allocations to GFP_KERNELKent Overstreet1-9/+9
We're transitioning to memalloc_nofs_save/restore instead of GFP flags with the rest of the kernel, and GFP_NOIO was excessively strict and causing unnnecessary allocation failures - these allocations are done with btree locks dropped. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix loopback in dio modeKent Overstreet1-4/+26
We had a deadlock on page_lock, because buffered reads signal completion by unlocking the page, but the dio read path normally dirties the pages it's reading to with set_page_dirty_lock. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Clean up bch2_extent_can_insertKent Overstreet1-10/+5
It was using an internal btree node iterator interface, when bch2_btree_iter_peek_slot() sufficed. We were hitting a null ptr deref that looked like it was from the iterator not being uptodate - this will also fix that. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix an assertion popKent Overstreet3-22/+1
There was a race: btree node writes drop their reference on journal pins before clearing the btree_node_write_in_flight flag. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Don't allocate stripes at POS_MINKent Overstreet2-2/+8
In the future, stripe index 0 will be a sentinal value. This patch doesn't disallow stripes at POS_MIN yet, leaving that for when we do the on disk format changes. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Rework allocating buckets for stripesKent Overstreet3-77/+92
Allocating buckets for existing stripes was busted, in part because the data structures were too contorted. This reworks new stripes so that we have an array of open buckets that matches blocks in the stripe, and it's sparse if we're reusing an existing stripe. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Verify transaction updates are sortedKent Overstreet1-4/+16
A user reported a bug that implies they might not be correctly sorted, this should help track that down. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Preserve stripe blockcounts on existing stripesKent Overstreet1-11/+48
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Kill stripe->dirtyKent Overstreet3-17/+22
This makes bch2_stripes_write() work more like bch2_alloc_write(). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix gc updating stripes infoKent Overstreet1-8/+5
The primary stripes radix tree can be sparse, which was causing an assertion to pop because the one use for gc isn't. Fix this by changing the algorithm to copy between the two radix trees. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix double counting of stripe block counts by GCKent Overstreet1-3/+9
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix integer overflow in bch2_disk_reservation_get()Kent Overstreet2-5/+4
The sectors argument shouldn't have been a u32 - it can be up to U32_MAX (i.e. fallocate creating persistent reservations), and if replication is enabled we'll overflow when we calculate the real number of sectors to reserve. Oops. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Correctly order flushes and journal writes on multi device filesystemsKent Overstreet5-44/+65
All writes prior to a journal write need to be flushed before the journal write itself happens. On single device filesystems, it suffices to mark the write with REQ_PREFLUSH|REQ_FUA, but on multi device filesystems we need to issue flushes to every device - and wait for them to complete - before issuing the journal writes. Previously, we were issuing flushes to every device, but we weren't waiting for them to complete before issuing the journal writes. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Run jset_validate in write path as wellKent Overstreet3-18/+43
This is because we had a bug where we were writing out journal entries with garbage last_seq, and not catching it. Also, completely ignore jset->last_seq when JSET_NO_FLUSH is true, because of aforementioned bug, but change the write path to set last_seq to 0 when JSET_NO_FLUSH is true. Minor other cleanups and comments. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Factor out bch2_ec_stripes_heap_start()Kent Overstreet4-15/+14
This fixes a bug where mark and sweep gc incorrectly was clearing out the stripes heap and causing assertions to fire later - simpler to just create the stripes heap after gc has finished. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Add btree node prefetching to bch2_btree_and_journal_walk()Kent Overstreet4-12/+38
bch2_btree_and_journal_walk() walks the btree overlaying keys from the journal; it was introduced so that we could read in the alloc btree prior to journal replay being done, when journalling of updates to interior btree nodes was introduced. But it didn't have btree node prefetching, which introduced a severe regression with mount times, particularly on spinning rust. This patch implements btree node prefetching for the btree + journal walk, hopefully fixing that. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Erasure coding fixes & refactoringKent Overstreet3-183/+194
- Originally bch_extent_stripe_ptr didn't contain the block index, instead we'd have to search through the stripe pointers to figure out which pointer matched. When the block field was added to bch_extent_stripe_ptr, not all of the code was updated to use it. This patch fixes that, and we also now verify that field where it makes sense. - The ec_stripe_buf_init/exit() functions have been improved, and are now used by the bch2_ec_read_extent() (recovery read) path. - get_stripe_key() is now used by bch2_ec_read_extent(). - We now have a getter and setter for checksums within a stripe, like we had previously for block sector counts, and ec_generate_checksums and ec_validate_checksums are now quite a bit smaller and cleaner. ec.c still needs a lot of work, but this patch is slowly moving things in the right direction. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Add cannibalize lock to btree_cache_to_text()Kent Overstreet1-2/+3
More debugging info is always a good thing. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix .splice_writeKent Overstreet1-2/+1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix bch2_replicas_gc2Kent Overstreet1-1/+5
This fixes a regression introduced by "bcachefs: Refactor filesystem usage accounting". We have to include all the replicas entries that have any of the entries for different journal entries nonzero, we can't skip them if they sum to zero. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: bch2_alloc_write() should be writing for all devicesKent Overstreet4-15/+12
Alloc info isn't stored on a particular device, it makes no sense to only be writing it out for rw members - this was causing fsck to not fix alloc info errors, oops. Also, make sure we write out alloc info in other repair paths. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix btree node split after merge operationsKent Overstreet1-0/+3
A btree node merge operation deletes a key in the parent node; if when inserting into the parent node we split the parent node, we can end up with a whiteout in the parent node that we don't want. The existing code drops them before doing the split, because they can screw up picking the pivot, but we forgot about the unwritten writeouts area - that needs to be cleared out too. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Reserve some open buckets for btree allocationsKent Overstreet3-5/+9
This reverts part of the change from "bcachefs: Don't use BTREE_INSERT_USE_RESERVE so much" - it turns out we still should be reserving open buckets for btree node allocations, because otherwise data bucket allocations (especially with erasure coding enabled) can use up all our open buckets and we won't be able to do the metadata update that lets us release those open bucket references. Oops. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Work around a zstd bugKent Overstreet1-1/+12
The zstd compression code seems to have a bug where it will write just past the end of the destination buffer - probably only when the compressed output isn't going to fit in the destination buffer, which will never happen if you're always allocating a bigger buffer than the source buffer which would explain other users not hitting it. But, we size the buffer according to how much contiguous space on disk we have, so... generally, bugs like this don't write more than a word past the end of the buffer, so an easy workaround is to subtract a fudge factor from the buffer size. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Don't error out of recovery process on journal read errorKent Overstreet1-2/+9
We don't want to fail the recovery/mount because of a single error reading from the journal - the relevant journal entry may still be found on other devices, and missing or no journal entries found is already handled later in the recovery process. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix journal_buf_realloc()Kent Overstreet1-3/+7
It used to be safe to reallocate a buf that the write path owns without holding the journal lock, but now this can trigger an assertion in journal_seq_to_buf(). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Reduce/kill BKEY_PADDED useKent Overstreet24-200/+247
With various newer key types - stripe keys, inline data extents - the old approach of calculating the maximum size of the value is becoming more and more error prone. Better to switch to bkey_on_stack, which can dynamically allocate if necessary to handle any size bkey. In particular we also want to get rid of BKEY_EXTENT_VAL_U64s_MAX. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Use separate new stripes for copygc and non-copygcKent Overstreet4-12/+23
Allocations for copygc have to be kept separate from everything else, so that copygc doesn't get starved. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Change allocations for ec stripes to blockingKent Overstreet3-31/+38
We don't want writes to not get erasure coded just because the allocator temporarily wasn't keeping up. However, it's not guaranteed that these allocations will ever succeed, we can currently get stuck - especially if devices are different sizes - we still have work to do in this area. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Don't read existing stripes synchronously in write pathKent Overstreet3-71/+117
Previously, in the stripe creation path, when reusing an existing stripe we'd read the existing stripe synchronously - ouch. Now, we allocate two stripe bufs if we're using an existing stripe, so that we can do the read asynchronously - and, we read the full stripe so that we can run recovery, if necessary. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Change when we allow overwritesKent Overstreet6-32/+75
Originally, we'd check for -ENOSPC when getting a disk reservation whenever the new extent took up more space on disk than the old extent. Erasure coding screwed this up, because with erasure coding writes are initially replicated, and then in the background the extra replicas are dropped when the stripe is created. This means that with erasure coding enabled, writes will always take up more space on disk than the data they're overwriting - but, according to posix, overwrites aren't supposed to return ENOSPC. So, in this patch we fudge things: if the new extent has more replicas than the _effective_ replicas of the old extent, or if the old extent is compressed and the new one isn't, we check for ENOSPC when getting the disk reservation - otherwise, we don't. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Don't use BTREE_INSERT_USE_RESERVE so muchKent Overstreet15-62/+20
Previously, we were using BTREE_INSERT_RESERVE in a lot of places where it no longer makes sense. - we now have more open_buckets than we used to, and the reserves work better, so we shouldn't need to use BTREE_INSERT_RESERVE just because we're holding open_buckets pinned anymore. - We have the btree key cache for updates to the alloc btree, meaning we no longer need the btree reserve to ensure the allocator can make forward progress. This means that we should only need a reserve for btree updates to ensure that copygc can make forward progress. Since it's now just for copygc, we can also fold RESERVE_BTREE into RESERVE_MOVINGGC (the allocator's freelist reserve). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix iterator overflow in move pathKent Overstreet2-1/+11
The move path was calling bch2_bucket_io_time_reset() for cached pointers (which it shouldn't have been), and then not calling bch2_trans_reset() when it got -EINTR (indicating transaction restart). Oops. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: Fix btree lock being incorrectly droppedKent Overstreet2-7/+10
__btree_trans_get_iter() was using bch2_btree_iter_upgrade, but it shouldn't have been because on failure bch2_btree_iter_upgrade may drop locks in other iterators, expecting the transaction to be restarted. But __btree_trans_get_iter can't return an error to indicate that we need to restart thet transaction - oops. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>