aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2024-03-13bcachefs: kill kvpmalloc()Kent Overstreet14-115/+49
Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: bch2_lookup() gives better error message on inode not foundKent Overstreet1-9/+64
When a dirent points to a missing inode, we really should print out the dirent. This requires quite a bit of refactoring, but there's some other benefits: we now do the entire looup (dirent and inode) in a single btree transaction, and copy to the VFS inode with btree locks still held, like the create path. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: bch2_inode_insert()Kent Overstreet1-62/+76
Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: factor out check_inode_backpointer()Kent Overstreet1-9/+29
Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Factor out check_subvol_dirent()Kent Overstreet1-48/+57
Going to be adding more code here for checking subvol structure. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Kill some -EINVALsKent Overstreet2-5/+5
Repurposing standard error codes in bcachefs code is banned in new code, and we need to get rid of the remaining ones - private error codes give us much better error messages. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: bump max_active on btree_interior_update_workerKent Overstreet1-1/+1
WQ_UNBOUND with max_active 1 means ordered workqueue, but we don't actually need or want ordered semantics - and probably want a higher concurrency limit anyways. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: move fsck_write_inode() to inode.cKent Overstreet3-40/+44
Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Initialize super_block->s_uuidKent Overstreet1-0/+1
Need to fix this oversight for the new FS_IOC_(GET|SET)UUID ioctls. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Switch to uuid_to_fsid()Kent Overstreet1-5/+1
switch the statfs code from something horrible and open coded to the more standard uuid_to_fsid() Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Subvolumes may now be renamedKent Overstreet2-26/+55
Files within a subvolume cannot be renamed into another subvolume, but subvolumes themselves were intended to be. This implements subvolume renaming - we need to ensure that there's only a single dirent that points to a subvolume key (not multiple versions in different snapshots), and we need to ensure that dirent.d_parent_subol and inode.bi_parent_subvol are updated. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: btree node prefetching in check_topologyKent Overstreet4-3/+42
btree_and_journal_iter is old code that we want to get rid of, but we're not ready to yet. lack of btree node prefetching is, it turns out, a real performance issue for fsck on spinning rust, so - add it. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: btree_and_journal_iter.transKent Overstreet4-17/+21
we now always have a btree_trans when using a btree_and_journal_iter; prep work for adding prefetching to btree_and_journal_iter Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: better journal pipeliningKent Overstreet4-59/+98
Recently a severe performance regression was discovered, which bisected to a6548c8b5eb5 bcachefs: Avoid flushing the journal in the discard path It turns out the old behaviour, which issued excessive journal flushes, worked around a performance issue where queueing delays would cause the journal to not be able to write quickly enough and stall. The journal flushes masked the issue because they periodically flushed the device write cache, reducing write latency for non flushes. This patch reworks the journalling code to allow more than one (non-flush) write to be in flight at a time. With this patch, doing 4k random writes and an iodepth of 128, we are now able to hit 560k iops to a Samsung 970 EVO Plus - previously, we were stuck in the ~200k range. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: closure per journal bufKent Overstreet3-23/+41
Prep work for having multiple journal writes in flight. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: bio per journal bufKent Overstreet3-29/+34
Prep work for having multiple journal writes in flight. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: jset_entry_datetimeKent Overstreet4-17/+67
This gives us a way to record the date and time every journal entry was written - useful for debugging. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: improve journal entry read fsck error messagesKent Overstreet1-41/+55
Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: convert journal replay ptrs to darrayKent Overstreet3-58/+36
Eliminates some error paths - no longer have a hardcoded BCH_REPLICAS_MAX limit. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Cleanup bch2_dirent_lookup_trans()Kent Overstreet3-26/+14
Drop an unnecessary bch2_subvolume_get_snapshot() call, and drop the __ from the name - this is a normal interface. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: bch2_hash_set_snapshot() -> bch2_hash_set_in_snapshot()Kent Overstreet3-18/+12
Minor renaming for clarity, bit of refactoring. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Workqueues should be WQ_HIGHPRIKent Overstreet1-4/+4
Most bcachefs workqueues are used for completions, and should be WQ_HIGHPRI - this helps reduce queuing delays, we want to complete quickly once we can no longer signal backpressure by blocking. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Improve bch2_dirent_to_text()Kent Overstreet1-9/+11
For DT_SUBVOL, we now print both parent and child subvol IDs. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: fixup for building in userspaceKent Overstreet1-1/+1
Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Avoid taking journal lock unnecessarilyKent Overstreet2-53/+55
Previously, any time we failed to get a journal reservation we'd retry, with the journal lock held; but this isn't necessary given wait_event()/wake_up() ordering. This avoids performance cliffs when the journal starts to get backed up and lock contention shoots up. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Journal writes should be REQ_SYNC|REQ_METAKent Overstreet1-1/+1
Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Avoid setting j->write_work unnecessarilyKent Overstreet1-13/+11
Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Split out journal workqueueKent Overstreet3-16/+19
We don't want journal write completions to be blocked behind btree transactions - io_complete_wq is used for btree updates after data and metadata writes. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Kill unnecessary wakeups in journal reclaimKent Overstreet1-11/+9
Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: skip invisible entries in empty subvolume checkingGuoyu Ou3-5/+9
When we are checking whether a subvolume is empty in the specified snapshot, entries that do not belong to this subvolume should be skipped. This fixes the following case: $ bcachefs subvolume create ./sub $ cd sub $ bcachefs subvolume create ./sub2 $ bcachefs subvolume snapshot . ./snap $ ls -a snap . .. $ rmdir snap rmdir: failed to remove 'snap': Directory not empty As Kent suggested, we pass 0 in may_delete_deleted_inode() to ignore subvols in the subvol we are checking, because inode.bi_subvol is only set on subvolume roots, and we can't go through every inode in the subvolume and change bi_subvol when taking a snapshot. It makes the check less strict, but that's ok, the rest of fsck will still catch it. Signed-off-by: Guoyu Ou <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: fix split brain messageKent Overstreet1-1/+1
Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Set path->uptodate when no node at levelKent Overstreet1-2/+2
We were failing to set path->uptodate when reaching the end of a btree node iterator, causing the new prefetch code for backpointers gc to go into an infinite loop. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Correctly validate k->u64s in btree node read pathKent Overstreet3-6/+27
validate_bset_keys() never properly validated k->u64s; it checked if it was 0, but not if it was smaller than keys for the given packed format; this fixes that small oversight. This patch was backported, so it's adding quite a few error enums so that they don't get renumbered and we don't have confusing gaps. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Fix degraded mode fsckKent Overstreet1-18/+18
We don't know where the superblock and journal lives on offline devices; that means if a device is offline fsck can't check those buckets. Previously, fsck would incorrectly clear bucket data types for those buckets on offline devices; now we just use the previous state. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Fix journal replay with unreadable btree rootsKent Overstreet4-6/+70
When a btree root is unreadable, we still might be able to get some data back by replaying what's in the journal. Previously though, we got confused when journal replay would attempt to replay a key for a level that didn't exist. This adds bch2_btree_increase_depth(), so that journal replay can handle this. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: fix check_inode_deleted_list()Kent Overstreet1-6/+3
check_inode_deleted_list() returns true if the inode is on the deleted list; check_inode() was checking the return code incorrectly. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: no_splitbrain_check optionKent Overstreet2-8/+22
This adds an option to disable kicking out devices when splitbrain is detected - it seems there's some issues with splitbrain detection and we're kicking out devices erronously. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: extent_entry_next_safe()Kent Overstreet1-3/+8
We need to be able to iterate over extent ptrs that may be corrupted in order to print them - this fixes a bug where we'd pop an assert in bch2_bkey_durability_safe(). Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: journal_seq_blacklist_add() now handles entries being added out of ↵Kent Overstreet2-46/+22
order bch2_journal_seq_blacklist_add() was bugged when the new entry overlapped with multiple existing entries, and it also assumed new entries are being added in increasing order. This is true on any sane filesystem, but when trying to recover from very badly mangled filesystems we might end up with the journal sequence number rewinding vs. what the blacklist list knows about - easiest to just handle that here. Signed-off-by: Kent Overstreet <[email protected]>
2024-03-10bcachefs: Fix null-ptr-deref in bch2_fs_alloc()Li Zetao1-3/+3
There is a null-ptr-deref issue reported by kasan: KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] Call Trace: <TASK> bch2_fs_alloc+0x1092/0x2170 [bcachefs] bch2_fs_open+0x683/0xe10 [bcachefs] ... When initializing the name of bch_fs, it needs to dynamically alloc memory to meet the length of the name. However, when name allocation failed, it will cause a null-ptr-deref access exception in subsequent string copy. Fix this issue by checking if name allocation is successful. Fixes: 401ec4db6308 ("bcachefs: Printbuf rework") Signed-off-by: Li Zetao <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-02-25Merge tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefsLinus Torvalds7-22/+25
Pull bcachefs fixes from Kent Overstreet: "Some more mostly boring fixes, but some not User reported ones: - the BTREE_ITER_FILTER_SNAPSHOTS one fixes a really nasty performance bug; user reported an untar initially taking two seconds and then ~2 minutes - kill a __GFP_NOFAIL in the buffered read path; this was a leftover from the trickier fix to kill __GFP_NOFAIL in readahead, where we can't return errors (and have to silently truncate the read ourselves). bcachefs can't use GFP_NOFAIL for folio state unlike iomap based filesystems because our folio state is just barely too big, 2MB hugepages cause us to exceed the 2 page threshhold for GFP_NOFAIL. additionally, the flags argument was just buggy, we weren't supplying GFP_KERNEL previously (!)" * tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs: bcachefs: fix bch2_save_backtrace() bcachefs: Fix check_snapshot() memcpy bcachefs: Fix bch2_journal_flush_device_pins() bcachefs: fix iov_iter count underflow on sub-block dio read bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btree bcachefs: Kill __GFP_NOFAIL in buffered read path bcachefs: fix backpointer_to_text() when dev does not exist
2024-02-25bcachefs: fix bch2_save_backtrace()Kent Overstreet1-1/+1
Missed a call in the previous fix. Signed-off-by: Kent Overstreet <[email protected]>
2024-02-25Merge tag 'erofs-for-6.8-rc6-fixes' of ↵Linus Torvalds1-14/+14
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fix from Gao Xiang: - Fix page refcount leak when looking up specific inodes introduced by metabuf reworking * tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix refcount on the metabuf used for inode lookup
2024-02-25Merge tag 'pull-fixes.pathwalk-rcu-2' of ↵Linus Torvalds20-63/+85
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull RCU pathwalk fixes from Al Viro: "We still have some races in filesystem methods when exposed to RCU pathwalk. This series is a result of code audit (the second round of it) and it should deal with most of that stuff. Still pending: ntfs3 ->d_hash()/->d_compare() and ceph_d_revalidate(). Up to maintainers (a note for NTFS folks - when documentation says that a method may not block, it *does* imply that blocking allocations are to be avoided. Really)" [ More explanations for people who aren't familiar with the vagaries of RCU path walking: most of it is hidden from filesystems, but if a filesystem actively participates in the low-level path walking it needs to make sure the fields involved in that walk are RCU-safe. That "actively participate in low-level path walking" includes things like having its own ->d_hash()/->d_compare() routines, or by having its own directory permission function that doesn't just use the common helpers. Having a ->d_revalidate() function will also have this issue. Note that instead of making everything RCU safe you can also choose to abort the RCU pathwalk if your operation cannot be done safely under RCU, but that obviously comes with a performance penalty. One common pattern is to allow the simple cases under RCU, and abort only if you need to do something more complicated. So not everything needs to be RCU-safe, and things like the inode etc that the VFS itself maintains obviously already are. But these fixes tend to be about properly RCU-delaying things like ->s_fs_info that are maintained by the filesystem and that got potentially released too early. - Linus ] * tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ext4_get_link(): fix breakage in RCU mode cifs_get_link(): bail out in unsafe case fuse: fix UAF in rcu pathwalks procfs: make freeing proc_fs_info rcu-delayed procfs: move dropping pde and pid from ->evict_inode() to ->free_inode() nfs: fix UAF on pathwalk running into umount nfs: make nfs_set_verifier() safe for use in RCU pathwalk afs: fix __afs_break_callback() / afs_drop_open_mmap() race hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper affs: free affs_sb_info with kfree_rcu() rcu pathwalk: prevent bogus hard errors from may_lookup() fs/super.c: don't drop ->s_user_ns until we free struct super_block itself
2024-02-25Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2-4/+8
Pull vfs fixes from Al Viro: "A couple of fixes - revert of regression from this cycle and a fix for erofs failure exit breakage (had been there since way back)" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: erofs: fix handling kern_mount() failure Revert "get rid of DCACHE_GENOCIDE"
2024-02-25ext4_get_link(): fix breakage in RCU modeAl Viro1-3/+5
1) errors from ext4_getblk() should not be propagated to caller unless we are really sure that we would've gotten the same error in non-RCU pathwalk. 2) we leak buffer_heads if ext4_getblk() is successful, but bh is not uptodate. Signed-off-by: Al Viro <[email protected]>
2024-02-25cifs_get_link(): bail out in unsafe caseAl Viro1-0/+3
->d_revalidate() bails out there, anyway. It's not enough to prevent getting into ->get_link() in RCU mode, but that could happen only in a very contrieved setup. Not worth trying to do anything fancy here unless ->d_revalidate() stops kicking out of RCU mode at least in some cases. Reviewed-by: Christian Brauner <[email protected]> Acked-by: Miklos Szeredi <[email protected]> Signed-off-by: Al Viro <[email protected]>
2024-02-25fuse: fix UAF in rcu pathwalksAl Viro3-6/+13
->permission(), ->get_link() and ->inode_get_acl() might dereference ->s_fs_info (and, in case of ->permission(), ->s_fs_info->fc->user_ns as well) when called from rcu pathwalk. Freeing ->s_fs_info->fc is rcu-delayed; we need to make freeing ->s_fs_info and dropping ->user_ns rcu-delayed too. Signed-off-by: Al Viro <[email protected]>
2024-02-25procfs: make freeing proc_fs_info rcu-delayedAl Viro1-1/+1
makes proc_pid_ns() safe from rcu pathwalk (put_pid_ns() is still synchronous, but that's not a problem - it does rcu-delay everything that needs to be) Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
2024-02-25procfs: move dropping pde and pid from ->evict_inode() to ->free_inode()Al Viro2-13/+8
that keeps both around until struct inode is freed, making access to them safe from rcu-pathwalk Acked-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>