aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2022-08-03Merge tag 'for-5.20-tag' of ↵Linus Torvalds50-2818/+3660
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This brings some long awaited changes, the send protocol bump, otherwise lots of small improvements and fixes. The main core part is reworking bio handling, cleaning up the submission and endio and improving error handling. There are some changes outside of btrfs adding helpers or updating API, listed at the end of the changelog. Features: - sysfs: - export chunk size, in debug mode add tunable for setting its size - show zoned among features (was only in debug mode) - show commit stats (number, last/max/total duration) - send protocol updated to 2 - new commands: - ability write larger data chunks than 64K - send raw compressed extents (uses the encoded data ioctls), ie. no decompression on send side, no compression needed on receive side if supported - send 'otime' (inode creation time) among other timestamps - send file attributes (a.k.a file flags and xflags) - this is first version bump, backward compatibility on send and receive side is provided - there are still some known and wanted commands that will be implemented in the near future, another version bump will be needed, however we want to minimize that to avoid causing usability issues - print checksum type and implementation at mount time - don't print some messages at mount (mentioned as people asked about it), we want to print messages namely for new features so let's make some space for that - big metadata - this has been supported for a long time and is not a feature that's worth mentioning - skinny metadata - same reason, set by default by mkfs Performance improvements: - reduced amount of reserved metadata for delayed items - when inserted items can be batched into one leaf - when deleting batched directory index items - when deleting delayed items used for deletion - overall improved count of files/sec, decreased subvolume lock contention - metadata item access bounds checker micro-optimized, with a few percent of improved runtime for metadata-heavy operations - increase direct io limit for read to 256 sectors, improved throughput by 3x on sample workload Notable fixes: - raid56 - reduce parity writes, skip sectors of stripe when there are no data updates - restore reading from on-disk data instead of using stripe cache, this reduces chances to damage correct data due to RMW cycle - refuse to replay log with unknown incompat read-only feature bit set - zoned - fix page locking when COW fails in the middle of allocation - improved tracking of active zones, ZNS drives may limit the number and there are ENOSPC errors due to that limit and not actual lack of space - adjust maximum extent size for zone append so it does not cause late ENOSPC due to underreservation - mirror reading error messages show the mirror number - don't fallback to buffered IO for NOWAIT direct IO writes, we don't have the NOWAIT semantics for buffered io yet - send, fix sending link commands for existing file paths when there are deleted and created hardlinks for same files - repair all mirrors for profiles with more than 1 copy (raid1c34) - fix repair of compressed extents, unify where error detection and repair happen Core changes: - bio completion cleanups - don't double defer compression bios - simplify endio workqueues - add more data to btrfs_bio to avoid allocation for read requests - rework bio error handling so it's same what block layer does, the submission works and errors are consumed in endio - when asynchronous bio offload fails fall back to synchronous checksum calculation to avoid errors under writeback or memory pressure - new trace points - raid56 events - ordered extent operations - super block log_root_transid deprecated (never used) - mixed_backref and big_metadata sysfs feature files removed, they've been default for sufficiently long time, there are no known users and mixed_backref could be confused with mixed_groups Non-btrfs changes, API updates: - minor highmem API update to cover const arguments - switch all kmap/kmap_atomic to kmap_local - remove redundant flush_dcache_page() - address_space_operations::writepage callback removed - add bdev_max_segments() helper" * tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (163 commits) btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read btrfs: fix repair of compressed extents btrfs: remove the start argument to check_data_csum and export btrfs: pass a btrfs_bio to btrfs_repair_one_sector btrfs: simplify the pending I/O counting in struct compressed_bio btrfs: repair all known bad mirrors btrfs: merge btrfs_dev_stat_print_on_error with its only caller btrfs: join running log transaction when logging new name btrfs: simplify error handling in btrfs_lookup_dentry btrfs: send: always use the rbtree based inode ref management infrastructure btrfs: send: fix sending link commands for existing file paths btrfs: send: introduce recorded_ref_alloc and recorded_ref_free btrfs: zoned: wait until zone is finished when allocation didn't progress btrfs: zoned: write out partially allocated region btrfs: zoned: activate necessary block group btrfs: zoned: activate metadata block group on flush_space btrfs: zoned: disable metadata overcommit for zoned btrfs: zoned: introduce space_info->active_total_bytes btrfs: zoned: finish least available block group on data bg allocation btrfs: let can_allocate_chunk return error ...
2022-08-03Merge tag 'efi-efivars-removal-for-v5.20' of ↵Linus Torvalds3-1/+779
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull efivars sysfs interface removal from Ard Biesheuvel: "Remove the obsolete 'efivars' sysfs based interface to the EFI variable store, now that all users have moved to the efivarfs pseudo file system, which was created ~10 years ago to address some fundamental shortcomings in the sysfs based driver. Move the 'business logic' related to which EFI variables are important and may affect the boot flow from the efivars support layer into the efivarfs pseudo file system, so it is no longer exposed to other parts of the kernel" * tag 'efi-efivars-removal-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi: vars: Move efivar caching layer into efivarfs efi: vars: Switch to new wrapper layer efi: vars: Remove deprecated 'efivars' sysfs interface
2022-08-03Merge tag 'efi-next-for-v5.20' of ↵Linus Torvalds3-10/+7
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI updates from Ard Biesheuvel: - Enable mirrored memory for arm64 - Fix up several abuses of the efivar API - Refactor the efivar API in preparation for moving the 'business logic' part of it into efivarfs - Enable ACPI PRM on arm64 * tag 'efi-next-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: (24 commits) ACPI: Move PRM config option under the main ACPI config ACPI: Enable Platform Runtime Mechanism(PRM) support on ARM64 ACPI: PRM: Change handler_addr type to void pointer efi: Simplify arch_efi_call_virt() macro drivers: fix typo in firmware/efi/memmap.c efi: vars: Drop __efivar_entry_iter() helper which is no longer used efi: vars: Use locking version to iterate over efivars linked lists efi: pstore: Omit efivars caching EFI varstore access layer efi: vars: Add thin wrapper around EFI get/set variable interface efi: vars: Don't drop lock in the middle of efivar_init() pstore: Add priv field to pstore_record for backend specific use Input: applespi - avoid efivars API and invoke EFI services directly selftests/kexec: remove broken EFI_VARS secure boot fallback check brcmfmac: Switch to appropriate helper to load EFI variable contents iwlwifi: Switch to proper EFI variable store interface media: atomisp_gmin_platform: stop abusing efivar API efi: efibc: avoid efivar API for setting variables efi: avoid efivars layer when loading SSDTs from variables efi: Correct comment on efi_memmap_alloc memblock: Disable mirror feature if kernelcore is not specified ...
2022-08-03Merge tag 'pull-work.iov_iter-base' of ↵Linus Torvalds10-40/+28
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs iov_iter updates from Al Viro: "Part 1 - isolated cleanups and optimizations. One of the goals is to reduce the overhead of using ->read_iter() and ->write_iter() instead of ->read()/->write(). new_sync_{read,write}() has a surprising amount of overhead, in particular inside iocb_flags(). That's the explanation for the beginning of the series is in this pile; it's not directly iov_iter-related, but it's a part of the same work..." * tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: first_iovec_segment(): just return address iov_iter: massage calling conventions for first_{iovec,bvec}_segment() iov_iter: first_{iovec,bvec}_segment() - simplify a bit iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment() iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT iov_iter_bvec_advance(): don't bother with bvec_iter copy_page_{to,from}_iter(): switch iovec variants to generic keep iocb_flags() result cached in struct file iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC struct file: use anonymous union member for rcuhead and llist btrfs: use IOMAP_DIO_NOSYNC teach iomap_dio_rw() to suppress dsync No need of likely/unlikely on calls of check_copy_size()
2022-08-03Merge tag 'pull-work.dcache' of ↵Linus Torvalds1-11/+43
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs dcache updates from Al Viro: "The main part here is making parallel lookups safe for RT - making sure preemption is disabled in start_dir_add()/ end_dir_add() sections (on non-RT it's automatic, on RT it needs to to be done explicitly) and moving wakeups from __d_lookup_done() inside of such to the end of those sections. Wakeups can be safely delayed for as long as ->d_lock on in-lookup dentry is held; proving that has caught a bug in d_add_ci() that allows memory corruption when sufficiently bogus ntfs (or case-insensitive xfs) image is mounted. Easily fixed, fortunately" * tag 'pull-work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs/dcache: Move wakeup out of i_seq_dir write held region. fs/dcache: Move the wakeup from __d_lookup_done() to the caller. fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT d_add_ci(): make sure we don't miss d_lookup_done()
2022-08-03Merge tag 'pull-work.lseek' of ↵Linus Torvalds6-24/+14
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs lseek updates from Al Viro: "Jason's lseek series. Saner handling of 'lseek should fail with ESPIPE' - this gets rid of the magical no_llseek thing and makes checks consistent. In particular, the ad-hoc "can we do splice via internal pipe" checks got saner (and somewhat more permissive, which is what Jason had been after, AFAICT)" * tag 'pull-work.lseek' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: remove no_llseek fs: check FMODE_LSEEK to control internal pipe splicing vfio: do not set FMODE_LSEEK flag dma-buf: remove useless FMODE_LSEEK flag fs: do not compare against ->llseek fs: clear or set FMODE_LSEEK based on llseek function
2022-08-03Merge tag 'pull-work.namei' of ↵Linus Torvalds3-108/+86
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs namei updates from Al Viro: "RCU pathwalk cleanups. Storing sampled ->d_seq of the next dentry in nameidata simplifies life considerably, especially if we delay fetching ->d_inode until step_into()" * tag 'pull-work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: step_into(): move fetching ->d_inode past handle_mounts() lookup_fast(): don't bother with inode follow_dotdot{,_rcu}(): don't bother with inode step_into(): lose inode argument namei: stash the sampled ->d_seq into nameidata namei: move clearing LOOKUP_RCU towards rcu_read_unlock() switch try_to_unlazy_next() to __legitimize_mnt() follow_dotdot{,_rcu}(): change calling conventions namei: get rid of pointless unlikely(read_seqcount_retry(...)) __follow_mount_rcu(): verify that mount_lock remains unchanged
2022-08-03Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds54-985/+350
Pull folio updates from Matthew Wilcox: - Fix an accounting bug that made NR_FILE_DIRTY grow without limit when running xfstests - Convert more of mpage to use folios - Remove add_to_page_cache() and add_to_page_cache_locked() - Convert find_get_pages_range() to filemap_get_folios() - Improvements to the read_cache_page() family of functions - Remove a few unnecessary checks of PageError - Some straightforward filesystem conversions to use folios - Split PageMovable users out from address_space_operations into their own movable_operations - Convert aops->migratepage to aops->migrate_folio - Remove nobh support (Christoph Hellwig) * tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits) fs: remove the NULL get_block case in mpage_writepages fs: don't call ->writepage from __mpage_writepage fs: remove the nobh helpers jfs: stop using the nobh helper ext2: remove nobh support ntfs3: refactor ntfs_writepages mm/folio-compat: Remove migration compatibility functions fs: Remove aops->migratepage() secretmem: Convert to migrate_folio hugetlb: Convert to migrate_folio aio: Convert to migrate_folio f2fs: Convert to filemap_migrate_folio() ubifs: Convert to filemap_migrate_folio() btrfs: Convert btrfs_migratepage to migrate_folio mm/migrate: Add filemap_migrate_folio() mm/migrate: Convert migrate_page() to migrate_folio() nfs: Convert to migrate_folio btrfs: Convert btree_migratepage to migrate_folio mm/migrate: Convert expected_page_refs() to folio_expected_refs() mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio() ...
2022-08-03fs/ntfs3: Make ni_ins_new_attr return errorKonstantin Komarov1-3/+26
Function ni_ins_new_attr now returns ERR_PTR(err), so we check it now in other functions like ni_expand_mft_list Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: Create MFT zone only if length is large enoughKonstantin Komarov1-8/+3
Also removed uninformative print Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: Refactoring attr_insert_range to restore after errorsKonstantin Komarov1-31/+86
Added done and undo labels for restoring after errors Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: Refactoring attr_punch_hole to restore after errorsKonstantin Komarov3-38/+105
Added comments to code Added new function run_clone to make a copy of run Added done and undo labels for restoring after errors Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: Refactoring attr_set_size to restore after errorsKonstantin Komarov1-54/+126
Added comments to code Added two undo labels for restoring after errors Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: New function ntfs_bad_inodeKonstantin Komarov6-17/+25
There are repetitive steps in case of bad inode This commit wraps them in function Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: Make MFT zone less fragmentedKonstantin Komarov3-14/+33
Now we take free space after the MFT zone if the MFT zone shrinks. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: Check possible errors in run_pack in advanceKonstantin Komarov1-17/+22
Checking in advance speeds things up in some cases. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: Added comments to frecord functionsKonstantin Komarov4-8/+6
Added some comments in frecord.c for more context. Also changed run_lookup to static because it's an internal function. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: Fill duplicate info in ni_add_nameKonstantin Komarov2-16/+14
Work with names must be completed in ni_add_name Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: Make static function attr_load_runsKonstantin Komarov2-4/+2
attr_load_runs is an internal function Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: Add new argument is_mft to ntfs_mark_rec_freeKonstantin Komarov4-11/+14
This argument helps in avoiding double locking Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: Remove unused mi_mark_freeKonstantin Komarov4-25/+2
Cleaning up dead code Fix wrong comments Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: Fix very fragmented case in attr_punch_holeKonstantin Komarov1-1/+12
In some cases we need to ni_find_attr attr_b Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: Fix work with fragmented xattrKonstantin Komarov1-1/+6
In some cases xattr is too fragmented, so we need to load it before writing. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: Make ntfs_fallocate return -ENOSPC instead of -EFBIGKonstantin Komarov1-0/+13
In some cases we need to return ENOSPC Fixes xfstest generic/213 Fixes: 114346978cf6 ("fs/ntfs3: Check new size for limits") Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: extend ni_insert_nonresident to return inserted ATTR_LIST_ENTRYKonstantin Komarov4-18/+25
Fixes xfstest generic/300 Fixes: 4534a70b7056 ("fs/ntfs3: Add headers and misc files") Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: Check reserved size for maximum allowedKonstantin Komarov2-3/+9
Also don't mask EFBIG Fixes xfstest generic/485 Fixes: 4342306f0f0d ("fs/ntfs3: Add file operations and implementation") Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03fs/ntfs3: Do not change mode if ntfs_set_ea failedKonstantin Komarov1-10/+10
ntfs_set_ea can fail with NOSPC, so we don't need to change mode in this situation. Fixes xfstest generic/449 Fixes: be71b5cba2e6 ("fs/ntfs3: Add attrib operations") Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2022-08-03libceph: clean up ceph_osdc_start_request prototypeJeff Layton2-39/+26
This function always returns 0, and ignores the nofail boolean. Drop the nofail argument, make the function void return and fix up the callers. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-02ext4: add ioctls to get/set the ext4 superblock uuidJeremy Bongio2-0/+94
This fixes a race between changing the ext4 superblock uuid and operations like mounting, resizing, changing features, etc. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jeremy Bongio <bongiojp@gmail.com> Link: https://lore.kernel.org/r/20220721224422.438351-1-bongiojp@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02ext4: avoid resizing to a partial cluster sizeKiselev, Oleg1-0/+10
This patch avoids an attempt to resize the filesystem to an unaligned cluster boundary. An online resize to a size that is not integral to cluster size results in the last iteration attempting to grow the fs by a negative amount, which trips a BUG_ON and leaves the fs with a corrupted in-memory superblock. Signed-off-by: Oleg Kiselev <okiselev@amazon.com> Link: https://lore.kernel.org/r/0E92A0AB-4F16-4F1A-94B7-702CC6504FDE@amazon.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02ext4: reduce computation of overhead during resizeKiselev, Oleg1-2/+21
This patch avoids doing an O(n**2)-complexity walk through every flex group. Instead, it uses the already computed overhead information for the newly allocated space, and simply adds it to the previously calculated overhead stored in the superblock. This drastically reduces the time taken to resize very large bigalloc filesystems (from 3+ hours for a 64TB fs down to milliseconds). Signed-off-by: Oleg Kiselev <okiselev@amazon.com> Link: https://lore.kernel.org/r/CE4F359F-4779-45E6-B6A9-8D67FDFF5AE2@amazon.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02jbd2: fix assertion 'jh->b_frozen_data == NULL' failure when journal abortedZhihao Cheng1-2/+12
Following process will fail assertion 'jh->b_frozen_data == NULL' in jbd2_journal_dirty_metadata(): jbd2_journal_commit_transaction unlink(dir/a) jh->b_transaction = trans1 jh->b_jlist = BJ_Metadata journal->j_running_transaction = NULL trans1->t_state = T_COMMIT unlink(dir/b) handle->h_trans = trans2 do_get_write_access jh->b_modified = 0 jh->b_frozen_data = frozen_buffer jh->b_next_transaction = trans2 jbd2_journal_dirty_metadata is_handle_aborted is_journal_aborted // return false --> jbd2 abort <-- while (commit_transaction->t_buffers) if (is_journal_aborted) jbd2_journal_refile_buffer __jbd2_journal_refile_buffer WRITE_ONCE(jh->b_transaction, jh->b_next_transaction) WRITE_ONCE(jh->b_next_transaction, NULL) __jbd2_journal_file_buffer(jh, BJ_Reserved) J_ASSERT_JH(jh, jh->b_frozen_data == NULL) // assertion failure ! The reproducer (See detail in [Link]) reports: ------------[ cut here ]------------ kernel BUG at fs/jbd2/transaction.c:1629! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 2 PID: 584 Comm: unlink Tainted: G W 5.19.0-rc6-00115-g4a57a8400075-dirty #697 RIP: 0010:jbd2_journal_dirty_metadata+0x3c5/0x470 RSP: 0018:ffffc90000be7ce0 EFLAGS: 00010202 Call Trace: <TASK> __ext4_handle_dirty_metadata+0xa0/0x290 ext4_handle_dirty_dirblock+0x10c/0x1d0 ext4_delete_entry+0x104/0x200 __ext4_unlink+0x22b/0x360 ext4_unlink+0x275/0x390 vfs_unlink+0x20b/0x4c0 do_unlinkat+0x42f/0x4c0 __x64_sys_unlink+0x37/0x50 do_syscall_64+0x35/0x80 After journal aborting, __jbd2_journal_refile_buffer() is executed with holding @jh->b_state_lock, we can fix it by moving 'is_handle_aborted()' into the area protected by @jh->b_state_lock. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216251 Fixes: 470decc613ab20 ("[PATCH] jbd2: initial copy of files from jbd") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Link: https://lore.kernel.org/r/20220715125152.4022726-1-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02ext4: block range must be validated before use in ext4_mb_clear_bb()Lukas Czerner1-1/+20
Block range to free is validated in ext4_free_blocks() using ext4_inode_block_valid() and then it's passed to ext4_mb_clear_bb(). However in some situations on bigalloc file system the range might be adjusted after the validation in ext4_free_blocks() which can lead to troubles on corrupted file systems such as one found by syzkaller that resulted in the following BUG kernel BUG at fs/ext4/ext4.h:3319! PREEMPT SMP NOPTI CPU: 28 PID: 4243 Comm: repro Kdump: loaded Not tainted 5.19.0-rc6+ #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014 RIP: 0010:ext4_free_blocks+0x95e/0xa90 Call Trace: <TASK> ? lock_timer_base+0x61/0x80 ? __es_remove_extent+0x5a/0x760 ? __mod_timer+0x256/0x380 ? ext4_ind_truncate_ensure_credits+0x90/0x220 ext4_clear_blocks+0x107/0x1b0 ext4_free_data+0x15b/0x170 ext4_ind_truncate+0x214/0x2c0 ? _raw_spin_unlock+0x15/0x30 ? ext4_discard_preallocations+0x15a/0x410 ? ext4_journal_check_start+0xe/0x90 ? __ext4_journal_start_sb+0x2f/0x110 ext4_truncate+0x1b5/0x460 ? __ext4_journal_start_sb+0x2f/0x110 ext4_evict_inode+0x2b4/0x6f0 evict+0xd0/0x1d0 ext4_enable_quotas+0x11f/0x1f0 ext4_orphan_cleanup+0x3de/0x430 ? proc_create_seq_private+0x43/0x50 ext4_fill_super+0x295f/0x3ae0 ? snprintf+0x39/0x40 ? sget_fc+0x19c/0x330 ? ext4_reconfigure+0x850/0x850 get_tree_bdev+0x16d/0x260 vfs_get_tree+0x25/0xb0 path_mount+0x431/0xa70 __x64_sys_mount+0xe2/0x120 do_syscall_64+0x5b/0x80 ? do_user_addr_fault+0x1e2/0x670 ? exc_page_fault+0x70/0x170 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7fdf4e512ace Fix it by making sure that the block range is properly validated before used every time it changes in ext4_free_blocks() or ext4_mb_clear_bb(). Link: https://syzkaller.appspot.com/bug?id=5266d464285a03cee9dbfda7d2452a72c3c2ae7c Reported-by: syzbot+15cd994e273307bf5cfa@syzkaller.appspotmail.com Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Tadeusz Struk <tadeusz.struk@linaro.org> Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org> Link: https://lore.kernel.org/r/20220714165903.58260-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02mbcache: automatically delete entries from cache on freeingJan Kara1-69/+39
Use the fact that entries with elevated refcount are not removed from the hash and just move removal of the entry from the hash to the entry freeing time. When doing this we also change the generic code to hold one reference to the cache entry, not two of them, which makes code somewhat more obvious. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220712105436.32204-10-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02mbcache: Remove mb_cache_entry_delete()Jan Kara1-37/+0
Nobody uses mb_cache_entry_delete() anymore. Remove it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220712105436.32204-9-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02ext2: avoid deleting xattr block that is being reusedJan Kara1-29/+29
Currently when we decide to reuse xattr block we detect the case when the last reference to xattr block is being dropped at the same time and cancel the reuse attempt. Convert ext2 to a new scheme when as soon as matching mbcache entry is found, we wait with dropping the last xattr block reference until mbcache entry reference is dropped (meaning either the xattr block reference is increased or we decided not to reuse the block). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220712105436.32204-8-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02ext2: unindent codeblock in ext2_xattr_set()Jan Kara1-16/+16
Replace one else in ext2_xattr_set() with a goto. This makes following code changes simpler to follow. No functional changes. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220712105436.32204-7-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02ext2: factor our freeing of xattr block referenceJan Kara1-52/+38
Free of xattr block reference is opencode in two places. Factor it out into a separate function and use it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220712105436.32204-6-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02ext4: fix race when reusing xattr blocksJan Kara1-22/+45
When ext4_xattr_block_set() decides to remove xattr block the following race can happen: CPU1 CPU2 ext4_xattr_block_set() ext4_xattr_release_block() new_bh = ext4_xattr_block_cache_find() lock_buffer(bh); ref = le32_to_cpu(BHDR(bh)->h_refcount); if (ref == 1) { ... mb_cache_entry_delete(); unlock_buffer(bh); ext4_free_blocks(); ... ext4_forget(..., bh, ...); jbd2_journal_revoke(..., bh); ext4_journal_get_write_access(..., new_bh, ...) do_get_write_access() jbd2_journal_cancel_revoke(..., new_bh); Later the code in ext4_xattr_block_set() finds out the block got freed and cancels reusal of the block but the revoke stays canceled and so in case of block reuse and journal replay the filesystem can get corrupted. If the race works out slightly differently, we can also hit assertions in the jbd2 code. Fix the problem by making sure that once matching mbcache entry is found, code dropping the last xattr block reference (or trying to modify xattr block in place) waits until the mbcache entry reference is dropped. This way code trying to reuse xattr block is protected from someone trying to drop the last reference to xattr block. Reported-and-tested-by: Ritesh Harjani <ritesh.list@gmail.com> CC: stable@vger.kernel.org Fixes: 82939d7999df ("ext4: convert to mbcache2") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220712105436.32204-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02ext4: unindent codeblock in ext4_xattr_block_set()Jan Kara1-39/+38
Remove unnecessary else (and thus indentation level) from a code block in ext4_xattr_block_set(). It will also make following code changes easier. No functional changes. CC: stable@vger.kernel.org Fixes: 82939d7999df ("ext4: convert to mbcache2") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220712105436.32204-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02ext4: remove EA inode entry from mbcache on inode evictionJan Kara3-16/+11
Currently we remove EA inode from mbcache as soon as its xattr refcount drops to zero. However there can be pending attempts to reuse the inode and thus refcount handling code has to handle the situation when refcount increases from zero anyway. So save some work and just keep EA inode in mbcache until it is getting evicted. At that moment we are sure following iget() of EA inode will fail anyway (or wait for eviction to finish and load things from the disk again) and so removing mbcache entry at that moment is fine and simplifies the code a bit. CC: stable@vger.kernel.org Fixes: 82939d7999df ("ext4: convert to mbcache2") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220712105436.32204-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02mbcache: add functions to delete entry if unusedJan Kara1-2/+64
Add function mb_cache_entry_delete_or_get() to delete mbcache entry if it is unused and also add a function to wait for entry to become unused - mb_cache_entry_wait_unused(). We do not share code between the two deleting function as one of them will go away soon. CC: stable@vger.kernel.org Fixes: 82939d7999df ("ext4: convert to mbcache2") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220712105436.32204-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02mbcache: don't reclaim used entriesJan Kara1-1/+9
Do not reclaim entries that are currently used by somebody from a shrinker. Firstly, these entries are likely useful. Secondly, we will need to keep such entries to protect pending increment of xattr block refcount. CC: stable@vger.kernel.org Fixes: 82939d7999df ("ext4: convert to mbcache2") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220712105436.32204-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02ext4: make sure ext4_append() always allocates new blockLukas Czerner1-0/+16
ext4_append() must always allocate a new block, otherwise we run the risk of overwriting existing directory block corrupting the directory tree in the process resulting in all manner of problems later on. Add a sanity check to see if the logical block is already allocated and error out if it is. Cc: stable@kernel.org Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20220704142721.157985-2-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02ext4: check if directory block is within i_sizeLukas Czerner1-0/+7
Currently ext4 directory handling code implicitly assumes that the directory blocks are always within the i_size. In fact ext4_append() will attempt to allocate next directory block based solely on i_size and the i_size is then appropriately increased after a successful allocation. However, for this to work it requires i_size to be correct. If, for any reason, the directory inode i_size is corrupted in a way that the directory tree refers to a valid directory block past i_size, we could end up corrupting parts of the directory tree structure by overwriting already used directory blocks when modifying the directory. Fix it by catching the corruption early in __ext4_read_dirblock(). Addresses Red-Hat-Bugzilla: #2070205 CVE: CVE-2022-1184 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: stable@vger.kernel.org Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20220704142721.157985-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02ext4: reflect mb_optimize_scan value in options fileOjaswin Mujoo1-0/+9
Add support to display the mb_optimize_scan value in /proc/fs/ext4/<dev>/options file. The option is only displayed when the value is non default. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20220704054603.21462-1-ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02ext4: avoid remove directory when directory is corruptedYe Bin1-5/+2
Now if check directoy entry is corrupted, ext4_empty_dir may return true then directory will be removed when file system mounted with "errors=continue". In order not to make things worse just return false when directory is corrupted. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220622090223.682234-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02ext4: aligned '*' in commentsJiang Jian1-1/+1
The '*' in the comment is not aligned. Signed-off-by: Jiang Jian <jiangjian@cdjrlc.com> Link: https://lore.kernel.org/r/20220621061531.19669-1-jiangjian@cdjrlc.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02ext4: recover csum seed of tmp_inode after migrating to extentsLi Lingfeng1-1/+3
When migrating to extents, the checksum seed of temporary inode need to be replaced by inode's, otherwise the inode checksums will be incorrect when swapping the inodes data. However, the temporary inode can not match it's checksum to itself since it has lost it's own checksum seed. mkfs.ext4 -F /dev/sdc mount /dev/sdc /mnt/sdc xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/sdc/testfile chattr -e /mnt/sdc/testfile chattr +e /mnt/sdc/testfile umount /dev/sdc fsck -fn /dev/sdc ======== ... Pass 1: Checking inodes, blocks, and sizes Inode 13 passes checks, but checksum does not match inode. Fix? no ... ======== The fix is simple, save the checksum seed of temporary inode, and recover it after migrating to extents. Fixes: e81c9302a6c3 ("ext4: set csum seed in tmp inode while migrating to extents") Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220617062515.2113438-1-lilingfeng3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02ext4: fix warning in ext4_iomap_begin as race between bmap and writeYe Bin1-3/+9
We got issue as follows: ------------[ cut here ]------------ WARNING: CPU: 3 PID: 9310 at fs/ext4/inode.c:3441 ext4_iomap_begin+0x182/0x5d0 RIP: 0010:ext4_iomap_begin+0x182/0x5d0 RSP: 0018:ffff88812460fa08 EFLAGS: 00010293 RAX: ffff88811f168000 RBX: 0000000000000000 RCX: ffffffff97793c12 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: ffff88812c669160 R08: ffff88811f168000 R09: ffffed10258cd20f R10: ffff88812c669077 R11: ffffed10258cd20e R12: 0000000000000001 R13: 00000000000000a4 R14: 000000000000000c R15: ffff88812c6691ee FS: 00007fd0d6ff3740(0000) GS:ffff8883af180000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd0d6dda290 CR3: 0000000104a62000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: iomap_apply+0x119/0x570 iomap_bmap+0x124/0x150 ext4_bmap+0x14f/0x250 bmap+0x55/0x80 do_vfs_ioctl+0x952/0xbd0 __x64_sys_ioctl+0xc6/0x170 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Above issue may happen as follows: bmap write bmap ext4_bmap iomap_bmap ext4_iomap_begin ext4_file_write_iter ext4_buffered_write_iter generic_perform_write ext4_da_write_begin ext4_da_write_inline_data_begin ext4_prepare_inline_data ext4_create_inline_data ext4_set_inode_flag(inode, EXT4_INODE_INLINE_DATA); if (WARN_ON_ONCE(ext4_has_inline_data(inode))) ->trigger bug_on To solved above issue hold inode lock in ext4_bamp. Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20220617013935.397596-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org