aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2022-05-25ceph: try to queue a writeback if revoking failsXiubo Li1-4/+24
If the pagecaches writeback just finished and the i_wrbuffer_ref reaches zero it will try to trigger ceph_check_caps(). But if just before ceph_check_caps() the i_wrbuffer_ref could be increased again by mmap/cache write, then the Fwb revoke will fail. We need to try to queue a writeback in this case instead of triggering the writeback by BDI's delayed work per 5 seconds. URL: https://tracker.ceph.com/issues/46904 URL: https://tracker.ceph.com/issues/55377 Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25ceph: fix statfs for subdir mountsLuís Henriques3-13/+36
When doing a mount using as base a directory that has 'max_bytes' quotas statfs uses that value as the total; if a subdirectory is used instead, the same 'max_bytes' too in statfs, unless there is another quota set. Unfortunately, if this subdirectory only has the 'max_files' quota set, then statfs uses the filesystem total. Fix this by making sure we only lookup realms that contain the 'max_bytes' quota. Cc: Ryan Taylor <[email protected]> URL: https://tracker.ceph.com/issues/55090 Signed-off-by: Luís Henriques <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25ceph: fix possible deadlock when holding Fwb to get inline_dataXiubo Li1-14/+19
1, mount with wsync. 2, create a file with O_RDWR, and the request was sent to mds.0: ceph_atomic_open()--> ceph_mdsc_do_request(openc) finish_open(file, dentry, ceph_open)--> ceph_open()--> ceph_init_file()--> ceph_init_file_info()--> ceph_uninline_data()--> { ... if (inline_version == 1 || /* initial version, no data */ inline_version == CEPH_INLINE_NONE) goto out_unlock; ... } The inline_version will be 1, which is the initial version for the new create file. And here the ci->i_inline_version will keep with 1, it's buggy. 3, buffer write to the file immediately: ceph_write_iter()--> ceph_get_caps(file, need=Fw, want=Fb, ...); generic_perform_write()--> a_ops->write_begin()--> ceph_write_begin()--> netfs_write_begin()--> netfs_begin_read()--> netfs_rreq_submit_slice()--> netfs_read_from_server()--> rreq->netfs_ops->issue_read()--> ceph_netfs_issue_read()--> { ... if (ci->i_inline_version != CEPH_INLINE_NONE && ceph_netfs_issue_op_inline(subreq)) return; ... } ceph_put_cap_refs(ci, Fwb); The ceph_netfs_issue_op_inline() will send a getattr(Fsr) request to mds.1. 4, then the mds.1 will request the rd lock for CInode::filelock from the auth mds.0, the mds.0 will do the CInode::filelock state transation from excl --> sync, but it need to revoke the Fxwb caps back from the clients. While the kernel client has aleady held the Fwb caps and waiting for the getattr(Fsr). It's deadlock! URL: https://tracker.ceph.com/issues/55377 Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25ceph: redirty the page for writepage on failureXiubo Li1-1/+3
When run out of memories we should redirty the page before failing the writepage. Or we will hit BUG_ON(folio_get_private(folio)) in ceph_dirty_folio(). URL: https://tracker.ceph.com/issues/55421 Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25ceph: try to choose the auth MDS if possible for getattrXiubo Li3-2/+29
If any 'x' caps is issued we can just choose the auth MDS instead of the random replica MDSes. Because only when the Locker is in LOCK_EXEC state will the loner client could get the 'x' caps. And if we send the getattr requests to any replica MDS it must auth pin and tries to rdlock from the auth MDS, and then the auth MDS need to do the Locker state transition to LOCK_SYNC. And after that the lock state will change back. This cost much when doing the Locker state transition and usually will need to revoke caps from clients. URL: https://tracker.ceph.com/issues/55240 Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25ceph: disable updating the atime since cephfs won't maintain itXiubo Li2-1/+1
Since CephFS makes no attempt to maintain atime, we shouldn't try to update it in mmap and generic read cases and ignore updating it in direct and sync read cases. And even we update it in mmap and generic read cases we will drop it and won't sync it to MDS. And we are seeing the atime will be updated and then dropped to the floor again and again. URL: https://lists.ceph.io/hyperkitty/list/[email protected]/thread/VSJM7T4CS5TDRFF6XFPIYMHP75K73PZ6/ Signed-off-by: Xiubo Li <[email protected]> Acked-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25ceph: flush the mdlog for filesystem syncXiubo Li1-6/+27
Before waiting for a request's safe reply, we will send the mdlog flush request to the relevant MDS. And this will also flush the mdlog for all the other unsafe requests in the same session, so we can record the last session and no need to flush mdlog again in the next loop. But there still have cases that it may send the mdlog flush requst twice or more, but that should be not often. Rename wait_unsafe_requests() to flush_mdlog_and_wait_mdsc_unsafe_requests() to make it more descriptive. [xiubli: fold in MDS request refcount leak fix from Jeff] URL: https://tracker.ceph.com/issues/55284 URL: https://tracker.ceph.com/issues/55411 Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25ceph: rename unsafe_request_wait()Xiubo Li1-4/+4
Rename it to flush_mdlog_and_wait_inode_unsafe_requests() to make it more descriptive. Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25ceph: fix statx AT_STATX_DONT_SYNC vs AT_STATX_FORCE_SYNC checkXiubo Li1-1/+1
From the posix and the initial statx supporting commit comments, the AT_STATX_DONT_SYNC is a lightweight stat and the AT_STATX_FORCE_SYNC is a heaverweight one. And also checked all the other current usage about these two flags they are all doing the same, that is only when the AT_STATX_FORCE_SYNC is not set and the AT_STATX_DONT_SYNC is set will they skip sync retriving the attributes from storage. Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: David Howells <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25ceph: no need to invalidate the fscache twiceXiubo Li1-1/+0
Fixes: 400e1286c0ec3 ("ceph: conversion to new fscache API") Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25ceph: replace usage of found with dedicated list iterator variableJakob Koschel1-17/+15
To move the list iterator variable into the list_for_each_entry_*() macro in the future it should be avoided to use the list iterator variable after the loop body. To *never* use the list iterator variable after the loop it was concluded to use a separate iterator variable instead of a found boolean. This removes the need to use a found variable and simply checking if the variable was set, can determine if the break/goto was hit. Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ Signed-off-by: Jakob Koschel <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25ceph: use dedicated list iterator variableJakob Koschel1-3/+4
To move the list iterator variable into the list_for_each_entry_*() macro in the future it should be avoided to use the list iterator variable after the loop body. To *never* use the list iterator variable after the loop it was concluded to use a separate iterator variable. Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ Signed-off-by: Jakob Koschel <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25ceph: update the dlease for the hashed dentry when removingXiubo Li1-1/+3
The MDS will always refresh the dentry lease when removing the files or directories. And if the dentry is still hashed, we can update the dentry lease and no need to do the lookup from the MDS later. Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25ceph: stop retrying the request when exceeding 256 timesXiubo Li1-2/+23
The type of 'r_attempts' in kernel 'ceph_mds_request' is 'int', while in 'ceph_mds_request_head' the type of 'num_retry' is '__u8'. So in case the request retries exceeding 256 times, the MDS will receive a incorrect retry seq. In this case it's ususally a bug in MDS and continue retrying the request makes no sense. For now let's limit it to 256. In future this could be fixed in ceph code, so avoid using the hardcode here. Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25ceph: stop forwarding the request when exceeding 256 timesXiubo Li1-5/+34
The type of 'num_fwd' in ceph 'MClientRequestForward' is 'int32_t', while in 'ceph_mds_request_head' the type is '__u8'. So in case the request bounces between MDSes exceeding 256 times, the client will get stuck. In this case it's ususally a bug in MDS and continue bouncing the request makes no sense. URL: https://tracker.ceph.com/issues/55130 Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Luís Henriques <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25ceph: remove unused CEPH_MDS_LEASE_RELEASE related codeXiubo Li1-6/+0
The ceph_mdsc_lease_release() has been removed by commit 8aa152c77890 (ceph: remove ceph_mdsc_lease_release). ceph_mdsc_lease_send_msg will never be called with CEPH_MDS_LEASE_RELEASE. Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25ceph: allow ceph.dir.rctime xattr to be updatableVenky Shankar1-1/+9
`rctime' has been a pain point in cephfs due to its buggy nature - inconsistent values reported and those sorts. Fixing rctime is non-trivial needing an overall redesign of the entire nested statistics infrastructure. As a workaround, PR http://github.com/ceph/ceph/pull/37938 allows this extended attribute to be manually set. This allows users to "fixup" inconsistent rctime values. While this sounds messy, its probably the wisest approach allowing users/scripts to workaround buggy rctime values. The above PR enables Ceph MDS to allow manually setting rctime extended attribute with the corresponding user-land changes. We may as well allow the same to be done via kclient for parity. Signed-off-by: Venky Shankar <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-05-25f2fs: add f2fs_init_write_merge_io functionYufen Yu3-24/+32
Almost all other initialization of variables in f2fs_fill_super are extraced to a single function. Also do it for write_io[], which can make code more clean. This patch just refactors the code, theres no functional change. Signed-off-by: Yufen Yu <[email protected]> Reviewed-by: Chao Yu <[email protected]> [Jaegeuk Kim: clean up] Signed-off-by: Jaegeuk Kim <[email protected]>
2022-05-25cifs: fix ntlmssp on old serversPaulo Alcantara6-49/+26
Some older servers seem to require the workstation name during ntlmssp to be at most 15 chars (RFC1001 name length), so truncate it before sending when using insecure dialects. Link: https://lore.kernel.org/r/[email protected] Reported-by: Byron Stanoszek <[email protected]> Tested-by: Byron Stanoszek <[email protected]> Fixes: 49bd49f983b5 ("cifs: send workstation name during ntlmssp session setup") Cc: [email protected] Signed-off-by: Paulo Alcantara (SUSE) <[email protected]> Signed-off-by: Steve French <[email protected]>
2022-05-25io_uring: make prep and issue side of req handlers named consistentlyJens Axboe1-6/+6
Almost all of them are, the odd ones out are the poll remove and the files update request. Name them like the others, which is: io_#cmdname_prep for request preparation io_#cmdname for request issue Reviewed-by: Kanchan Joshi <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-05-25io_uring: make timeout prep handlers consistent with other prep handlersJens Axboe1-4/+17
All other opcodes take a {req, sqe} set for prep handling, split out a timeout prep handler so that timeout and linked timeouts can use the same one. Reviewed-by: Kanchan Joshi <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-05-24f2fs: avoid unneeded error handling for revoke_entry_slab allocationChao Yu1-5/+0
In __f2fs_commit_atomic_write(), we will guarantee success of revoke_entry_slab allocation, so let's avoid unneeded error handling. Signed-off-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-05-24f2fs: allow compression for mmap files in compress_mode=userSungjong Seo1-10/+0
Since commit e3c548323d32 ("f2fs: let's allow compression for mmap files"), it has been allowed to compress mmap files. However, in compress_mode=user, it is not allowed yet. To keep the same concept in both compress_modes, f2fs_ioc_(de)compress_file() should also allow it. Let's remove checking mmap files in f2fs_ioc_(de)compress_file() so that the compression for mmap files is also allowed in compress_mode=user. Signed-off-by: Sungjong Seo <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2022-05-24Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds127-940/+950
Pull page cache updates from Matthew Wilcox: - Appoint myself page cache maintainer - Fix how scsicam uses the page cache - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS - Remove the AOP flags entirely - Remove pagecache_write_begin() and pagecache_write_end() - Documentation updates - Convert several address_space operations to use folios: - is_dirty_writeback - readpage becomes read_folio - releasepage becomes release_folio - freepage becomes free_folio - Change filler_t to require a struct file pointer be the first argument like ->read_folio * tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits) nilfs2: Fix some kernel-doc comments Appoint myself page cache maintainer fs: Remove aops->freepage secretmem: Convert to free_folio nfs: Convert to free_folio orangefs: Convert to free_folio fs: Add free_folio address space operation fs: Convert drop_buffers() to use a folio fs: Change try_to_free_buffers() to take a folio jbd2: Convert release_buffer_page() to use a folio jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio reiserfs: Convert release_buffer_page() to use a folio fs: Remove last vestiges of releasepage ubifs: Convert to release_folio reiserfs: Convert to release_folio orangefs: Convert to release_folio ocfs2: Convert to release_folio nilfs2: Remove comment about releasepage nfs: Convert to release_folio jfs: Convert to release_folio ...
2022-05-24Merge tag 'iomap-5.19-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2-6/+4
Pull iomap updates from Darrick Wong: "There's a couple of corrections sent in by Andreas for some accounting errors. The biggest change this time around is that writeback errors longer clear pageuptodate nor does XFS invalidate the page cache anymore. This brings XFS (and gfs2/zonefs) behavior in line with every other Linux filesystem driver, and fixes some UAF bugs that only cropped up after willy turned on multipage folios for XFS in 5.18-rc1. Regrettably, it took all the way to the end of the 5.18 cycle to find the source of these bugs and reach a consensus that XFS' writeback failure behavior from 20 years ago is no longer necessary. Summary: - Fix a couple of accounting errors in the buffered io code. - Discontinue the practice of marking folios !uptodate and invalidating them when writeback fails. This fixes some UAF bugs when multipage folios are enabled, and brings the behavior of XFS/gfs/zonefs into alignment with the behavior of all the other Linux filesystems" * tag 'iomap-5.19-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: don't invalidate folios after writeback errors iomap: iomap_write_end cleanup iomap: iomap_write_failed fix
2022-05-24Merge tag 'dlm-5.19' of ↵Linus Torvalds15-669/+633
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This includes several large patches to improve endian handling and remove sparse warnings. The code previously used in/out, in-place endianness conversion functions. Other code cleanup includes the list iterator changes. Finally, a long standing bug was found and fixed, caused by missed decrement on an lock struct ref count" * tag 'dlm-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: (28 commits) dlm: use kref_put_lock in __put_lkb dlm: use kref_put_lock in put_rsb dlm: remove unnecessary error assign dlm: fix missing lkb refcount handling fs: dlm: cast resource pointer to uintptr_t dlm: replace usage of found with dedicated list iterator variable dlm: remove usage of list iterator for list_add() after the loop body dlm: fix pending remove if msg allocation fails dlm: fix wake_up() calls for pending remove dlm: check required context while close dlm: cleanup lock handling in dlm_master_lookup dlm: remove found label in dlm_master_lookup dlm: remove __user conversion warnings dlm: move conversion to compile time dlm: use __le types for dlm messages dlm: use __le types for rcom messages dlm: use __le types for dlm header dlm: use __le types for options header dlm: add __CHECKER__ for false positives dlm: move global to static inits ...
2022-05-24Merge tag 'ext4_for_linus' of ↵Linus Torvalds14-429/+564
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Various bug fixes and cleanups for ext4. In particular, move the crypto related fucntions from fs/ext4/super.c into a new fs/ext4/crypto.c, and fix a number of bugs found by fuzzers and error injection tools" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits) ext4: only allow test_dummy_encryption when supported ext4: fix bug_on in __es_tree_search ext4: avoid cycles in directory h-tree ext4: verify dir block before splitting it ext4: filter out EXT4_FC_REPLAY from on-disk superblock field s_state ext4: fix bug_on in ext4_writepages ext4: refactor and move ext4_ioctl_get_encryption_pwsalt() ext4: cleanup function defs from ext4.h into crypto.c ext4: move ext4 crypto code to its own file crypto.c ext4: fix memory leak in parse_apply_sb_mount_options() ext4: reject the 'commit' option on ext2 filesystems ext4: remove duplicated #include of dax.h in inode.c ext4: fix race condition between ext4_write and ext4_convert_inline_data ext4: convert symlink external data block mapping to bdev ext4: add nowait mode for ext4_getblk() ext4: fix journal_ioprio mount option handling ext4: mark group as trimmed only if it was fully scanned ext4: fix use-after-free in ext4_rename_dir_prepare ext4: add unmount filesystem message ext4: remove unnecessary conditionals ...
2022-05-24Merge tag 'gfs2-v5.18-rc6-fixes' of ↵Linus Torvalds8-66/+91
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Andreas Gruenbacher: - Clean up the allocation of glocks that have an address space attached - Quota locking fix and quota iomap conversion - Fix the FITRIM error reporting - Some list iterator cleanups * tag 'gfs2-v5.18-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Convert function bh_get to use iomap gfs2: use i_lock spin_lock for inode qadata gfs2: Return more useful errors from gfs2_rgrp_send_discards() gfs2: Use container_of() for gfs2_glock(aspace) gfs2: Explain some direct I/O oddities gfs2: replace 'found' with dedicated list iterator variable
2022-05-24Merge tag 'for-5.19-tag' of ↵Linus Torvalds63-4178/+4441
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Features: - subpage: - support for PAGE_SIZE > 4K (previously only 64K) - make it work with raid56 - repair super block num_devices automatically if it does not match the number of device items - defrag can convert inline extents to regular extents, up to now inline files were skipped but the setting of mount option max_inline could affect the decision logic - zoned: - minimal accepted zone size is explicitly set to 4MiB - make zone reclaim less aggressive and don't reclaim if there are enough free zones - add per-profile sysfs tunable of the reclaim threshold - allow automatic block group reclaim for non-zoned filesystems, with sysfs tunables - tree-checker: new check, compare extent buffer owner against owner rootid Performance: - avoid blocking on space reservation when doing nowait direct io writes (+7% throughput for reads and writes) - NOCOW write throughput improvement due to refined locking (+3%) - send: reduce pressure to page cache by dropping extent pages right after they're processed Core: - convert all radix trees to xarray - add iterators for b-tree node items - support printk message index - user bulk page allocation for extent buffers - switch to bio_alloc API, use on-stack bios where convenient, other bio cleanups - use rw lock for block groups to favor concurrent reads - simplify workques, don't allocate high priority threads for all normal queues as we need only one - refactor scrub, process chunks based on their constraints and similarity - allocate direct io structures on stack and pass around only pointers, avoids allocation and reduces potential error handling Fixes: - fix count of reserved transaction items for various inode operations - fix deadlock between concurrent dio writes when low on free data space - fix a few cases when zones need to be finished VFS, iomap: - add helper to check if sb write has started (usable for assertions) - new helper iomap_dio_alloc_bio, export iomap_dio_bio_end_io" * tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (173 commits) btrfs: zoned: introduce a minimal zone size 4M and reject mount btrfs: allow defrag to convert inline extents to regular extents btrfs: add "0x" prefix for unsupported optional features btrfs: do not account twice for inode ref when reserving metadata units btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointer btrfs: send: avoid trashing the page cache btrfs: send: keep the current inode open while processing it btrfs: allocate the btrfs_dio_private as part of the iomap dio bio btrfs: move struct btrfs_dio_private to inode.c btrfs: remove the disk_bytenr in struct btrfs_dio_private btrfs: allocate dio_data on stack iomap: add per-iomap_iter private data iomap: allow the file system to provide a bio_set for direct I/O btrfs: add a btrfs_dio_rw wrapper btrfs: zoned: zone finish unused block group btrfs: zoned: properly finish block group on metadata write btrfs: zoned: finish block group when there are no more allocatable bytes left btrfs: zoned: consolidate zone finish functions btrfs: zoned: introduce btrfs_zoned_bg_is_full btrfs: improve error reporting in lookup_inline_extent_backref ...
2022-05-24Merge tag 'zonefs-5.19-rc1-fix' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs Pull zonefs fix from Damien Le Moal: "A single patch to fix zonefs_init_file_inode() return value" * tag 'zonefs-5.19-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: zonefs: Fix zonefs_init_file_inode() return value
2022-05-24Merge tag 'erofs-for-5.19-rc1' of ↵Linus Torvalds19-164/+1573
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs (and fscache) updates from Gao Xiang: "After working on it on the mailing list for more than half a year, we finally form 'erofs over fscache' feature into shape. Hopefully it could bring more possibility to the communities. The story mainly started from a new project what we called "RAFS v6" [1] for Nydus image service almost a year ago, which enhances EROFS to be a new form of one bootstrap (which includes metadata representing the whole fs tree) + several data-deduplicated content addressable blobs (actually treated as multiple devices). Each blob can represent one container image layer but not quite exactly since all new data can be fully existed in the previous blobs so no need to introduce another new blob. It is actually not a new idea (at least on my side it's much like a simpilied casync [2] for now) and has many benefits over per-file blobs or some other exist ways since typically each RAFS v6 image only has dozens of device blobs instead of thousands of per-file blobs. It's easy to be signed with user keys as a golden image, transfered untouchedly with minimal overhead over the network, kept in some type of storage conveniently, and run with (optional) runtime verification but without involving too many irrelevant features crossing the system beyond EROFS itself. At least it's our final goal and we're keeping working on it. There was also a good summary of this approach from the casync author [3]. Regardless further optimizations, this work is almost done in the previous Linux release cycles. In this round, we'd like to introduce on-demand load for EROFS with the fscache/cachefiles infrastructure, considering the following advantages: - Introduce new file-based backend to EROFS. Although each image only contains dozens of blobs but in densely-deployed runC host for example, there could still be massive blobs on a machine, which is messy if each blob is treated as a device. In contrast, fscache and cachefiles are really great interfaces for us to make them work. - Introduce on-demand load to fscache and EROFS. Previously, fscache is mainly used to caching network-likewise filesystems, now it can support on-demand downloading for local fses too with the exact localfs on-disk format. It has many advantages which we're been described in the latest patchset cover letter [4]. In addition to that, most importantly, the cached data is still stored in the original local fs on-disk format so that it's still the one signed with private keys but only could be partially available. Users can fully trust it during running. Later, users can also back up cachefiles easily to another machine. - More reliable on-demand approach in principle. After data is all available locally, user daemon can be no longer online in some use cases, which helps daemon crash recovery (filesystems can still in service) and hot-upgrade (user daemon can be upgraded more frequently due to new features or protocols introduced.) - Other format can also be converted to EROFS filesystem format over the internet on the fly with the new on-demand load feature and mounted. That is entirely possible with on-demand load feature as long as such archive format metadata can be fetched in advance like stargz. In addition, although currently our target user is Nydus image service [5], but laterly, it can be used for other use cases like on-demand system booting, etc. As for the fscache on-demand load feature itself, strictly it can be used for other local fses too. Laterly we could promote most code to the iomap infrastructure and also enhance it in the read-write way if other local fses are interested. Thanks David Howells for taking so much time and patience on this these months, many thanks with great respect here again! Thanks Jeffle for working on this feature and Xin Yin from Bytedance for asynchronous I/O implementation as well as Zichen Tian, Jia Zhu, and Yan Song for testing, much appeciated. We're also exploring more possibly over fscache cache management over FSDAX for secure containers and working on more improvements and useful features for fscache, cachefiles, and on-demand load. In addition to "erofs over fscache", NFS export and idmapped mount are also completed in this cycle for container use cases as well. Summary: - Add erofs on-demand load support over fscache - Support NFS export for erofs - Support idmapped mounts for erofs - Don't prompt for risk any more when using big pcluster - Fix buffer copy overflow of ztailpacking feature - Several minor cleanups" [1] https://lore.kernel.org/r/[email protected] [2] https://github.com/systemd/casync [3] http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html [4] https://lore.kernel.org/r/[email protected] [5] https://github.com/dragonflyoss/image-service * tag 'erofs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (29 commits) erofs: scan devices from device table erofs: change to use asynchronous io for fscache readpage/readahead erofs: add 'fsid' mount option erofs: implement fscache-based data readahead erofs: implement fscache-based data read for inline layout erofs: implement fscache-based data read for non-inline layout erofs: implement fscache-based metadata read erofs: register fscache context for extra data blobs erofs: register fscache context for primary data blob erofs: add erofs_fscache_read_folios() helper erofs: add anonymous inode caching metadata for data blobs erofs: add fscache context helper functions erofs: register fscache volume erofs: add fscache mode check helper erofs: make erofs_map_blocks() generally available cachefiles: document on-demand read mode cachefiles: add tracepoints for on-demand read mode cachefiles: enable on-demand read mode cachefiles: implement on-demand read cachefiles: notify the user daemon when withdrawing cookie ...
2022-05-24Merge tag 'exfat-for-5.19-rc1' of ↵Linus Torvalds6-61/+47
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat updates from Namjae Jeon: - fix referencing wrong parent directory information during rename - introduce a sys_tz mount option to use system timezone - improve performance while zeroing a cluster with dirsync mount option - fix slab-out-bounds in exat_clear_bitmap() reported from syzbot * tag 'exfat-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: check if cluster num is valid exfat: reduce block requests when zeroing a cluster block: add sync_blockdev_range() exfat: introduce mount option 'sys_tz' exfat: fix referencing wrong parent directory information after renaming
2022-05-24Merge tag 'fs.idmapped.v5.19' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull fs idmapping updates from Christian Brauner: "This contains two minor updates: - An update to the idmapping documentation by Rodrigo making it easier to understand that we first introduce several use-cases that fail without idmapped mounts simply to explain how they can be handled with idmapped mounts. - When changing a mount's idmapping we now hold writers to make it more robust. This is similar to turning a mount ro with the difference that in contrast to turning a mount ro changing the idmapping can only ever be done once while a mount can transition between ro and rw as much as it wants. The vfs layer itself takes care to retrieve the idmapping of a mount once ensuring that the idmapping used for vfs permission checking is identical to the idmapping passed down to the filesystem. All filesystems with FS_ALLOW_IDMAP raised take the same precautions as the vfs in code-paths that are outside of direct control of the vfs such as ioctl()s. However, holding writers makes this more robust and predictable for both the kernel and userspace. This is a minor user-visible change. But it is extremely unlikely to matter. The caller must've created a detached mount via OPEN_TREE_CLONE and then handed that O_PATH fd to another process or thread which then must've gotten a writable fd for that mount and started creating files in there while the caller is still changing mount properties. While not impossible it will be an extremely rare corner-case and should in general be considered a bug in the application. Consider making a mount MOUNT_ATTR_NOEXEC or MOUNT_ATTR_NODEV while allowing someone else to perform lookups or exec'ing in parallel by handing them a copy of the OPEN_TREE_CLONE fd or another fd beneath that mount. I've pinged all major users of idmapped mounts pointing out this change and none of them have active writers on a mount while still changing mount properties. It would've been strange if they did. The rest and majority of the work will be coming through the overlayfs tree this cycle. In addition to overlayfs this cycle should also see support for idmapped mounts on erofs as I've acked a patch to this effect a little while ago" * tag 'fs.idmapped.v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fs: hold writers when changing mount's idmapping docs: Add small intro to idmap examples
2022-05-24Merge tag 'integrity-v5.19' of ↵Linus Torvalds3-7/+44
git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity Pull IMA updates from Mimi Zohar: "New is IMA support for including fs-verity file digests and signatures in the IMA measurement list as well as verifying the fs-verity file digest based signatures, both based on policy. In addition, are two bug fixes: - avoid reading UEFI variables, which cause a page fault, on Apple Macs with T2 chips. - remove the original "ima" template Kconfig option to address a boot command line ordering issue. The rest is a mixture of code/documentation cleanup" * tag 'integrity-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity: integrity: Fix sparse warnings in keyring_handler evm: Clean up some variables evm: Return INTEGRITY_PASS for enum integrity_status value '0' efi: Do not import certificates from UEFI Secure Boot for T2 Macs fsverity: update the documentation ima: support fs-verity file digest based version 3 signatures ima: permit fsverity's file digests in the IMA measurement list ima: define a new template field named 'd-ngv2' and templates fs-verity: define a function to return the integrity protected file digest ima: use IMA default hash algorithm for integrity violations ima: fix 'd-ng' comments and documentation ima: remove the IMA_TEMPLATE Kconfig option ima: remove redundant initialization of pointer 'file'.
2022-05-24Merge tag 'execve-v5.19-rc1' of ↵Linus Torvalds2-179/+66
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: - Fix binfmt_flat GOT handling for riscv (Niklas Cassel) - Remove unused/broken binfmt_flat shared library and coredump code (Eric W. Biederman) * tag 'execve-v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_flat: Remove shared library support binfmt_flat: Drop vestiges of coredump support binfmt_flat: do not stop relocating GOT entries prematurely on riscv
2022-05-24ext4: only allow test_dummy_encryption when supportedEric Biggers2-28/+38
Make the test_dummy_encryption mount option require that the encrypt feature flag be already enabled on the filesystem, rather than automatically enabling it. Practically, this means that "-O encrypt" will need to be included in MKFS_OPTIONS when running xfstests with the test_dummy_encryption mount option. (ext4/053 also needs an update.) Moreover, as long as the preconditions for test_dummy_encryption are being tightened anyway, take the opportunity to start rejecting it when !CONFIG_FS_ENCRYPTION rather than ignoring it. The motivation for requiring the encrypt feature flag is that: - Having the filesystem auto-enable feature flags is problematic, as it bypasses the usual sanity checks. The specific issue which came up recently is that in kernel versions where ext4 supports casefold but not encrypt+casefold (v5.1 through v5.10), the kernel will happily add the encrypt flag to a filesystem that has the casefold flag, making it unmountable -- but only for subsequent mounts, not the initial one. This confused the casefold support detection in xfstests, causing generic/556 to fail rather than be skipped. - The xfstests-bld test runners (kvm-xfstests et al.) already use the required mkfs flag, so they will not be affected by this change. Only users of test_dummy_encryption alone will be affected. But, this option has always been for testing only, so it should be fine to require that the few users of this option update their test scripts. - f2fs already requires it (for its equivalent feature flag). Signed-off-by: Eric Biggers <[email protected]> Reviewed-by: Gabriel Krisman Bertazi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2022-05-24ext4: fix bug_on in __es_tree_searchBaokun Li1-5/+5
Hulk Robot reported a BUG_ON: ================================================================== kernel BUG at fs/ext4/extents_status.c:199! [...] RIP: 0010:ext4_es_end fs/ext4/extents_status.c:199 [inline] RIP: 0010:__es_tree_search+0x1e0/0x260 fs/ext4/extents_status.c:217 [...] Call Trace: ext4_es_cache_extent+0x109/0x340 fs/ext4/extents_status.c:766 ext4_cache_extents+0x239/0x2e0 fs/ext4/extents.c:561 ext4_find_extent+0x6b7/0xa20 fs/ext4/extents.c:964 ext4_ext_map_blocks+0x16b/0x4b70 fs/ext4/extents.c:4384 ext4_map_blocks+0xe26/0x19f0 fs/ext4/inode.c:567 ext4_getblk+0x320/0x4c0 fs/ext4/inode.c:980 ext4_bread+0x2d/0x170 fs/ext4/inode.c:1031 ext4_quota_read+0x248/0x320 fs/ext4/super.c:6257 v2_read_header+0x78/0x110 fs/quota/quota_v2.c:63 v2_check_quota_file+0x76/0x230 fs/quota/quota_v2.c:82 vfs_load_quota_inode+0x5d1/0x1530 fs/quota/dquot.c:2368 dquot_enable+0x28a/0x330 fs/quota/dquot.c:2490 ext4_quota_enable fs/ext4/super.c:6137 [inline] ext4_enable_quotas+0x5d7/0x960 fs/ext4/super.c:6163 ext4_fill_super+0xa7c9/0xdc00 fs/ext4/super.c:4754 mount_bdev+0x2e9/0x3b0 fs/super.c:1158 mount_fs+0x4b/0x1e4 fs/super.c:1261 [...] ================================================================== Above issue may happen as follows: ------------------------------------- ext4_fill_super ext4_enable_quotas ext4_quota_enable ext4_iget __ext4_iget ext4_ext_check_inode ext4_ext_check __ext4_ext_check ext4_valid_extent_entries Check for overlapping extents does't take effect dquot_enable vfs_load_quota_inode v2_check_quota_file v2_read_header ext4_quota_read ext4_bread ext4_getblk ext4_map_blocks ext4_ext_map_blocks ext4_find_extent ext4_cache_extents ext4_es_cache_extent ext4_es_cache_extent __es_tree_search ext4_es_end BUG_ON(es->es_lblk + es->es_len < es->es_lblk) The error ext4 extents is as follows: 0af3 0300 0400 0000 00000000 extent_header 00000000 0100 0000 12000000 extent1 00000000 0100 0000 18000000 extent2 02000000 0400 0000 14000000 extent3 In the ext4_valid_extent_entries function, if prev is 0, no error is returned even if lblock<=prev. This was intended to skip the check on the first extent, but in the error image above, prev=0+1-1=0 when checking the second extent, so even though lblock<=prev, the function does not return an error. As a result, bug_ON occurs in __es_tree_search and the system panics. To solve this problem, we only need to check that: 1. The lblock of the first extent is not less than 0. 2. The lblock of the next extent is not less than the next block of the previous extent. The same applies to extent_idx. Cc: [email protected] Fixes: 5946d089379a ("ext4: check for overlapping extents in ext4_valid_extent_entries()") Reported-by: Hulk Robot <[email protected]> Signed-off-by: Baokun Li <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2022-05-24ext4: avoid cycles in directory h-treeJan Kara1-3/+19
A maliciously corrupted filesystem can contain cycles in the h-tree stored inside a directory. That can easily lead to the kernel corrupting tree nodes that were already verified under its hands while doing a node split and consequently accessing unallocated memory. Fix the problem by verifying traversed block numbers are unique. Cc: [email protected] Signed-off-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2022-05-24ext4: verify dir block before splitting itJan Kara1-11/+21
Before splitting a directory block verify its directory entries are sane so that the splitting code does not access memory it should not. Cc: [email protected] Signed-off-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2022-05-24ext4: filter out EXT4_FC_REPLAY from on-disk superblock field s_stateTheodore Ts'o1-2/+3
The EXT4_FC_REPLAY bit in sbi->s_mount_state is used to indicate that we are in the middle of replay the fast commit journal. This was actually a mistake, since the sbi->s_mount_info is initialized from es->s_state. Arguably s_mount_state is misleadingly named, but the name is historical --- s_mount_state and s_state dates back to ext2. What should have been used is the ext4_{set,clear,test}_mount_flag() inline functions, which sets EXT4_MF_* bits in sbi->s_mount_flags. The problem with using EXT4_FC_REPLAY is that a maliciously corrupted superblock could result in EXT4_FC_REPLAY getting set in s_mount_state. This bypasses some sanity checks, and this can trigger a BUG() in ext4_es_cache_extent(). As a easy-to-backport-fix, filter out the EXT4_FC_REPLAY bit for now. We should eventually transition away from EXT4_FC_REPLAY to something like EXT4_MF_REPLAY. Cc: [email protected] Signed-off-by: Theodore Ts'o <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Reported-by: [email protected]
2022-05-24cifs: cache the dirents for entries in a cached directoryRonnie Sahlberg4-8/+213
This adds caching of the directory entries for a cached directory while we keep a lease on the directory. Reviewed-by: Paulo Alcantara (SUSE) <[email protected]> Reviewed-by: Enzo Matsumiya <[email protected]> Signed-off-by: Ronnie Sahlberg <[email protected]> Signed-off-by: Steve French <[email protected]>
2022-05-24gfs2: Convert function bh_get to use iomapBob Peterson1-5/+12
Before this patch, function bh_get used block_map to figure out the block it needed to read in from the quota_change file. This patch changes it to use iomap directly to make it more efficient. Signed-off-by: Bob Peterson <[email protected]> Signed-off-by: Andreas Gruenbacher <[email protected]>
2022-05-24gfs2: use i_lock spin_lock for inode qadataBob Peterson1-12/+20
Before this patch, functions gfs2_qa_get and _put used the i_rw_mutex to prevent simultaneous access to its i_qadata. But i_rw_mutex is now used for many other things, including iomap_begin and end, which causes a conflict according to lockdep. We cannot just remove the lock since simultaneous opens (gfs2_open -> gfs2_open_common -> gfs2_qa_get) can then stomp on each others values for i_qadata. This patch solves the conflict by using the i_lock spin_lock in the inode to prevent simultaneous access. Signed-off-by: Bob Peterson <[email protected]> Signed-off-by: Andreas Gruenbacher <[email protected]>
2022-05-24gfs2: Return more useful errors from gfs2_rgrp_send_discards()Andrew Price1-2/+2
The bug that 27ca8273f ("gfs2: Make sure FITRIM minlen is rounded up to fs block size") fixes was a little confusing as the user saw "Input/output error" which masked the -EINVAL that sb_issue_discard() returned. sb_issue_discard() can fail for various reasons, so we should return its return value from gfs2_rgrp_send_discards() to avoid all errors being reported as IO errors. This improves error reporting for FITRIM and makes no difference to the -o discard code path because the return value from gfs2_rgrp_send_discards() gets thrown away in that case (and the option switches off). Presumably that's why it was ok to just return -EIO in the past, before FITRIM was implemented. Tested with xfstests. Signed-off-by: Andrew Price <[email protected]> Signed-off-by: Andreas Gruenbacher <[email protected]>
2022-05-24gfs2: Use container_of() for gfs2_glock(aspace)Kees Cook4-27/+38
Clang's structure layout randomization feature gets upset when it sees struct address_space (which is randomized) cast to struct gfs2_glock. This is due to seeing the mapping pointer as being treated as an array of gfs2_glock, rather than "something else, before struct address_space": In file included from fs/gfs2/acl.c:23: fs/gfs2/meta_io.h:44:12: error: casting from randomized structure pointer type 'struct address_space *' to 'struct gfs2_glock *' return (((struct gfs2_glock *)mapping) - 1)->gl_name.ln_sbd; ^ Replace the instances of open-coded pointer math with container_of() usage, and update the allocator to match. Some cleanups and conversion of gfs2_glock_get() and gfs2_glock_dealloc() by Andreas. Reported-by: kernel test robot <[email protected]> Link: https://lore.kernel.org/lkml/[email protected] Cc: Bob Peterson <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Bill Wendling <[email protected]> Cc: [email protected] Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Andreas Gruenbacher <[email protected]>
2022-05-24gfs2: Explain some direct I/O odditiesAndreas Gruenbacher1-0/+4
Add some comments explaining the oddities of partial direct I/O reads and writes. Signed-off-by: Andreas Gruenbacher <[email protected]>
2022-05-24Merge tag 'fsverity-for-linus' of ↵Linus Torvalds4-17/+10
git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fsverity updates from Eric Biggers: "A couple small cleanups for fs/verity/" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fs-verity: Use struct_size() helper in enable_verity() fs-verity: remove unused parameter desc_size in fsverity_create_info()
2022-05-24Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds8-96/+188
Pull fscrypt updates from Eric Biggers: "Some cleanups for fs/crypto/: - Split up the misleadingly-named FS_CRYPTO_BLOCK_SIZE constant. - Consistently report the encryption implementation that is being used. - Add helper functions for the test_dummy_encryption mount option that work properly with the new mount API. ext4 and f2fs will use these" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: add new helper functions for test_dummy_encryption fscrypt: factor out fscrypt_policy_to_key_spec() fscrypt: log when starting to use inline encryption fscrypt: split up FS_CRYPTO_BLOCK_SIZE
2022-05-24cifs: avoid parallel session setups on same channelShyam Prasad N5-7/+65
After allowing channels to reconnect in parallel, it now becomes important to take care that multiple processes do not call negotiate/session setup in parallel on the same channel. This change avoids that by marking a channel as "in_reconnect". During session setup if the channel in question has this flag set, we return immediately. Signed-off-by: Shyam Prasad N <[email protected]> Signed-off-by: Steve French <[email protected]>
2022-05-24cifs: use new enum for ses_statusShyam Prasad N8-32/+39
ses->status today shares statusEnum with server->tcpStatus. This has been confusing, and tcon->status has deviated to use a new enum. Follow suit and use new enum for ses_status as well. Signed-off-by: Shyam Prasad N <[email protected]> Signed-off-by: Steve French <[email protected]>