aboutsummaryrefslogtreecommitdiff
path: root/fs/ceph
AgeCommit message (Collapse)AuthorFilesLines
2023-08-24ceph: allow encrypting a directory while not having Ax capsLuís Henriques1-1/+2
If a client doesn't have Fx caps on a directory, it will get errors while trying encrypt it: ceph: handle_cap_grant: cap grant attempt to change fscrypt_auth on non-I_NEW inode (old len 0 new len 48) fscrypt (ceph, inode 1099511627812): Error -105 getting encryption context A simple way to reproduce this is to use two clients: client1 # mkdir /mnt/mydir client2 # ls /mnt/mydir client1 # fscrypt encrypt /mnt/mydir client1 # echo hello > /mnt/mydir/world This happens because, in __ceph_setattr(), we only initialize ci->fscrypt_auth if we have Ax and ceph_fill_inode() won't use the fscrypt_auth received if the inode state isn't I_NEW. Fix it by allowing ceph_fill_inode() to also set ci->fscrypt_auth if the inode doesn't have it set already. Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: add some fscrypt guardrailsJeff Layton5-9/+55
Add the appropriate calls into fscrypt for various actions, including link, rename, setattr, and the open codepaths. Disable fallocate for encrypted inodes -- hopefully, just for now. If we have an encrypted inode, then the client will need to re-encrypt the contents of the new object. Disable copy offload to or from encrypted inodes. Set i_blkbits to crypto block size for encrypted inodes -- some of the underlying infrastructure for fscrypt relies on i_blkbits being aligned to crypto blocksize. Report STATX_ATTR_ENCRYPTED on encrypted inodes. [ lhenriques: forbid encryption with striped layouts ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: create symlinks with encrypted and base64-encoded targetsJeff Layton2-17/+150
When creating symlinks in encrypted directories, encrypt and base64-encode the target with the new inode's key before sending to the MDS. When filling a symlinked inode, base64-decode it into a buffer that we'll keep in ci->i_symlink. When get_link is called, decrypt the buffer into a new one that will hang off i_link. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: add support to readdir for encrypted namesXiubo Li6-26/+144
To make it simpler to decrypt names in a readdir reply (i.e. before we have a dentry), add a new ceph_encode_encrypted_fname()-like helper that takes a qstr pointer instead of a dentry pointer. Once we've decrypted the names in a readdir reply, we no longer need the crypttext, so overwrite them in ceph_mds_reply_dir_entry with the unencrypted names. Then in both ceph_readdir_prepopulate() and ceph_readdir() we will use the dencrypted name directly. [ jlayton: convert some BUG_ONs into error returns ] Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: pass the request to parse_reply_info_readdir()Xiubo Li1-10/+13
Instead of passing just the r_reply_info to the readdir reply parser, pass the request pointer directly instead. This will facilitate implementing readdir on fscrypted directories. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: make ceph_fill_trace and ceph_get_name decrypt namesJeff Layton2-14/+60
When we get a dentry in a trace, decrypt the name so we can properly instantiate the dentry or fill out ceph_get_name() buffer. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: add helpers for converting names for userland presentationJeff Layton2-0/+123
Define a new ceph_fname struct that we can use to carry information about encrypted dentry names. Add helpers for working with these objects, including ceph_fname_to_usr which formats an encrypted filename for userland presentation. [ xiubli: fix resulting name length check -- neither name_len nor ctext_len should exceed NAME_MAX ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: make d_revalidate call fscrypt revalidator for encrypted dentriesJeff Layton1-2/+7
If we have a dentry which represents a no-key name, then we need to test whether the parent directory's encryption key has since been added. Do that before we test anything else about the dentry. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: set DCACHE_NOKEY_NAME flag in ceph_lookup/atomic_open()Jeff Layton2-0/+18
This is required so that we know to invalidate these dentries when the directory is unlocked. Atomic open can act as a lookup if handed a dentry that is negative on the MDS. Ensure that we set DCACHE_NOKEY_NAME on the dentry in atomic_open, if we don't have the key for the parent. Otherwise, we can end up validating the dentry inappropriately if someone later adds a key. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: decode alternate_name in lease infoJeff Layton2-10/+38
Ceph is a bit different from local filesystems, in that we don't want to store filenames as raw binary data, since we may also be dealing with clients that don't support fscrypt. We could just base64-encode the encrypted filenames, but that could leave us with filenames longer than NAME_MAX. It turns out that the MDS doesn't care much about filename length, but the clients do. To manage this, we've added a new "alternate name" field that can be optionally added to any dentry that we'll use to store the binary crypttext of the filename if its base64-encoded value will be longer than NAME_MAX. When a dentry has one of these names attached, the MDS will send it along in the lease info, which we can then store for later usage. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: send alternate_name in MClientRequestJeff Layton2-5/+74
In the event that we have a filename longer than CEPH_NOHASH_NAME_MAX, we'll need to hash the tail of the filename. The client however will still need to know the full name of the file if it has a key. To support this, the MClientRequest field has grown a new alternate_name field that we populate with the full (binary) crypttext of the filename. This is then transmitted to the clients in readdir or traces as part of the dentry lease. Add support for populating this field when the filenames are very long. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: encode encrypted name in ceph_mdsc_build_path and dentry releaseJeff Layton5-25/+172
Allow ceph_mdsc_build_path to encrypt and base64 encode the filename when the parent is encrypted and we're sending the path to the MDS. In a similar fashion, encode encrypted dentry names if including a dentry release in a request. In most cases, we just encrypt the filenames and base64 encode them, but when the name is longer than CEPH_NOHASH_NAME_MAX, we use a similar scheme to fscrypt proper, and hash the remaning bits with sha256. When doing this, we then send along the full crypttext of the name in the new alternate_name field of the MClientRequest. The MDS can then send that along in readdir responses and traces. [ idryomov: drop duplicate include reported by Abaci Robot ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-22ceph: add base64 endcoding routines for encrypted namesLuís Henriques2-0/+92
The base64url encoding used by fscrypt includes the '_' character, which may cause problems in snapshot names (if the name starts with '_'). Thus, use the base64 encoding defined for IMAP mailbox names (RFC 3501), which uses '+' and ',' instead of '-' and '_'. Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-22ceph: make ioctl cmds more readable in debug logXiubo Li1-1/+38
ioctl file 0000000004e6b054 cmd 2148296211 arg 824635143532 The numerical cmd value in the ioctl debug log message is too hard to understand even when you look at it in the code. Make it more readable. [ idryomov: add missing _ in ceph_ioctl_cmd_name() ] Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-22ceph: add fscrypt ioctls and ceph.fscrypt.auth vxattrJeff Layton2-0/+111
We gate most of the ioctls on MDS feature support. The exception is the key removal and status functions that we still want to work if the MDS's were to (inexplicably) lose the feature. For the set_policy ioctl, we take Fs caps to ensure that nothing can create files in the directory while the ioctl is running. That should be enough to ensure that the "empty_dir" check is reliable. The vxattr is read-only, added mostly for future debugging purposes. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-22ceph: implement -o test_dummy_encryption mount optionJeff Layton6-6/+186
Add support for the test_dummy_encryption mount option. This allows us to test the encrypted codepaths in ceph without having to manually set keys, etc. [ lhenriques: fix potential fsc->fsc_dummy_enc_policy memory leak in ceph_real_mount() ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-22ceph: fscrypt_auth handling for cephJeff Layton10-27/+372
Most fscrypt-enabled filesystems store the crypto context in an xattr, but that's problematic for ceph as xatts are governed by the XATTR cap, but we really want the crypto context as part of the AUTH cap. Because of this, the MDS has added two new inode metadata fields: fscrypt_auth and fscrypt_file. The former is used to hold the crypto context, and the latter is used to track the real file size. Parse new fscrypt_auth and fscrypt_file fields in inode traces. For now, we don't use fscrypt_file, but fscrypt_auth is used to hold the fscrypt context. Allow the client to use a setattr request for setting the fscrypt_auth field. Since this is not a standard setattr request from the VFS, we add a new field to __ceph_setattr that carries ceph-specific inode attrs. Have the set_context op do a setattr that sets the fscrypt_auth value, and get_context just return the contents of that field (since it should always be available). Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-22ceph: use osd_req_op_extent_osd_iter for netfs readsJeff Layton1-18/+1
The netfs layer has already pinned the pages involved before calling issue_op, so we can just pass down the iter directly instead of calling iov_iter_get_pages_alloc. Instead of having to allocate a page array, use CEPH_MSG_DATA_ITER and pass it the iov_iter directly to clone. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-22ceph: make ceph_msdc_build_path use ref-walkJeff Layton1-16/+19
Encryption potentially requires allocation, at which point we'll need to be in a non-atomic context. Convert ceph_msdc_build_path to take dentry spinlocks and references instead of using rcu_read_lock to walk the path. This is slightly less efficient, and we may want to eventually allow using RCU when the leaf dentry isn't encrypted. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-22ceph: preallocate inode for ops that may create oneJeff Layton6-67/+179
When creating a new inode, we need to determine the crypto context before we can transmit the RPC. The fscrypt API has a routine for getting a crypto context before a create occurs, but it requires an inode. Change the ceph code to preallocate an inode in advance of a create of any sort (open(), mknod(), symlink(), etc). Move the existing code that generates the ACL and SELinux blobs into this routine since that's mostly common across all the different codepaths. In most cases, we just want to allow ceph_fill_trace to use that inode after the reply comes in, so add a new field to the MDS request for it (r_new_inode). The async create codepath is a bit different though. In that case, we want to hash the inode in advance of the RPC so that it can be used before the reply comes in. If the call subsequently fails with -EJUKEBOX, then just put the references and clean up the as_ctx. Note that with this change, we now need to regenerate the as_ctx when this occurs, but it's quite rare for it to happen. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-22ceph: add new mount option to enable sparse readsJeff Layton4-11/+72
Add a new mount option that has the client issue sparse reads instead of normal ones. The callers now preallocate an sparse extent buffer that the libceph receive code can populate and hand back after the operation completes. After a successful sparse read, we can't use the req->r_result value to determine the amount of data "read", so instead we set the received length to be from the end of the last extent in the buffer. Any interstitial holes will have been filled by the receive code. [ xiubli: fix a double free on req reported by Ilya ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-18mm, netfs, fscache: stop read optimisation when folio removed from pagecacheDavid Howells1-0/+2
Fscache has an optimisation by which reads from the cache are skipped until we know that (a) there's data there to be read and (b) that data isn't entirely covered by pages resident in the netfs pagecache. This is done with two flags manipulated by fscache_note_page_release(): if (... test_bit(FSCACHE_COOKIE_HAVE_DATA, &cookie->flags) && test_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags)) clear_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags); where the NO_DATA_TO_READ flag causes cachefiles_prepare_read() to indicate that netfslib should download from the server or clear the page instead. The fscache_note_page_release() function is intended to be called from ->releasepage() - but that only gets called if PG_private or PG_private_2 is set - and currently the former is at the discretion of the network filesystem and the latter is only set whilst a page is being written to the cache, so sometimes we miss clearing the optimisation. Fix this by following Willy's suggestion[1] and adding an address_space flag, AS_RELEASE_ALWAYS, that causes filemap_release_folio() to always call ->release_folio() if it's set, even if PG_private or PG_private_2 aren't set. Note that this would require folio_test_private() and page_has_private() to become more complicated. To avoid that, in the places[*] where these are used to conditionalise calls to filemap_release_folio() and try_to_release_page(), the tests are removed the those functions just jumped to unconditionally and the test is performed there. [*] There are some exceptions in vmscan.c where the check guards more than just a call to the releaser. I've added a function, folio_needs_release() to wrap all the checks for that. AS_RELEASE_ALWAYS should be set if a non-NULL cookie is obtained from fscache and cleared in ->evict_inode() before truncate_inode_pages_final() is called. Additionally, the FSCACHE_COOKIE_NO_DATA_TO_READ flag needs to be cleared and the optimisation cancelled if a cachefiles object already contains data when we open it. [dwysocha@redhat.com: call folio_mapping() inside folio_needs_release()] Link: https://github.com/DaveWysochanskiRH/kernel/commit/902c990e311120179fa5de99d68364b2947b79ec Link: https://lkml.kernel.org/r/20230628104852.3391651-3-dhowells@redhat.com Fixes: 1f67e6d0b188 ("fscache: Provide a function to note the release of a page") Fixes: 047487c947e8 ("cachefiles: Implement the I/O routines") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Reported-by: Rohith Surabattula <rohiths.msft@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Tested-by: SeongJae Park <sj@kernel.org> Cc: Daire Byrne <daire.byrne@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steve French <sfrench@samba.org> Cc: Shyam Prasad N <nspmangalore@gmail.com> Cc: Rohith Surabattula <rohiths.msft@gmail.com> Cc: Dave Wysochanski <dwysocha@redhat.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jingbo Xu <jefflexu@linux.alibaba.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-09fs: pass the request_mask to generic_fillattrJeff Layton1-1/+1
generic_fillattr just fills in the entire stat struct indiscriminately today, copying data from the inode. There is at least one attribute (STATX_CHANGE_COOKIE) that can have side effects when it is reported, and we're looking at adding more with the addition of multigrain timestamps. Add a request_mask argument to generic_fillattr and have most callers just pass in the value that is passed to getattr. Have other callers (e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of STATX_CHANGE_COOKIE into generic_fillattr. Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org> Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-06vfs: get rid of old '->iterate' directory operationLinus Torvalds1-2/+3
All users now just use '->iterate_shared()', which only takes the directory inode lock for reading. Filesystems that never got convered to shared mode now instead use a wrapper that drops the lock, re-takes it in write mode, calls the old function, and then downgrades the lock back to read mode. This way the VFS layer and other callers no longer need to care about filesystems that never got converted to the modern era. The filesystems that use the new wrapper are ceph, coda, exfat, jfs, ntfs, ocfs2, overlayfs, and vboxsf. Honestly, several of them look like they really could just iterate their directories in shared mode and skip the wrapper entirely, but the point of this change is to not change semantics or fix filesystems that haven't been fixed in the last 7+ years, but to finally get rid of the dual iterators. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-02ceph: defer stopping mdsc delayed_workXiubo Li3-2/+17
Flushing the dirty buffer may take a long time if the cluster is overloaded or if there is network issue. So we should ping the MDSs periodically to keep alive, else the MDS will blocklist the kclient. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/61843 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-07-24ceph: never send metrics if disable_send_metrics is setXiubo Li1-1/+1
Even the 'disable_send_metrics' is true so when the session is being opened it will always trigger to send the metric for the first time. Cc: stable@vger.kernel.org Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-07-13ceph: convert to ctime accessor functionsJeff Layton5-11/+13
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-28-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-07Merge tag 'ceph-for-6.5-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds8-40/+154
Pull ceph updates from Ilya Dryomov: "A bunch of CephFS fixups from Xiubo, mostly around dropping caps, along with a fix for a regression in the readahead handling code which sneaked in with the switch to netfs helpers" * tag 'ceph-for-6.5-rc1' of https://github.com/ceph/ceph-client: ceph: don't let check_caps skip sending responses for revoke msgs ceph: issue a cap release immediately if no cap exists ceph: trigger to flush the buffer when making snapshot ceph: fix blindly expanding the readahead windows ceph: add a dedicated private data for netfs rreq ceph: voluntarily drop Xx caps for requests those touch parent mtime ceph: try to dump the msgs when decoding fails ceph: only send metrics when the MDS rank is ready
2023-06-30ceph: don't let check_caps skip sending responses for revoke msgsXiubo Li1-0/+9
If a client sends out a cap update dropping caps with the prior 'seq' just before an incoming cap revoke request, then the client may drop the revoke because it believes it's already released the requested capabilities. This causes the MDS to wait indefinitely for the client to respond to the revoke. It's therefore always a good idea to ack the cap revoke request with the bumped up 'seq'. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/61782 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-06-30ceph: issue a cap release immediately if no cap existsXiubo Li1-11/+29
In case: mds client - Releases cap and put Inode - Increase cap->seq and sends revokes req to the client - Receives release req and - Receives & drops the revoke req skip removing the cap and then eval the CInode and issue or revoke caps again. - Receives & drops the caps update or revoke req - Health warning for client isn't responding to mclientcaps(revoke) All the IMPORT/REVOKE/GRANT cap ops will increase the session seq in MDS side and then the client need to issue a cap release to unblock MDS to remove the corresponding cap to unblock possible waiters. Link: https://tracker.ceph.com/issues/61332 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-06-30ceph: trigger to flush the buffer when making snapshotXiubo Li2-3/+12
The 'i_wr_ref' is used to track the 'Fb' caps, while whenever the 'Fb' caps is took the kclient will always take the 'Fw' caps at the same time. That means it will always be a false check in __ceph_finish_cap_snap(). When writing to buffer the kclient will take both 'Fb|Fw' caps and then write the contents to the buffer pages by increasing the 'i_wrbuffer_ref' and then just release both 'Fb|Fw'. This is different with the user space libcephfs, which will keep the 'Fb' being took and use 'i_wr_ref' instead of 'i_wrbuffer_ref' to track this until the buffer is flushed to Rados. We need to defer flushing the capsnap until the corresponding buffer pages are all flushed to Rados, and at the same time just trigger to flush the buffer pages immediately. Link: https://tracker.ceph.com/issues/48640 Link: https://tracker.ceph.com/issues/59343 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-06-30ceph: fix blindly expanding the readahead windowsXiubo Li1-7/+33
Blindly expanding the readahead windows will cause unneccessary pagecache thrashing and also will introduce the network workload. We should disable expanding the windows if the readahead is disabled and also shouldn't expand the windows too much. Expanding forward firstly instead of expanding backward for possible sequential reads. Bound `rreq->len` to the actual file size to restore the previous page cache usage. The posix_fadvise may change the maximum size of a file readahead. Cc: stable@vger.kernel.org Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead") Link: https://lore.kernel.org/ceph-devel/20230504082510.247-1-sehuww@mail.scut.edu.cn Link: https://www.spinics.net/lists/ceph-users/msg76183.html Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Hu Weiwen <sehuww@mail.scut.edu.cn> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-06-30ceph: add a dedicated private data for netfs rreqXiubo Li2-11/+47
We need to save the 'f_ra.ra_pages' to expand the readahead window later. Cc: stable@vger.kernel.org Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead") Link: https://lore.kernel.org/ceph-devel/20230504082510.247-1-sehuww@mail.scut.edu.cn Link: https://www.spinics.net/lists/ceph-users/msg76183.html Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Hu Weiwen <sehuww@mail.scut.edu.cn> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-06-30ceph: voluntarily drop Xx caps for requests those touch parent mtimeXiubo Li2-8/+12
For write requests the parent's mtime will be updated correspondingly. And if the 'Xx' caps is issued and when releasing other caps together with the write requests the MDS Locker will try to eval the xattr lock, which need to change the locker state excl --> sync and need to do Xx caps revocation. Just voluntarily dropping CEPH_CAP_XATTR_EXCL caps to avoid a cap revoke message, which could cause the mtime will be overwrote by stale one. [ idryomov: break unnecessarily long lines ] Link: https://tracker.ceph.com/issues/61584 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-06-30ceph: try to dump the msgs when decoding failsXiubo Li1-0/+4
When the msgs are corrupted we need to dump them and then it will be easier to dig what has happened and where the issue is. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-06-30ceph: only send metrics when the MDS rank is readyXiubo Li1-0/+8
When the MDS rank is in clientreplay state, the metrics requests will be discarded directly. Also, when there are a lot of known client requests to recover from, the metrics requests will slow down the MDS rank from getting to the active state sooner. With this patch, we will send the metrics requests only when the MDS rank is in active state. Link: https://tracker.ceph.com/issues/61524 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-06-28Merge tag 'mm-stable-2023-06-24-19-15' of ↵Linus Torvalds1-6/+0
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ...
2023-06-26Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linuxLinus Torvalds1-1/+64
Pull splice updates from Jens Axboe: "This kills off ITER_PIPE to avoid a race between truncate, iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio with unpinned/unref'ed pages from an O_DIRECT splice read. This causes memory corruption. Instead, we either use (a) filemap_splice_read(), which invokes the buffered file reading code and splices from the pagecache into the pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads into it and then pushes the filled pages into the pipe; or (c) handle it in filesystem-specific code. Summary: - Rename direct_splice_read() to copy_splice_read() - Simplify the calculations for the number of pages to be reclaimed in copy_splice_read() - Turn do_splice_to() into a helper, vfs_splice_read(), so that it can be used by overlayfs and coda to perform the checks on the lower fs - Make vfs_splice_read() jump to copy_splice_read() to handle direct-I/O and DAX - Provide shmem with its own splice_read to handle non-existent pages in the pagecache. We don't want a ->read_folio() as we don't want to populate holes, but filemap_get_pages() requires it - Provide overlayfs with its own splice_read to call down to a lower layer as overlayfs doesn't provide ->read_folio() - Provide coda with its own splice_read to call down to a lower layer as coda doesn't provide ->read_folio() - Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs and random files as they just copy to the output buffer and don't splice pages - Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3, ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation - Make cifs use filemap_splice_read() - Replace pointers to generic_file_splice_read() with pointers to filemap_splice_read() as DIO and DAX are handled in the caller; filesystems can still provide their own alternate ->splice_read() op - Remove generic_file_splice_read() - Remove ITER_PIPE and its paraphernalia as generic_file_splice_read was the only user" * tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits) splice: kdoc for filemap_splice_read() and copy_splice_read() iov_iter: Kill ITER_PIPE splice: Remove generic_file_splice_read() splice: Use filemap_splice_read() instead of generic_file_splice_read() cifs: Use filemap_splice_read() trace: Convert trace/seq to use copy_splice_read() zonefs: Provide a splice-read wrapper xfs: Provide a splice-read wrapper orangefs: Provide a splice-read wrapper ocfs2: Provide a splice-read wrapper ntfs3: Provide a splice-read wrapper nfs: Provide a splice-read wrapper f2fs: Provide a splice-read wrapper ext4: Provide a splice-read wrapper ecryptfs: Provide a splice-read wrapper ceph: Provide a splice-read wrapper afs: Provide a splice-read wrapper 9p: Add splice_read wrapper net: Make sock_splice_read() use copy_splice_read() by default tty, proc, kernfs, random: Use copy_splice_read() ...
2023-06-09filemap: update ki_pos in generic_perform_writeChristoph Hellwig1-2/+0
All callers of generic_perform_write need to updated ki_pos, move it into common code. Link: https://lkml.kernel.org/r/20230601145904.1385409-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Theodore Ts'o <tytso@mit.edu> Acked-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09backing_dev: remove current->backing_dev_infoChristoph Hellwig1-4/+0
Patch series "cleanup the filemap / direct I/O interaction", v4. This series cleans up some of the generic write helper calling conventions and the page cache writeback / invalidation for direct I/O. This is a spinoff from the no-bufferhead kernel project, for which we'll want to an use iomap based buffered write path in the block layer. This patch (of 12): The last user of current->backing_dev_info disappeared in commit b9b1335e6403 ("remove bdi_congested() and wb_congested() and related functions"). Remove the field and all assignments to it. Link: https://lkml.kernel.org/r/20230601145904.1385409-1-hch@lst.de Link: https://lkml.kernel.org/r/20230601145904.1385409-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-08ceph: fix use-after-free bug for inodes when flushing capsnapsXiubo Li2-1/+9
There is a race between capsnaps flush and removing the inode from 'mdsc->snap_flush_list' list: == Thread A == == Thread B == ceph_queue_cap_snap() -> allocate 'capsnapA' ->ihold('&ci->vfs_inode') ->add 'capsnapA' to 'ci->i_cap_snaps' ->add 'ci' to 'mdsc->snap_flush_list' ... == Thread C == ceph_flush_snaps() ->__ceph_flush_snaps() ->__send_flush_snap() handle_cap_flushsnap_ack() ->iput('&ci->vfs_inode') this also will release 'ci' ... == Thread D == ceph_handle_snap() ->flush_snaps() ->iterate 'mdsc->snap_flush_list' ->get the stale 'ci' ->remove 'ci' from ->ihold(&ci->vfs_inode) this 'mdsc->snap_flush_list' will WARNING To fix this we will increase the inode's i_count ref when adding 'ci' to the 'mdsc->snap_flush_list' list. [ idryomov: need_put int -> bool ] Cc: stable@vger.kernel.org Link: https://bugzilla.redhat.com/show_bug.cgi?id=2209299 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-05-24ceph: Provide a splice-read wrapperDavid Howells1-1/+64
Provide a splice_read wrapper for Ceph. This does the inode shutdown check before proceeding and jumps to copy_splice_read() if the file has inline data or is a synchronous file. We try and get FILE_RD and either FILE_CACHE and/or FILE_LAZYIO caps and hold them across filemap_splice_read(). If we fail to get FILE_CACHE or FILE_LAZYIO capabilities, we use copy_splice_read() instead. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> cc: Christoph Hellwig <hch@lst.de> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Ilya Dryomov <idryomov@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-block@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20230522135018.2742245-17-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18ceph: force updating the msg pointer in non-split caseXiubo Li1-0/+13
When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/61200 Reported-by: Frank Schilder <frans@dtu.dk> Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-05-18ceph: silence smatch warning in reconnect_caps_cb()Xiubo Li1-1/+2
Smatch static checker warning: fs/ceph/mds_client.c:3968 reconnect_caps_cb() warn: missing error code here? '__get_cap_for_mds()' failed. 'err' = '0' [ idryomov: Dan says that Smatch considers it intentional only if the "ret = 0;" assignment is within 4 or 5 lines of the goto. ] Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-05-04Merge tag 'ceph-for-6.4-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds8-47/+102
Pull ceph updates from Ilya Dryomov: "A few filesystem improvements, with a rather nasty use-after-free fix from Xiubo intended for stable" * tag 'ceph-for-6.4-rc1' of https://github.com/ceph/ceph-client: ceph: reorder fields in 'struct ceph_snapid_map' ceph: pass ino# instead of old_dentry if it's disconnected ceph: fix potential use-after-free bug when trimming caps ceph: implement writeback livelock avoidance using page tagging ceph: do not print the whole xattr value if it's too long
2023-04-30ceph: reorder fields in 'struct ceph_snapid_map'Christophe JAILLET1-1/+1
Group some variables based on their sizes to reduce holes. On x86_64, this shrinks the size of 'struct ceph_snapid_map' from 72 to 64 bytes. When such a structure is allocated, because of the way memory allocation works, when 72 bytes were requested, 96 bytes were allocated. So, on x86_64, this change saves 32 bytes per allocation and has the structure fit in a single cacheline. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-04-30ceph: pass ino# instead of old_dentry if it's disconnectedXiubo Li2-3/+16
When exporting the kceph to NFS it may pass a DCACHE_DISCONNECTED dentry for the link operation. Then it will parse this dentry as a snapdir, and the mds will fail the link request as -EROFS. MDS allow clients to pass a ino# instead of a path. Link: https://tracker.ceph.com/issues/59515 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-04-30ceph: fix potential use-after-free bug when trimming capsXiubo Li5-35/+62
When trimming the caps and just after the 'session->s_cap_lock' is released in ceph_iterate_session_caps() the cap maybe removed by another thread, and when using the stale cap memory in the callbacks it will trigger use-after-free crash. We need to check the existence of the cap just after the 'ci->i_ceph_lock' being acquired. And do nothing if it's already removed. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/43272 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Luís Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-04-30ceph: implement writeback livelock avoidance using page taggingXiubo Li1-1/+10
While the mapped IOs continue if we try to flush a file's buffer we can see that the fsync() won't complete until the IOs finish. This is analogous to Jan Kara's commit (f446daaea9d4 mm: implement writeback livelock avoidance using page tagging), we will try to avoid livelocks of writeback when some steadily creates dirty pages in a mapping we are writing out. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-04-30ceph: do not print the whole xattr value if it's too longXiubo Li1-7/+13
If the xattr's value size is long enough the kernel will warn and then will fail the xfstests test case. Just print part of the value string if it's too long. At the same time fix the function name issue in the debug logs. Link: https://tracker.ceph.com/issues/58404 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>