Age | Commit message (Collapse) | Author | Files | Lines |
|
Our user space filesystem relies on fuse to provide POSIX interface.
In our test, a known string is written into a file and the content
is read back later to verify correct data returned. We observed wrong
data returned in read buffer in rare cases although correct data are
stored in our filesystem.
Fuse kernel module calls iov_iter_get_pages2() to get the physical
pages of the user-space read buffer passed in read(). The pages are
not pinned to avoid page migration. When page migration occurs, the
consequence are two-folds.
1) Applications do not receive correct data in read buffer.
2) fuse kernel writes data into a wrong place.
Using iov_iter_extract_pages() to pin pages fixes the issue in our
test.
An auxiliary variable "struct page **pt_pages" is used in the patch
to prepare the 2nd parameter for iov_iter_extract_pages() since
iov_iter_get_pages2() uses a different type for the 2nd parameter.
[SzM] add iov_iter_extract_will_pin(ii) and unpin only if true.
Signed-off-by: Lei Huang <lei.huang@linux.intel.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
FUSE remote locking code paths never add any locking state to
inode->i_flctx, so the locks_remove_posix() function called on
file close will return without calling fuse_setlk().
Therefore, as the if statement to be removed in this commit will
always be false, remove it for clearness.
Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Due to the fact that fuse does not count the write IO of processes in the
direct and writethrough write modes, user processes cannot track
write_bytes through the “/proc/[pid]/io” path. For example, the system
tool iotop cannot count the write operations of the corresponding process.
Signed-off-by: Zhou Jifeng <zhoujifeng@kylinos.com.cn>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Some FUSE daemons want to know if the received request is a resend
request. The high bit of the fuse request ID is utilized for indicating
this, enabling the receiver to perform appropriate handling.
The init flag "FUSE_HAS_RESEND" is added to indicate this feature.
Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
When a FUSE daemon panics and failover, we aim to minimize the impact on
applications by reusing the existing FUSE connection. During this process,
another daemon is employed to preserve the FUSE connection's file
descriptor. The new started FUSE Daemon will takeover the fd and continue
to provide service.
However, it is possible for some inflight requests to be lost and never
returned. As a result, applications awaiting replies would become stuck
forever. To address this, we can resend these pending requests to the
new started FUSE daemon.
This patch introduces a new notification type "FUSE_NOTIFY_RESEND", which
can trigger resending of the pending requests, ensuring they are properly
processed again.
Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
open_by_handle_at(2) can fail with -ESTALE with a valid handle returned
by a previous name_to_handle_at(2) for evicted fuse inodes, which is
especially common when entry_valid_timeout is 0, e.g. when the fuse
daemon is in "cache=none" mode.
The time sequence is like:
name_to_handle_at(2) # succeed
evict fuse inode
open_by_handle_at(2) # fail
The root cause is that, with 0 entry_valid_timeout, the dput() called in
name_to_handle_at(2) will trigger iput -> evict(), which will send
FUSE_FORGET to the daemon. The following open_by_handle_at(2) will send
a new FUSE_LOOKUP request upon inode cache miss since the previous inode
eviction. Then the fuse daemon may fail the FUSE_LOOKUP request with
-ENOENT as the cached metadata of the requested inode has already been
cleaned up during the previous FUSE_FORGET. The returned -ENOENT is
treated as -ESTALE when open_by_handle_at(2) returns.
This confuses the application somehow, as open_by_handle_at(2) fails
when the previous name_to_handle_at(2) succeeds. The returned errno is
also confusing as the requested file is not deleted and already there.
It is reasonable to fail name_to_handle_at(2) early in this case, after
which the application can fallback to open(2) to access files.
Since this issue typically appears when entry_valid_timeout is 0 which
is configured by the fuse daemon, the fuse daemon is the right person to
explicitly disable the export when required.
Also considering FUSE_EXPORT_SUPPORT actually indicates the support for
lookups of "." and "..", and there are existing fuse daemons supporting
export without FUSE_EXPORT_SUPPORT set, for compatibility, we add a new
INIT flag for such purpose.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
For the sake of consistency, let's use these helpers to extract
{u,g}id_t values from k{u,g}id_t ones.
There are no functional changes, just to make code cleaner.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Found by chance while working on support for idmapped mounts in fuse.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The one remaining caller of fuse_writepage_locked() already has a folio,
so convert this function entirely. Saves a few calls to compound_head()
but no attempt is made to support large folios in this patch.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The writepage operation is deprecated as it leads to worse performance
under high memory pressure due to folios being written out in LRU order
rather than sequentially within a file. Use filemap_migrate_folio() to
support dirty folio migration instead of writepage.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
virtqueue_enable_cb() will call virtqueue_poll() which will check if
queue is broken at beginning, so remove the virtqueue_is_broken() call
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
...when calling fuse_iget().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The root inode is assumed to be always hashed. Do not unhash the root
inode even if it is marked BAD.
Fixes: 5d069dbe8aaf ("fuse: fix bad inode")
Cc: <stable@vger.kernel.org> # v5.11
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The root inode has a fixed nodeid and generation (1, 0).
Prior to the commit 15db16837a35 ("fuse: fix illegal access to inode with
reused nodeid") generation number on lookup was ignored. After this commit
lookup with the wrong generation number resulted in the inode being
unhashed. This is correct for non-root inodes, but replacing the root
inode is wrong and results in weird behavior.
Fix by reverting to the old behavior if ignoring the generation for the
root inode, but issuing a warning in dmesg.
Reported-by: Antonio SJ Musumeci <trapexit@spawn.link>
Closes: https://lore.kernel.org/all/CAOQ4uxhek5ytdN8Yz2tNEOg5ea4NkBb4nk0FGPjPk_9nz-VG3g@mail.gmail.com/
Fixes: 15db16837a35 ("fuse: fix illegal access to inode with reused nodeid")
Cc: <stable@vger.kernel.org> # v5.14
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
fuse_do_statx() was added with the wrong helper.
Fixes: d3045530bdd2 ("fuse: implement statx")
Cc: <stable@vger.kernel.org> # v6.6
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
virtio_fs_sysfs_exit() is called by:
- static int __init virtio_fs_init(void)
- static void __exit virtio_fs_exit(void)
Remove __exit from virtio_fs_sysfs_exit() since virtio_fs_init() is not
an __exit function.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202402270649.GYjNX0yw-lkp@intel.com/
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
An mmap request for a file open in passthrough mode, maps the memory
directly to the backing file.
An mmap of a file in direct io mode, usually uses cached mmap and puts
the inode in caching io mode, which denies new passthrough opens of that
inode, because caching io mode is conflicting with passthrough io mode.
For the same reason, trying to mmap a direct io file, while there is
a passthrough file open on the same inode will fail with -ENODEV.
An mmap of a file in direct io mode, also needs to wait for parallel
dio writes in-progress to complete.
If a passthrough file is opened, while an mmap of another direct io
file is waiting for parallel dio writes to complete, the wait is aborted
and mmap fails with -ENODEV.
A FUSE server that uses passthrough and direct io opens on the same inode
that may also be mmaped, is advised to provide a backing fd also for the
files that are open in direct io mode (i.e. use the flags combination
FOPEN_DIRECT_IO | FOPEN_PASSTHROUGH), so that mmap will always use the
backing file, even if read/write do not passthrough.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This allows passing fstests generic/249 and generic/591.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Use the backing file read/write helpers to implement read/write
passthrough to a backing file.
After read/write, we invalidate a/c/mtime/size attributes.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
After getting a backing file id with FUSE_DEV_IOC_BACKING_OPEN ioctl,
a FUSE server can reply to an OPEN request with flag FOPEN_PASSTHROUGH
and the backing file id.
The FUSE server should reuse the same backing file id for all the open
replies of the same FUSE inode and open will fail (with -EIO) if a the
server attempts to open the same inode with conflicting io modes or to
setup passthrough to two different backing files for the same FUSE inode.
Using the same backing file id for several different inodes is allowed.
Opening a new file with FOPEN_DIRECT_IO for an inode that is already
open for passthrough is allowed, but only if the FOPEN_PASSTHROUGH flag
and correct backing file id are specified as well.
The read/write IO of such files will not use passthrough operations to
the backing file, but mmap, which does not support direct_io, will use
the backing file insead of using the page cache as it always did.
Even though all FUSE passthrough files of the same inode use the same
backing file as a backing inode reference, each FUSE file opens a unique
instance of a backing_file object to store the FUSE path that was used
to open the inode and the open flags of the specific open file.
The per-file, backing_file object is released along with the FUSE file.
The inode associated fuse_backing object is released when the last FUSE
passthrough file of that inode is released AND when the backing file id
is closed by the server using the FUSE_DEV_IOC_BACKING_CLOSE ioctl.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
In preparation for opening file in passthrough mode, store the
fuse_open_out argument in ff->args to be passed into fuse_file_io_open()
with the optional backing_id member.
This will be used for setting up passthrough to backing file on open
reply with FOPEN_PASSTHROUGH flag and a valid backing_id.
Opening a file in passthrough mode may fail for several reasons, such as
missing capability, conflicting open flags or inode in caching mode.
Return EIO from fuse_file_io_open() in those cases.
The combination of FOPEN_PASSTHROUGH and FOPEN_DIRECT_IO is allowed -
it mean that read/write operations will go directly to the server,
but mmap will be done to the backing file.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
FUSE server calls the FUSE_DEV_IOC_BACKING_OPEN ioctl with a backing file
descriptor. If the call succeeds, a backing file identifier is returned.
A later change will be using this backing file id in a reply to OPEN
request with the flag FOPEN_PASSTHROUGH to setup passthrough of file
operations on the open FUSE file to the backing file.
The FUSE server should call FUSE_DEV_IOC_BACKING_CLOSE ioctl to close the
backing file by its id.
This can be done at any time, but if an open reply with FOPEN_PASSTHROUGH
flag is still in progress, the open may fail if the backing file is
closed before the fuse file was opened.
Setting up backing files requires a server with CAP_SYS_ADMIN privileges.
For the backing file to be successfully setup, the backing file must
implement both read_iter and write_iter file operations.
The limitation on the level of filesystem stacking allowed for the
backing file is enforced before setting up the backing file.
Signed-off-by: Alessio Balsini <balsini@android.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
FUSE_PASSTHROUGH capability to passthrough FUSE operations to backing
files will be made available with kernel config CONFIG_FUSE_PASSTHROUGH.
When requesting FUSE_PASSTHROUGH, userspace needs to specify the
max_stack_depth that is allowed for FUSE on top of backing files.
Introduce the flag FOPEN_PASSTHROUGH and backing_id to fuse_open_out
argument that can be used when replying to OPEN request, to setup
passthrough of io operations on the fuse inode to a backing file.
Introduce a refcounted fuse_backing object that will be used to
associate an open backing file with a fuse inode.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
In preparation to adding more fuse dev ioctls.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Instead of denying caching mode on parallel dio open, deny caching
open only while parallel dio are in-progress and wait for in-progress
parallel dio writes before entering inode caching io mode.
This allows executing parallel dio when inode is not in caching mode
even if shared mmap is allowed, but no mmaps have been performed on
the inode in question.
An mmap on direct_io file now waits for all in-progress parallel dio
writes to complete, so parallel dio writes together with
FUSE_DIRECT_IO_ALLOW_MMAP is enabled by this commit.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The fuse inode io mode is determined by the mode of its open files/mmaps
and parallel dio opens and expressed in the value of fi->iocachectr:
> 0 - caching io: files open in caching mode or mmap on direct_io file
< 0 - parallel dio: direct io mode with parallel dio writes enabled
== 0 - direct io: no files open in caching mode and no files mmaped
Note that iocachectr value of 0 might become positive or negative,
while non-parallel dio is getting processed.
direct_io mmap uses page cache, so first mmap will mark the file as
ff->io_opened and increment fi->iocachectr to enter the caching io mode.
If the server opens the file in caching mode while it is already open
for parallel dio or vice versa the open fails.
This allows executing parallel dio when inode is not in caching mode
and no mmaps have been performed on the inode in question.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
In preparation for inode io modes, a server open response could fail due to
conflicting inode io modes.
Allow returning an error from fuse_finish_open() and handle the error in
the callers.
fuse_finish_open() is used as the callback of finish_open(), so that
FMODE_OPENED will not be set if fuse_finish_open() fails.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
fuse_open_common() has a lot of code relevant only for regular files and
O_TRUNC in particular.
Copy the little bit of remaining code into fuse_dir_open() and stop using
this common helper for directory open.
Also split out fuse_dir_finish_open() from fuse_finish_open() before we add
inode io modes to fuse_finish_open().
Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This removed the need to pass isdir argument to fuse_put_file().
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
fuse_finish_open() is called from fuse_open_common() and from
fuse_create_open(). In the latter case, the O_TRUNC flag is always
cleared in finish_open()m before calling into fuse_finish_open().
Move the bits that update attribute cache post O_TRUNC open into a
helper and call this helper from fuse_open_common() directly.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
So far this is just a helper to remove complex locking logic out of
fuse_direct_write_iter. Especially needed by the next patch in the series
to that adds the fuse inode cache IO mode and adds in even more locking
complexity.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This makes the code a bit easier to read and allows to more easily add more
conditions when an exclusive lock is needed.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
There were multiple issues with direct_io_allow_mmap:
- fuse_link_write_file() was missing, resulting in warnings in
fuse_write_file_get() and EIO from msync()
- "vma->vm_ops = &fuse_file_vm_ops" was not set, but especially
fuse_page_mkwrite is needed.
The semantics of invalidate_inode_pages2() is so far not clearly defined in
fuse_file_mmap. It dates back to commit 3121bfe76311 ("fuse: fix
"direct_io" private mmap") Though, as direct_io_allow_mmap is a new
feature, that was for MAP_PRIVATE only. As invalidate_inode_pages2() is
calling into fuse_launder_folio() and writes out dirty pages, it should be
safe to call invalidate_inode_pages2 for MAP_PRIVATE and MAP_SHARED as
well.
Cc: Hao Xu <howeyxu@tencent.com>
Cc: stable@vger.kernel.org
Fixes: e78662e818f9 ("fuse: add a new fuse init flag to relax restrictions in no cache mode")
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Alyssa Ross <hi@alyssa.is> requested that virtiofs notifies userspace
when filesytems become available. This can be used to detect when a
filesystem with a given tag is hotplugged, for example. uevents allow
userspace to detect changes without resorting to polling.
The tag is included as a uevent property so it's easy for userspace to
identify the filesystem in question even when the sysfs directory goes
away during removal.
Here are example uevents:
# udevadm monitor -k -p
KERNEL[111.113221] add /fs/virtiofs/2 (virtiofs)
ACTION=add
DEVPATH=/fs/virtiofs/2
SUBSYSTEM=virtiofs
TAG=test
KERNEL[165.527167] remove /fs/virtiofs/2 (virtiofs)
ACTION=remove
DEVPATH=/fs/virtiofs/2
SUBSYSTEM=virtiofs
TAG=test
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The virtiofs filesystem is mounted using a "tag" which is exported by
the virtiofs device:
# mount -t virtiofs <tag> /mnt
The virtiofs driver knows about all the available tags but these are
currently not exported to user space.
People have asked for these tags to be exported to user space. Most
recently Lennart Poettering has asked for it as he wants to scan the
tags and mount virtiofs automatically in certain cases.
https://gitlab.com/virtio-fs/virtiofsd/-/issues/128
This patch exports tags at /sys/fs/virtiofs/<N>/tag where N is the id of
the virtiofs device. The filesystem tag can be obtained by reading this
"tag" file.
There is also a symlink at /sys/fs/virtiofs/<N>/device that points to
the virtiofs device that exports this tag.
This patch converts the existing struct virtio_fs into a full kobject.
It already had a refcount so it's an easy change. The virtio_fs objects
can then be exposed in a kset at /sys/fs/virtiofs/. Note that virtio_fs
objects may live slightly longer than we wish for them to be exposed to
userspace, so kobject_del() is called explicitly when the underlying
virtio_device is removed. The virtio_fs object is freed when all
references are dropped (e.g. active mounts) but disappears as soon as
the virtiofs device is gone.
Originally-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Newlines in virtiofs tags are awkward for users and potential vectors
for string injection attacks.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Pull bcachefs fixes from Kent Overstreet:
"Mostly pretty trivial, the user visible ones are:
- don't barf when replicas_required > replicas
- fix check_version_upgrade() so it doesn't do something nonsensical
when we're downgrading"
* tag 'bcachefs-2024-02-17' of https://evilpiepirate.org/git/bcachefs:
bcachefs: Fix missing va_end()
bcachefs: Fix check_version_upgrade()
bcachefs: Clamp replicas_required to replicas
bcachefs: fix missing endiannes conversion in sb_members
bcachefs: fix kmemleak in __bch2_read_super error handling path
bcachefs: Fix missing bch2_err_class() calls
|
|
Pull smb client fixes from Steve French:
"Five smb3 client fixes, most also for stable:
- Two multichannel fixes (one to fix potential handle leak on retry)
- Work around possible serious data corruption (due to change in
folios in 6.3, for cases when non standard maximum write size
negotiated)
- Symlink creation fix
- Multiuser automount fix"
* tag '6.8-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: Fix regression in writes when non-standard maximum write size negotiated
smb: client: handle path separator of created SMB symlinks
smb: client: set correct id, uid and cruid for multiuser automounts
cifs: update the same create_guid on replay
cifs: fix underflow in parse_server_interfaces()
|
|
Pull ceph fixes from Ilya Dryomov:
"Additional cap handling fixes from Xiubo to avoid "client isn't
responding to mclientcaps(revoke)" stalls on the MDS side"
* tag 'ceph-for-6.8-rc5' of https://github.com/ceph/ceph-client:
ceph: add ceph_cap_unlink_work to fire check_caps() immediately
ceph: always queue a writeback when revoking the Fb caps
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs fix from Damien Le Moal:
- Fix direct write error handling to avoid a race between failed IO
completion and the submission path itself which can result in an
invalid file size exposed to the user after the failed IO.
* tag 'zonefs-6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: Improve error handling
|
|
The conversion to netfs in the 6.3 kernel caused a regression when
maximum write size is set by the server to an unexpected value which is
not a multiple of 4096 (similarly if the user overrides the maximum
write size by setting mount parm "wsize", but sets it to a value that
is not a multiple of 4096). When negotiated write size is not a
multiple of 4096 the netfs code can skip the end of the final
page when doing large sequential writes, causing data corruption.
This section of code is being rewritten/removed due to a large
netfs change, but until that point (ie for the 6.3 kernel until now)
we can not support non-standard maximum write sizes.
Add a warning if a user specifies a wsize on mount that is not
a multiple of 4096 (and round down), also add a change where we
round down the maximum write size if the server negotiates a value
that is not a multiple of 4096 (we also have to check to make sure that
we do not round it down to zero).
Reported-by: R. Diez" <rdiez-2006@rd10.de>
Fixes: d08089f649a0 ("cifs: Change the I/O paths to use an iterator rather than a page list")
Suggested-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Acked-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Tested-by: Matthew Ruffell <matthew.ruffell@canonical.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org # v6.3+
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Write error handling is racy and can sometime lead to the error recovery
path wrongly changing the inode size of a sequential zone file to an
incorrect value which results in garbage data being readable at the end
of a file. There are 2 problems:
1) zonefs_file_dio_write() updates a zone file write pointer offset
after issuing a direct IO with iomap_dio_rw(). This update is done
only if the IO succeed for synchronous direct writes. However, for
asynchronous direct writes, the update is done without waiting for
the IO completion so that the next asynchronous IO can be
immediately issued. However, if an asynchronous IO completes with a
failure right before the i_truncate_mutex lock protecting the update,
the update may change the value of the inode write pointer offset
that was corrected by the error path (zonefs_io_error() function).
2) zonefs_io_error() is called when a read or write error occurs. This
function executes a report zone operation using the callback function
zonefs_io_error_cb(), which does all the error recovery handling
based on the current zone condition, write pointer position and
according to the mount options being used. However, depending on the
zoned device being used, a report zone callback may be executed in a
context that is different from the context of __zonefs_io_error(). As
a result, zonefs_io_error_cb() may be executed without the inode
truncate mutex lock held, which can lead to invalid error processing.
Fix both problems as follows:
- Problem 1: Perform the inode write pointer offset update before a
direct write is issued with iomap_dio_rw(). This is safe to do as
partial direct writes are not supported (IOMAP_DIO_PARTIAL is not
set) and any failed IO will trigger the execution of zonefs_io_error()
which will correct the inode write pointer offset to reflect the
current state of the one on the device.
- Problem 2: Change zonefs_io_error_cb() into zonefs_handle_io_error()
and call this function directly from __zonefs_io_error() after
obtaining the zone information using blkdev_report_zones() with a
simple callback function that copies to a local stack variable the
struct blk_zone obtained from the device. This ensures that error
handling is performed holding the inode truncate mutex.
This change also simplifies error handling for conventional zone files
by bypassing the execution of report zones entirely. This is safe to
do because the condition of conventional zones cannot be read-only or
offline and conventional zone files are always fully mapped with a
constant file size.
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few regular fixes and one fix for space reservation regression since
6.7 that users have been reporting:
- fix over-reservation of metadata chunks due to not keeping proper
balance between global block reserve and delayed refs reserve; in
practice this leaves behind empty metadata block groups, the
workaround is to reclaim them by using the '-musage=1' balance
filter
- other space reservation fixes:
- do not delete unused block group if it may be used soon
- do not reserve space for checksums for NOCOW files
- fix extent map assertion failure when writing out free space inode
- reject encoded write if inode has nodatasum flag set
- fix chunk map leak when loading block group zone info"
* tag 'for-6.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't refill whole delayed refs block reserve when starting transaction
btrfs: zoned: fix chunk map leak when loading block group zone info
btrfs: reject encoded write if inode has nodatasum flag set
btrfs: don't reserve space for checksums when writing to nocow files
btrfs: add new unused block groups to the list of unused block groups
btrfs: do not delete unused block group if it may be used soon
btrfs: add and use helper to check if block group is used
btrfs: don't drop extent_map for free space inode on write error
|
|
Fixes: https://lore.kernel.org/linux-bcachefs/202402131603.E953E2CF@keescook/T/#u
Reported-by: coverity scan
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When also downgrading, check_version_upgrade() could pick a new version
greater than the latest supported version.
Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This prevents going emergency read only when the user has specified
replicas_required > replicas.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Since commit 28270e25c69a ("btrfs: always reserve space for delayed refs
when starting transaction") we started not only to reserve metadata space
for the delayed refs a caller of btrfs_start_transaction() might generate
but also to try to fully refill the delayed refs block reserve, because
there are several case where we generate delayed refs and haven't reserved
space for them, relying on the global block reserve. Relying too much on
the global block reserve is not always safe, and can result in hitting
-ENOSPC during transaction commits or worst, in rare cases, being unable
to mount a filesystem that needs to do orphan cleanup or anything that
requires modifying the filesystem during mount, and has no more
unallocated space and the metadata space is nearly full. This was
explained in detail in that commit's change log.
However the gap between the reserved amount and the size of the delayed
refs block reserve can be huge, so attempting to reserve space for such
a gap can result in allocating many metadata block groups that end up
not being used. After a recent patch, with the subject:
"btrfs: add new unused block groups to the list of unused block groups"
We started to add new block groups that are unused to the list of unused
block groups, to avoid having them around for a very long time in case
they are never used, because a block group is only added to the list of
unused block groups when we deallocate the last extent or when mounting
the filesystem and the block group has 0 bytes used. This is not a problem
introduced by the commit mentioned earlier, it always existed as our
metadata space reservations are, most of the time, pessimistic and end up
not using all the space they reserved, so we can occasionally end up with
one or two unused metadata block groups for a long period. However after
that commit mentioned earlier, we are just more pessimistic in the
metadata space reservations when starting a transaction and therefore the
issue is more likely to happen.
This however is not always enough because we might create unused metadata
block groups when reserving metadata space at a high rate if there's
always a gap in the delayed refs block reserve and the cleaner kthread
isn't triggered often enough or is busy with other work (running delayed
iputs, cleaning deleted roots, etc), not to mention the block group's
allocated space is only usable for a new block group after the transaction
used to remove it is committed.
A user reported that he's getting a lot of allocated metadata block groups
but the usage percentage of metadata space was very low compared to the
total allocated space, specially after running a series of block group
relocations.
So for now stop trying to refill the gap in the delayed refs block reserve
and reserve space only for the delayed refs we are expected to generate
when starting a transaction.
CC: stable@vger.kernel.org # 6.7+
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H6802ayLHUJFztzZAVzBLJAGdFx=6FHNNy87+obZXXZpQ@mail.gmail.com/
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Reported-by: Heddxh <g311571057@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAE93xANEby6RezOD=zcofENYZOT-wpYygJyauyUAZkLv6XVFOA@mail.gmail.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At btrfs_load_block_group_zone_info() we never drop a reference on the
chunk map we have looked up, therefore leaking a reference on it. So
add the missing btrfs_free_chunk_map() at the end of the function.
Fixes: 7dc66abb5a47 ("btrfs: use a dedicated data structure for chunk maps")
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we allow an encoded write against inodes that have the NODATASUM
flag set, either because they are NOCOW files or they were created while
the filesystem was mounted with "-o nodatasum". This results in having
compressed extents without corresponding checksums, which is a filesystem
inconsistency reported by 'btrfs check'.
For example, running btrfs/281 with MOUNT_OPTIONS="-o nodatacow" triggers
this and 'btrfs check' errors out with:
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
root 256 inode 257 errors 1040, bad file extent, some csum missing
root 256 inode 258 errors 1040, bad file extent, some csum missing
ERROR: errors found in fs roots
(...)
So reject encoded writes if the target inode has NODATASUM set.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently when doing a write to a file we always reserve metadata space
for inserting data checksums. However we don't need to do it if we have
a nodatacow file (-o nodatacow mount option or chattr +C) or if checksums
are disabled (-o nodatasum mount option), as in that case we are only
adding unnecessary pressure to metadata reservations.
For example on x86_64, with the default node size of 16K, a 4K buffered
write into a nodatacow file is reserving 655360 bytes of metadata space,
as it's accounting for checksums. After this change, which stops reserving
space for checksums if we have a nodatacow file or checksums are disabled,
we only need to reserve 393216 bytes of metadata.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|