Age | Commit message (Collapse) | Author | Files | Lines |
|
If all free_nat_bitmap are available, we can rebuild nat_bits from
free_nat_bitmap entirely during umount, let's make another chance
to reenable nat_bits for image.
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Whenever we notice some sluggish issues on our machines, we are always
curious about how well all types of I/O in the f2fs filesystem are
handled. But, it's hard to get this kind of real data. First of all,
we need to reproduce the issue while turning on the profiling tool like
blktrace, but the issue doesn't happen again easily. Second, with the
intervention of any tools, the overall timing of the issue will be
slightly changed and it sometimes makes us hard to figure it out.
So, I added the feature printing out IO latency statistics tracepoint
events, which are minimal things to understand filesystem's I/O related
behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs
node on, we can get this statistics info in a periodic way and it
would cause the least overhead.
[samples]
f2fs_ckpt-254:1-507 [003] .... 2842.439683: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4],
wr_sync_data [164/16/3331], wr_sync_node [152/3/648],
wr_sync_meta [160/2/4243], wr_async_data [24/13/15],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
f2fs_ckpt-254:1-507 [002] .... 2845.450514: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1],
wr_sync_data [120/12/2285], wr_sync_node [88/5/428],
wr_sync_meta [52/6/2990], wr_async_data [4/1/3],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
Signed-off-by: Daeho Jeong <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Added F2FS_IOSTAT config option to support getting IO statistics through
sysfs and printing out periodic IO statistics tracepoint events and
moved I/O statistics related codes into separate files for better
maintenance.
Signed-off-by: Daeho Jeong <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
[Jaegeuk Kim: set default=y]
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Update comment to reflect that we *do* allow reexport, whether it's a
good idea or not....
Signed-off-by: J. Bruce Fields <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
|
|
Make this lookup slightly more concise, and prepare for changing how we
look this up in a following patch.
Signed-off-by: J. Bruce Fields <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
|
|
It'll come in handy to get the whole nlm_lock.
Signed-off-by: J. Bruce Fields <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
|
|
|
|
This crash was observed with a failed assertion on device close:
BTRFS: Transaction aborted (error -28)
WARNING: CPU: 1 PID: 3902 at fs/btrfs/extent-tree.c:2150 btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
CPU: 1 PID: 3902 Comm: kworker/u8:4 Not tainted 5.14.0-rc5-default+ #1532
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
RIP: 0010:btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
RSP: 0018:ffffb7a5452d7d80 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
RBP: ffff97834176a378 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffff97835195d388
R13: 0000000005b08000 R14: ffff978385484000 R15: 000000000000016c
FS: 0000000000000000(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056190d003fe8 CR3: 000000002a81e005 CR4: 0000000000170ea0
Call Trace:
flush_space+0x197/0x2f0 [btrfs]
btrfs_async_reclaim_metadata_space+0x139/0x300 [btrfs]
process_one_work+0x262/0x5e0
worker_thread+0x4c/0x320
? process_one_work+0x5e0/0x5e0
kthread+0x144/0x170
? set_kthread_struct+0x40/0x40
ret_from_fork+0x1f/0x30
irq event stamp: 19334989
hardirqs last enabled at (19334997): [<ffffffffab0e0c87>] console_unlock+0x2b7/0x400
hardirqs last disabled at (19335006): [<ffffffffab0e0d0d>] console_unlock+0x33d/0x400
softirqs last enabled at (19334900): [<ffffffffaba0030d>] __do_softirq+0x30d/0x574
softirqs last disabled at (19334893): [<ffffffffab0721ec>] irq_exit_rcu+0x12c/0x140
---[ end trace 45939e308e0dd3c7 ]---
BTRFS: error (device vdd) in btrfs_run_delayed_refs:2150: errno=-28 No space left
BTRFS info (device vdd): forced readonly
BTRFS warning (device vdd): failed setting block group ro: -30
BTRFS info (device vdd): suspending dev_replace for unmount
assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3431!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 3982 Comm: umount Tainted: G W 5.14.0-rc5-default+ #1532
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
RSP: 0018:ffffb7a5454c7db8 EFLAGS: 00010246
RAX: 0000000000000068 RBX: ffff978364b91c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
RBP: ffff9783523a4c00 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffff9783523a4d18
R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000003
FS: 00007f61c8f42800(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056190cffa810 CR3: 0000000030b96002 CR4: 0000000000170ea0
Call Trace:
btrfs_close_one_device.cold+0x11/0x55 [btrfs]
close_fs_devices+0x44/0xb0 [btrfs]
btrfs_close_devices+0x48/0x160 [btrfs]
generic_shutdown_super+0x69/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x2c/0xa0
cleanup_mnt+0x144/0x1b0
task_work_run+0x59/0xa0
exit_to_user_mode_loop+0xe7/0xf0
exit_to_user_mode_prepare+0xaf/0xf0
syscall_exit_to_user_mode+0x19/0x50
do_syscall_64+0x4a/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
This happens when close_ctree is called while a dev_replace hasn't
completed. In close_ctree, we suspend the dev_replace, but keep the
replace target around so that we can resume the dev_replace procedure
when we mount the root again. This is the call trace:
close_ctree():
btrfs_dev_replace_suspend_for_unmount();
btrfs_close_devices():
btrfs_close_fs_devices():
btrfs_close_one_device():
ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
&device->dev_state));
However, since the replace target sticks around, there is a device
with BTRFS_DEV_STATE_REPLACE_TGT set on close, and we fail the
assertion in btrfs_close_one_device.
To fix this, if we come across the replace target device when
closing, we should properly reset it back to allocation state. This
fix also ensures that if a non-target device has a corrupted state and
has the BTRFS_DEV_STATE_REPLACE_TGT bit set, the assertion will still
catch the error.
Reported-by: David Sterba <[email protected]>
Fixes: b2a616676839 ("btrfs: fix rw device counting in __btrfs_free_extra_devids")
CC: [email protected] # 4.19+
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Desmond Cheong Zhi Xi <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
When parsing the ExtendedAttr data, malicous or corrupt attribute length
could cause kernel hangs and buffer overruns in some special cases.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Stian Skjelstad <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
|
|
btrfs_lookup_ordered_extent() is supposed to query the offset in a file
instead of the logical address. Pass the file offset from
submit_extent_page() to calc_bio_boundaries().
Also, calc_bio_boundaries() relies on the bio's operation flag, so move
the call site after setting it.
Fixes: 390ed29b817e ("btrfs: refactor submit_extent_page() to make bio and its flag tracing easier")
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Naohiro Aota <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
A common characteristic of the bug report where preemptive flushing was
going full tilt was the fact that the vast majority of the free metadata
space was used up by the global reserve. The hard 90% threshold would
cover the majority of these cases, but to be even smarter we should take
into account how much of the outstanding reservations are covered by the
global block reserve. If the global block reserve accounts for the vast
majority of outstanding reservations, skip preemptive flushing, as it
will likely just cause churn and pain.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
The preemptive flushing code was added in order to avoid needing to
synchronously wait for ENOSPC flushing to recover space. Once we're
almost full however we can essentially flush constantly. We were using
98% as a threshold to determine if we were simply full, however in
practice this is a really high bar to hit. For example reports of
systems running into this problem had around 94% usage and thus
continued to flush. Fix this by lowering the threshold to 90%, which is
a more sane value, especially for smaller file systems.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185
CC: [email protected] # 5.12+
Fixes: 576fa34830af ("btrfs: improve preemptive background space flushing")
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Function btrfs_lookup_data_extent calls btrfs_search_slot to verify if
the EXTENT_ITEM exists in the extent tree. btrfs_search_slot can return
values bellow zero if an error happened.
Function replay_one_extent currently checks if the search found
something (0 returned) and increments the reference, and if not, it
seems to evaluate as 'not found'.
Fix the condition by checking if the value was bellow zero and return
early.
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Marcos Paulo de Souza <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
There are several cases where when logging an inode we need to log its
parent directories or logging subdirectories when logging a directory.
There are cases however where we end up logging a directory even if it was
not changed in the current transaction, no dentries added or removed since
the last transaction. While this is harmless from a functional point of
view, it is a waste time as it brings no advantage.
One example where this is triggered is the following:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/A
$ mkdir /mnt/B
$ mkdir /mnt/C
$ touch /mnt/A/foo
$ ln /mnt/A/foo /mnt/B/bar
$ ln /mnt/A/foo /mnt/C/baz
$ sync
$ rm -f /mnt/A/foo
$ xfs_io -c "fsync" /mnt/B/bar
This last fsync ends up logging directories A, B and C, however we only
need to log directory A, as B and C were not changed since the last
transaction commit.
So fix this by changing need_log_inode(), to return false in case the
given inode is a directory and has a ->last_trans value smaller than the
current transaction's ID.
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Now that we converted btrfs internally to account for idmapped mounts
allow the creation of idmapped mounts on by setting the FS_ALLOW_IDMAP
flag. We only need to raise this flag on the btrfs_root_fs_type
filesystem since btrfs_mount_root() is ultimately responsible for
allocating the superblock and is called into from btrfs_mount()
associated with btrfs_fs_type.
The conversion of the btrfs inode operations was straightforward.
Regarding btrfs specific ioctls that perform checks based on inode
permissions only those have been allowed that are not filesystem wide
operations and hence can be reasonably charged against a specific mount.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Make the ACL code idmapped mount aware. The POSIX default and POSIX
access ACLs are the only ACLs other than some specific xattrs that take
DAC permissions into account. On an idmapped mount they need to be
translated according to the mount's userns. The main change is done to
__btrfs_set_acl() which is responsible for translating POSIX ACLs to
their final on-disk representation.
The btrfs_init_acl() helper does not need to take the idmapped mount
into account since it is called in the context of file creation
operations (mknod, create, mkdir, symlink, tmpfile) and is used for
btrfs_init_inode_security() to copy POSIX default and POSIX access
permissions from the parent directory. These ACLs need to be inherited
unmodified from the parent directory. This is identical to what we do
for ext4 and xfs.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
The INO_LOOKUP_USER is an unprivileged version of the INO_LOOKUP ioctl
and has the following restrictions. The main difference between the two
is that INO_LOOKUP is filesystem wide operation wheres INO_LOOKUP_USER
is scoped beneath the file descriptor passed with the ioctl.
Specifically, INO_LOOKUP_USER must adhere to the following restrictions:
- The caller must be privileged over each inode of each path component
for the path they are trying to lookup.
- The path for the subvolume the caller is trying to lookup must be reachable
from the inode associated with the file descriptor passed with the ioctl.
The second condition makes it possible to scope the lookup of the path
to the mount identified by the file descriptor passed with the ioctl.
This allows us to enable this ioctl on idmapped mounts.
Specifically, this is possible because all child subvolumes of a parent
subvolume are reachable when the parent subvolume is mounted. So if the
user had access to open the parent subvolume or has been given the fd
then they can lookup the path if they had access to it provided they
were privileged over each path component.
Note, the INO_LOOKUP_USER ioctl allows a user to learn the path and name
of a subvolume even though they would otherwise be restricted from doing
so via regular VFS-based lookup.
So think about a parent subvolume with multiple child subvolumes.
Someone could mount he parent subvolume and restrict access to the child
subvolumes by overmounting them with empty directories. At this point
the user can't traverse the child subvolumes and they can't open files
in the child subvolumes. However, they can still learn the path of
child subvolumes as long as they have access to the parent subvolume by
using the INO_LOOKUP_USER ioctl.
The underlying assumption here is that it's ok that the lookup ioctls
can't really take mounts into account other than the original mount the
fd belongs to during lookup. Since this assumption is baked into the
original INO_LOOKUP_USER ioctl we can extend it to idmapped mounts.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Setting flags on subvolumes or snapshots are core features of btrfs. The
SUBVOL_SETFLAGS ioctl is especially important as it allows to make
subvolumes and snapshots read-only or read-write. Allow setting flags on
btrfs subvolumes and snapshots on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
The SET_RECEIVED_SUBVOL ioctls are used to set information about
a received subvolume. Make it possible to set information about a
received subvolume on idmapped mounts. This is a fairly straightforward
operation since all the permission checking helpers are already capable
of handling idmapped mounts. So we just need to pass down the mount's
userns.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
So far we prevented the deletion of subvolumes and snapshots using
subvolume ids possible with the BTRFS_SUBVOL_SPEC_BY_ID flag.
This restriction is necessary on idmapped mounts as this allows
filesystem wide subvolume and snapshot deletions and thus can escape the
scope of what's exposed under the mount identified by the fd passed with
the ioctl.
Deletion by subvolume id works by looking for an alias of the parent of
the subvolume or snapshot to be deleted. The parent alias can be
anywhere in the filesystem. However, as long as the alias of the parent
that is found is the same as the one identified by the file descriptor
passed through the ioctl we can allow the deletion.
Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Destroying subvolumes and snapshots are important features of btrfs.
Both operations are available to unprivileged users if the filesystem
has been mounted with the "user_subvol_rm_allowed" mount option. Allow
subvolume and snapshot deletion on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.
Subvolumes and snapshots can either be deleted by specifying their name
or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or
snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set.
This feature is blocked on idmapped mounts as this allows filesystem
wide subvolume deletions and thus can escape the scope of what's exposed
under the mount identified by the fd passed with the ioctl.
This means that even the root or CAP_SYS_ADMIN capable user can't delete
a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional.
The root user is currently already subject to permission checks in
btrfs_may_delete() including whether the inode's i_uid/i_gid of the
directory the subvolume is located in have a mapping in the caller's
idmapping. For this to fail isn't currently possible since a btrfs
filesystem can't be mounted with a non-initial idmapping but it shows
that even the root user would fail to delete a subvolume if the relevant
inode isn't mapped in their idmapping. The idmapped mount case is the
same in principle.
This isn't a huge problem a root user wanting to delete arbitrary
subvolumes can just always create another (even detached) mount without
an idmapping attached.
In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the
subvolume to delete is directly located under inode referenced by the fd
passed for the ioctl() in a follow-up commit.
Here is an example where a btrfs subvolume is deleted through a
subvolume mount that does not expose the subvolume to be delete but it
can still be deleted by using the subvolume id:
/* Compile the following program as "delete_by_spec". */
#define _GNU_SOURCE
#include <fcntl.h>
#include <inttypes.h>
#include <linux/btrfs.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
static int rm_subvolume_by_id(int fd, uint64_t subvolid)
{
struct btrfs_ioctl_vol_args_v2 args = {};
int ret;
args.flags = BTRFS_SUBVOL_SPEC_BY_ID;
args.subvolid = subvolid;
ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args);
if (ret < 0)
return -1;
return 0;
}
int main(int argc, char *argv[])
{
int subvolid = 0;
if (argc < 3)
exit(1);
fprintf(stderr, "Opening %s\n", argv[1]);
int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY);
if (fd < 0)
exit(2);
subvolid = atoi(argv[2]);
fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid);
int ret = rm_subvolume_by_id(fd, subvolid);
if (ret < 0)
exit(3);
exit(0);
}
#include <stdio.h>"
#include <stdlib.h>"
#include <linux/btrfs.h"
truncate -s 10G btrfs.img
mkfs.btrfs btrfs.img
export LOOPDEV=$(sudo losetup -f --show btrfs.img)
mount ${LOOPDEV} /mnt
sudo chown $(id -u):$(id -g) /mnt
btrfs subvolume create /mnt/A
btrfs subvolume create /mnt/B/C
# Get subvolume id via:
sudo btrfs subvolume show /mnt/A
# Save subvolid
SUBVOLID=<nr>
sudo umount /mnt
sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt
./delete_by_spec /mnt ${SUBVOLID}
With idmapped mounts this can potentially be used by users to delete
subvolumes/snapshots they would otherwise not have access to as the
idmapping would be applied to an inode that is not exposed in the mount
of the subvolume.
The fact that this is a filesystem wide operation suggests it might be a
good idea to expose this under a separate ioctl that clearly indicates
this. In essence, the file descriptor passed with the ioctl is merely
used to identify the filesystem on which to operate when
BTRFS_SUBVOL_SPEC_BY_ID is used.
Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Creating subvolumes and snapshots is one of the core features of btrfs
and is even available to unprivileged users. Make it possible to use
subvolume and snapshot creation on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
When a new subvolume is created btrfs currently doesn't check whether
the fsgid/fsuid of the caller actually have a mapping in the user
namespace attached to the filesystem. The VFS always checks this to make
sure that the caller's fsgid/fsuid can be represented on-disk. This is
most relevant for filesystems that can be mounted inside user namespaces
but it is in general a good hardening measure to prevent unrepresentable
gid/uid from being written to disk.
Since we want to support idmapped mounts for btrfs ioctls to create
subvolumes in follow-up patches this becomes important since we want to
make sure the fsgid/fsuid of the caller as mapped according to the
idmapped mount can be represented on-disk. Simply add the missing
fsuidgid_has_mapping() line from the VFS may_create() version to
btrfs_may_create().
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Enable btrfs_permission() to handle idmapped mounts. This is just a
matter of passing down the mount's userns.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Enable btrfs_setattr() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Enable btrfs_tmpfile() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Enable btrfs_symlink() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Enable btrfs_mkdir() to handle idmapped mounts. This is just a matter of
passing down the mount's userns.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Enable btrfs_create() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Enable btrfs_mknod() to handle idmapped mounts. This is just a matter of
passing down the mount's userns.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Enable btrfs_getattr() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Enable btrfs_rename() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Extend btrfs_new_inode() to take the idmapped mount into account when
initializing a new inode. This is just a matter of passing down the
mount's userns. The rest is taken care of in inode_init_owner(). This is
a preliminary patch to make the individual btrfs inode operations
idmapped mount aware.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Various filesystems rely on the lookup_one_len() helper to lookup a
single path component relative to a well-known starting point. Allow
such filesystems to support idmapped mounts by adding a version of this
helper to take the idmap into account when calling inode_permission().
This change is a required to let btrfs (and other filesystems) support
idmapped mounts.
Cc: Christoph Hellwig <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: [email protected]
Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Sysfs file has grown big. It takes some time to locate the correct
struct attribute to add new files. Create a table and map the struct
attribute to its sysfs path.
Also, fix the comment about the debug sysfs path. And add the comments
to the attributes instead of attribute group, where sysfs file names are
defined.
Signed-off-by: Anand Jain <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
[BUG]
It's easy to trigger NULL pointer dereference, just by removing a
non-existing device id:
# mkfs.btrfs -f -m single -d single /dev/test/scratch1 \
/dev/test/scratch2
# mount /dev/test/scratch1 /mnt/btrfs
# btrfs device remove 3 /mnt/btrfs
Then we have the following kernel NULL pointer dereference:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 9 PID: 649 Comm: btrfs Not tainted 5.14.0-rc3-custom+ #35
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:btrfs_rm_device+0x4de/0x6b0 [btrfs]
btrfs_ioctl+0x18bb/0x3190 [btrfs]
? lock_is_held_type+0xa5/0x120
? find_held_lock.constprop.0+0x2b/0x80
? do_user_addr_fault+0x201/0x6a0
? lock_release+0xd2/0x2d0
? __x64_sys_ioctl+0x83/0xb0
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
[CAUSE]
Commit a27a94c2b0c7 ("btrfs: Make btrfs_find_device_by_devspec return
btrfs_device directly") moves the "missing" device path check into
btrfs_rm_device().
But btrfs_rm_device() itself can have case where it only receives
@devid, with NULL as @device_path.
In that case, calling strcmp() on NULL will trigger the NULL pointer
dereference.
Before that commit, we handle the "missing" case inside
btrfs_find_device_by_devspec(), which will not check @device_path at all
if @devid is provided, thus no way to trigger the bug.
[FIX]
Before calling strcmp(), also make sure @device_path is not NULL.
Fixes: a27a94c2b0c7 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly")
CC: [email protected] # 5.4+
Reported-by: butt3rflyh4ck <[email protected]>
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
We call split_zoned_em() on an extent_map on submitting a bio for it. Thus,
we can assume the extent_map is PINNED, not LOGGING, and in the modified
list. Add ASSERT()s to ensure the extent_maps after the split also has the
proper flags set and are in the modified list.
Suggested-by: Filipe Manana <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Naohiro Aota <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
alloc_offset is offset from the start of a block group and @offset is
actually an address in logical space. Thus, we need to consider
block_group->start when calculating them.
Fixes: 011b41bffa3d ("btrfs: zoned: advance allocation pointer after tree log node")
CC: [email protected] # 5.12+
Signed-off-by: Naohiro Aota <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
btrfs_relocate_chunk() can fail with -EAGAIN when e.g. send operations are
running. The message can fail btrfs/187 and it's unnecessary because we
anyway add it back to the reclaim list.
btrfs_reclaim_bgs_work()
`-> btrfs_relocate_chunk()
`-> btrfs_relocate_block_group()
`-> reloc_chunk_start()
`-> if (fs_info->send_in_progress)
`-> return -EAGAIN
CC: [email protected] # 5.13+
Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones")
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Naohiro Aota <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Automatically reclaiming dirty zones might not always be desired for all
workloads, especially as there are currently still some rough edges with
the relocation code on zoned filesystems.
Allow disabling zone auto reclaim on a per filesystem basis by writing 0
as the threshold value.
Reviewed-by: Naohiro Aota <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
A comment at log_conflicting_inodes() mentions that we check the inode's
logged_trans field instead of using btrfs_inode_in_log() because the field
last_log_commit is not updated when we log that an inode exists and the
inode has the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) set. The part
about the full sync flag is not true anymore since commit 9acc8103ab594f
("btrfs: fix unpersisted i_size on fsync after expanding truncate"), so
update the comment to not mention that part anymore.
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Now that we are checking if the inode's logged_trans is 0 to detect the
possibility of the inode having been evicted and reloaded, the test for
the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) is no longer needed at
tree-log.c:inode_logged(). Its purpose was to detect the possibility
of a previous eviction as well, since when an inode is loaded the full
sync flag is always set on it (and only cleared after the inode is
logged).
So just remove the check and update the comment. The check for the inode's
logged_trans being 0 was added recently by the patch with the subject
"btrfs: eliminate some false positives when checking if inode was logged".
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
At the very end of btrfs_rename_exchange(), in case an error happened, we
are checking if 'new_inode' is NULL, but that is not needed since during a
rename exchange, unlike regular renames, 'new_inode' can never be NULL,
and if it were, we would have a crashed much earlier when we dereference it
multiple times.
So remove the check because it is not necessary and because it is causing
static checkers to emit a warning. I probably introduced the check by
copy-pasting similar code from btrfs_rename(), where 'new_inode' can be
NULL, in commit 86e8aa0e772cab ("Btrfs: unpin logs if rename exchange
operation fails").
Reported-by: kernel test robot <[email protected]>
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Instead of using kmalloc() to allocate backref_ctx, allocate backref_ctx
on stack. The size is reasonably small.
sizeof(backref_ctx) = 48
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Goldwyn Rodrigues <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Instead of using kmalloc() to allocate btrfs_ioctl_defrag_range_args,
allocate btrfs_ioctl_defrag_range_args on stack, the size is reasonably
small and ioctls are called in process context.
sizeof(btrfs_ioctl_defrag_range_args) = 48
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Goldwyn Rodrigues <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Instead of using kmalloc() to allocate btrfs_ioctl_quota_rescan_args,
allocate btrfs_ioctl_quota_rescan_args on stack, the size is reasonably
small and ioctls are called in process context.
sizeof(btrfs_ioctl_quota_rescan_args) = 64
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Goldwyn Rodrigues <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Instead of allocating file_ra_state using kmalloc, allocate on stack.
sizeof(struct readahead) = 32 bytes.
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Goldwyn Rodrigues <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
It's a common practice to start a search using offset (u64)-1, which is
the u64 maximum value, meaning that we want the search_slot function to
be set in the last item with the same objectid and type.
Once we are in this position, it's a matter to start a search backwards
by calling btrfs_previous_item, which will check if we'll need to go to
a previous leaf and other necessary checks, only to be sure that we are
in last offset of the same object and type.
The new btrfs_search_backwards function does the all these steps when
necessary, and can be used to avoid code duplication.
Signed-off-by: Marcos Paulo de Souza <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
As fsverity support depends on a config option, print that at module
load time like we do for similar features.
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Writing out the verity data is too large of an operation to do in a
single transaction. If we are interrupted before we finish creating
fsverity metadata for a file, or fail to clean up already created
metadata after a failure, we could leak the verity items that we already
committed.
To address this issue, we use the orphan mechanism. When we start
enabling verity on a file, we also add an orphan item for that inode.
When we are finished, we delete the orphan. However, if we are
interrupted midway, the orphan will be present at mount and we can
cleanup the half-formed verity state.
There is a possible race with a normal unlink operation: if unlink and
verity run on the same file in parallel, it is possible for verity to
succeed and delete the still legitimate orphan added by unlink. Then, if
we are interrupted and mount in that state, we will never clean up the
inode properly. This is also possible for a file created with O_TMPFILE.
Check nlink==0 before deleting to avoid this race.
A final thing to note is that this is a resurrection of using orphans to
signal an operation besides "delete this inode". The old case was to
signal the need to do a truncate. That case still technically applies
for mounting very old file systems, so we need to take some care to not
clobber it. To that end, we just have to be careful that verity orphan
cleanup is a no-op for non-verity files.
Signed-off-by: Boris Burkov <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|