blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2021-08-23	f2fs: rebuild nat_bits during umount	Chao Yu	3	-59/+95
	If all free_nat_bitmap are available, we can rebuild nat_bits from free_nat_bitmap entirely during umount, let's make another chance to reenable nat_bits for image. Signed-off-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2021-08-23	f2fs: introduce periodic iostat io latency traces	Daeho Jeong	5	-6/+220
	Whenever we notice some sluggish issues on our machines, we are always curious about how well all types of I/O in the f2fs filesystem are handled. But, it's hard to get this kind of real data. First of all, we need to reproduce the issue while turning on the profiling tool like blktrace, but the issue doesn't happen again easily. Second, with the intervention of any tools, the overall timing of the issue will be slightly changed and it sometimes makes us hard to figure it out. So, I added the feature printing out IO latency statistics tracepoint events, which are minimal things to understand filesystem's I/O related behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs node on, we can get this statistics info in a periodic way and it would cause the least overhead. [samples] f2fs_ckpt-254:1-507 [003] .... 2842.439683: f2fs_iostat_latency: dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count], rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4], wr_sync_data [164/16/3331], wr_sync_node [152/3/648], wr_sync_meta [160/2/4243], wr_async_data [24/13/15], wr_async_node [0/0/0], wr_async_meta [0/0/0] f2fs_ckpt-254:1-507 [002] .... 2845.450514: f2fs_iostat_latency: dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count], rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1], wr_sync_data [120/12/2285], wr_sync_node [88/5/428], wr_sync_meta [52/6/2990], wr_async_data [4/1/3], wr_async_node [0/0/0], wr_async_meta [0/0/0] Signed-off-by: Daeho Jeong <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2021-08-23	f2fs: separate out iostat feature	Daeho Jeong	13	-149/+223
	Added F2FS_IOSTAT config option to support getting IO statistics through sysfs and printing out periodic IO statistics tracepoint events and moved I/O statistics related codes into separate files for better maintenance. Signed-off-by: Daeho Jeong <[email protected]> Reviewed-by: Chao Yu <[email protected]> [Jaegeuk Kim: set default=y] Signed-off-by: Jaegeuk Kim <[email protected]>
2021-08-23	lockd: update nlm_lookup_file reexport comment	J. Bruce Fields	1	-2/+3
	Update comment to reflect that we do allow reexport, whether it's a good idea or not.... Signed-off-by: J. Bruce Fields <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2021-08-23	nlm: minor refactoring	J. Bruce Fields	2	-10/+10
	Make this lookup slightly more concise, and prepare for changing how we look this up in a following patch. Signed-off-by: J. Bruce Fields <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2021-08-23	nlm: minor nlm_lookup_file argument change	J. Bruce Fields	3	-9/+11
	It'll come in handy to get the whole nlm_lock. Signed-off-by: J. Bruce Fields <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2021-08-23	Merge branch 'torvalds:master' into master	aalexandrovich	23	-157/+321

2021-08-23	btrfs: reset replace target device to allocation state on close	Desmond Cheong Zhi Xi	1	-0/+3
	This crash was observed with a failed assertion on device close: BTRFS: Transaction aborted (error -28) WARNING: CPU: 1 PID: 3902 at fs/btrfs/extent-tree.c:2150 btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs] Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop CPU: 1 PID: 3902 Comm: kworker/u8:4 Not tainted 5.14.0-rc5-default+ #1532 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] RIP: 0010:btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs] RSP: 0018:ffffb7a5452d7d80 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffabee13c4 RDI: 00000000ffffffff RBP: ffff97834176a378 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: ffff97835195d388 R13: 0000000005b08000 R14: ffff978385484000 R15: 000000000000016c FS: 0000000000000000(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056190d003fe8 CR3: 000000002a81e005 CR4: 0000000000170ea0 Call Trace: flush_space+0x197/0x2f0 [btrfs] btrfs_async_reclaim_metadata_space+0x139/0x300 [btrfs] process_one_work+0x262/0x5e0 worker_thread+0x4c/0x320 ? process_one_work+0x5e0/0x5e0 kthread+0x144/0x170 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 irq event stamp: 19334989 hardirqs last enabled at (19334997): [<ffffffffab0e0c87>] console_unlock+0x2b7/0x400 hardirqs last disabled at (19335006): [<ffffffffab0e0d0d>] console_unlock+0x33d/0x400 softirqs last enabled at (19334900): [<ffffffffaba0030d>] __do_softirq+0x30d/0x574 softirqs last disabled at (19334893): [<ffffffffab0721ec>] irq_exit_rcu+0x12c/0x140 ---[ end trace 45939e308e0dd3c7 ]--- BTRFS: error (device vdd) in btrfs_run_delayed_refs:2150: errno=-28 No space left BTRFS info (device vdd): forced readonly BTRFS warning (device vdd): failed setting block group ro: -30 BTRFS info (device vdd): suspending dev_replace for unmount assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3431! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 1 PID: 3982 Comm: umount Tainted: G W 5.14.0-rc5-default+ #1532 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs] RSP: 0018:ffffb7a5454c7db8 EFLAGS: 00010246 RAX: 0000000000000068 RBX: ffff978364b91c00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffabee13c4 RDI: 00000000ffffffff RBP: ffff9783523a4c00 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: ffff9783523a4d18 R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000003 FS: 00007f61c8f42800(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056190cffa810 CR3: 0000000030b96002 CR4: 0000000000170ea0 Call Trace: btrfs_close_one_device.cold+0x11/0x55 [btrfs] close_fs_devices+0x44/0xb0 [btrfs] btrfs_close_devices+0x48/0x160 [btrfs] generic_shutdown_super+0x69/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x2c/0xa0 cleanup_mnt+0x144/0x1b0 task_work_run+0x59/0xa0 exit_to_user_mode_loop+0xe7/0xf0 exit_to_user_mode_prepare+0xaf/0xf0 syscall_exit_to_user_mode+0x19/0x50 do_syscall_64+0x4a/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae This happens when close_ctree is called while a dev_replace hasn't completed. In close_ctree, we suspend the dev_replace, but keep the replace target around so that we can resume the dev_replace procedure when we mount the root again. This is the call trace: close_ctree(): btrfs_dev_replace_suspend_for_unmount(); btrfs_close_devices(): btrfs_close_fs_devices(): btrfs_close_one_device(): ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)); However, since the replace target sticks around, there is a device with BTRFS_DEV_STATE_REPLACE_TGT set on close, and we fail the assertion in btrfs_close_one_device. To fix this, if we come across the replace target device when closing, we should properly reset it back to allocation state. This fix also ensures that if a non-target device has a corrupted state and has the BTRFS_DEV_STATE_REPLACE_TGT bit set, the assertion will still catch the error. Reported-by: David Sterba <[email protected]> Fixes: b2a616676839 ("btrfs: fix rw device counting in __btrfs_free_extra_devids") CC: [email protected] # 4.19+ Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Desmond Cheong Zhi Xi <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	udf_get_extendedattr() had no boundary checks.	Stian Skjelstad	1	-2/+11
	When parsing the ExtendedAttr data, malicous or corrupt attribute length could cause kernel hangs and buffer overruns in some special cases. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Stian Skjelstad <[email protected]> Signed-off-by: Jan Kara <[email protected]>
2021-08-23	btrfs: zoned: fix ordered extent boundary calculation	Naohiro Aota	1	-6/+7
	btrfs_lookup_ordered_extent() is supposed to query the offset in a file instead of the logical address. Pass the file offset from submit_extent_page() to calc_bio_boundaries(). Also, calc_bio_boundaries() relies on the bio's operation flag, so move the call site after setting it. Fixes: 390ed29b817e ("btrfs: refactor submit_extent_page() to make bio and its flag tracing easier") Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: do not do preemptive flushing if the majority is global rsv	Josef Bacik	1	-0/+14
	A common characteristic of the bug report where preemptive flushing was going full tilt was the fact that the vast majority of the free metadata space was used up by the global reserve. The hard 90% threshold would cover the majority of these cases, but to be even smarter we should take into account how much of the outstanding reservations are covered by the global block reserve. If the global block reserve accounts for the vast majority of outstanding reservations, skip preemptive flushing, as it will likely just cause churn and pain. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185 Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: reduce the preemptive flushing threshold to 90%	Josef Bacik	1	-1/+1
	The preemptive flushing code was added in order to avoid needing to synchronously wait for ENOSPC flushing to recover space. Once we're almost full however we can essentially flush constantly. We were using 98% as a threshold to determine if we were simply full, however in practice this is a really high bar to hit. For example reports of systems running into this problem had around 94% usage and thus continued to flush. Fix this by lowering the threshold to 90%, which is a more sane value, especially for smaller file systems. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185 CC: [email protected] # 5.12+ Fixes: 576fa34830af ("btrfs: improve preemptive background space flushing") Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: tree-log: check btrfs_lookup_data_extent return value	Marcos Paulo de Souza	1	-1/+3
	Function btrfs_lookup_data_extent calls btrfs_search_slot to verify if the EXTENT_ITEM exists in the extent tree. btrfs_search_slot can return values bellow zero if an error happened. Function replay_one_extent currently checks if the search found something (0 returned) and increments the reference, and if not, it seems to evaluate as 'not found'. Fix the condition by checking if the value was bellow zero and return early. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Marcos Paulo de Souza <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: avoid unnecessarily logging directories that had no changes	Filipe Manana	1	-0/+7
	There are several cases where when logging an inode we need to log its parent directories or logging subdirectories when logging a directory. There are cases however where we end up logging a directory even if it was not changed in the current transaction, no dentries added or removed since the last transaction. While this is harmless from a functional point of view, it is a waste time as it brings no advantage. One example where this is triggered is the following: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/A $ mkdir /mnt/B $ mkdir /mnt/C $ touch /mnt/A/foo $ ln /mnt/A/foo /mnt/B/bar $ ln /mnt/A/foo /mnt/C/baz $ sync $ rm -f /mnt/A/foo $ xfs_io -c "fsync" /mnt/B/bar This last fsync ends up logging directories A, B and C, however we only need to log directory A, as B and C were not changed since the last transaction commit. So fix this by changing need_log_inode(), to return false in case the given inode is a directory and has a ->last_trans value smaller than the current transaction's ID. Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allow idmapped mount	Christian Brauner	1	-1/+1
	Now that we converted btrfs internally to account for idmapped mounts allow the creation of idmapped mounts on by setting the FS_ALLOW_IDMAP flag. We only need to raise this flag on the btrfs_root_fs_type filesystem since btrfs_mount_root() is ultimately responsible for allocating the superblock and is called into from btrfs_mount() associated with btrfs_fs_type. The conversion of the btrfs inode operations was straightforward. Regarding btrfs specific ioctls that perform checks based on inode permissions only those have been allowed that are not filesystem wide operations and hence can be reasonably charged against a specific mount. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: handle ACLs on idmapped mounts	Christian Brauner	1	-5/+6
	Make the ACL code idmapped mount aware. The POSIX default and POSIX access ACLs are the only ACLs other than some specific xattrs that take DAC permissions into account. On an idmapped mount they need to be translated according to the mount's userns. The main change is done to __btrfs_set_acl() which is responsible for translating POSIX ACLs to their final on-disk representation. The btrfs_init_acl() helper does not need to take the idmapped mount into account since it is called in the context of file creation operations (mknod, create, mkdir, symlink, tmpfile) and is used for btrfs_init_inode_security() to copy POSIX default and POSIX access permissions from the parent directory. These ACLs need to be inherited unmodified from the parent directory. This is identical to what we do for ext4 and xfs. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allow idmapped INO_LOOKUP_USER ioctl	Christian Brauner	1	-3/+4
	The INO_LOOKUP_USER is an unprivileged version of the INO_LOOKUP ioctl and has the following restrictions. The main difference between the two is that INO_LOOKUP is filesystem wide operation wheres INO_LOOKUP_USER is scoped beneath the file descriptor passed with the ioctl. Specifically, INO_LOOKUP_USER must adhere to the following restrictions: - The caller must be privileged over each inode of each path component for the path they are trying to lookup. - The path for the subvolume the caller is trying to lookup must be reachable from the inode associated with the file descriptor passed with the ioctl. The second condition makes it possible to scope the lookup of the path to the mount identified by the file descriptor passed with the ioctl. This allows us to enable this ioctl on idmapped mounts. Specifically, this is possible because all child subvolumes of a parent subvolume are reachable when the parent subvolume is mounted. So if the user had access to open the parent subvolume or has been given the fd then they can lookup the path if they had access to it provided they were privileged over each path component. Note, the INO_LOOKUP_USER ioctl allows a user to learn the path and name of a subvolume even though they would otherwise be restricted from doing so via regular VFS-based lookup. So think about a parent subvolume with multiple child subvolumes. Someone could mount he parent subvolume and restrict access to the child subvolumes by overmounting them with empty directories. At this point the user can't traverse the child subvolumes and they can't open files in the child subvolumes. However, they can still learn the path of child subvolumes as long as they have access to the parent subvolume by using the INO_LOOKUP_USER ioctl. The underlying assumption here is that it's ok that the lookup ioctls can't really take mounts into account other than the original mount the fd belongs to during lookup. Since this assumption is baked into the original INO_LOOKUP_USER ioctl we can extend it to idmapped mounts. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allow idmapped SUBVOL_SETFLAGS ioctl	Christian Brauner	1	-1/+1
	Setting flags on subvolumes or snapshots are core features of btrfs. The SUBVOL_SETFLAGS ioctl is especially important as it allows to make subvolumes and snapshots read-only or read-write. Allow setting flags on btrfs subvolumes and snapshots on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls	Christian Brauner	1	-3/+4
	The SET_RECEIVED_SUBVOL ioctls are used to set information about a received subvolume. Make it possible to set information about a received subvolume on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids	Christian Brauner	1	-11/+16
	So far we prevented the deletion of subvolumes and snapshots using subvolume ids possible with the BTRFS_SUBVOL_SPEC_BY_ID flag. This restriction is necessary on idmapped mounts as this allows filesystem wide subvolume and snapshot deletions and thus can escape the scope of what's exposed under the mount identified by the fd passed with the ioctl. Deletion by subvolume id works by looking for an alias of the parent of the subvolume or snapshot to be deleted. The parent alias can be anywhere in the filesystem. However, as long as the alias of the parent that is found is the same as the one identified by the file descriptor passed through the ioctl we can allow the deletion. Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allow idmapped SNAP_DESTROY ioctls	Christian Brauner	1	-7/+20
	Destroying subvolumes and snapshots are important features of btrfs. Both operations are available to unprivileged users if the filesystem has been mounted with the "user_subvol_rm_allowed" mount option. Allow subvolume and snapshot deletion on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Subvolumes and snapshots can either be deleted by specifying their name or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set. This feature is blocked on idmapped mounts as this allows filesystem wide subvolume deletions and thus can escape the scope of what's exposed under the mount identified by the fd passed with the ioctl. This means that even the root or CAP_SYS_ADMIN capable user can't delete a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional. The root user is currently already subject to permission checks in btrfs_may_delete() including whether the inode's i_uid/i_gid of the directory the subvolume is located in have a mapping in the caller's idmapping. For this to fail isn't currently possible since a btrfs filesystem can't be mounted with a non-initial idmapping but it shows that even the root user would fail to delete a subvolume if the relevant inode isn't mapped in their idmapping. The idmapped mount case is the same in principle. This isn't a huge problem a root user wanting to delete arbitrary subvolumes can just always create another (even detached) mount without an idmapping attached. In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the subvolume to delete is directly located under inode referenced by the fd passed for the ioctl() in a follow-up commit. Here is an example where a btrfs subvolume is deleted through a subvolume mount that does not expose the subvolume to be delete but it can still be deleted by using the subvolume id: /* Compile the following program as "delete_by_spec". / #define _GNU_SOURCE #include <fcntl.h> #include <inttypes.h> #include <linux/btrfs.h> #include <stdio.h> #include <stdlib.h> #include <sys/ioctl.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> static int rm_subvolume_by_id(int fd, uint64_t subvolid) { struct btrfs_ioctl_vol_args_v2 args = {}; int ret; args.flags = BTRFS_SUBVOL_SPEC_BY_ID; args.subvolid = subvolid; ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args); if (ret < 0) return -1; return 0; } int main(int argc, char argv[]) { int subvolid = 0; if (argc < 3) exit(1); fprintf(stderr, "Opening %s\n", argv[1]); int fd = open(argv[1], O_CLOEXEC \| O_DIRECTORY); if (fd < 0) exit(2); subvolid = atoi(argv[2]); fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid); int ret = rm_subvolume_by_id(fd, subvolid); if (ret < 0) exit(3); exit(0); } #include <stdio.h>" #include <stdlib.h>" #include <linux/btrfs.h" truncate -s 10G btrfs.img mkfs.btrfs btrfs.img export LOOPDEV=$(sudo losetup -f --show btrfs.img) mount ${LOOPDEV} /mnt sudo chown $(id -u):$(id -g) /mnt btrfs subvolume create /mnt/A btrfs subvolume create /mnt/B/C # Get subvolume id via: sudo btrfs subvolume show /mnt/A # Save subvolid SUBVOLID=<nr> sudo umount /mnt sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt ./delete_by_spec /mnt ${SUBVOLID} With idmapped mounts this can potentially be used by users to delete subvolumes/snapshots they would otherwise not have access to as the idmapping would be applied to an inode that is not exposed in the mount of the subvolume. The fact that this is a filesystem wide operation suggests it might be a good idea to expose this under a separate ioctl that clearly indicates this. In essence, the file descriptor passed with the ioctl is merely used to identify the filesystem on which to operate when BTRFS_SUBVOL_SPEC_BY_ID is used. Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls	Christian Brauner	3	-23/+32
	Creating subvolumes and snapshots is one of the core features of btrfs and is even available to unprivileged users. Make it possible to use subvolume and snapshot creation on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: check whether fsgid/fsuid are mapped during subvolume creation	Christian Brauner	1	-0/+2
	When a new subvolume is created btrfs currently doesn't check whether the fsgid/fsuid of the caller actually have a mapping in the user namespace attached to the filesystem. The VFS always checks this to make sure that the caller's fsgid/fsuid can be represented on-disk. This is most relevant for filesystems that can be mounted inside user namespaces but it is in general a good hardening measure to prevent unrepresentable gid/uid from being written to disk. Since we want to support idmapped mounts for btrfs ioctls to create subvolumes in follow-up patches this becomes important since we want to make sure the fsgid/fsuid of the caller as mapped according to the idmapped mount can be represented on-disk. Simply add the missing fsuidgid_has_mapping() line from the VFS may_create() version to btrfs_may_create(). Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allow idmapped permission inode op	Christian Brauner	1	-1/+1
	Enable btrfs_permission() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allow idmapped setattr inode op	Christian Brauner	1	-4/+3
	Enable btrfs_setattr() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allow idmapped tmpfile inode op	Christian Brauner	1	-1/+1
	Enable btrfs_tmpfile() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allow idmapped symlink inode op	Christian Brauner	1	-1/+1
	Enable btrfs_symlink() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allow idmapped mkdir inode op	Christian Brauner	1	-1/+1
	Enable btrfs_mkdir() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allow idmapped create inode op	Christian Brauner	1	-1/+1
	Enable btrfs_create() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allow idmapped mknod inode op	Christian Brauner	1	-1/+1
	Enable btrfs_mknod() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allow idmapped getattr inode op	Christian Brauner	1	-1/+1
	Enable btrfs_getattr() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allow idmapped rename inode op	Christian Brauner	1	-7/+10
	Enable btrfs_rename() to handle idmapped mounts. This is just a matter of passing down the mount's userns. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: handle idmaps in btrfs_new_inode()	Christian Brauner	1	-15/+19
	Extend btrfs_new_inode() to take the idmapped mount into account when initializing a new inode. This is just a matter of passing down the mount's userns. The rest is taken care of in inode_init_owner(). This is a preliminary patch to make the individual btrfs inode operations idmapped mount aware. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	namei: add mapping aware lookup helper	Christian Brauner	1	-6/+37
	Various filesystems rely on the lookup_one_len() helper to lookup a single path component relative to a well-known starting point. Allow such filesystems to support idmapped mounts by adding a version of this helper to take the idmap into account when calling inode_permission(). This change is a required to let btrfs (and other filesystems) support idmapped mounts. Cc: Christoph Hellwig <[email protected]> Cc: Al Viro <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: [email protected] Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: sysfs: document structures and their associated files	Anand Jain	1	-16/+75
	Sysfs file has grown big. It takes some time to locate the correct struct attribute to add new files. Create a table and map the struct attribute to its sysfs path. Also, fix the comment about the debug sysfs path. And add the comments to the attributes instead of attribute group, where sysfs file names are defined. Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: fix NULL pointer dereference when deleting device by invalid id	Qu Wenruo	1	-1/+1
	[BUG] It's easy to trigger NULL pointer dereference, just by removing a non-existing device id: # mkfs.btrfs -f -m single -d single /dev/test/scratch1 \ /dev/test/scratch2 # mount /dev/test/scratch1 /mnt/btrfs # btrfs device remove 3 /mnt/btrfs Then we have the following kernel NULL pointer dereference: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 9 PID: 649 Comm: btrfs Not tainted 5.14.0-rc3-custom+ #35 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:btrfs_rm_device+0x4de/0x6b0 [btrfs] btrfs_ioctl+0x18bb/0x3190 [btrfs] ? lock_is_held_type+0xa5/0x120 ? find_held_lock.constprop.0+0x2b/0x80 ? do_user_addr_fault+0x201/0x6a0 ? lock_release+0xd2/0x2d0 ? __x64_sys_ioctl+0x83/0xb0 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae [CAUSE] Commit a27a94c2b0c7 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly") moves the "missing" device path check into btrfs_rm_device(). But btrfs_rm_device() itself can have case where it only receives @devid, with NULL as @device_path. In that case, calling strcmp() on NULL will trigger the NULL pointer dereference. Before that commit, we handle the "missing" case inside btrfs_find_device_by_devspec(), which will not check @device_path at all if @devid is provided, thus no way to trigger the bug. [FIX] Before calling strcmp(), also make sure @device_path is not NULL. Fixes: a27a94c2b0c7 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly") CC: [email protected] # 5.4+ Reported-by: butt3rflyh4ck <[email protected]> Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: zoned: add asserts on splitting extent_map	Naohiro Aota	1	-6/+6
	We call split_zoned_em() on an extent_map on submitting a bio for it. Thus, we can assume the extent_map is PINNED, not LOGGING, and in the modified list. Add ASSERT()s to ensure the extent_maps after the split also has the proper flags set and are in the modified list. Suggested-by: Filipe Manana <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: zoned: fix block group alloc_offset calculation	Naohiro Aota	1	-2/+5
	alloc_offset is offset from the start of a block group and @offset is actually an address in logical space. Thus, we need to consider block_group->start when calculating them. Fixes: 011b41bffa3d ("btrfs: zoned: advance allocation pointer after tree log node") CC: [email protected] # 5.12+ Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: zoned: suppress reclaim error message on EAGAIN	Naohiro Aota	1	-1/+1
	btrfs_relocate_chunk() can fail with -EAGAIN when e.g. send operations are running. The message can fail btrfs/187 and it's unnecessary because we anyway add it back to the reclaim list. btrfs_reclaim_bgs_work() `-> btrfs_relocate_chunk() `-> btrfs_relocate_block_group() `-> reloc_chunk_start() `-> if (fs_info->send_in_progress) `-> return -EAGAIN CC: [email protected] # 5.13+ Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: zoned: allow disabling of zone auto reclaim	Johannes Thumshirn	2	-6/+8
	Automatically reclaiming dirty zones might not always be desired for all workloads, especially as there are currently still some rough edges with the relocation code on zoned filesystems. Allow disabling zone auto reclaim on a per filesystem basis by writing 0 as the threshold value. Reviewed-by: Naohiro Aota <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: update comment at log_conflicting_inodes()	Filipe Manana	1	-2/+2
	A comment at log_conflicting_inodes() mentions that we check the inode's logged_trans field instead of using btrfs_inode_in_log() because the field last_log_commit is not updated when we log that an inode exists and the inode has the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) set. The part about the full sync flag is not true anymore since commit 9acc8103ab594f ("btrfs: fix unpersisted i_size on fsync after expanding truncate"), so update the comment to not mention that part anymore. Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: remove no longer needed full sync flag check at inode_logged()	Filipe Manana	1	-7/+5
	Now that we are checking if the inode's logged_trans is 0 to detect the possibility of the inode having been evicted and reloaded, the test for the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) is no longer needed at tree-log.c:inode_logged(). Its purpose was to detect the possibility of a previous eviction as well, since when an inode is loaded the full sync flag is always set on it (and only cleared after the inode is logged). So just remove the check and update the comment. The check for the inode's logged_trans being 0 was added recently by the patch with the subject "btrfs: eliminate some false positives when checking if inode was logged". Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: remove unnecessary NULL check for the new inode during rename exchange	Filipe Manana	1	-2/+1
	At the very end of btrfs_rename_exchange(), in case an error happened, we are checking if 'new_inode' is NULL, but that is not needed since during a rename exchange, unlike regular renames, 'new_inode' can never be NULL, and if it were, we would have a crashed much earlier when we dereference it multiple times. So remove the check because it is not necessary and because it is causing static checkers to emit a warning. I probably introduced the check by copy-pasting similar code from btrfs_rename(), where 'new_inode' can be NULL, in commit 86e8aa0e772cab ("Btrfs: unpin logs if rename exchange operation fails"). Reported-by: kernel test robot <[email protected]> Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allocate backref_ctx on stack in find_extent_clone	Goldwyn Rodrigues	1	-18/+11
	Instead of using kmalloc() to allocate backref_ctx, allocate backref_ctx on stack. The size is reasonably small. sizeof(backref_ctx) = 48 Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Goldwyn Rodrigues <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allocate btrfs_ioctl_defrag_range_args on stack	Goldwyn Rodrigues	1	-16/+7
	Instead of using kmalloc() to allocate btrfs_ioctl_defrag_range_args, allocate btrfs_ioctl_defrag_range_args on stack, the size is reasonably small and ioctls are called in process context. sizeof(btrfs_ioctl_defrag_range_args) = 48 Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Goldwyn Rodrigues <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allocate btrfs_ioctl_quota_rescan_args on stack	Goldwyn Rodrigues	1	-9/+4
	Instead of using kmalloc() to allocate btrfs_ioctl_quota_rescan_args, allocate btrfs_ioctl_quota_rescan_args on stack, the size is reasonably small and ioctls are called in process context. sizeof(btrfs_ioctl_quota_rescan_args) = 64 Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Goldwyn Rodrigues <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: allocate file_ra_state on stack in readahead_cache	Goldwyn Rodrigues	1	-9/+3
	Instead of allocating file_ra_state using kmalloc, allocate on stack. sizeof(struct readahead) = 32 bytes. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Goldwyn Rodrigues <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: introduce btrfs_search_backwards function	Marcos Paulo de Souza	5	-48/+40
	It's a common practice to start a search using offset (u64)-1, which is the u64 maximum value, meaning that we want the search_slot function to be set in the last item with the same objectid and type. Once we are in this position, it's a matter to start a search backwards by calling btrfs_previous_item, which will check if we'll need to go to a previous leaf and other necessary checks, only to be sure that we are in last offset of the same object and type. The new btrfs_search_backwards function does the all these steps when necessary, and can be used to avoid code duplication. Signed-off-by: Marcos Paulo de Souza <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: print if fsverity support is built in when loading module	David Sterba	1	-0/+5
	As fsverity support depends on a config option, print that at module load time like we do for similar features. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23	btrfs: verity metadata orphan items	Boris Burkov	2	-9/+93
	Writing out the verity data is too large of an operation to do in a single transaction. If we are interrupted before we finish creating fsverity metadata for a file, or fail to clean up already created metadata after a failure, we could leak the verity items that we already committed. To address this issue, we use the orphan mechanism. When we start enabling verity on a file, we also add an orphan item for that inode. When we are finished, we delete the orphan. However, if we are interrupted midway, the orphan will be present at mount and we can cleanup the half-formed verity state. There is a possible race with a normal unlink operation: if unlink and verity run on the same file in parallel, it is possible for verity to succeed and delete the still legitimate orphan added by unlink. Then, if we are interrupted and mount in that state, we will never clean up the inode properly. This is also possible for a file created with O_TMPFILE. Check nlink==0 before deleting to avoid this race. A final thing to note is that this is a resurrection of using orphans to signal an operation besides "delete this inode". The old case was to signal the need to do a truncate. That case still technically applies for mounting very old file systems, so we need to take some care to not clobber it. To that end, we just have to be careful that verity orphan cleanup is a no-op for non-verity files. Signed-off-by: Boris Burkov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>