aboutsummaryrefslogtreecommitdiff
path: root/fs/orangefs
AgeCommit message (Collapse)AuthorFilesLines
2022-12-07orangefs: remove variable iColin Ian King1-2/+0
Variable i is just being incremented and it's never used anywhere else. The variable and the increment are redundant so remove it. Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Mike Marshall <[email protected]>
2022-11-25use less confusing names for iov_iter direction initializersAl Viro1-4/+4
READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by: Al Viro <[email protected]>
2022-11-13orangefs: fix mode handlingChristian Brauner3-10/+12
In 4053d2500beb ("orangefs: rework posix acl handling when creating new filesystem objects") we tried to precalculate the correct mode when creating a new inode. However, this leads to regressions when creating new filesystem objects. Even if we precalculate the mode we still need to call __orangefs_setattr() to perform additional checks and we also need to update the mode of ACL_TYPE_ACCESS acls set on the inode. The patch referenced above regressed that. Restore that part of the old behavior and remove the mode precalculation as it doesn't get us anything anymore. Fixes: 4053d2500beb ("orangefs: rework posix acl handling when creating new filesystem objects") Reported-by: Mike Marshall <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
2022-10-20fs: rename current get acl methodChristian Brauner2-2/+2
The current way of setting and getting posix acls through the generic xattr interface is error prone and type unsafe. The vfs needs to interpret and fixup posix acls before storing or reporting it to userspace. Various hacks exist to make this work. The code is hard to understand and difficult to maintain in it's current form. Instead of making this work by hacking posix acls through xattr handlers we are building a dedicated posix acl api around the get and set inode operations. This removes a lot of hackiness and makes the codepaths easier to maintain. A lot of background can be found in [1]. The current inode operation for getting posix acls takes an inode argument but various filesystems (e.g., 9p, cifs, overlayfs) need access to the dentry. In contrast to the ->set_acl() inode operation we cannot simply extend ->get_acl() to take a dentry argument. The ->get_acl() inode operation is called from: acl_permission_check() -> check_acl() -> get_acl() which is part of generic_permission() which in turn is part of inode_permission(). Both generic_permission() and inode_permission() are called in the ->permission() handler of various filesystems (e.g., overlayfs). So simply passing a dentry argument to ->get_acl() would amount to also having to pass a dentry argument to ->permission(). We should avoid this unnecessary change. So instead of extending the existing inode operation rename it from ->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that passes a dentry argument and which filesystems that need access to the dentry can implement instead of ->get_inode_acl(). Filesystems like cifs which allow setting and getting posix acls but not using them for permission checking during lookup can simply not implement ->get_inode_acl(). This is intended to be a non-functional change. Link: https://lore.kernel.org/all/[email protected] [1] Suggested-by/Inspired-by: Christoph Hellwig <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
2022-10-19fs: pass dentry to set acl methodChristian Brauner3-7/+9
The current way of setting and getting posix acls through the generic xattr interface is error prone and type unsafe. The vfs needs to interpret and fixup posix acls before storing or reporting it to userspace. Various hacks exist to make this work. The code is hard to understand and difficult to maintain in it's current form. Instead of making this work by hacking posix acls through xattr handlers we are building a dedicated posix acl api around the get and set inode operations. This removes a lot of hackiness and makes the codepaths easier to maintain. A lot of background can be found in [1]. Since some filesystem rely on the dentry being available to them when setting posix acls (e.g., 9p and cifs) they cannot rely on set acl inode operation. But since ->set_acl() is required in order to use the generic posix acl xattr handlers filesystems that do not implement this inode operation cannot use the handler and need to implement their own dedicated posix acl handlers. Update the ->set_acl() inode method to take a dentry argument. This allows all filesystems to rely on ->set_acl(). As far as I can tell all codepaths can be switched to rely on the dentry instead of just the inode. Note that the original motivation for passing the dentry separate from the inode instead of just the dentry in the xattr handlers was because of security modules that call security_d_instantiate(). This hook is called during d_instantiate_new(), d_add(), __d_instantiate_anon(), and d_splice_alias() to initialize the inode's security context and possibly to set security.* xattrs. Since this only affects security.* xattrs this is completely irrelevant for posix acls. Link: https://lore.kernel.org/all/[email protected] [1] Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
2022-10-19orangefs: rework posix acl handling when creating new filesystem objectsChristian Brauner4-54/+50
When creating new filesytem objects orangefs used to create posix acls after it had created and inserted a new inode. This made it necessary to all posix_acl_chmod() on the newly created inode in case the mode of the inode would be changed by the posix acls. Instead of doing it this way calculate the correct mode directly before actually creating the inode. So we first create posix acls, then pass the mode that posix acls mandate into the orangefs getattr helper and calculate the correct mode. This is needed so we can simply change posix_acl_chmod() to take a dentry instead of an inode argument in the next patch. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
2022-10-13Merge tag 'for-linus-6.1-ofs1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs update from Mike Marshall: "Change iterate to iterate_shared" * tag 'for-linus-6.1-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: Orangefs: change iterate to iterate_shared
2022-10-03Orangefs: change iterate to iterate_sharedMike Marshall1-1/+1
Changed .iterate to .iterate_shared in orangefs_dir_operations. I didn't change anything else, there were no xfstests regressions and no problem with any of my other tests... Signed-off-by: Mike Marshall <[email protected]>
2022-09-01orangefs: use ->f_mappingAl Viro1-3/+1
... and don't check for impossible conditions - file_inode() is never NULL in anything seen by ->release() and neither is its ->i_mapping. Reviewed-by: Christian Brauner (Microsoft) <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-06-29orangefs: Remove test for folio errorMatthew Wilcox (Oracle)1-3/+1
The page cache clears the error bit before calling ->read_folio(), so this condition could never have been true. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
2022-05-09orangefs: Convert to free_folioMatthew Wilcox (Oracle)1-3/+3
I suspect this isn't actually needed and that releasepage will have done the job, but convert it for now and we can delete it later. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
2022-05-09orangefs: Convert to release_folioMatthew Wilcox (Oracle)1-3/+3
Use folios throughout the release_folio path. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Jeff Layton <[email protected]>
2022-05-09orangefs: Convert orangefs to read_folioMatthew Wilcox (Oracle)1-17/+16
This is a full conversion which should be large folio ready, although I have not tested it. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
2022-05-08fs: Remove flags parameter from aops->write_beginMatthew Wilcox (Oracle)1-3/+2
There are no more aop flags left, so remove the parameter. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2022-05-08fs: Remove aop flags parameter from grab_cache_page_write_begin()Matthew Wilcox (Oracle)1-1/+1
There are no more aop flags left, so remove the parameter. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2022-03-22Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds1-60/+61
Pull filesystem folio updates from Matthew Wilcox: "Primarily this series converts some of the address_space operations to take a folio instead of a page. Notably: - a_ops->is_partially_uptodate() takes a folio instead of a page and changes the type of the 'from' and 'count' arguments to make it obvious they're bytes. - a_ops->invalidatepage() becomes ->invalidate_folio() and has a similar type change. - a_ops->launder_page() becomes ->launder_folio() - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the address_space as an argument. There are a couple of other misc changes up front that weren't worth separating into their own pull request" * tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits) fs: Remove aops ->set_page_dirty fb_defio: Use noop_dirty_folio() fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio fs: Convert __set_page_dirty_buffers to block_dirty_folio nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio() mm: Convert swap_set_page_dirty() to swap_dirty_folio() ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio() btrfs: Convert extent_range_redirty_for_io() to use folios fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio btrfs: Convert from set_page_dirty to dirty_folio fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio() fs: Add aops->dirty_folio fs: Remove aops->launder_page orangefs: Convert launder_page to launder_folio nfs: Convert from launder_page to launder_folio fuse: Convert from launder_page to launder_folio ...
2022-03-22fs: allocate inode by using alloc_inode_sb()Muchun Song1-1/+1
The inode allocation is supposed to use alloc_inode_sb(), so convert kmem_cache_alloc() of all filesystems to alloc_inode_sb(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Theodore Ts'o <[email protected]> [ext4] Acked-by: Roman Gushchin <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-15fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folioMatthew Wilcox (Oracle)1-1/+1
These filesystems use __set_page_dirty_nobuffers() either directly or with a very thin wrapper; convert them en masse. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Tested-by: Damien Le Moal <[email protected]> Acked-by: Damien Le Moal <[email protected]> Tested-by: Mike Marshall <[email protected]> # orangefs Tested-by: David Howells <[email protected]> # afs
2022-03-15orangefs: Convert launder_page to launder_folioMatthew Wilcox (Oracle)1-33/+36
OrangeFS launders its pages from a number of locations, so add a small amount of folio usage to its callers where it makes sense. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Tested-by: Damien Le Moal <[email protected]> Acked-by: Damien Le Moal <[email protected]> Tested-by: Mike Marshall <[email protected]> # orangefs Tested-by: David Howells <[email protected]> # afs
2022-03-15orangefs: Convert from invalidatepage to invalidate_folioMatthew Wilcox (Oracle)1-27/+25
This is a straightforward conversion. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Tested-by: Damien Le Moal <[email protected]> Acked-by: Damien Le Moal <[email protected]> Tested-by: Mike Marshall <[email protected]> # orangefs Tested-by: David Howells <[email protected]> # afs
2021-12-31orangefs: Fix the size of a memory allocation in orangefs_bufmap_alloc()Christophe JAILLET1-4/+3
'buffer_index_array' really looks like a bitmap. So it should be allocated as such. When kzalloc is called, a number of bytes is expected, but a number of longs is passed instead. In get(), if not enough memory is allocated, un-allocated memory may be read or written. So use bitmap_zalloc() to safely allocate the correct memory size and avoid un-expected behavior. While at it, change the corresponding kfree() into bitmap_free() to keep the semantic. Fixes: ea2c9c9f6574 ("orangefs: bufmap rewrite") Signed-off-by: Christophe JAILLET <[email protected]> Signed-off-by: Mike Marshall <[email protected]>
2021-12-31orangefs: use default_groups in kobj_typeGreg Kroah-Hartman1-7/+14
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the orangfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: Mike Marshall <[email protected]> Cc: Martin Brandenburg <[email protected]> Cc: [email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Mike Marshall <[email protected]>
2021-11-09Merge tag 'for-linus-5.16-ofs1' of ↵Linus Torvalds2-3/+5
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs fixes from Mike Marshall: - fix sb refcount leak when allocate sb info failed (Chenyuan Mi) - fix error return code of orangefs_revalidate_lookup() (Jia-Ju Bai) - remove redundant initialization of variable ret (Colin Ian King) * tag 'for-linus-5.16-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: Fix sb refcount leak when allocate sb info failed. fs: orangefs: fix error return code of orangefs_revalidate_lookup() orangefs: Remove redundant initialization of variable ret
2021-10-18mm: don't include <linux/blkdev.h> in <linux/backing-dev.h>Christoph Hellwig1-1/+1
Move inode_to_bdi out of line to avoid having to include blkdev.h. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18mm: don't include <linux/blk-cgroup.h> in <linux/backing-dev.h>Christoph Hellwig1-0/+1
There is no need to pull blk-cgroup.h and thus blkdev.h in here, so break the include chain. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-11orangefs: Fix sb refcount leak when allocate sb info failed.Chenyuan Mi1-1/+1
The reference counting issue happens in one exception handling path of orangefs_mount(). When failing to allocate sb info, the function forgets to decrease the refcount of sb increased by sget(), causing a refcount leak. Fix this issue by jumping to the label "free_sb_and_op" instead of "free_op" Signed-off-by: Chenyuan Mi <[email protected]> Signed-off-by: Xiyu Yang <[email protected]> Signed-off-by: Xin Tan <[email protected]> Signed-off-by: Mike Marshall <[email protected]>
2021-10-11fs: orangefs: fix error return code of orangefs_revalidate_lookup()Jia-Ju Bai1-1/+3
When op_alloc() returns NULL to new_op, no error return code of orangefs_revalidate_lookup() is assigned. To fix this bug, ret is assigned with -ENOMEM in this case. Fixes: 8bb8aefd5afb ("OrangeFS: Change almost all instances of the string PVFS2 to OrangeFS.") Reported-by: TOTE Robot <[email protected]> Signed-off-by: Jia-Ju Bai <[email protected]> Signed-off-by: Mike Marshall <[email protected]>
2021-10-11orangefs: Remove redundant initialization of variable retColin Ian King1-1/+1
The variable ret is being initialized with a value that is never read, it is being updated later on. The assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Mike Marshall <[email protected]>
2021-08-18vfs: add rcu argument to ->get_acl() callbackMiklos Szeredi2-2/+5
Add a rcu argument to the ->get_acl() callback to allow get_cached_acl_rcu() to call the ->get_acl() method in the next patch. Signed-off-by: Miklos Szeredi <[email protected]>
2021-08-17fs: add generic helper for filling statx attribute flagsAmir Goldstein1-6/+1
The immutable and append-only properties on an inode are published on the inode's i_flags and enforced by the VFS. Create a helper to fill the corresponding STATX_ATTR_ flags in the kstat structure from the inode's i_flags. Only orange was converted to use this helper. Other filesystems could use it in the future. Suggested-by: Miklos Szeredi <[email protected]> Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2021-06-28orangefs: fix orangefs df output.Mike Marshall1-1/+1
Orangefs df output is whacky. Walt Ligon suggested this might fix it. It seems way more in line with reality now... Signed-off-by: Mike Marshall <[email protected]>
2021-06-28orangefs: readahead adjustmentMike Marshall1-4/+3
Matthew Wilcox suggested that perhaps it could be "possible for rac->file to be NULL if the caller doesn't have a struct file"... Signed-off-by: Mike Marshall <[email protected]>
2021-04-29orangefs: leave files in the page cache for a few micro seconds at leastMike Marshall1-1/+1
Signed-off-by: Mike Marshall <[email protected]>
2021-04-28Orangef: implement orangefs_readahead.Mike Marshall2-103/+53
Also remove open-coded readahead logic from orangefs_readpage. Signed-off-by: Mike Marshall <[email protected]>
2021-04-27Merge branch 'miklos.fileattr' of ↵Linus Torvalds2-79/+50
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fileattr conversion updates from Miklos Szeredi via Al Viro: "This splits the handling of FS_IOC_[GS]ETFLAGS from ->ioctl() into a separate method. The interface is reasonably uniform across the filesystems that support it and gives nice boilerplate removal" * 'miklos.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (23 commits) ovl: remove unneeded ioctls fuse: convert to fileattr fuse: add internal open/release helpers fuse: unsigned open flags fuse: move ioctl to separate source file vfs: remove unused ioctl helpers ubifs: convert to fileattr reiserfs: convert to fileattr ocfs2: convert to fileattr nilfs2: convert to fileattr jfs: convert to fileattr hfsplus: convert to fileattr efivars: convert to fileattr xfs: convert to fileattr orangefs: convert to fileattr gfs2: convert to fileattr f2fs: convert to fileattr ext4: convert to fileattr ext2: convert to fileattr btrfs: convert to fileattr ...
2021-04-12orangefs: convert to fileattrMiklos Szeredi2-79/+50
Use the fileattr API to let the VFS handle locking, permission checking and conversion. Signed-off-by: Miklos Szeredi <[email protected]> Cc: Mike Marshall <[email protected]>
2021-03-12orangefs_inode_is_stale(): i_mode type bits do *not* form a bitmap...Al Viro1-1/+1
Signed-off-by: Al Viro <[email protected]>
2021-02-23Merge tag 'idmapped-mounts-v5.12' of ↵Linus Torvalds5-20/+32
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull idmapped mounts from Christian Brauner: "This introduces idmapped mounts which has been in the making for some time. Simply put, different mounts can expose the same file or directory with different ownership. This initial implementation comes with ports for fat, ext4 and with Christoph's port for xfs with more filesystems being actively worked on by independent people and maintainers. Idmapping mounts handle a wide range of long standing use-cases. Here are just a few: - Idmapped mounts make it possible to easily share files between multiple users or multiple machines especially in complex scenarios. For example, idmapped mounts will be used in the implementation of portable home directories in systemd-homed.service(8) where they allow users to move their home directory to an external storage device and use it on multiple computers where they are assigned different uids and gids. This effectively makes it possible to assign random uids and gids at login time. - It is possible to share files from the host with unprivileged containers without having to change ownership permanently through chown(2). - It is possible to idmap a container's rootfs and without having to mangle every file. For example, Chromebooks use it to share the user's Download folder with their unprivileged containers in their Linux subsystem. - It is possible to share files between containers with non-overlapping idmappings. - Filesystem that lack a proper concept of ownership such as fat can use idmapped mounts to implement discretionary access (DAC) permission checking. - They allow users to efficiently changing ownership on a per-mount basis without having to (recursively) chown(2) all files. In contrast to chown (2) changing ownership of large sets of files is instantenous with idmapped mounts. This is especially useful when ownership of a whole root filesystem of a virtual machine or container is changed. With idmapped mounts a single syscall mount_setattr syscall will be sufficient to change the ownership of all files. - Idmapped mounts always take the current ownership into account as idmappings specify what a given uid or gid is supposed to be mapped to. This contrasts with the chown(2) syscall which cannot by itself take the current ownership of the files it changes into account. It simply changes the ownership to the specified uid and gid. This is especially problematic when recursively chown(2)ing a large set of files which is commong with the aforementioned portable home directory and container and vm scenario. - Idmapped mounts allow to change ownership locally, restricting it to specific mounts, and temporarily as the ownership changes only apply as long as the mount exists. Several userspace projects have either already put up patches and pull-requests for this feature or will do so should you decide to pull this: - systemd: In a wide variety of scenarios but especially right away in their implementation of portable home directories. https://systemd.io/HOME_DIRECTORY/ - container runtimes: containerd, runC, LXD:To share data between host and unprivileged containers, unprivileged and privileged containers, etc. The pull request for idmapped mounts support in containerd, the default Kubernetes runtime is already up for quite a while now: https://github.com/containerd/containerd/pull/4734 - The virtio-fs developers and several users have expressed interest in using this feature with virtual machines once virtio-fs is ported. - ChromeOS: Sharing host-directories with unprivileged containers. I've tightly synced with all those projects and all of those listed here have also expressed their need/desire for this feature on the mailing list. For more info on how people use this there's a bunch of talks about this too. Here's just two recent ones: https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf https://fosdem.org/2021/schedule/event/containers_idmap/ This comes with an extensive xfstests suite covering both ext4 and xfs: https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts It covers truncation, creation, opening, xattrs, vfscaps, setid execution, setgid inheritance and more both with idmapped and non-idmapped mounts. It already helped to discover an unrelated xfs setgid inheritance bug which has since been fixed in mainline. It will be sent for inclusion with the xfstests project should you decide to merge this. In order to support per-mount idmappings vfsmounts are marked with user namespaces. The idmapping of the user namespace will be used to map the ids of vfs objects when they are accessed through that mount. By default all vfsmounts are marked with the initial user namespace. The initial user namespace is used to indicate that a mount is not idmapped. All operations behave as before and this is verified in the testsuite. Based on prior discussions we want to attach the whole user namespace and not just a dedicated idmapping struct. This allows us to reuse all the helpers that already exist for dealing with idmappings instead of introducing a whole new range of helpers. In addition, if we decide in the future that we are confident enough to enable unprivileged users to setup idmapped mounts the permission checking can take into account whether the caller is privileged in the user namespace the mount is currently marked with. The user namespace the mount will be marked with can be specified by passing a file descriptor refering to the user namespace as an argument to the new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern of extensibility. The following conditions must be met in order to create an idmapped mount: - The caller must currently have the CAP_SYS_ADMIN capability in the user namespace the underlying filesystem has been mounted in. - The underlying filesystem must support idmapped mounts. - The mount must not already be idmapped. This also implies that the idmapping of a mount cannot be altered once it has been idmapped. - The mount must be a detached/anonymous mount, i.e. it must have been created by calling open_tree() with the OPEN_TREE_CLONE flag and it must not already have been visible in the filesystem. The last two points guarantee easier semantics for userspace and the kernel and make the implementation significantly simpler. By default vfsmounts are marked with the initial user namespace and no behavioral or performance changes are observed. The manpage with a detailed description can be found here: https://git.kernel.org/brauner/man-pages/c/1d7b902e2875a1ff342e036a9f866a995640aea8 In order to support idmapped mounts, filesystems need to be changed and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The patches to convert individual filesystem are not very large or complicated overall as can be seen from the included fat, ext4, and xfs ports. Patches for other filesystems are actively worked on and will be sent out separately. The xfstestsuite can be used to verify that port has been done correctly. The mount_setattr() syscall is motivated independent of the idmapped mounts patches and it's been around since July 2019. One of the most valuable features of the new mount api is the ability to perform mounts based on file descriptors only. Together with the lookup restrictions available in the openat2() RESOLVE_* flag namespace which we added in v5.6 this is the first time we are close to hardened and race-free (e.g. symlinks) mounting and path resolution. While userspace has started porting to the new mount api to mount proper filesystems and create new bind-mounts it is currently not possible to change mount options of an already existing bind mount in the new mount api since the mount_setattr() syscall is missing. With the addition of the mount_setattr() syscall we remove this last restriction and userspace can now fully port to the new mount api, covering every use-case the old mount api could. We also add the crucial ability to recursively change mount options for a whole mount tree, both removing and adding mount options at the same time. This syscall has been requested multiple times by various people and projects. There is a simple tool available at https://github.com/brauner/mount-idmapped that allows to create idmapped mounts so people can play with this patch series. I'll add support for the regular mount binary should you decide to pull this in the following weeks: Here's an example to a simple idmapped mount of another user's home directory: u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt u1001@f2-vm:/$ ls -al /home/ubuntu/ total 28 drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 . drwxr-xr-x 4 root root 4096 Oct 28 04:00 .. -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ ls -al /mnt/ total 28 drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 . drwxr-xr-x 29 root root 4096 Oct 28 22:01 .. -rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile -rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ touch /mnt/my-file u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file u1001@f2-vm:/$ ls -al /mnt/my-file -rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file u1001@f2-vm:/$ ls -al /home/ubuntu/my-file -rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file u1001@f2-vm:/$ getfacl /mnt/my-file getfacl: Removing leading '/' from absolute path names # file: mnt/my-file # owner: u1001 # group: u1001 user::rw- user:u1001:rwx group::rw- mask::rwx other::r-- u1001@f2-vm:/$ getfacl /home/ubuntu/my-file getfacl: Removing leading '/' from absolute path names # file: home/ubuntu/my-file # owner: ubuntu # group: ubuntu user::rw- user:ubuntu:rwx group::rw- mask::rwx other::r--" * tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits) xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl xfs: support idmapped mounts ext4: support idmapped mounts fat: handle idmapped mounts tests: add mount_setattr() selftests fs: introduce MOUNT_ATTR_IDMAP fs: add mount_setattr() fs: add attr_flags_to_mnt_flags helper fs: split out functions to hold writers namespace: only take read lock in do_reconfigure_mnt() mount: make {lock,unlock}_mount_hash() static namespace: take lock_mount_hash() directly when changing flags nfs: do not export idmapped mounts overlayfs: do not mount on top of idmapped mounts ecryptfs: do not mount on top of idmapped mounts ima: handle idmapped mounts apparmor: handle idmapped mounts fs: make helpers idmap mount aware exec: handle idmapped mounts would_dump: handle idmapped mounts ...
2021-01-24fs: make helpers idmap mount awareChristian Brauner4-14/+24
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/[email protected] Cc: Christoph Hellwig <[email protected]> Cc: David Howells <[email protected]> Cc: Al Viro <[email protected]> Cc: [email protected] Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2021-01-24stat: handle idmapped mountsChristian Brauner1-1/+1
The generic_fillattr() helper fills in the basic attributes associated with an inode. Enable it to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace before we store the uid and gid. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/[email protected] Cc: Christoph Hellwig <[email protected]> Cc: David Howells <[email protected]> Cc: Al Viro <[email protected]> Cc: [email protected] Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: James Morris <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2021-01-24acl: handle idmapped mountsChristian Brauner3-2/+4
The posix acl permission checking helpers determine whether a caller is privileged over an inode according to the acls associated with the inode. Add helpers that make it possible to handle acls on idmapped mounts. The vfs and the filesystems targeted by this first iteration make use of posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to translate basic posix access and default permissions such as the ACL_USER and ACL_GROUP type according to the initial user namespace (or the superblock's user namespace) to and from the caller's current user namespace. Adapt these two helpers to handle idmapped mounts whereby we either map from or into the mount's user namespace depending on in which direction we're translating. Similarly, cap_convert_nscap() is used by the vfs to translate user namespace and non-user namespace aware filesystem capabilities from the superblock's user namespace to the caller's user namespace. Enable it to handle idmapped mounts by accounting for the mount's user namespace. In addition the fileystems targeted in the first iteration of this patch series make use of the posix_acl_chmod() and, posix_acl_update_mode() helpers. Both helpers perform permission checks on the target inode. Let them handle idmapped mounts. These two helpers are called when posix acls are set by the respective filesystems to handle this case we extend the ->set() method to take an additional user namespace argument to pass the mount's user namespace down. Link: https://lore.kernel.org/r/[email protected] Cc: Christoph Hellwig <[email protected]> Cc: David Howells <[email protected]> Cc: Al Viro <[email protected]> Cc: [email protected] Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2021-01-24attr: handle idmapped mountsChristian Brauner1-2/+2
When file attributes are changed most filesystems rely on the setattr_prepare(), setattr_copy(), and notify_change() helpers for initialization and permission checking. Let them handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Helpers that perform checks on the ia_uid and ia_gid fields in struct iattr assume that ia_uid and ia_gid are intended values and have already been mapped correctly at the userspace-kernelspace boundary as we already do today. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/[email protected] Cc: Christoph Hellwig <[email protected]> Cc: David Howells <[email protected]> Cc: Al Viro <[email protected]> Cc: [email protected] Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2021-01-24namei: make permission helpers idmapped mount awareChristian Brauner1-1/+1
The two helpers inode_permission() and generic_permission() are used by the vfs to perform basic permission checking by verifying that the caller is privileged over an inode. In order to handle idmapped mounts we extend the two helpers with an additional user namespace argument. On idmapped mounts the two helpers will make sure to map the inode according to the mount's user namespace and then peform identical permission checks to inode_permission() and generic_permission(). If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/[email protected] Cc: Christoph Hellwig <[email protected]> Cc: David Howells <[email protected]> Cc: Al Viro <[email protected]> Cc: [email protected] Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: James Morris <[email protected]> Acked-by: Serge Hallyn <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2021-01-06orangefs_file_mmap(): use %pDAl Viro1-4/+1
... and no, file can't be NULL there - it's not called that way *and* it would've oopsed a few lines prior on such call anyway. Signed-off-by: Al Viro <[email protected]>
2020-12-16orangefs: add splice file operationsMike Marshall1-0/+2
Fix some xfstests regressions that started after 36e2c7421f02, "don't allow splice read/write without explicit ops". Thanks for help from Dave Chinner and Matthew Wilcox. Signed-off-by: Mike Marshall <[email protected]>
2020-08-04orangefs: remove unnecessary assignment to variable retJing Xiangfeng1-1/+0
The variable ret is guaranteed to be 0 in this if (). So we can remove this assignement. Signed-off-by: Jing Xiangfeng <[email protected]> Signed-off-by: Mike Marshall <[email protected]>
2020-07-28orangefs: posix acl fix...Mike Marshall1-9/+10
Al Viro pointed out that I broke some acl functionality... * ACLs could not be fully removed * posix_acl_chmod would be called while the old ACL was still cached * new mode propagated to orangefs server before ACL. ... when I tried to make sure that modes that got changed as a result of ACL-sets would be sent back to the orangefs server. Not wanting to try and change the code without having some cases to test it with, I began to hunt for setfacl examples that were expressible in pure mode. Along the way I found examples like the following which confused me: user A had a file (/home/A/asdf) with mode 740 user B was in user A's group user C was not in user A's group setfacl -m u:C:rwx /home/A/asdf The above setfacl caused ls -l /home/A/asdf to show a mode of 770, making it appear that all users in user A's group now had full access to /home/A/asdf, however, user B still only had read acces. Madness. Anywho, I finally found that the above (whacky as it is) appears to be "posixly on purpose" and explained in acl(5): If the ACL has an ACL_MASK entry, the group permissions correspond to the permissions of the ACL_MASK entry. Signed-off-by: Mike Marshall <[email protected]>
2020-06-05Merge tag 'for-linus-5.8-ofs1' of ↵Linus Torvalds2-7/+4
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs updates from Mike Marshall: - John Hubbard's conversion from get_user_pages() to pin_user_pages() - Colin Ian King's removal of an unneeded variable initialization * tag 'for-linus-5.8-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: convert get_user_pages() --> pin_user_pages() orangefs: remove redundant assignment to variable ret
2020-06-02orangefs: use attach/detach_page_privateGuoqing Jiang1-26/+6
Since the new pair function is introduced, we can call them to clean the code in orangefs. Signed-off-by: Guoqing Jiang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Mike Marshall <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Martin Brandenburg <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-05-29orangefs: convert get_user_pages() --> pin_user_pages()John Hubbard1-6/+3
This code was using get_user_pages*(), in a "Case 1" scenario (Direct IO), using the categorization from [1]. That means that it's time to convert the get_user_pages*() + put_page() calls to pin_user_pages*() + unpin_user_pages() calls. There is some helpful background in [2]: basically, this is a small part of fixing a long-standing disconnect between pinning pages, and file systems' use of those pages. [1] Documentation/core-api/pin_user_pages.rst [2] "Explicit pinning of user-space pages": https://lwn.net/Articles/807108/ Cc: Mike Marshall <[email protected]> Cc: Martin Brandenburg <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Mike Marshall <[email protected]>