aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-08-30inode: port __I_SYNC to var eventChristian Brauner1-16/+29
Port the __I_SYNC mechanism to use the new var event mechanism. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30fs: reorder i_state bitsChristian Brauner1-17/+21
so that we can use the first bits to derive unique addresses from i_state. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30fs: add i_state helpersChristian Brauner2-0/+26
The i_state member is an unsigned long so that it can be used with the wait bit infrastructure which expects unsigned long. This wastes 4 bytes which we're unlikely to ever use. Switch to using the var event wait mechanism using the address of the bit. Thanks to Linus for the address idea. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30MAINTAINERS: add the VFS git treeEric Biggers1-0/+1
The VFS git tree is missing from MAINTAINERS. Add it. Signed-off-by: Eric Biggers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-08-30fs: s/__u32/u32/ for s_fsnotify_maskChristian Brauner1-1/+1
The underscore variants are for uapi whereas the non-underscore variants are for in-kernel consumers. Link: https://lore.kernel.org/r/20240822-anwerben-nutzung-1cd6c82a565f@brauner Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30fs: remove unused path_put_init()Christian Brauner1-6/+0
This helper has been unused for a while now. Link: https://lore.kernel.org/r/20240822-bewuchs-werktag-46672b3c0606@brauner Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30vfs: drop one lock trip in evict()Mateusz Guzik2-16/+6
Most commonly neither I_LRU_ISOLATING nor I_SYNC are set, but the stock kernel takes a back-to-back relock trip to check for them. It probably can be avoided altogether, but for now massage things back to just one lock acquire. Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Zhihao Cheng <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30debugfs show actual source in /proc/mountsMarc Aurèle La France1-0/+8
After its conversion to the new mount API, debugfs displays "none" in /proc/mounts instead of the actual source. Fix this by recognising its "source" mount option. Signed-off-by: Marc Aurèle La France <[email protected]> Link: https://lore.kernel.org/r/[email protected] Fixes: a20971c18752 ("vfs: Convert debugfs to use the new mount API") Cc: [email protected] # 6.10.x: 49abee5991e1: debugfs: Convert to new uid/gid option parsing helpers Signed-off-by: Christian Brauner <[email protected]>
2024-08-30inode: remove __I_DIO_WAKEUPChristian Brauner3-34/+19
Afaict, we can just rely on inode->i_dio_count for waiting instead of this awkward indirection through __I_DIO_WAKEUP. This survives LTP dio and xfstests dio tests. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30doc: correcting the idmapping mount exampleHongbo Li1-4/+4
In step 2, we obtain the kernel id `k1000`. So in next step (step 3), we should translate the `k1000` not `k21000`. Signed-off-by: Hongbo Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-08-30fs: Use in_group_or_capable() helper to simplify the codeHongbo Li1-2/+2
Since in_group_or_capable has been exported, we can use it to simplify the code when check group and capable. Signed-off-by: Hongbo Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30vfs: elide smp_mb in iversion handling in the common caseMateusz Guzik1-10/+18
According to bpftrace on these routines most calls result in cmpxchg, which already provides the same guarantee. In inode_maybe_inc_iversion elision is possible because even if the wrong value was read due to now missing smp_mb fence, the issue is going to correct itself after cmpxchg. If it appears cmpxchg wont be issued, the fence + reload are there bringing back previous behavior. Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30autofs: add per dentry expire timeoutIan Kent5-8/+104
Add ability to set per-dentry mount expire timeout to autofs. There are two fairly well known automounter map formats, the autofs format and the amd format (more or less System V and Berkley). Some time ago Linux autofs added an amd map format parser that implemented a fair amount of the amd functionality. This was done within the autofs infrastructure and some functionality wasn't implemented because it either didn't make sense or required extra kernel changes. The idea was to restrict changes to be within the existing autofs functionality as much as possible and leave changes with a wider scope to be considered later. One of these changes is implementing the amd options: 1) "unmount", expire this mount according to a timeout (same as the current autofs default). 2) "nounmount", don't expire this mount (same as setting the autofs timeout to 0 except only for this specific mount) . 3) "utimeout=<seconds>", expire this mount using the specified timeout (again same as setting the autofs timeout but only for this mount). To implement these options per-dentry expire timeouts need to be implemented for autofs indirect mounts. This is because all map keys (mounts) for autofs indirect mounts use an expire timeout stored in the autofs mount super block info. structure and all indirect mounts use the same expire timeout. Now I have a request to add the "nounmount" option so I need to add the per-dentry expire handling to the kernel implementation to do this. The implementation uses the trailing path component to identify the mount (and is also used as the autofs map key) which is passed in the autofs_dev_ioctl structure path field. The expire timeout is passed in autofs_dev_ioctl timeout field (well, of the timeout union). If the passed in timeout is equal to -1 the per-dentry timeout and flag are cleared providing for the "unmount" option. If the timeout is greater than or equal to 0 the timeout is set to the value and the flag is also set. If the dentry timeout is 0 the dentry will not expire by timeout which enables the implementation of the "nounmount" option for the specific mount. When the dentry timeout is greater than zero it allows for the implementation of the "utimeout=<seconds>" option. Signed-off-by: Ian Kent <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-08-30vfs: use RCU in ilookupMateusz Guzik1-3/+1
A soft lockup in ilookup was reported when stress-testing a 512-way system [1] (see [2] for full context) and it was verified that not taking the lock shifts issues back to mm. [1] https://lore.kernel.org/linux-mm/[email protected]/ [2] https://lore.kernel.org/linux-mm/[email protected]/ Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30fs: move FMODE_UNSIGNED_OFFSET to fop_flagsChristian Brauner18-21/+27
This is another flag that is statically set and doesn't need to use up an FMODE_* bit. Move it to ->fop_flags and free up another FMODE_* bit. (1) mem_open() used from proc_mem_operations (2) adi_open() used from adi_fops (3) drm_open_helper(): (3.1) accel_open() used from DRM_ACCEL_FOPS (3.2) drm_open() used from (3.2.1) amdgpu_driver_kms_fops (3.2.2) psb_gem_fops (3.2.3) i915_driver_fops (3.2.4) nouveau_driver_fops (3.2.5) panthor_drm_driver_fops (3.2.6) radeon_driver_kms_fops (3.2.7) tegra_drm_fops (3.2.8) vmwgfx_driver_fops (3.2.9) xe_driver_fops (3.2.10) DRM_GEM_FOPS (3.2.11) DEFINE_DRM_GEM_DMA_FOPS (4) struct memdev sets fmode flags based on type of device opened. For devices using struct mem_fops unsigned offset is used. Mark all these file operations as FOP_UNSIGNED_OFFSET and add asserts into the open helper to ensure that the flag is always set. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30vfs: only read fops once in fops_get/putMateusz Guzik1-4/+11
In do_dentry_open() the usage is: f->f_op = fops_get(inode->i_fop); In generated asm the compiler emits 2 reads from inode->i_fop instead of just one. This popped up due to false-sharing where loads from that offset end up bouncing a cacheline during parallel open. While this is going to be fixed, the spurious load does not need to be there. This makes do_dentry_open() go down from 1177 to 1154 bytes. fops_put() is patched to maintain some consistency. No functional changes. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-08-30fs/select: Annotate struct poll_list with __counted_by()Thorsten Blum1-1/+1
Add the __counted_by compiler attribute to the flexible array member entries to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Signed-off-by: Thorsten Blum <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30fs: rearrange general fastpath check now that O_CREAT uses itChristian Brauner1-13/+4
If we find a positive dentry we can now simply try and open it. All prelimiary checks are already done with or without O_CREAT. Signed-off-by: Christian Brauner <[email protected]>
2024-08-30fs: remove audit dummy context checkChristian Brauner1-11/+1
Now that we audit later during lookup_open() we can remove the audit dummy context check. This simplifies things a lot. Signed-off-by: Christian Brauner <[email protected]>
2024-08-30fs: pull up trailing slashes check for O_CREATChristian Brauner1-8/+4
Perform the check for trailing slashes right in the fastpath check and don't bother with any additional work. Signed-off-by: Christian Brauner <[email protected]>
2024-08-30fs: move audit parent inodeChristian Brauner1-1/+3
During O_CREAT we unconditionally audit the parent inode. This makes it difficult to support a fastpath for O_CREAT when the file already exists because we have to drop out of RCU lookup needlessly. We worked around this by checking whether audit was actually active but that's also suboptimal. Instead, move the audit of the parent inode down into lookup_open() at a point where it's mostly certain that the file needs to be created. This also reduced the inconsistency that currently exists: while audit on the parent is done independent of whether or no the file already existed an audit on the file is only performed if it has been created. By moving the audit down a bit we emit the audit a little later but it will allow us to simplify the fastpath for O_CREAT significantly. Signed-off-by: Christian Brauner <[email protected]>
2024-08-30fs: try an opportunistic lookup for O_CREAT opens tooJeff Layton1-10/+64
Today, when opening a file we'll typically do a fast lookup, but if O_CREAT is set, the kernel always takes the exclusive inode lock. I assume this was done with the expectation that O_CREAT means that we always expect to do the create, but that's often not the case. Many programs set O_CREAT even in scenarios where the file already exists. This patch rearranges the pathwalk-for-open code to also attempt a fast_lookup in certain O_CREAT cases. If a positive dentry is found, the inode_lock can be avoided altogether, and if auditing isn't enabled, it can stay in rcuwalk mode for the last step_into. One notable exception that is hopefully temporary: if we're doing an rcuwalk and auditing is enabled, skip the lookup_fast. Legitimizing the dentry in that case is more expensive than taking the i_rwsem for now. Signed-off-by: Jeff Layton <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30eventpoll: Annotate data-race of busy_poll_usecsMartin Karsten1-1/+1
A struct eventpoll's busy_poll_usecs field can be modified via a user ioctl at any time. All reads of this field should be annotated with READ_ONCE. Fixes: 85455c795c07 ("eventpoll: support busy poll per epoll instance") Cc: [email protected] Signed-off-by: Martin Karsten <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Joe Damato <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30eventpoll: Don't re-zero eventpoll fieldsJoe Damato1-5/+0
Remove redundant and unnecessary code. ep_alloc uses kzalloc to create struct eventpoll, so there is no need to set fields to defaults of 0. This was accidentally introduced in commit 85455c795c07 ("eventpoll: support busy poll per epoll instance") and expanded on in follow-up commits. Signed-off-by: Joe Damato <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Martin Karsten <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30Fix spelling and gramatical errorsXiaxi Shen1-3/+3
Fixed 3 typos in design.rst Signed-off-by: Xiaxi Shen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Carlos Maiolino <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30vfs: dodge smp_mb in break_lease and break_deleg in the common caseMateusz Guzik1-2/+12
These inlines show up in the fast path (e.g., in do_dentry_open()) and induce said full barrier regarding i_flctx access when in most cases the pointer is NULL. The pointer can be safely checked before issuing the barrier, dodging it in most cases as a result. It is plausible the consume fence would be sufficient, but I don't want to go audit all callers regarding what they before calling here. Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30file: remove outdated comment after close_fd()Joel Savitz1-1/+1
Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jan Kara <[email protected]> Cc: [email protected] The comment on EXPORT_SYMBOL(close_fd) was added in commit 2ca2a09d6215 ("fs: add ksys_close() wrapper; remove in-kernel calls to sys_close()"), before commit 8760c909f54a ("file: Rename __close_fd to close_fd and remove the files parameter") gave the function its current name, however commit 1572bfdf21d4 ("file: Replace ksys_close with close_fd") removes the referenced caller entirely, obsoleting this comment. Signed-off-by: Joel Savitz <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30fs/namespace.c: Fix typo in commentYuesong Li1-2/+2
replace 'permanetly' with 'permanently' in the comment & replace 'propogated' with 'propagated' in the comment Signed-off-by: Yuesong Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-08-30exec: don't WARN for racy path_noexec checkMateusz Guzik1-19/+12
Both i_mode and noexec checks wrapped in WARN_ON stem from an artifact of the previous implementation. They used to legitimately check for the condition, but that got moved up in two commits: 633fb6ac3980 ("exec: move S_ISREG() check earlier") 0fd338b2d2cd ("exec: move path_noexec() check earlier") Instead of being removed said checks are WARN_ON'ed instead, which has some debug value. However, the spurious path_noexec check is racy, resulting in unwarranted warnings should someone race with setting the noexec flag. One can note there is more to perm-checking whether execve is allowed and none of the conditions are guaranteed to still hold after they were tested for. Additionally this does not validate whether the code path did any perm checking to begin with -- it will pass if the inode happens to be regular. Keep the redundant path_noexec() check even though it's mindless nonsense checking for guarantee that isn't given so drop the WARN. Reword the commentary and do small tidy ups while here. Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] [brauner: keep redundant path_noexec() check] Signed-off-by: Christian Brauner <[email protected]>
2024-08-30fs: add a kerneldoc header over lookup_fastJeff Layton1-0/+14
The lookup_fast helper in fs/namei.c has some subtlety in how dentries are returned. Document them. Signed-off-by: Jeff Layton <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-08-30fs: remove comment about d_rcu_to_refcountJeff Layton1-3/+0
This function no longer exists. Signed-off-by: Jeff Layton <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-08-30fs: mounts: Remove unused declaration mnt_cursor_del()Yue Haibing1-1/+0
Commit 2eea9ce4310d ("mounts: keep list of mounts in an rbtree") removed the implementation but leave declaration. Signed-off-by: Yue Haibing <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-08-30Merge patch series "proc: restrict overmounting of ephemeral entities"Christian Brauner3-10/+23
Christian Brauner <[email protected]> says: It is currently possible to mount on top of various ephemeral entities in procfs. This specifically includes magic links. To recap, magic links are links of the form /proc/<pid>/fd/<nr>. They serve as references to a target file and during path lookup they cause a jump to the target path. Such magic links disappear if the corresponding file descriptor is closed. Currently it is possible to overmount such magic links: int fd = open("/mnt/foo", O_RDONLY); sprintf(path, "/proc/%d/fd/%d", getpid(), fd); int fd2 = openat(AT_FDCWD, path, O_PATH | O_NOFOLLOW); mount("/mnt/bar", path, "", MS_BIND, 0); Arguably, this is nonsensical and is mostly interesting for an attacker that wants to somehow trick a process into e.g., reopening something that they didn't intend to reopen or to hide a malicious file descriptor. But also it risks leaking mounts for long-running processes. When overmounting a magic link like above, the mount will not be detached when the file descriptor is closed. Only the target mountpoint will disappear. Which has the consequence of making it impossible to unmount that mount afterwards. So the mount will stick around until the process exits and the /proc/<pid>/ directory is cleaned up during proc_flush_pid() when the dentries are pruned and invalidated. That in turn means it's possible for a program to accidentally leak mounts and it's also possible to make a task leak mounts without it's knowledge if the attacker just keeps overmounting things under /proc/<pid>/fd/<nr>. I think it's wrong to try and fix this by us starting to play games with close() or somewhere else to undo these mounts when the file descriptor is closed. The fact that we allow overmounting of such magic links is simply a bug and one that we need to fix. Similar things can be said about entries under fdinfo/ and map_files/ so those are restricted as well. I have a further more aggressive patch that gets out the big hammer and makes everything under /proc/<pid>/*, as well as immediate symlinks such as /proc/self, /proc/thread-self, /proc/mounts, /proc/net that point into /proc/<pid>/ not overmountable. Imho, all of this should be blocked if we can get away with it. It's only useful to hide exploits such as in [1]. And again, overmounting of any global procfs files remains unaffected and is an existing and supported use-case. Link: https://righteousit.com/2024/07/24/hiding-linux-processes-with-bind-mounts [1] // Note that repro uses the traditional way of just mounting over // /proc/<pid>/fd/<nr>. This could also all be achieved just based on // file descriptors using move_mount(). So /proc/<pid>/fd/<nr> isn't the // only entry vector here. It's also possible to e.g., mount directly // onto /proc/<pid>/map_files/* without going over /proc/<pid>/fd/<nr>. int main(int argc, char *argv[]) { char path[PATH_MAX]; creat("/mnt/foo", 0777); creat("/mnt/bar", 0777); /* * For illustration use a bunch of file descriptors in the upper * range that are unused. */ for (int i = 10000; i >= 256; i--) { printf("I'm: /proc/%d/\n", getpid()); int fd2 = open("/mnt/foo", O_RDONLY); if (fd2 < 0) { printf("%m - Failed to open\n"); _exit(1); } int newfd = dup2(fd2, i); if (newfd < 0) { printf("%m - Failed to dup\n"); _exit(1); } close(fd2); sprintf(path, "/proc/%d/fd/%d", getpid(), newfd); int fd = openat(AT_FDCWD, path, O_PATH | O_NOFOLLOW); if (fd < 0) { printf("%m - Failed to open\n"); _exit(3); } sprintf(path, "/proc/%d/fd/%d", getpid(), fd); printf("Mounting on top of %s\n", path); if (mount("/mnt/bar", path, "", MS_BIND, 0)) { printf("%m - Failed to mount\n"); _exit(4); } close(newfd); close(fd2); } /* * Give some time to look at things. The mounts now linger until * the process exits. */ sleep(10000); _exit(0); } * patches from https://lore.kernel.org/r/[email protected]: proc: block mounting on top of /proc/<pid>/fdinfo/* proc: block mounting on top of /proc/<pid>/fd/* proc: block mounting on top of /proc/<pid>/map_files/* proc: add proc_splice_unmountable() proc: proc_readfdinfo() -> proc_fdinfo_iterate() proc: proc_readfd() -> proc_fd_iterate() Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-08-30proc: block mounting on top of /proc/<pid>/fdinfo/*Christian Brauner1-2/+2
Entries under /proc/<pid>/fdinfo/* are ephemeral and may go away before the process dies. As such allowing them to be used as mount points creates the ability to leak mounts that linger until the process dies with no ability to unmount them until then. Don't allow using them as mountpoints. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30proc: block mounting on top of /proc/<pid>/fd/*Christian Brauner1-2/+2
Entries under /proc/<pid>/fd/* are ephemeral and may go away before the process dies. As such allowing them to be used as mount points creates the ability to leak mounts that linger until the process dies with no ability to unmount them until then. Don't allow using them as mountpoints. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30proc: block mounting on top of /proc/<pid>/map_files/*Christian Brauner1-2/+2
Entries under /proc/<pid>/map_files/* are ephemeral and may go away before the process dies. As such allowing them to be used as mount points creates the ability to leak mounts that linger until the process dies with no ability to unmount them until then. Don't allow using them as mountpoints. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30proc: add proc_splice_unmountable()Christian Brauner1-0/+13
Add a tiny procfs helper to splice a dentry that cannot be mounted upon. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30proc: proc_readfdinfo() -> proc_fdinfo_iterate()Christian Brauner1-2/+2
Give the method to iterate through the fdinfo directory a better name. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30proc: proc_readfd() -> proc_fd_iterate()Christian Brauner1-2/+2
Give the method to iterate through the fd directory a better name. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-30proc: add config & param to block forcing mem writesAdrian Ratiu3-1/+102
This adds a Kconfig option and boot param to allow removing the FOLL_FORCE flag from /proc/pid/mem write calls because it can be abused. The traditional forcing behavior is kept as default because it can break GDB and some other use cases. Previously we tried a more sophisticated approach allowing distributions to fine-tune /proc/pid/mem behavior, however that got NAK-ed by Linus [1], who prefers this simpler approach with semantics also easier to understand for users. Link: https://lore.kernel.org/lkml/CAHk-=wiGWLChxYmUA5HrT5aopZrB7_2VTa0NLZcxORgkUe5tEQ@mail.gmail.com/ [1] Cc: Doug Anderson <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kees Cook <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Christian Brauner <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Adrian Ratiu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-08-30Merge tag 'drm-intel-next-2024-08-29' of ↵Dave Airlie80-775/+1515
https://gitlab.freedesktop.org/drm/i915/kernel into drm-next Cross-driver (xe-core) Changes: - Require BMG scanout buffers to be 64k physically aligned (Maarten) Core (drm) Changes: - Introducing Xe2 ccs modifiers for integrated and discrete graphics (Juha-Pekka) Driver Changes: - General cleanup and more work moving towards intel_display isolation (Jani) - New display workaround (Suraj) - Use correct cp_irq_count on HDCP (Suraj) - eDP PSR fix when CRC is enabled (Jouni) - Fix DP MST state after a sink reset (Imre) - Fix Arrow Lake GSC firmware version (John) - Use chained DSBs for LUT programming (Ville) Signed-off-by: Dave Airlie <[email protected]> From: Rodrigo Vivi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2024-08-30Merge tag 'drm-xe-next-2024-08-28' of ↵Dave Airlie87-835/+2271
https://gitlab.freedesktop.org/drm/xe/kernel into drm-next UAPI Changes: - Fix OA format masks which were breaking build with gcc-5 Cross-subsystem Changes: Driver Changes: - Use dma_fence_chain_free in chain fence unused as a sync (Matthew Brost) - Refactor hw engine lookup and mmio access to be used in more places (Dominik, Matt Auld, Mika Kuoppala) - Enable priority mem read for Xe2 and later (Pallavi Mishra) - Fix PL1 disable flow in xe_hwmon_power_max_write (Karthik) - Fix refcount and speedup devcoredump (Matthew Brost) - Add performance tuning changes to Xe2 (Akshata, Shekhar) - Fix OA sysfs entry (Ashutosh) - Add first GuC firmware support for BMG (Julia) - Bump minimum GuC firmware for platforms under force_probe to match LNL and BMG (Julia) - Fix access check on user fence creation (Nirmoy) - Add/document workarounds for Xe2 (Julia, Daniele, John, Tejas) - Document workaround and use proper WA infra (Matt Roper) - Fix VF configuration on media GT (Michal Wajdeczko) - Fix VM dma-resv lock (Matthew Brost) - Allow suspend/resume exec queue backend op to be called multiple times (Matthew Brost) - Add GT stats to debugfs (Nirmoy) - Add hwconfig to debugfs (Matt Roper) - Compile out all debugfs code with ONFIG_DEUBG_FS=n (Lucas) - Remove dead kunit code (Jani Nikula) - Refactor drvdata storing to help display (Jani Nikula) - Cleanup unsused xe parameter in pte handling (Himal) - Rename s/enable_display/probe_display/ for clarity (Lucas) - Fix missing MCR annotation in couple of registers (Tejas) - Fix DGFX display suspend/resume (Maarten) - Prepare exec_queue_kill for PXP handling (Daniele) - Fix devm/drmm issues (Daniele, Matthew Brost) - Fix tile and ggtt fini sequences (Matthew Brost) - Fix crashes when probing without firmware in place (Daniele, Matthew Brost) - Use xe_managed for kernel BOs (Daniele, Matthew Brost) - Future-proof dss_per_group calculation by using hwconfig (Matt Roper) - Use reserved copy engine for user binds on faulting devices (Matthew Brost) - Allow mixing dma-fence jobs and long-running faulting jobs (Francois) - Cleanup redundant arg when creating use BO (Nirmoy) - Prevent UAF around preempt fence (Auld) - Fix display suspend/resume (Maarten) - Use vma_pages() helper (Thorsten) - Calculate pagefault queue size (Stuart, Matthew Auld) - Fix missing pagefault wq destroy (Stuart) - Fix lifetime handling of HW fence ctx (Matthew Brost) - Fix order destroy order for jobs (Matthew Brost) - Fix TLB invalidation for media GT (Matthew Brost) - Document GGTT (Rodrigo Vivi) - Refactor GGTT layering and fix runtime outer protection (Rodrigo Vivi) - Handle HPD polling on display pm runtime suspend/resume (Imre, Vinod) - Drop unrequired NULL checks (Apoorva, Himal) - Use separate rpm lockdep map for non-d3cold-capable devices (Thomas Hellström) - Support "nomodeset" kernel command-line option (Thomas Zimmermann) - Drop force_probe requirement for LNL and BMG (Lucas, Balasubramani) Signed-off-by: Dave Airlie <[email protected]> From: Lucas De Marchi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/wd42jsh4i3q5zlrmi2cljejohdsrqc6hvtxf76lbxsp3ibrgmz@y54fa7wwxgsd
2024-08-30Merge tag 'drm-misc-next-2024-08-29' of ↵Dave Airlie59-876/+2368
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for v6.12: UAPI Changes: devfs: - support device numbers up to MINORBITS limit Core Changes: ci: - increase job timeout devfs: - use XArray for minor ids displayport: - mst: GUID improvements docs: - add fixes and cleanups panic: - optionally display QR code Driver Changes: amdgpu: - faster vblank disabling - GUID improvements gm12u320 - convert to struct drm_edid host1x: - fix syncpoint IRQ during resume - use iommu_paging_domain_alloc() imx: - ipuv3: convert to struct drm_edid omapdrm: - improve error handling panel: - add support for BOE TV101WUM-LL2 plus DT bindings - novatek-nt35950: improve error handling - nv3051d: improve error handling - panel-edp: add support for BOE NE140WUM-N6G; revert support for SDC ATNA45AF01 - visionox-vtdr6130: improve error handling; use devm_regulator_bulk_get_const() renesas: - rz-du: add support for RZ/G2UL plus DT bindings sti: - convert to struct drm_edid tegra: - gr3d: improve PM domain handling - convert to struct drm_edid Signed-off-by: Dave Airlie <[email protected]> From: Thomas Zimmermann <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2024-08-29KVM: x86: Add fastpath handling of HLT VM-ExitsSean Christopherson4-3/+36
Add a fastpath for HLT VM-Exits by immediately re-entering the guest if it has a pending wake event. When virtual interrupt delivery is enabled, i.e. when KVM doesn't need to manually inject interrupts, this allows KVM to stay in the fastpath run loop when a vIRQ arrives between the guest doing CLI and STI;HLT. Without AMD's Idle HLT-intercept support, the CPU generates a HLT VM-Exit even though KVM will immediately resume the guest. Note, on bare metal, it's relatively uncommon for a modern guest kernel to actually trigger this scenario, as the window between the guest checking for a wake event and committing to HLT is quite small. But in a nested environment, the timings change significantly, e.g. rudimentary testing showed that ~50% of HLT exits where HLT-polling was successful would be serviced by this fastpath, i.e. ~50% of the time that a nested vCPU gets a wake event before KVM schedules out the vCPU, the wake event was pending even before the VM-Exit. Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-29KVM: x86: Reorganize code in x86.c to co-locate vCPU blocking/running helpersSean Christopherson1-132/+132
Shuffle code around in x86.c so that the various helpers related to vCPU blocking/running logic are (a) located near each other and (b) ordered so that HLT emulation can use kvm_vcpu_has_events() in a future path. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-29KVM: x86: Exit to userspace if fastpath triggers one on instruction skipSean Christopherson2-2/+8
Exit to userspace if a fastpath handler triggers such an exit, which can happen when skipping the instruction, e.g. due to userspace single-stepping the guest via KVM_GUESTDBG_SINGLESTEP or because of an emulation failure. Fixes: 404d5d7bff0d ("KVM: X86: Introduce more exit_fastpath_completion enum values") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-29KVM: x86: Dedup fastpath MSR post-handling logicSean Christopherson1-10/+11
Now that the WRMSR fastpath for x2APIC_ICR and TSC_DEADLINE are identical, ignoring the backend MSR handling, consolidate the common bits of skipping the instruction and setting the return value. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-29KVM: x86: Re-enter guest if WRMSR(X2APIC_ICR) fastpath is successfulSean Christopherson1-1/+1
Re-enter the guest in the fastpath if WRMSR emulation for x2APIC's ICR is successful, as no additional work is needed, i.e. there is no code unique for WRMSR exits between the fastpath and the "!= EXIT_FASTPATH_NONE" check in __vmx_handle_exit(). Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-29KVM: selftests: Explicitly include committed one-off assets in .gitignoreSean Christopherson1-0/+4
Add KVM selftests' one-off assets, e.g. the Makefile, to the .gitignore so that they are explicitly included. The justification for omitting the one-offs was that including them wouldn't help prevent mistakes: Deliberately do not include the one-off assets, e.g. config, settings, .gitignore itself, etc as Git doesn't ignore files that are already in the repository. Adding the one-off assets won't prevent mistakes where developers forget to --force add files that don't match the "allowed". Turns out that's not the case, as W=1 will generate warnings, and the amazing-as-always kernel test bot reports new warnings: tools/testing/selftests/kvm/.gitignore: warning: ignored by one of the .gitignore files tools/testing/selftests/kvm/Makefile: warning: ignored by one of the .gitignore files >> tools/testing/selftests/kvm/Makefile.kvm: warning: ignored by one of the .gitignore files tools/testing/selftests/kvm/config: warning: ignored by one of the .gitignore files tools/testing/selftests/kvm/settings: warning: ignored by one of the .gitignore files Fixes: 43e96957e8b8 ("KVM: selftests: Use pattern matching in .gitignore") Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-29KVM: Clean up coalesced MMIO ring full checkSean Christopherson1-21/+8
Fold coalesced_mmio_has_room() into its sole caller, coalesced_mmio_write(), as it's really just a single line of code, has a goofy return value, and is unnecessarily brittle. E.g. if coalesced_mmio_has_room() were to check ring->last directly, or the caller failed to use READ_ONCE(), KVM would be susceptible to TOCTOU attacks from userspace. Opportunistically add a comment explaining why on earth KVM leaves one entry free, which may not be obvious to readers that aren't familiar with ring buffers. No functional change intended. Reviewed-by: Ilias Stamatis <[email protected]> Cc: Paul Durrant <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>