| Age | Commit message (Collapse) | Author | Files | Lines |
|
Pass a folio instead of the head page to save a few instructions. Update
the documentation, at least in English.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Sidhartha Kumar <[email protected]>
Cc: Yanteng Si <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Indirect calls are expensive, thanks to Spectre. Call free_huge_page()
directly if the folio belongs to hugetlb.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Yanteng Si <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Especially the "For PROT_NONE VMAs, the PTEs are not marked
_PAGE_PROTNONE" part is wrong: doing an mprotect(PROT_NONE) will end up
marking all PTEs on x86_64 as _PAGE_PROTNONE, making pte_protnone()
indicate "yes".
So let's improve the comment, so it's easier to grasp which semantics
pte_protnone() actually has.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: liubo <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
|
|
Eric has pointed out that we still have 3 users of do_each_thread().
Change them to use for_each_process_thread() and kill this helper.
There is a subtle change, after do_each_thread/while_each_thread g == t ==
&init_task, while after for_each_process_thread() they both point to
nowhere, but this doesn't matter.
> Why is for_each_process_thread() better than do_each_thread()?
Say, for_each_process_thread() is rcu safe, do_each_thread() is not.
And certainly
for_each_process_thread(p, t) {
do_something(p, t);
}
looks better than
do_each_thread(p, t) {
do_something(p, t);
} while_each_thread(p, t);
And again, there are only 3 users of this awkward helper left. It should
have been killed years ago and in fact I thought it had already been
killed. It uses while_each_thread() which needs some changes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: "Christian Brauner (Microsoft)" <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Jiri Slaby <[email protected]> # tty/serial
Signed-off-by: Andrew Morton <[email protected]>
|
|
Sparse is not happy to see non-static variable without declaration:
lib/vsprintf.c:61:6: warning: symbol 'no_hash_pointers' was not declared.
Should it be static?
Declare respective variable in the sprintf.h. With this, add a comment to
discourage its use if no real need.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Acked-by: Marco Elver <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Steven Rostedt (Google) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "lib/vsprintf: Rework header inclusions", v3.
Some patches that reduce the mess with the header inclusions related to
vsprintf.c module. Each patch has its own description, and has no
dependencies to each other, except the collisions over modifications of
the same places. Hence the series.
This patch (of 2):
kernel.h is being used as a dump for all kinds of stuff for a long time.
sprintf() and friends are used in many drivers without need of the full
kernel.h dependency train with it.
Here is the attempt on cleaning it up by splitting out sprintf() and
friends.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Steven Rostedt (Google) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
There is no users of wait_on_page_locked_killable(), remove it.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
It seems it was introduced by commit d3f77dfdc718 ("blkcg: implement
REQ_CGROUP_PUNT") unintentionally, but the definition does not exist,
remove it.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Xiu Jianfeng <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Stefan Roesch <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This sysctl has the very unusual behaviour of not allowing any user (even
CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were
to set this sysctl to a more restrictive option in the host pidns you
would need to reboot your machine in order to reset it.
The justification given in [1] is that this is a security feature and thus
it should not be possible to disable. Aside from the fact that we have
plenty of security-related sysctls that can be disabled after being
enabled (fs.protected_symlinks for instance), the protection provided by
the sysctl is to stop users from being able to create a binary and then
execute it. A user with CAP_SYS_ADMIN can trivially do this without
memfd_create(2):
% cat mount-memfd.c
#include <fcntl.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/mount.h>
#define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:"
int main(void)
{
int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
assert(fsfd >= 0);
assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2));
int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
assert(dfd >= 0);
int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782);
assert(execfd >= 0);
assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE));
assert(!close(execfd));
char *execpath = NULL;
char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL };
execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC);
assert(execfd >= 0);
assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0);
assert(!execve(execpath, argv, envp));
}
% ./mount-memfd
this file was executed from this totally private tmpfs: /proc/self/fd/5
%
Given that it is possible for CAP_SYS_ADMIN users to create executable
binaries without memfd_create(2) and without touching the host filesystem
(not to mention the many other things a CAP_SYS_ADMIN process would be
able to do that would be equivalent or worse), it seems strange to cause a
fair amount of headache to admins when there doesn't appear to be an
actual security benefit to blocking this. There appear to be concerns
about confused-deputy-esque attacks[2] but a confused deputy that can
write to arbitrary sysctls is a bigger security issue than executable
memfds.
/* New API */
The primary requirement from the original author appears to be more based
on the need to be able to restrict an entire system in a hierarchical
manner[3], such that child namespaces cannot re-enable executable memfds.
So, implement that behaviour explicitly -- the vm.memfd_noexec scope is
evaluated up the pidns tree to &init_pid_ns and you have the most
restrictive value applied to you. The new lower limit you can set
vm.memfd_noexec is whatever limit applies to your parent.
Note that a pidns will inherit a copy of the parent pidns's effective
vm.memfd_noexec setting at unshare() time. This matches the existing
behaviour, and it also ensures that a pidns will never have its
vm.memfd_noexec setting *lowered* behind its back (but it will be raised
if the parent raises theirs).
/* Backwards Compatibility */
As the previous version of the sysctl didn't allow you to lower the
setting at all, there are no backwards compatibility issues with this
aspect of the change.
However it should be noted that now that the setting is completely
hierarchical. Previously, a cloned pidns would just copy the current
pidns setting, meaning that if the parent's vm.memfd_noexec was changed it
wouldn't propoagate to existing pid namespaces. Now, the restriction
applies recursively. This is a uAPI change, however:
* The sysctl is very new, having been merged in 6.3.
* Several aspects of the sysctl were broken up until this patchset and
the other patchset by Jeff Xu last month.
And thus it seems incredibly unlikely that any real users would run into
this issue. In the worst case, if this causes userspace isues we could
make it so that modifying the setting follows the hierarchical rules but
the restriction checking uses the cached copy.
[1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/
[2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/
[3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <[email protected]>
Cc: Dominique Martinet <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Daniel Verkamp <[email protected]>
Cc: Jeff Xu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Given the difficulty of auditing all of userspace to figure out whether
every memfd_create() user has switched to passing MFD_EXEC and
MFD_NOEXEC_SEAL flags, it seems far less distruptive to make it possible
for older programs that don't make use of executable memfds to run under
vm.memfd_noexec=2. Otherwise, a small dependency change can result in
spurious errors. For programs that don't use executable memfds, passing
MFD_NOEXEC_SEAL is functionally a no-op and thus having the same
In addition, every failure under vm.memfd_noexec=2 needs to print to the
kernel log so that userspace can figure out where the error came from.
The concerns about pr_warn_ratelimited() spam that caused the switch to
pr_warn_once()[1,2] do not apply to the vm.memfd_noexec=2 case.
This is a user-visible API change, but as it allows programs to do
something that would be blocked before, and the sysctl itself was broken
and recently released, it seems unlikely this will cause any issues.
[1]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/
[2]: https://lore.kernel.org/202212161233.85C9783FB@keescook/
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <[email protected]>
Cc: Dominique Martinet <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Daniel Verkamp <[email protected]>
Cc: Jeff Xu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
These functions are no longer necessary. Remove them and cleanup
Documentation referencing them.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The page table members are now split out into their own ptdesc struct.
Remove them from struct page.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Create pagetable_pte_ctor(), pagetable_pmd_ctor(), pagetable_pte_dtor(),
and pagetable_pmd_dtor() and make the original pgtable
constructor/destructors wrappers.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Converts internal pmd_pgtable_page() callers to use pmd_ptdesc(). This
removes some direct accesses to struct page, working towards splitting out
struct ptdesc from struct page.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Introduce utility functions setting the foundation for ptdescs. These
will also assist in the splitting out of ptdesc from struct page.
Functions that focus on the descriptor are prefixed with ptdesc_* while
functions that focus on the pagetable are prefixed with pagetable_*.
pagetable_alloc() is defined to allocate new ptdesc pages as compound
pages. This is to standardize ptdescs by allowing for one allocation and
one free function, in contrast to 2 allocation and 2 free functions.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Currently, page table information is stored within struct page. As part
of simplifying struct page, create struct ptdesc for page table
information.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "Split ptdesc from struct page", v9.
The MM subsystem is trying to shrink struct page. This patchset
introduces a memory descriptor for page table tracking - struct ptdesc.
This patchset introduces ptdesc, splits ptdesc from struct page, and
converts many callers of page table constructor/destructors to use
ptdescs.
Ptdesc is a foundation to further standardize page tables, and eventually
allow for dynamic allocation of page tables independent of struct page.
However, the use of pages for page table tracking is quite deeply
ingrained and varied across archictectures, so there is still a lot of
work to be done before that can happen.
This patch (of 31):
No folio equivalents for page type operations have been defined, so define
them for later folio conversions.
Also changes the Page##uname macros to take in const struct page* since we
only read the memory here.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
With memmap on memory, some architecture needs more details w.r.t altmap
such as base_pfn, end_pfn, etc to unmap vmemmap memory. Instead of
computing them again when we remove a memory block, embed vmem_altmap
details in struct memory_block if we are using memmap on memory block
feature.
[[email protected]: fix error return code in add_memory_resource()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Yang Yingliang <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
If not supported, fallback to not using memap on memmory. This avoids
the need for callers to do the fallback.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
It is better to not expose too many internal variables of memtest,
add a helper memtest_report_meminfo() to show memtest results.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Tomas Mudrunka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Otherwise the kernel ends up with multiple copies:
$ nm vmlinux | grep dummy_vm_ops
ffffffff81e4ea00 d dummy_vm_ops.2
ffffffff81e11760 d dummy_vm_ops.254
ffffffff81e406e0 d dummy_vm_ops.4
ffffffff81e3c780 d dummy_vm_ops.7
While here prefix it with vma_.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mateusz Guzik <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Implicit vma locking inside vm_flags_reset() and vm_flags_reset_once() is
not obvious and makes it hard to understand where vma locking is happening.
Also in some cases (like in dup_userfaultfd()) vma should be locked earlier
than vma_flags modification. To make locking more visible, change these
functions to assert that the vma write lock is taken and explicitly lock
the vma beforehand. Fix userfaultfd functions which should lock the vma
earlier.
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
When CONFIG_PER_VMA_LOCK=n, vma_assert_write_locked() should be equivalent
to mmap_assert_write_locked().
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Jann Horn <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
One DAMON context can have multiple monitoring targets, and DAMOS schemes
are applied to all targets. In some cases, users need to apply different
scheme to different targets. Retrieving monitoring results via DAMON
sysfs interface' 'tried_regions' directory could be one good example.
Also, there could be cases that cgroup DAMOS filter is not enough. All
such use cases can be worked around by having multiple DAMON contexts
having only single target, but it is inefficient in terms of resource
usage, thogh the overhead is not estimated to be huge.
Implement DAMON monitoring target based DAMOS filter for the case. Like
address range target DAMOS filter, handle these filters in the DAMON core
layer, since it is more efficient than doing in operations set layer.
This also means that regions that filtered out by monitoring target type
DAMOS filters are counted as not tried by the scheme. Hence, target
granularity monitoring results retrieval via DAMON sysfs interface becomes
available.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Brendan Higgins <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "Extend DAMOS filters for address ranges and DAMON monitoring
targets"
There are use cases that need to apply DAMOS schemes to specific address
ranges or DAMON monitoring targets. NUMA nodes in the physical address
space, special memory objects in the virtual address space, and monitoring
target specific efficient monitoring results snapshot retrieval could be
examples of such use cases. This patchset extends DAMOS filters feature
for such cases, by implementing two more filter types, namely address
ranges and DAMON monitoring types.
Patches sequence
----------------
The first seven patches are for the address ranges based DAMOS filter.
The first patch implements the filter feature and expose it via DAMON
kernel API. The second patch further expose the feature to users via
DAMON sysfs interface. The third and fourth patches implement unit tests
and selftests for the feature. Three patches (fifth to seventh) updating
the documents follow.
The following six patches are for the DAMON monitoring target based DAMOS
filter. The eighth patch implements the feature in the core layer and
expose it via DAMON's kernel API. The ninth patch further expose it to
users via DAMON sysfs interface. Tenth patch add a selftest, and two
patches (eleventh and twelfth) update documents.
[1] https://lore.kernel.org/damon/[email protected]/
This patch (of 13):
Users can know special characteristic of specific address ranges. NUMA
nodes or special objects or buffers in virtual address space could be such
examples. For such cases, DAMOS schemes could required to be applied to
only specific address ranges. Implement yet another type of DAMOS filter
for the purpose.
Note that the existing filter types, namely anon pages and memcg DAMOS
filters needed page level type check. Because such check can be done
efficiently in the opertions set layer, those filters are handled in
operations set layer. Specifically, only paddr operations set
implementation supports these filters. Also, because statistics counting
is done in the DAMON core layer, the regions that filtered out by these
filters are counted as tried but failed to the statistics.
Unlike those, address range based filters can efficiently handled in the
core layer. Hence, do the handling in the layer, and count the regions
that filtered out by those as the scheme has not tried for the region.
This difference should clearly documented.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Brendan Higgins <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Since commit bef8620cd8e0 ("mm: memcg: deprecate the non-hierarchical
mode"), use_hierarchy is already deprecated. And it's further removed via
commit 9d9d341df4d5 ("cgroup: remove obsoleted broken_hierarchy and
warned_broken_hierarchy"). Update corresponding comment.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm: convert to vma_is_initial_heap/stack()", v3.
Add vma_is_initial_stack() and vma_is_initial_heap() helpers and use them
to simplify code.
This patch (of 4):
Factor out VMA stack and heap checks and name them vma_is_initial_stack()
and vma_is_initial_heap() for general use.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Christian Göttsche <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Christian Göttsche <[email protected]>
Cc: Christian König <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Eric Paris <[email protected]>
Cc: Felix Kuehling <[email protected]>
Cc: "Pan, Xinhui" <[email protected]>
Cc: Paul Moore <[email protected]>
Cc: Stephen Smalley <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
page_ext_operations should only be defined when CONFIG_PAGE_EXTENSION is
enabled.
Besides, this may detect missing reliance on CONFIG_PAGE_EXTENSION from
future Page Extension clients at compile time.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kemeng Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Commit 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from
protection checks") changed the function name but not the corresponding
comment.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "add page_ext_data to get client data in page_ext".
Current clients get data from page_ext by adding offset which is auto
generated in page_ext core and exposes the data layout design inside
page_ext core. This series adds a page_ext_data() to hide this from
clients.
Benefits include:
1. Future clients can call page_ext_data directly instead of defining
a new function like get_page_owner to get the data.
2. There is no change to clients if the layout of page_ext data changes.
This patch (of 3):
Add common page_ext_data function to get client data. This could hide
offset which is auto generated in page_ext core and expose the desgin of
page_ext data layout.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kemeng Shi <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Only convert a few easy parts of this function to use the folio passed in;
convert back to struct page for the majority of it. Removes three hidden
calls to compound_head().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Domenico Cerasuolo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Vitaly Wool <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
As the one caller now has a folio, pass it in and use it. Removes three
calls to compound_head().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Domenico Cerasuolo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Vitaly Wool <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "Followup folio conversions for zswap".
With frontswap killed, it's worth converting the zswap_load() and
zswap_store() functions to take a folio instead of a page pointer. They
aren't converted to support large folios, but there are a lot of
unnecessary calls to compound_head() that are removed by these patches.
This patch (of 4):
Only convert a few easy parts of this function to use the folio passed in;
convert back to struct page for the majority of it. This does remove a
few hidden calls to compound_head().
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Domenico Cerasuolo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Vitaly Wool <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The only user of frontswap is zswap, and has been for a long time. Have
swap call into zswap directly and remove the indirection.
[[email protected]: remove obsolete comment, per Yosry]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: don't warn if none swapcache folio is passed to zswap_load]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Yin Fengwei <[email protected]>
Acked-by: Konrad Rzeszutek Wilk <[email protected]>
Acked-by: Nhat Pham <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Acked-by: Christoph Hellwig <[email protected]>
Cc: Domenico Cerasuolo <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Vitaly Wool <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
walk_page_range() and friends often operate under write-locked mmap_lock.
With introduction of vma locks, the vmas have to be locked as well during
such walks to prevent concurrent page faults in these areas. Add an
additional member to mm_walk_ops to indicate locking requirements for the
walk.
The change ensures that page walks which prevent concurrent page faults
by write-locking mmap_lock, operate correctly after introduction of
per-vma locks. With per-vma locks page faults can be handled under vma
lock without taking mmap_lock at all, so write locking mmap_lock would
not stop them. The change ensures vmas are properly locked during such
walks.
A sample issue this solves is do_mbind() performing queue_pages_range()
to queue pages for migration. Without this change a concurrent page
can be faulted into the area and be left out of migration.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Suggested-by: Jann Horn <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Laurent Dufour <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
We shouldn't be using a GUP-internal helper if it can be avoided.
Similar to smaps_pte_entry() that uses vm_normal_page(), let's use
vm_normal_page_pmd() that similarly refuses to return the huge zeropage.
In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd():
(1) Will always return the head page, not a tail page of a THP.
If we'd ever call smaps_account with a tail page while setting "compound
= true", we could be in trouble, because smaps_account() would look at
the memmap of unrelated pages.
If we're unlucky, that memmap does not exist at all. Before we removed
PG_doublemap, we could have triggered something similar as in
commit 24d7275ce279 ("fs/proc: task_mmu.c: don't read mapcount for
migration entry").
This can theoretically happen ever since commit ff9f47f6f00c ("mm: proc:
smaps_rollup: do not stall write attempts on mmap_lock"):
(a) We're in show_smaps_rollup() and processed a VMA
(b) We release the mmap lock in show_smaps_rollup() because it is
contended
(c) We merged that VMA with another VMA
(d) We collapsed a THP in that merged VMA at that position
If the end address of the original VMA falls into the middle of a THP
area, we would call smap_gather_stats() with a start address that falls
into a PMD-mapped THP. It's probably very rare to trigger when not
really forced.
(2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page()
Treat such PMDs here just like smaps_pte_entry() would treat such PTEs.
If such pages would be anonymous, we most certainly would want to
account them.
(3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap()
As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE
pages. So just like smaps_pte_entry(), we'll now also ignore such PMD
entries.
Especially, follow_pmd_mask() never ends up calling
follow_trans_huge_pmd() on pmd_devmap(). Instead it calls
follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN
is set.
So skipping pmd_devmap() pages seems to be the right thing to do.
(4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page()
We won't be returning a memmap that should be ignored by core-mm, or
worse, a memmap that does not even exist. Note that while
walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't.
Most probably this case doesn't currently really happen on the PMD level,
otherwise we'd already be able to trigger kernel crashes when reading
smaps / smaps_rollup.
So most probably only (1) is relevant in practice as of now, but could only
cause trouble in extreme corner cases.
Let's move follow_trans_huge_pmd() to mm/internal.h to discourage future
reuse in wrong context.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: liubo <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Unfortunately commit 474098edac26 ("mm/gup: replace FOLL_NUMA by
gup_can_follow_protnone()") missed that follow_page() and
follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really
don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting
or due to inaccessible (PROT_NONE) VMAs.
As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
page faults from gup/gup_fast"): "Other follow_page callers like KSM
should not use FOLL_NUMA, or they would fail to get the pages if they use
follow_page instead of get_user_pages."
liubo reported [1] that smaps_rollup results are imprecise, because they
miss accounting of pages that are mapped PROT_NONE. Further, it's easy to
reproduce that KSM no longer works on inaccessible VMAs on x86-64, because
pte_protnone()/pmd_protnone() also indictaes "true" in inaccessible VMAs,
and follow_page() refuses to return such pages right now.
As KVM really depends on these NUMA hinting faults, removing the
pte_protnone()/pmd_protnone() handling in GUP code completely is not
really an option.
To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
to restore the original behavior for now and add better comments.
Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in
is_valid_gup_args(), to add that flag for all external GUP users.
Note that there are three GUP-internal __get_user_pages() users that don't
end up calling is_valid_gup_args() and consequently won't get
FOLL_HONOR_NUMA_FAULT set.
1) get_dump_page(): we really don't want to handle NUMA hinting
faults. It specifies FOLL_FORCE and wouldn't have honored NUMA
hinting faults already.
2) populate_vma_page_range(): we really don't want to handle NUMA hinting
faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have
honored NUMA hinting faults already.
3) faultin_vma_page_range(): we similarly don't want to handle NUMA
hinting faults.
To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in
inaccessible VMAs properly, we have to perform VMA accessibility checks in
gup_can_follow_protnone().
As GUP-fast should reject such pages either way in
pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and
arm64 that both implement pte_protnone() -- let's just always fallback to
ordinary GUP when stumbling over pte_protnone()/pmd_protnone().
As Linus notes [2], honoring NUMA faults might only make sense for
selected GUP users.
So we should really see if we can instead let relevant GUP callers specify
it manually, and not trigger NUMA hinting faults from GUP as default.
Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag and
adding appropriate documenation.
While at it, remove a stale comment from follow_trans_huge_pmd(): That
comment for pmd_protnone() was added in commit 2b4847e73004 ("mm: numa:
serialise parallel get_user_page against THP migration"), which noted:
THP does not unmap pages due to a lack of support for migration
entries at a PMD level. This allows races with get_user_pages
Nowadays, we do have PMD migration entries, so the comment no longer
applies. Let's drop it.
[1] https://lore.kernel.org/r/[email protected]
[2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
Signed-off-by: David Hildenbrand <[email protected]>
Reported-by: liubo <[email protected]>
Closes: https://lore.kernel.org/r/[email protected]
Reported-by: Peter Xu <[email protected]>
Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/
Acked-by: Mel Gorman <[email protected]>
Acked-by: Peter Xu <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
atomic_t variables are currently used to implement reference counters
with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows and
underflows. This is important since overflows and underflows can lead
to use-after-free situation and be exploitable.
The variable nsproxy.count is used as pure reference counter. Convert it
to refcount_t and fix up the operations.
**Important note for maintainers:
Some functions from refcount_t API defined in refcount.h have different
memory ordering guarantees than their atomic counterparts. Please check
Documentation/core-api/refcount-vs-atomic.rst for more information.
Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in some
rare cases it might matter. Please double check that you don't have
some undocumented memory guarantees for this variable usage.
For the nsproxy.count it might make a difference in following places:
- put_nsproxy() and switch_task_namespaces(): decrement in
refcount_dec_and_test() only provides RELEASE ordering and ACQUIRE
ordering on success vs. fully ordered atomic counterpart
Suggested-by: Kees Cook <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
|
|
Other parts of the kernel may use LPS0 constraints information to make
decisions on what power state to put a device into.
Signed-off-by: Mario Limonciello <[email protected]>
[ rjw: Rewrite kerneldoc, rearrange if () statement, edit subject and
changelog ]
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
The `#ifdef` for acpi_register_lps0_dev() currently is guarded against
`CONFIG_X86`, but actually the functions contained in the block are
specifically sleep related functions.
Adjust the guard to also check for `CONFIG_SUSPEND`.
Signed-off-by: Mario Limonciello <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
Recent rework moved block device closing out of sb->put_super() and into
sb->kill_sb() to avoid deadlocks as s_umount is held in put_super() and
blkdev_put() can end up taking s_umount again.
That means we need to move the removal of the superblock from @fs_supers
out of generic_shutdown_super() and into deactivate_locked_super() to
ensure that concurrent mounters don't fail to open block devices that
are still in use because blkdev_put() in sb->kill_sb() hasn't been
called yet.
We can now do this as we can make iterators through @fs_super and
@super_blocks wait without holding s_umount. Concurrent mounts will wait
until a dying superblock is fully dead so until sb->kill_sb() has been
called and SB_DEAD been set. Concurrent iterators can already discard
any SB_DYING superblock.
Reviewed-by: Jan Kara <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
Recent patches experiment with making it possible to allocate a new
superblock before opening the relevant block device. Naturally this has
intricate side-effects that we get to learn about while developing this.
Superblock allocators such as sget{_fc}() return with s_umount of the
new superblock held and lock ordering currently requires that block
level locks such as bdev_lock and open_mutex rank above s_umount.
Before aca740cecbe5 ("fs: open block device after superblock creation")
ordering was guaranteed to be correct as block devices were opened prior
to superblock allocation and thus s_umount wasn't held. But now s_umount
must be dropped before opening block devices to avoid locking
violations.
This has consequences. The main one being that iterators over
@super_blocks and @fs_supers that grab a temporary reference to the
superblock can now also grab s_umount before the caller has managed to
open block devices and called fill_super(). So whereas before such
iterators or concurrent mounts would have simply slept on s_umount until
SB_BORN was set or the superblock was discard due to initalization
failure they can now needlessly spin through sget{_fc}().
If the caller is sleeping on bdev_lock or open_mutex one caller waiting
on SB_BORN will always spin somewhere and potentially this can go on for
quite a while.
It should be possible to drop s_umount while allowing iterators to wait
on a nascent superblock to either be born or discarded. This patch
implements a wait_var_event() mechanism allowing iterators to sleep
until they are woken when the superblock is born or discarded.
This also allows us to avoid relooping through @fs_supers and
@super_blocks if a superblock isn't yet born or dying.
Link: aca740cecbe5 ("fs: open block device after superblock creation")
Reviewed-by: Jan Kara <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
The current code that marshalls the EFI runtime call arguments to hand
them off to a async helper does so in a type unsafe and slightly messy
manner - everything is cast to void* except for some integral types that
are passed by reference and dereferenced on the receiver end.
Let's clean this up a bit, and record the arguments of each runtime
service invocation exactly as they are issued, in a manner that permits
the compiler to check the types of the arguments at both ends.
Signed-off-by: Ard Biesheuvel <[email protected]>
|