Age | Commit message (Collapse) | Author | Files | Lines |
|
Patch series "page table check", v3.
Ensure that some memory corruptions are prevented by checking at the
time of insertion of entries into user page tables that there is no
illegal sharing.
We have recently found a problem [1] that existed in kernel since 4.14.
The problem was caused by broken page ref count and led to memory
leaking from one process into another. The problem was accidentally
detected by studying a dump of one process and noticing that one page
contains memory that should not belong to this process.
There are some other page->_refcount related problems that were recently
fixed: [2], [3] which potentially could also lead to illegal sharing.
In addition to hardening refcount [4] itself, this work is an attempt to
prevent this class of memory corruption issues.
It uses a simple state machine that is independent from regular MM logic
to check for illegal sharing at time pages are inserted and removed from
page tables.
[1] https://lore.kernel.org/all/[email protected]
[2] https://lore.kernel.org/all/[email protected]
[3] https://lore.kernel.org/all/[email protected]
[4] https://lore.kernel.org/all/[email protected]
This patch (of 4):
There are a few places where we first update the entry in the user page
table, and later change the struct page to indicate that this is
anonymous or file page.
In most places, however, we first configure the page metadata and then
insert entries into the page table. Page table check, will use the
information from struct page to verify the type of entry is inserted.
Change the order in all places to first update struct page, and later to
update page table.
This means that we first do calls that may change the type of page (anon
or file):
page_move_anon_rmap
page_add_anon_rmap
do_page_add_anon_rmap
page_add_new_anon_rmap
page_add_file_rmap
hugepage_add_anon_rmap
hugepage_add_new_anon_rmap
And after that do calls that add entries to the page table:
set_huge_pte_at
set_pte_at
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pasha Tatashin <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a new document to explain Virtually Mapped Kernel Stack Support.
This is a compilation of information from the code and original patch
series that introduced the Virtually Mapped Kernel Stacks feature.
This document summarizes the feature and provides details on allocation,
free, and stack overflow handling. Provides reference to available
tests.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shuah Khan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
With exit_mmap holding mmap_write_lock during free_pgtables call,
process_mrelease does not need to elevate mm->mm_users in order to
prevent exit_mmap from destrying pagetables while __oom_reap_task_mm is
walking the VMA tree. The change prevents process_mrelease from calling
the last mmput, which can lead to waiting for IO completion in exit_aio.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: Jan Engelhardt <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Tim Murray <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add comments for vm_operations_struct::close documenting locking
requirements for this callback and its callers.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: Jan Engelhardt <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Tim Murray <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
oom-reaper and process_mrelease system call should protect against races
with exit_mmap which can destroy page tables while they walk the VMA
tree. oom-reaper protects from that race by setting MMF_OOM_VICTIM and
by relying on exit_mmap to set MMF_OOM_SKIP before taking and releasing
mmap_write_lock. process_mrelease has to elevate mm->mm_users to
prevent such race.
Both oom-reaper and process_mrelease hold mmap_read_lock when walking
the VMA tree. The locking rules and mechanisms could be simpler if
exit_mmap takes mmap_write_lock while executing destructive operations
such as free_pgtables.
Change exit_mmap to hold the mmap_write_lock when calling unlock_range,
free_pgtables and remove_vma. Note also that because oom-reaper checks
VM_LOCKED flag, unlock_range() should not be allowed to race with it.
Before this patch, remove_vma used to be called with no locks held,
however with fput being executed asynchronously and vm_ops->close not
being allowed to hold mmap_lock (it is called from __split_vma with
mmap_sem held for write), changing that should be fine.
In most cases this lock should be uncontended. Previously, Kirill
reported ~4% regression caused by a similar change [1]. We reran the
same test and although the individual results are quite noisy, the
percentiles show lower regression with 1.6% being the worst case [2].
The change allows oom-reaper and process_mrelease to execute safely
under mmap_read_lock without worries that exit_mmap might destroy page
tables from under them.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: Jan Engelhardt <[email protected]>
Cc: Tim Murray <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
linux/mm_types.h should only define structure definitions, to make it
cheap to include elsewhere. The atomic_t helper function definitions
are particularly large, so it's better to move the helpers using those
into the existing linux/mm_inline.h and only include that where needed.
As a follow-up, we may want to go through all the indirect includes in
mm_types.h and reduce them as much as possible.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Eric Biederman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The patch to add anonymous vma names causes a build failure in some
configurations:
include/linux/mm_types.h: In function 'is_same_vma_anon_name':
include/linux/mm_types.h:924:37: error: implicit declaration of function 'strcmp' [-Werror=implicit-function-declaration]
924 | return name && vma_name && !strcmp(name, vma_name);
| ^~~~~~
include/linux/mm_types.h:22:1: note: 'strcmp' is defined in header '<string.h>'; did you forget to '#include <string.h>'?
This should not really be part of linux/mm_types.h in the first place,
as that header is meant to only contain structure defintions and need a
minimum set of indirect includes itself.
While the header clearly includes more than it should at this point,
let's not make it worse by including string.h as well, which would pull
in the expensive (compile-speed wise) fortify-string logic.
Move the new functions into a separate header that only needs to be
included in a couple of locations.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: "mm: add a field to store names for private anonymous memory"
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
While forking a process with high number (64K) of named anonymous vmas
the overhead caused by strdup() is noticeable. Experiments with ARM64
Android device show up to 40% performance regression when forking a
process with 64k unpopulated anonymous vmas using the max name lengths
vs the same process with the same number of anonymous vmas having no
name.
Introduce anon_vma_name refcounted structure to avoid the overhead of
copying vma names during fork() and when splitting named anonymous vmas.
When a vma is duplicated, instead of copying the name we increment the
refcount of this structure. Multiple vmas can point to the same
anon_vma_name as long as they increment the refcount. The name member
of anon_vma_name structure is assigned at structure allocation time and
is never changed. If vma name changes then the refcount of the original
structure is dropped, a new anon_vma_name structure is allocated to hold
the new name and the vma pointer is updated to point to the new
structure.
With this approach the fork() performance regressions is reduced 3-4x
times and with usecases using more reasonable number of VMAs (a few
thousand) the regressions is not measurable.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jan Glauber <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Cc: Shaohua Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in
use. At a minimum there is libc malloc and the stack, and in many cases
there are libc malloc, the stack, direct syscalls to mmap anonymous
memory, and multiple VM heaps (one for small objects, one for big
objects, etc.). Each of these layers usually has its own tools to
inspect its usage; malloc by compiling a debug version, the VM through
heap inspection tools, and for direct syscalls there is usually no way
to track them.
On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages
mapped in userspace and slice their usage by process, shared (COW) vs.
unique mappings, backing, etc. This can account for real physical
memory usage even in cases like fork without exec (which Android uses
heavily to share as many private COW pages as possible between
processes), Kernel SamePage Merging, and clean zero pages. It produces
a measurement of the pages that only exist in that process (USS, for
unique), and a measurement of the physical memory usage of that process
with the cost of shared pages being evenly split between processes that
share them (PSS).
If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap
walking tool that can understand the heap debugging of every layer, or
for every layer's heap debugging tools to implement the pagemap walking
logic, in which case it is hard to get a consistent view of memory
across the whole system.
Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that somebody
needs to clean up on crashes. It needs to be readable while the process
is still running, so it has to have some sort of synchronization with
every layer of userspace. Efficiently tracking the ranges requires
reimplementing something like the kernel vma trees, and linking to it
from every layer of userspace. It requires more memory, more syscalls,
more runtime cost, and more complexity to separately track regions that
the kernel is already tracking.
This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas. The names of named
anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
[anon:<name>].
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
Setting the name to NULL clears it. The name length limit is 80 bytes
including NUL-terminator and is checked to contain only printable ascii
characters (including space), except '[',']','\','$' and '`'.
Ascii strings are being used to have a descriptive identifiers for vmas,
which can be understood by the users reading /proc/pid/maps or
/proc/pid/smaps. Names can be standardized for a given system and they
can include some variable parts such as the name of the allocator or a
library, tid of the thread using it, etc.
The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string. Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name. The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.
CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
feature. It keeps the feature disabled by default to prevent any
additional memory overhead and to avoid confusing procfs parsers on
systems which are not ready to support named anonymous vmas.
The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names. In that design, name
pointers could be shared between vmas. However during the last
upstreaming attempt, Kees Cook raised concerns [2] about this approach
and suggested to copy the name into kernel memory space, perform
validity checks [3] and store as a string referenced from
vm_area_struct.
One big concern is about fork() performance which would need to strdup
anonymous vma names. Dave Hansen suggested experimenting with
worst-case scenario of forking a process with 64k vmas having longest
possible names [4]. I ran this experiment on an ARM64 Android device
and recorded a worst-case regression of almost 40% when forking such a
process.
This regression is addressed in the followup patch which replaces the
pointer to a name with a refcounted structure that allows sharing the
name pointer between vmas of the same name. Instead of duplicating the
string during fork() or when splitting a vma it increments the refcount.
[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/[email protected]/
Changes for prctl(2) manual page (in the options section):
PR_SET_VMA
Sets an attribute specified in arg2 for virtual memory areas
starting from the address specified in arg3 and spanning the
size specified in arg4. arg5 specifies the value of the attribute
to be set. Note that assigning an attribute to a virtual memory
area might prevent it from being merged with adjacent virtual
memory areas due to the difference in that attribute's value.
Currently, arg2 must be one of:
PR_SET_VMA_ANON_NAME
Set a name for anonymous virtual memory areas. arg5 should
be a pointer to a null-terminated string containing the
name. The name length including null byte cannot exceed
80 bytes. If arg5 is NULL, the name of the appropriate
anonymous virtual memory areas will be reset. The name
can contain only printable ascii characters (including
space), except '[',']','\','$' and '`'.
This feature is available only if the kernel is built with
the CONFIG_ANON_VMA_NAME option enabled.
[[email protected]: docs: proc.rst: /proc/PID/maps: fix malformed table]
Link: https://lkml.kernel.org/r/[email protected]
[surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
work here was done by Colin Cross, therefore, with his permission, keeping
him as the author]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Colin Cross <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jan Glauber <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Cc: Shaohua Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm: rearrange madvise code to allow for reuse", v11.
Avoid performance regression of the new anon vma name field refcounting it.
I checked the image sizes with allnoconfig builds:
unpatched Linus' ToT
text data bss dec hex filename
1324759 32 73928 1398719 1557bf vmlinux
After the first patch is applied (madvise refactoring)
text data bss dec hex filename
1322346 32 73928 1396306 154e52 vmlinux
>>> 2413 bytes decrease vs ToT <<<
After all patches applied with CONFIG_ANON_VMA_NAME=n
text data bss dec hex filename
1322337 32 73928 1396297 154e49 vmlinux
>>> 2422 bytes decrease vs ToT <<<
After all patches applied with CONFIG_ANON_VMA_NAME=y
text data bss dec hex filename
1325228 32 73928 1399188 155994 vmlinux
>>> 469 bytes increase vs ToT <<<
This patch (of 3):
Refactor the madvise syscall to allow for parts of it to be reused by a
prctl syscall that affects vmas.
Move the code that walks vmas in a virtual address range into a function
that takes a function pointer as a parameter. The only caller for now
is sys_madvise, which uses it to call madvise_vma_behavior on each vma,
but the next patch will add an additional caller.
Move handling all vma behaviors inside madvise_behavior, and rename it
to madvise_vma_behavior.
Move the code that updates the flags on a vma, including splitting or
merging the vma as necessary, into a new function called
madvise_update_vma. The next patch will add support for updating a new
anon_name field as well.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Colin Cross <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Jan Glauber <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since commit 4064b9827063 ("mm: allow VM_FAULT_RETRY for multiple
times") allowed VM_FAULT_RETRY for multiple times, the
FAULT_FLAG_ALLOW_RETRY bit of fault_flag will not be changed in the page
fault path, so the following check is no longer needed:
flags & FAULT_FLAG_ALLOW_RETRY
So just remove it.
[[email protected]: coding style fixes]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Qi Zheng <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Kirill Shutemov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Chengming Zhou <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix the following coccicheck REVIEW:
tools/testing/selftests/vm/userfaultfd.c:1531:21-22:use swap() to make code cleaner
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: chiminghao <[email protected]>
Reported-by: Zeal Robot <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The kvmalloc* allocation functions can fallback to vmalloc allocations
and more often on long running machines. In addition the kernel does
have __GFP_ACCOUNT kvmalloc* calls. So, often on long running machines,
the memory.stat does not tell the complete picture which type of memory
is charged to the memcg. So add a per-memcg vmalloc stat.
[[email protected]: page_memcg() within rcu lock, per Muchun]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: remove cast, per Muchun]
[[email protected]: remove area->page[0] checks and move to page by page accounting per Michal]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Make use of the struct_size() helper instead of an open-coded version,
in order to avoid any potential type mistakes or integer overflows that,
in the worst scenario, could lead to heap overflows.
Link: https://github.com/KSPP/linux/issues/160
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Wang Weiyang <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit 11192d9c124d ("memcg: flush stats only if updated") added
tracking of memcg stats updates which is used by the readers to flush
only if the updates are over a certain threshold. However each
individual update can correspond to a large value change for a given
stat. For example adding or removing a hugepage to an LRU changes the
stat by thp_nr_pages (512 on x86_64).
Treating the update related to THP as one can keep the stat off, in
theory, by (thp_nr_pages * nr_cpus * CHARGE_BATCH) before flush.
To handle such scenarios, this patch adds consideration of the stat
update value as well instead of just the update event. In addition let
the asyn flusher unconditionally flush the stats to put time limit on
the stats skew and hopefully a lot less readers would need to flush.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Michal Koutný" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Our container agent wants to know when a container exits if it was OOM
killed or not to report to the user. We use memory.oom.group = 1 to
ensure that OOM kills within the container's cgroup kill everything.
Existing memory.events are insufficient for knowing if this triggered:
1) Our current approach reads memory.events oom_kill and reports the
container was killed if the value is non-zero. This is erroneous in
some cases where containers create their children cgroups with
memory.oom.group=1 as such OOM kills will get counted against the
container cgroup's oom_kill counter despite not actually OOM killing
the entire container.
2) Reading memory.events.local will fail to identify OOM kills in leaf
cgroups (that don't set memory.oom.group) within the container
cgroup.
This patch adds a new oom_group_kill event when memory.oom.group
triggers to allow userspace to cleanly identify when an entire cgroup is
oom killed.
[[email protected]: changes from Johannes and Chris]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dan Schatzberg <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Chris Down <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
propagate_protected_usage() is called to propagate the usage change in
the page_counter structure. But there is a call to this function from
page_counter_try_charge() when there is actually no usage change. Hence
this call should be removed.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Donghai Qiao <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit 494c1dfe855e ("mm: memcg/slab: create a new set of kmalloc-cg-<n>
caches") makes cgroup_memory_nokmem global, however, it is unnecessary
because there is already a function mem_cgroup_kmem_disabled() which
exports it.
Just make it static and replace it with mem_cgroup_kmem_disabled() in
mm/slab_common.c.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Chris Down <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The 'a' and 'b' bitmaps are local to this function, so no concurrent
access can occur. So the non-atomic '__set_bit()' can be used to save a
few cycles.
Link: https://lkml.kernel.org/r/e52476da5cee57151745c5c3c934a69798dc6fa4.1638132190.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix a data race in commit 779750d20b93 ("shmem: split huge pages beyond
i_size under memory pressure").
Here are call traces causing race:
Call Trace 1:
shmem_unused_huge_shrink+0x3ae/0x410
? __list_lru_walk_one.isra.5+0x33/0x160
super_cache_scan+0x17c/0x190
shrink_slab.part.55+0x1ef/0x3f0
shrink_node+0x10e/0x330
kswapd+0x380/0x740
kthread+0xfc/0x130
? mem_cgroup_shrink_node+0x170/0x170
? kthread_create_on_node+0x70/0x70
ret_from_fork+0x1f/0x30
Call Trace 2:
shmem_evict_inode+0xd8/0x190
evict+0xbe/0x1c0
do_unlinkat+0x137/0x330
do_syscall_64+0x76/0x120
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
A simple explanation:
Image there are 3 items in the local list (@list). In the first
traversal, A is not deleted from @list.
1) A->B->C
^
|
pos (leave)
In the second traversal, B is deleted from @list. Concurrently, A is
deleted from @list through shmem_evict_inode() since last reference
counter of inode is dropped by other thread. Then the @list is corrupted.
2) A->B->C
^ ^
| |
evict pos (drop)
We should make sure the inode is either on the global list or deleted from
any local list before iput().
Fixed by moving inodes back to global list before we put them.
[[email protected]: coding style fixes]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
Signed-off-by: Gang Li <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The current behavior of memory failure is to truncate the page cache
regardless of dirty or clean. If the page is dirty the later access
will get the obsolete data from disk without any notification to the
users. This may cause silent data loss. It is even worse for shmem
since shmem is in-memory filesystem, truncating page cache means
discarding data blocks. The later read would return all zero.
The right approach is to keep the corrupted page in page cache, any
later access would return error for syscalls or SIGBUS for page fault,
until the file is truncated, hole punched or removed. The regular
storage backed filesystems would be more complicated so this patch is
focused on shmem. This also unblock the support for soft offlining
shmem THP.
[[email protected]: coding style fixes]
[[email protected]: fix uninitialized variable use in me_pagecache_clean()]
Link: https://lkml.kernel.org/r/[email protected]
[Fix invalid pointer dereference in shmem_read_mapping_page_gfp() with a
slight different implementation from what Ajay Garg <[email protected]>
and Muchun Song <[email protected]> proposed and reworked the
error handling of shmem_write_begin() suggested by Linus]
Link: https://lore.kernel.org/linux-mm/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ajay Garg <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Andy Lavr <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When BUG_ON check for THP migration entry, the existing code only check
thp_migration_supported case, but not for !thp_migration_supported case.
If !thp_migration_supported() and !pmd_present(), the original code may
dead loop in theory. To make the BUG_ON check consistent, we need catch
both cases.
Move the BUG_ON check one step earlier, because if the bug happen we
should know it instead of depend on FOLL_MIGRATION been used by caller.
Because pmdval instead of *pmd is read by the is_pmd_migration_entry()
check, the existing code don't help to avoid useless locking within
pmd_migration_entry_wait(), so remove that check.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Li Xinhai <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
fault_in_readable() and fault_in_writeable() perform __get_user() and
__put_user() in a loop, implying multiple user access locking/unlocking.
To avoid that, use user access blocks.
Link: https://lkml.kernel.org/r/720dcf79314acca1a78fae56d478cc851952149d.1637084492.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <[email protected]>
Reviewed-by: Andreas Gruenbacher <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Return value directly instead of taking this in another redundant
variable.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: chiminghao <[email protected]>
Reported-by: Zeal Robot <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Pankaj Gupta <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit 4dd845b5a3e5 ("mm/swapops: rework swap entry manipulation code")
had changed migtation entry related helpers. Just update
debug_vm_pgatble() synced documentation to reflect those changes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
dump_mapping() is a big chunk of dump_page(), and it'd be handy to be
able to call it when we don't have a struct page. Split it out and move
it to fs/inode.c. Take the opportunity to simplify some of the debug
messages a little.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
KASAN's quarantine might save its metadata inside freed objects. As
this happens after the memory is zeroed by the slab allocator when
init_on_free is enabled, the memory coming out of quarantine is not
properly zeroed.
This causes lib/test_meminit.c tests to fail with Generic KASAN.
Zero the metadata when the object is removed from quarantine.
Link: https://lkml.kernel.org/r/2805da5df4b57138fdacd671f5d227d58950ba54.1640037083.git.andreyknvl@google.com
Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a test case for double-kmem_cache_destroy() detection.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Because mm/slab_common.c is not instrumented with software KASAN modes,
it is not possible to detect use-after-free of the kmem_cache passed
into kmem_cache_destroy(). In particular, because of the s->refcount--
and subsequent early return if non-zero, KASAN would never be able to
see the double-free via kmem_cache_free(kmem_cache, s). To be able to
detect a double-kmem_cache_destroy(), check accessibility of the
kmem_cache, and in case of failure return early.
While KASAN_HW_TAGS is able to detect such bugs, by checking
accessibility and returning early we fail more gracefully and also avoid
corrupting reused objects (where tags mismatch).
A recent case of a double-kmem_cache_destroy() was detected by KFENCE:
https://lkml.kernel.org/r/[email protected], which
was not detectable by software KASAN modes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a test checking that KASAN generic can also detect out-of-bounds
accesses to the left of globals.
Unfortunately it seems that GCC doesn't catch this (tested GCC 10, 11).
The main difference between GCC's globals redzoning and Clang's is that
GCC relies on using increased alignment to producing padding, where
Clang's redzoning implementation actually adds real data after the
global and doesn't rely on alignment to produce padding. I believe this
is the main reason why GCC can't reliably catch globals out-of-bounds in
this case.
Given this is now a known issue, to avoid failing the whole test suite,
skip this test case with GCC.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Reported-by: Kaiwan N Billimoria <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Kaiwan N Billimoria <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use the newly added compound devmap facility which maps the assigned dax
ranges as compound pages at a page size of @align.
dax devices are created with a fixed @align (huge page size) which is
enforced through as well at mmap() of the device. Faults, consequently
happen too at the specified @align specified at the creation, and those
don't change throughout dax device lifetime. MCEs unmap a whole dax
huge page, as well as splits occurring at the configured page size.
Performance measured by gup_test improves considerably for
unpin_user_pages() and altmap with NVDIMMs:
$ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
(pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
[altmap]
(pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms
$ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
(pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
[altmap with -m 127004]
(pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms
.. as well as unpin_user_page_range_dirty_lock() being just as effective
as THP/hugetlb[0] pages.
[0] https://lore.kernel.org/linux-mm/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Joao Martins <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
After moving the page mapping to be set prior to pte insertion, the pfn
in dev_dax_huge_fault() no longer is necessary. Remove it, as well as
the @pfn argument passed to the internal fault handler helpers.
[[email protected]: fix CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=n build]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Joao Martins <[email protected]>
Suggested-by: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Normally, the @page mapping is set prior to inserting the page into a
page table entry. Make device-dax adhere to the same ordering, rather
than setting mapping after the PTE is inserted.
The address_space never changes and it is always associated with the
same inode and underlying pages. So, the page mapping is set once but
cleared when the struct pages are removed/freed (i.e. after
{devm_}memunmap_pages()).
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Move initialization of page->mapping into a separate helper.
This is in preparation to move the mapping set to be prior to inserting
the page table entry and also for tidying up compound page handling into
one helper.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Joao Martins <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Right now, only static dax regions have a valid @pgmap pointer in its
struct dev_dax. Dynamic dax case however, do not.
In preparation for device-dax compound devmap support, make sure that
dev_dax pgmap field is set after it has been allocated and initialized.
dynamic dax device have the @pgmap is allocated at probe() and it's
managed by devm (contrast to static dax region which a pgmap is provided
and dax core kfrees it). So in addition to ensure a valid @pgmap, clear
the pgmap when the dynamic dax device is released to avoid the same
pgmap ranges to be re-requested across multiple region device reconfigs.
Add a static_dev_dax() and use that helper in dev_dax_probe() to ensure
the initialization differences between dynamic and static regions are
more explicit. While at it, consolidate the ranges initialization when
we allocate the @pgmap for the dynamic dax region case. Also take the
opportunity to document the differences between static and dynamic da
regions.
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Dan Williams <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use the struct_size() helper for the size of a struct with variable
array member at the end, rather than manually calculating it.
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Dan Williams <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Rather than calculating @pgoff manually, switch to ALIGN() instead.
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Dan Williams <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a new @vmemmap_shift property for struct dev_pagemap which specifies
that a devmap is composed of a set of compound pages of order
@vmemmap_shift, instead of base pages. When a compound page devmap is
requested, all but the first page are initialised as tail pages instead
of order-0 pages.
For certain ZONE_DEVICE users like device-dax which have a fixed page
size, this creates an opportunity to optimize GUP and GUP-fast walkers,
treating it the same way as THP or hugetlb pages.
Additionally, commit 7118fc2906e2 ("hugetlb: address ref count racing in
prep_compound_gigantic_page") removed set_page_count() because the
setting of page ref count to zero was redundant. devmap pages don't
come from page allocator though and only head page refcount is used for
compound pages, hence initialize tail page count to zero.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Joao Martins <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Move struct page init to an helper function __init_zone_device_page().
This is in preparation for sharing the storage for compound page
metadata.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Joao Martins <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm, device-dax: Introduce compound pages in devmap", v7.
This series converts device-dax to use compound pages, and moves away
from the 'struct page per basepage on PMD/PUD' that is done today.
Doing so
1) unlocks a few noticeable improvements on unpin_user_pages() and
makes device-dax+altmap case 4x times faster in pinning (numbers
below and in last patch)
2) as mentioned in various other threads it's one important step
towards cleaning up ZONE_DEVICE refcounting.
I've split the compound pages on devmap part from the rest based on
recent discussions on devmap pending and future work planned[5][6].
There is consensus that device-dax should be using compound pages to
represent its PMD/PUDs just like HugeTLB and THP, and that leads to less
specialization of the dax parts. I will pursue the rest of the work in
parallel once this part is merged, particular the GUP-{slow,fast}
improvements [7] and the tail struct page deduplication memory savings
part[8].
To summarize what the series does:
Patch 1: Prepare hwpoisoning to work with dax compound pages.
Patches 2-3: Split the current utility function of prep_compound_page()
into head and tail and use those two helpers where appropriate to take
advantage of caches being warm after __init_single_page(). This is used
when initializing zone device when we bring up device-dax namespaces.
Patches 4-10: Add devmap support for compound pages in device-dax.
memmap_init_zone_device() initialize its metadata as compound pages, and
it introduces a new devmap property known as vmemmap_shift which
outlines how the vmemmap is structured (defaults to base pages as done
today). The property describe the page order of the metadata
essentially. While at it do a few cleanups in device-dax in patches
5-9. Finally enable device-dax usage of devmap @vmemmap_shift to a
value based on its own @align property. @vmemmap_shift returns 0 by
default (which is today's case of base pages in devmap, like fsdax or
the others) and the usage of compound devmap is optional. Starting with
device-dax (*not* fsdax) we enable it by default. There are a few
pinning improvements particular on the unpinning case and altmap, as
well as unpin_user_page_range_dirty_lock() being just as effective as
THP/hugetlb[0] pages.
$ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
(pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
[altmap]
(pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms
$ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
(pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
[altmap with -m 127004]
(pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms
Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with
and without altmap), alongside gup_test selftests with dynamic dax
regions and static dax regions. Coupled with ndctl unit tests for
dynamic dax devices that exercise all of this. Note, for dynamic dax
regions I had to revert commit 8aa83e6395 ("x86/setup: Call
early_reserve_memory() earlier"), it is a known issue that this commit
broke efi_fake_mem=.
This patch (of 11):
Split the utility function prep_compound_page() into head and tail
counterparts, and use them accordingly.
This is in preparation for sharing the storage for compound page
metadata.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Joao Martins <[email protected]>
Acked-by: Mike Kravetz <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Yongqiang reports a kmemleak panic when module insmod/rmmod with KASAN
enabled(without KASAN_VMALLOC) on x86[1].
When the module area allocates memory, it's kmemleak_object is created
successfully, but the KASAN shadow memory of module allocation is not
ready, so when kmemleak scan the module's pointer, it will panic due to
no shadow memory with KASAN check.
module_alloc
__vmalloc_node_range
kmemleak_vmalloc
kmemleak_scan
update_checksum
kasan_module_alloc
kmemleak_ignore
Note, there is no problem if KASAN_VMALLOC enabled, the modules area
entire shadow memory is preallocated. Thus, the bug only exits on ARCH
which supports dynamic allocation of module area per module load, for
now, only x86/arm64/s390 are involved.
Add a VM_DEFER_KMEMLEAK flags, defer vmalloc'ed object register of
kmemleak in module_alloc() to fix this issue.
[1] https://lore.kernel.org/all/[email protected]/
[[email protected]: fix build]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: simplify ifdefs, per Andrey]
Link: https://lkml.kernel.org/r/CA+fCnZcnwJHUQq34VuRxpdoY6_XbJCDJ-jopksS5Eia4PijPzw@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 793213a82de4 ("s390/kasan: dynamic shadow mem allocation for modules")
Fixes: 39d114ddc682 ("arm64: add KASAN support")
Fixes: bebf56a1b176 ("kasan: enable instrumentation of global variables")
Signed-off-by: Kefeng Wang <[email protected]>
Reported-by: Yongqiang Liu <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Kefeng Wang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Reserved regions with direct mapping may contain references to other
regions. CMA region with fixed location is reserved without creating
kmemleak_object for it.
So add them as gray kmemleak objects.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Calvin Zhang <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Frank Rowand <[email protected]>
Cc: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
With HW tag-based kasan enable, We will get the warning when we free
object whose address starts with 0xFF.
It is because kmemleak rbtree stores tagged object and this freeing
object's tag does not match with rbtree object.
In the example below, kmemleak rbtree stores the tagged object in the
kmalloc(), and kfree() gets the pointer with 0xFF tag.
Call sequence:
ptr = kmalloc(size, GFP_KERNEL);
page = virt_to_page(ptr);
offset = offset_in_page(ptr);
kfree(page_address(page) + offset);
ptr = kmalloc(size, GFP_KERNEL);
A sequence like that may cause the warning as following:
1) Freeing unknown object:
In kfree(), we will get free unknown object warning in
kmemleak_free(). Because object(0xFx) in kmemleak rbtree and
pointer(0xFF) in kfree() have different tag.
2) Overlap existing:
When we allocate that object with the same hw-tag again, we will
find the overlap in the kmemleak rbtree and kmemleak thread will be
killed.
kmemleak: Freeing unknown object at 0xffff000003f88000
CPU: 5 PID: 177 Comm: cat Not tainted 5.16.0-rc1-dirty #21
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x1ac
show_stack+0x1c/0x30
dump_stack_lvl+0x68/0x84
dump_stack+0x1c/0x38
kmemleak_free+0x6c/0x70
slab_free_freelist_hook+0x104/0x200
kmem_cache_free+0xa8/0x3d4
test_version_show+0x270/0x3a0
module_attr_show+0x28/0x40
sysfs_kf_seq_show+0xb0/0x130
kernfs_seq_show+0x30/0x40
seq_read_iter+0x1bc/0x4b0
seq_read_iter+0x1bc/0x4b0
kernfs_fop_read_iter+0x144/0x1c0
generic_file_splice_read+0xd0/0x184
do_splice_to+0x90/0xe0
splice_direct_to_actor+0xb8/0x250
do_splice_direct+0x88/0xd4
do_sendfile+0x2b0/0x344
__arm64_sys_sendfile64+0x164/0x16c
invoke_syscall+0x48/0x114
el0_svc_common.constprop.0+0x44/0xec
do_el0_svc+0x74/0x90
el0_svc+0x20/0x80
el0t_64_sync_handler+0x1a8/0x1b0
el0t_64_sync+0x1ac/0x1b0
...
kmemleak: Cannot insert 0xf2ff000003f88000 into the object search tree (overlaps existing)
CPU: 5 PID: 178 Comm: cat Not tainted 5.16.0-rc1-dirty #21
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x1ac
show_stack+0x1c/0x30
dump_stack_lvl+0x68/0x84
dump_stack+0x1c/0x38
create_object.isra.0+0x2d8/0x2fc
kmemleak_alloc+0x34/0x40
kmem_cache_alloc+0x23c/0x2f0
test_version_show+0x1fc/0x3a0
module_attr_show+0x28/0x40
sysfs_kf_seq_show+0xb0/0x130
kernfs_seq_show+0x30/0x40
seq_read_iter+0x1bc/0x4b0
kernfs_fop_read_iter+0x144/0x1c0
generic_file_splice_read+0xd0/0x184
do_splice_to+0x90/0xe0
splice_direct_to_actor+0xb8/0x250
do_splice_direct+0x88/0xd4
do_sendfile+0x2b0/0x344
__arm64_sys_sendfile64+0x164/0x16c
invoke_syscall+0x48/0x114
el0_svc_common.constprop.0+0x44/0xec
do_el0_svc+0x74/0x90
el0_svc+0x20/0x80
el0t_64_sync_handler+0x1a8/0x1b0
el0t_64_sync+0x1ac/0x1b0
kmemleak: Kernel memory leak detector disabled
kmemleak: Object 0xf2ff000003f88000 (size 128):
kmemleak: comm "cat", pid 177, jiffies 4294921177
kmemleak: min_count = 1
kmemleak: count = 0
kmemleak: flags = 0x1
kmemleak: checksum = 0
kmemleak: backtrace:
kmem_cache_alloc+0x23c/0x2f0
test_version_show+0x1fc/0x3a0
module_attr_show+0x28/0x40
sysfs_kf_seq_show+0xb0/0x130
kernfs_seq_show+0x30/0x40
seq_read_iter+0x1bc/0x4b0
kernfs_fop_read_iter+0x144/0x1c0
generic_file_splice_read+0xd0/0x184
do_splice_to+0x90/0xe0
splice_direct_to_actor+0xb8/0x250
do_splice_direct+0x88/0xd4
do_sendfile+0x2b0/0x344
__arm64_sys_sendfile64+0x164/0x16c
invoke_syscall+0x48/0x114
el0_svc_common.constprop.0+0x44/0xec
do_el0_svc+0x74/0x90
kmemleak: Automatic memory scanning thread ended
[[email protected]: whitespace tweak]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kuan-Ying Lee <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Doug Berger <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is no external users of slab_start/next/stop(), so make them
static. And the memory.kmem.slabinfo is deprecated, which outputs
nothing now, so move memcg_slab_show() into mm/memcontrol.c and rename
it to mem_cgroup_slab_show to be consistent with other function names.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Calling kmem_cache_destroy() while the cache still has objects allocated
is a kernel bug, and will usually result in the entire cache being
leaked. While the message in kmem_cache_destroy() resembles a warning,
it is currently not implemented using a real WARN().
This is problematic for infrastructure testing the kernel, all of which
rely on the specific format of WARN()s to pick up on bugs.
Some 13 years ago this used to be a simple WARN_ON() in slub, but commit
d629d8195793 ("slub: improve kmem_cache_destroy() error message")
changed it into an open-coded warning to avoid confusion with a bug in
slub itself.
Instead, turn the open-coded warning into a real WARN() with the message
preserved, so that test systems can actually identify these issues, and
we get all the other benefits of using a normal WARN(). The warning
message is extended with "when called from <caller-ip>" to make it even
clearer where the fault lies.
For most configurations this is only a cosmetic change, however, note
that WARN() here will now also respect panic_on_warn.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
__user annotations are used by the checker (e.g sparse) to mark user
pointers. However here __user is applied to a struct directly, without a
pointer being directly involved.
Although the presence of __user does not cause sparse to emit a warning,
__user should be removed for consistency with other uses of offsetof().
Note: No functional changes intended.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Amit Daniel Kachhap <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The variable 'free_space' is being initialized with a value that is not
read, it is being re-assigned later in the two paths of an if statement.
The early initialization is redundant and can be removed.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Colin Ian King <[email protected]>
Acked-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There are currently two ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field.
Move the ocfs2 cluster sysfs code to use default_groups field which has
been the preferred way since aa30f47cf666 ("kobject: Add support for
default attribute groups to kobj_type") so that we can soon get rid of
the obsolete default_attrs field.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Tested-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The variable 'root_bh' is being initialized with a value that is not
read, it is being re-assigned later on closer to its use. The early
initialization is redundant and can be removed.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Colin Ian King <[email protected]>
Acked-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There are currently two ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field.
Move the ocfs2 code to use default_groups field which has been the
preferred way since aa30f47cf666 ("kobject: Add support for default
attribute groups to kobj_type") so that we can soon get rid of the
obsolete default_attrs field.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Acked-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|