Age | Commit message (Collapse) | Author | Files | Lines |
|
/proc/<pid>/maps file is extremely useful in practice for various tasks
involving figuring out process memory layout, what files are backing any
given memory range, etc. One important class of applications that
absolutely rely on this are profilers/stack symbolizers (perf tool being
one of them). Patterns of use differ, but they generally would fall into
two categories.
In on-demand pattern, a profiler/symbolizer would normally capture stack
trace containing absolute memory addresses of some functions, and would
then use /proc/<pid>/maps file to find corresponding backing ELF files
(normally, only executable VMAs are of interest), file offsets within
them, and then continue from there to get yet more information (ELF
symbols, DWARF information) to get human-readable symbolic information.
This pattern is used by Meta's fleet-wide profiler, as one example.
In preprocessing pattern, application doesn't know the set of addresses of
interest, so it has to fetch all relevant VMAs (again, probably only
executable ones), store or cache them, then proceed with profiling and
stack trace capture. Once done, it would do symbolization based on stored
VMA information. This can happen at much later point in time. This
patterns is used by perf tool, as an example.
In either case, there are both performance and correctness requirement
involved. This address to VMA information translation has to be done as
efficiently as possible, but also not miss any VMA (especially in the case
of loading/unloading shared libraries). In practice, correctness can't be
guaranteed (due to process dying before VMA data can be captured, or
shared library being unloaded, etc), but any effort to maximize the chance
of finding the VMA is appreciated.
Unfortunately, for all the /proc/<pid>/maps file universality and
usefulness, it doesn't fit the above use cases 100%.
First, it's main purpose is to emit all VMAs sequentially, but in practice
captured addresses would fall only into a smaller subset of all process'
VMAs, mainly containing executable text. Yet, library would need to parse
most or all of the contents to find needed VMAs, as there is no way to
skip VMAs that are of no use. Efficient library can do the linear pass
and it is still relatively efficient, but it's definitely an overhead that
can be avoided, if there was a way to do more targeted querying of the
relevant VMA information.
Second, it's a text based interface, which makes its programmatic use from
applications and libraries more cumbersome and inefficient due to the need
to handle text parsing to get necessary pieces of information. The
overhead is actually payed both by kernel, formatting originally binary
VMA data into text, and then by user space application, parsing it back
into binary data for further use.
For the on-demand pattern of usage, described above, another problem when
writing generic stack trace symbolization library is an unfortunate
performance-vs-correctness tradeoff that needs to be made. Library has to
make a decision to either cache parsed contents of /proc/<pid>/maps (after
initial processing) to service future requests (if application requests to
symbolize another set of addresses (for the same process), captured at
some later time, which is typical for periodic/continuous profiling cases)
to avoid higher costs of re-parsing this file. Or it has to choose to
cache the contents in memory to speed up future requests. In the former
case, more memory is used for the cache and there is a risk of getting
stale data if application loads or unloads shared libraries, or otherwise
changed its set of VMAs somehow, e.g., through additional mmap() calls.
In the latter case, it's the performance hit that comes from re-opening
the file and re-parsing its contents all over again.
This patch aims to solve this problem by providing a new API built on top
of /proc/<pid>/maps. It's meant to address both non-selectiveness and
text nature of /proc/<pid>/maps, by giving user more control of what sort
of VMA(s) needs to be queried, and being binary-based interface eliminates
the overhead of text formatting (on kernel side) and parsing (on user
space side).
It's also designed to be extensible and forward/backward compatible by
including required struct size field, which user has to provide. We use
established copy_struct_from_user() approach to handle extensibility.
User has a choice to pick either getting VMA that covers provided address
or -ENOENT if none is found (exact, least surprising, case). Or, with an
extra query flag (PROCMAP_QUERY_COVERING_OR_NEXT_VMA), they can get either
VMA that covers the address (if there is one), or the closest next VMA
(i.e., VMA with the smallest vm_start > addr). The latter allows more
efficient use, but, given it could be a surprising behavior, requires an
explicit opt-in.
There is another query flag that is useful for some use cases.
PROCMAP_QUERY_FILE_BACKED_VMA instructs this API to only return
file-backed VMAs. Combining this with PROCMAP_QUERY_COVERING_OR_NEXT_VMA
makes it possible to efficiently iterate only file-backed VMAs of the
process, which is what profilers/symbolizers are normally interested in.
All the above querying flags can be combined with (also optional) set of
desired VMA permissions flags. This allows to, for example, iterate only
an executable subset of VMAs, which is what preprocessing pattern, used by
perf tool, would benefit from, as the assumption is that captured stack
traces would have addresses of executable code. This saves time by
skipping non-executable VMAs altogether efficienty.
All these querying flags (modifiers) are orthogonal and can be combined in
a semantically meaningful and natural way.
Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes sense
given it's querying the same set of VMA data. It's also benefitial
because permission checks for /proc/<pid>/maps is performed at open time
once, and the actual data read of text contents of /proc/<pid>/maps is
done without further permission checks. We piggyback on this pattern with
ioctl()-based API as well, as that's a desired property. Both for
performance reasons, but also for security and flexibility reasons.
Allowing application to open an FD for /proc/self/maps without any extra
capabilities, and then passing it to some sort of profiling agent through
Unix-domain socket, would allow such profiling agent to not require some
of the capabilities that are otherwise expected when opening
/proc/<pid>/maps file for *another* process. This is a desirable property
for some more restricted setups.
This new ioctl-based implementation doesn't interfere with seq_file-based
implementation of /proc/<pid>/maps textual interface, and so could be used
together or independently without paying any price for that.
Note also, that fetching VMA name (e.g., backing file path, or special
hard-coded or user-provided names) is optional just like build ID. If
user sets vma_name_size to zero, kernel code won't attempt to retrieve it,
saving resources.
Earlier versions of this patch set were adding per-VMA locking, which is
why we have a code structure that is ready for abstracting mmap_lock vs
vm_lock differences (query_vma_setup(), query_vma_teardown(), and
query_vma_find_by_addr()), but given anon_vma_name() is not yet compatible
with per-VMA locking, initial implementation sticks to using only
mmap_lock for now. It will be easy to add back per-VMA locking once all
the pieces are ready later on. Which is why we keep existing code
structure with setup/teardown/query helper functions.
[[email protected]: improve PROCMAP_QUERY's compat mode handling]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Liam R. Howlett <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "ioctl()-based API to query VMAs from /proc/<pid>/maps", v6.
Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
applications to query VMA information more efficiently than reading *all*
VMAs nonselectively through text-based interface of /proc/<pid>/maps file.
Patch #2 goes into a lot of details and background on some common patterns
of using /proc/<pid>/maps in the area of performance profiling and
subsequent symbolization of captured stack traces. As mentioned in that
patch, patterns of VMA querying can differ depending on specific use case,
but can generally be grouped into two main categories: the need to query a
small subset of VMAs covering a given batch of addresses, or
reading/storing/caching all (typically, executable) VMAs upfront for later
processing.
The new PROCMAP_QUERY ioctl() API added in this patch set was motivated by
the former pattern of usage. Earlier revisions had a patch adding a tool
that faithfully reproduces an efficient VMA matching pass of a symbolizer,
collecting a subset of covering VMAs for a given set of addresses as
efficiently as possible. This tool served both as a testing ground, as
well as a benchmarking tool. It implements everything both for currently
existing text-based /proc/<pid>/maps interface, as well as for newly-added
PROCMAP_QUERY ioctl(). This revision dropped the tool from the patch set
and, once the API lands upstream, this tool might be added separately on
Github as an example.
Based on discussion on earlier revisions of this patch set, it turned out
that this ioctl() API is competitive with highly-optimized text-based
pre-processing pattern that perf tool is using. Based on perf discussion,
this revision adds more flexibility in specifying a subset of VMAs that
are of interest. Now it's possible to specify desired permissions of VMAs
(e.g., request only executable ones) and/or restrict to only a subset of
VMAs that have file backing. This further improves the efficiency when
using this new API thanks to more selective (executable VMAs only)
querying.
In addition to a custom benchmarking tool, and experimental perf
integration (available at [0]), Daniel Mueller has since also implemented
an experimental integration into blazesym (see [1]), a library used for
stack trace symbolization by our server fleet-wide profiler and another
on-device profiler agent that runs on weaker ARM devices. The latter
ARM-based device profiler is especially sensitive to performance, and so
we benchmarked and compared text-based /proc/<pid>/maps solution to the
equivalent one using PROCMAP_QUERY ioctl().
Results are very encouraging, giving us 5x improvement for end-to-end
so-called "address normalization" pass, which is the part of the
symbolization process that happens locally on ARM device, before being
sent out for further heavier-weight processing on more powerful remote
server. Note that this is not an artificial microbenchmark. It's a full
end-to-end API call being measured with real-world data on real-world
device.
TEXT-BASED
==========
Benchmarking main/normalize_process_no_build_ids_uncached_maps
main/normalize_process_no_build_ids_uncached_maps
time: [49.777 µs 49.982 µs 50.250 µs]
IOCTL-BASED
===========
Benchmarking main/normalize_process_no_build_ids_uncached_maps
main/normalize_process_no_build_ids_uncached_maps
time: [10.328 µs 10.391 µs 10.457 µs]
change: [−79.453% −79.304% −79.166%] (p = 0.00 < 0.02)
Performance has improved.
You can see above that we see the drop from 50µs down to 10µs for
exactly the same amount of work, with the same data and target process.
With the aforementioned custom tool, we see about ~40x improvement (it
might vary a bit, depending on a specific captured set of addresses). And
even for perf-based benchmark it's on par or slightly ahead when using
permission-based filtering (fetching only executable VMAs).
Earlier revisions attempted to use per-VMA locking, if kernel was compiled
with CONFIG_PER_VMA_LOCK=y, but it turned out that anon_vma_name() is not
yet compatible with per-VMA locking and assumes mmap_lock to be taken,
which makes the use of per-VMA locking for this API premature. It was
agreed ([2]) to continue for now with just mmap_lock, but the code
structure is such that it should be easy to add per-VMA locking support
once all the pieces are ready.
One thing that did not change was basing this new API as an ioctl()
command on /proc/<pid>/maps file. An ioctl-based API on top of pidfd was
considered, but has its own downsides. Implementing ioctl() directly on
pidfd will cause access permission checks on every single ioctl(), which
leads to performance concerns and potential spam of capable() audit
messages. It also prevents a nice pattern, possible with
/proc/<pid>/maps, in which application opens /proc/self/maps FD (requiring
no additional capabilities) and passed this FD to profiling agent for
querying. To achieve similar pattern, a new file would have to be created
from pidf just for VMA querying, which is considered to be inferior to
just querying /proc/<pid>/maps FD as proposed in current approach. These
aspects were discussed in the hallway track at recent LSF/MM/BPF 2024 and
sticking to procfs ioctl() was the final agreement we arrived at.
[0] https://github.com/anakryiko/linux/commits/procfs-proc-maps-ioctl-v2/
[1] https://github.com/libbpf/blazesym/pull/675
[2] https://lore.kernel.org/bpf/7rm3izyq2vjp5evdjc7c6z4crdd3oerpiknumdnmmemwyiwx7t@hleldw7iozi3/
This patch (of 6):
Extract generic logic to fetch relevant pieces of data to describe VMA
name. This could be just some string (either special constant or
user-provided), or a string with some formatted wrapping text (e.g.,
"[anon_shmem:<something>]"), or, commonly, file path. seq_file-based
logic has different methods to handle all three cases, but they are
currently mixed in with extracting underlying sources of data.
This patch splits this into data fetching and data formatting, so that
data fetching can be reused later on.
There should be no functional changes.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Liam R. Howlett <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Since the memfd pages associated with a udmabuf may be migrated as part of
udmabuf create, we need to verify the data coherency after successful
migration. The new tests added in this patch try to do just that using 4k
sized pages and also 2 MB sized huge pages for the memfd.
Successful completion of the tests would mean that there is no disconnect
between the memfd pages and the ones associated with a udmabuf. And,
these tests can also be augmented in the future to test newer udmabuf
features (such as handling memfd hole punch).
The idea for these tests comes from a patch by Mike Kravetz here:
https://lists.freedesktop.org/archives/dri-devel/2023-June/410623.html
v1->v2: (suggestions from Shuah)
- Use ksft_* functions to print and capture results of tests
- Use appropriate KSFT_* status codes for exit()
- Add Mike Kravetz's suggested-by tag
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vivek Kasireddy <[email protected]>
Suggested-by: Mike Kravetz <[email protected]>
Acked-by: Dave Airlie <[email protected]>
Acked-by: Gerd Hoffmann <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Dongwon Kim <[email protected]>
Cc: Junxiao Chang <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Using memfd_pin_folios() will ensure that the pages are pinned
correctly using FOLL_PIN. And, this also ensures that we don't
accidentally break features such as memory hotunplug as it would
not allow pinning pages in the movable zone.
Using this new API also simplifies the code as we no longer have
to deal with extracting individual pages from their mappings or
handle shmem and hugetlb cases separately.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vivek Kasireddy <[email protected]>
Acked-by: Dave Airlie <[email protected]>
Acked-by: Gerd Hoffmann <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Dongwon Kim <[email protected]>
Cc: Junxiao Chang <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This is mainly a preparatory patch to use memfd_pin_folios() API for
pinning folios. Using folios instead of pages makes sense as the udmabuf
driver needs to handle both shmem and hugetlb cases. And, using the
memfd_pin_folios() API makes this easier as we no longer need to
separately handle shmem vs hugetlb cases in the udmabuf driver.
Note that, the function vmap_udmabuf() still needs a list of pages; so, we
collect all the head pages into a local array in this case.
Other changes in this patch include the addition of helpers for checking
the memfd seals and exporting dmabuf. Moving code from udmabuf_create()
into these helpers improves readability given that udmabuf_create() is a
bit long.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vivek Kasireddy <[email protected]>
Acked-by: Dave Airlie <[email protected]>
Acked-by: Gerd Hoffmann <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Dongwon Kim <[email protected]>
Cc: Junxiao Chang <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
A user or admin can configure a VMM (Qemu) Guest's memory to be backed by
hugetlb pages for various reasons. However, a Guest OS would still
allocate (and pin) buffers that are backed by regular 4k sized pages. In
order to map these buffers and create dma-bufs for them on the Host, we
first need to find the hugetlb pages where the buffer allocations are
located and then determine the offsets of individual chunks (within those
pages) and use this information to eventually populate a scatterlist.
Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options
were passed to the Host kernel and Qemu was launched with these
relevant options: qemu-system-x86_64 -m 4096m....
-device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
-display gtk,gl=on
-object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
-machine memory-backend=mem1
Replacing -display gtk,gl=on with -display gtk,gl=off above would
exercise the mmap handler.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vivek Kasireddy <[email protected]>
Acked-by: Mike Kravetz <[email protected]> (v2)
Acked-by: Dave Airlie <[email protected]>
Acked-by: Gerd Hoffmann <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Dongwon Kim <[email protected]>
Cc: Junxiao Chang <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add VM_PFNMAP to vm_flags in the mmap handler to ensure that the mappings
would be managed without using struct page.
And, in the vm_fault handler, use vmf_insert_pfn to share the page's pfn
to userspace instead of directly sharing the page (via struct page *).
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vivek Kasireddy <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Dave Airlie <[email protected]>
Acked-by: Gerd Hoffmann <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Dongwon Kim <[email protected]>
Cc: Junxiao Chang <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
There is no !CONFIG_MMU version of vmf_insert_pfn():
arm-linux-gnueabi-ld: drivers/dma-buf/udmabuf.o: in function `udmabuf_vm_fault':
udmabuf.c:(.text+0xaa): undefined reference to `vmf_insert_pfn'
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Vivek Kasireddy <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Dongwon Kim <[email protected]>
Cc: Gerd Hoffmann <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Junxiao Chang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
For drivers that would like to longterm-pin the folios associated with a
memfd, the memfd_pin_folios() API provides an option to not only pin the
folios via FOLL_PIN but also to check and migrate them if they reside in
movable zone or CMA block. This API currently works with memfds but it
should work with any files that belong to either shmemfs or hugetlbfs.
Files belonging to other filesystems are rejected for now.
The folios need to be located first before pinning them via FOLL_PIN. If
they are found in the page cache, they can be immediately pinned.
Otherwise, they need to be allocated using the filesystem specific APIs
and then pinned.
[[email protected]: improve the CONFIG_MMU=n situation, per SeongJae]
[[email protected]: return -EINVAL if the end offset is greater than the size of memfd]
Link: https://lkml.kernel.org/r/IA0PR11MB71850525CBC7D541CAB45DF1F8DB2@IA0PR11MB7185.namprd11.prod.outlook.com
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vivek Kasireddy <[email protected]>
Suggested-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]> (v2)
Reviewed-by: David Hildenbrand <[email protected]> (v3)
Reviewed-by: Christoph Hellwig <[email protected]> (v6)
Acked-by: Dave Airlie <[email protected]>
Acked-by: Gerd Hoffmann <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Dongwon Kim <[email protected]>
Cc: Junxiao Chang <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This helper is the folio equivalent of check_and_migrate_movable_pages().
Therefore, all the rules that apply to check_and_migrate_movable_pages()
also apply to this one as well. Currently, this helper is only used by
memfd_pin_folios().
This patch also includes changes to rename and convert the internal
functions collect_longterm_unpinnable_pages() and
migrate_longterm_unpinnable_pages() to work on folios. As a result,
check_and_migrate_movable_pages() is now a wrapper around
check_and_migrate_movable_folios().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vivek Kasireddy <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Dave Airlie <[email protected]>
Acked-by: Gerd Hoffmann <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dongwon Kim <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Junxiao Chang <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm/gup: Introduce memfd_pin_folios() for pinning memfd
folios", v16.
Currently, some drivers (e.g, Udmabuf) that want to longterm-pin the
pages/folios associated with a memfd, do so by simply taking a reference
on them. This is not desirable because the pages/folios may reside in
Movable zone or CMA block.
Therefore, having drivers use memfd_pin_folios() API ensures that the
folios are appropriately pinned via FOLL_PIN for longterm DMA.
This patchset also introduces a few helpers and converts the Udmabuf
driver to use folios and memfd_pin_folios() API to longterm-pin the folios
for DMA. Two new Udmabuf selftests are also included to test the driver
and the new API.
This patch (of 9):
These helpers are the folio versions of unpin_user_page/unpin_user_pages.
They are currently only useful for unpinning folios pinned by
memfd_pin_folios() or other associated routines. However, they could find
new uses in the future, when more and more folio-only helpers are added to
GUP.
We should probably sanity check the folio as part of unpin similar to how
it is done in unpin_user_page/unpin_user_pages but we cannot cleanly do
that at the moment without also checking the subpage. Therefore, sanity
checking needs to be added to these routines once we have a way to
determine if any given folio is anon-exclusive (via a per folio
AnonExclusive flag).
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vivek Kasireddy <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Dave Airlie <[email protected]>
Acked-by: Gerd Hoffmann <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dongwon Kim <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Junxiao Chang <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Zswap uses 32 pools to workaround the locking scalability problem in zswap
backends (mainly zsmalloc nowadays), which brings its own problems like
memory waste and more memory fragmentation.
Testing results show that we can have near performance with only one pool
in zswap after changing zsmalloc to use per-size_class lock instead of
pool spinlock.
Testing kernel build (make bzImage -j32) on tmpfs with memory.max=1GB, and
zswap shrinker enabled with 10GB swapfile on ext4.
real user sys
6.10.0-rc3 138.18 1241.38 1452.73
6.10.0-rc3-onepool 149.45 1240.45 1844.69
6.10.0-rc3-onepool-perclass 138.23 1242.37 1469.71
And do the same testing using zbud, which shows a little worse performance
as expected since we don't do any locking optimization for zbud. I think
it's acceptable since zsmalloc became a lot more popular than other
backends, and we may want to support only zsmalloc in the future.
real user sys
6.10.0-rc3-zbud 138.23 1239.58 1430.09
6.10.0-rc3-onepool-zbud 139.64 1241.37 1516.59
[[email protected]: fix error handling in zswap_pool_create(), per Dan Carpenter]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix error handling again in zswap_pool_create(), per Yosry]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chengming Zhou <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm/zsmalloc: change back to per-size_class lock, v2".
Commit c0547d0b6a4b ("zsmalloc: consolidate zs_pool's migrate_lock and
size_class's locks") changed per-size_class lock to pool spinlock to
prepare reclaim support in zsmalloc. Then reclaim support in zsmalloc had
been dropped in favor of LRU reclaim in zswap, but this locking change had
been left there.
Obviously, the scalability of pool spinlock is worse than per-size_class.
And we have a workaround that using 32 pools in zswap to avoid this
scalability problem, which brings its own problems like memory waste and
more memory fragmentation.
So this series changes back to use per-size_class lock and using testing
data in much stressed situation to verify that we can use only one pool in
zswap. Note we only test and care about the zsmalloc backend, which makes
sense now since zsmalloc became a lot more popular than other backends.
Testing kernel build (make bzImage -j32) on tmpfs with memory.max=1GB, and
zswap shrinker enabled with 10GB swapfile on ext4.
real user sys
6.10.0-rc3 138.18 1241.38 1452.73
6.10.0-rc3-onepool 149.45 1240.45 1844.69
6.10.0-rc3-onepool-perclass 138.23 1242.37 1469.71
We can see from "sys" column that per-size_class locking with only one
pool in zswap can have near performance with the current 32 pools.
This patch (of 2):
This patch is almost the revert of the commit c0547d0b6a4b ("zsmalloc:
consolidate zs_pool's migrate_lock and size_class's locks"), which changed
to use a global pool->lock instead of per-size_class lock and
pool->migrate_lock, was preparation for suppporting reclaim in zsmalloc.
Then reclaim in zsmalloc had been dropped in favor of LRU reclaim in
zswap.
In theory, per-size_class is more fine-grained than the pool->lock, since
a pool can have many size_classes. As for the additional
pool->migrate_lock, only free() and map() need to grab it to access stable
handle to get zspage, and only in read lock mode.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chengming Zhou <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
During conflict resolution a line was unintentionally removed by a ksm.c
patch.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: ac90c56bbd73 ("mm/ksm: refactor out try_to_merge_with_zero_page()")
Reported-by: Hugh Dickins <[email protected]>
Cc: Aristeu Rozanski <[email protected]>
Cc: Chengming Zhou <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
If CONFIG_ZSWAP is set to N, it means zswap cannot be enabled.
zswap_never_enabled() should return true.
The only effect of this issue is that with Barry's latest large folio
swapin patches for zram ("mm: support mTHP swap-in for zRAM-like
swapfile"), we will always fallback to order-0 swapin, even mistakenly
when !CONFIG_ZSWAP.
Basically this bug makes Barry's in progress patches not work at all.
The API was created to inform the mm core that zswap has never been
enabled, allowing the mm core to perform mTHP swap-in. This is a
transitional solution until zswap supports mTHP. If zswap has been
enabled, performing mTHP swap-in will result in corrupted data. You
may find the answer in the mTHP swap-in series:
https://lore.kernel.org/linux-mm/CAJD7tkZ4FQr6HZpduOdvmqgg_-whuZYE-Bz5O2t6yzw6Yg+v1A@mail.gmail.com/
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0300e17d67c3 ("mm: zswap: add zswap_never_enabled()")
Signed-off-by: Barry Song <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Acked-by: Yosry Ahmed <[email protected]>
Acked-by: Chris Li <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The optimization of list_empty(&folio->_deferred_list) aimed to prevent
increasing the PTL duration when a large folio is partially unmapped, for
example, from subpage 0 to subpage (nr - 2).
But Ryan's commit 5ed890ce5147 ("mm: vmscan: avoid split during
shrink_folio_list()") actually splits this kind of large folios. This
makes the "optimization" useless.
Additionally, the list_empty() technically required a data_race()
annotation.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Barry Song <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The prefetchw() is introduced from an ancient patch[1].
The change log says:
The basic idea is to free higher order pages instead of going
through every single one. Also, some unnecessary atomic operations
are done away with and replaced with non-atomic equivalents, and
prefetching is done where it helps the most. For a more in-depth
discusion of this patch, please see the linux-ia64 archives (topic
is "free bootmem feedback patch").
So there are several changes improve the bootmem freeing, in which the
most basic idea is freeing higher order pages. And as Matthew says,
"Itanium CPUs of this era had no prefetchers."
I did 10 round bootup tests before and after this change, the data doesn't
prove prefetchw() help speeding up bootmem freeing. The sum of the 10
round bootmem freeing time after prefetchw() removal even 5.2% faster than
before.
[1]: https://lore.kernel.org/linux-ia64/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
Reviewed-by: Matthew Wilcox <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The functions set_max_threads() and task_struct_whitelist() are only used
by fork_init() during bootup.
Let's add __init tag to them.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Suggested-by: Oleg Nesterov <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Since we plan to move the accounting into __free_pages_core(),
totalram_pages may not represent the total usable pages on system at this
point when defer_init is enabled.
Instead we can get the total usable pages from memblock directly.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
CONFIG_MEMCG_KMEM used to be a user-visible option for whether slab
tracking is enabled. It has been default-enabled and equivalent to
CONFIG_MEMCG for almost a decade. We've only grown more kernel memory
accounting sites since, and there is no imaginable cgroup usecase going
forward that wants to track user pages but not the multitude of
user-drivable kernel allocations.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Memcg v1-specific fields serve a buffer function between read-mostly and
update often parts of the mem_cgroup_per_node structure. If
CONFIG_MEMCG_V1 is not set and these fields are not present, an explicit
cacheline padding is needed.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Suggested-by: Shakeel Butt <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
After the grouping of the cgroup v1-related fields and the corresponding
reorganization of the struct mem_cgroup, the existing cache line padding
doesn't make much sense anymore. Let's drop it for now and put back to
new places, if necessary.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Suggested-by: Shakeel Butt <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Readers of DAMON subsystem documents index would want to further learn how
they can use DAMON from the user-space. Add the link to the admin guide.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
DAMON subsystem documents index page provides a short intro of DAMON core
concepts. Add links to sections of the design document to let users
easily browse to the details.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Readers of the design document would wonder how they can configure and use
specific DAMON features. Add links to sections of DAMON sysfs interface
usage document that provides the answers for easier browsing.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
'Modules' section
'Programmable Modules' section provides high level descriptions of the
DAMON API-based kernel modules layer. But 'Modules' section, which is at
the end of the document, provides every detail about the layer including
that of 'Programmable Modules' section.
Since the brief summary of the layers at the beginning of the document has
a link to the 'Modules' section, browsing to the section is not that
difficult. Remove 'Programmable Modules' section in favor of 'Modules'
section and reducing duplicates.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
'Operations Set Layer' section
'Configurable Operations Set' section is for providing a description of
the pluggability of the operations set layer. Just after that,
'Operations Set Layer' section, which is dedicated for the entire things
of the layer, follows. The layout is odd, and some descriptions are
duplicated. Move 'Configurable Operations Set' section into 'Operations
Set Layer' and re-write some of the detailed descriptions.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
DAMON design document briefly explains the overall layers architecture
first, and then provides detailed explanations of each layer with
dedicated sections. Letting readers go directly to the detailed sections
for specific layers could help easy browsing of the not-very-short
document. Add links from the overall summary to the sections of details.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
DAMON user-space tool (damo) provides access pattern snapshot feature,
which is expected to be frequently used for real time access pattern
analysis. The snapshot output is also showing what DAMON provides on its
own, including the 'age' information.
In contrast, the recorded access patterns, which is shown as an example
usage on the quick start section, shows what users can make from what
DAMON provided. It includes information that generated outside of DAMON
and makes the 'age' concept bit unclear. Hence snapshot output is easier
at understanding the raw realtime output of DAMON. Add the snapshot usage
example on the quick start section.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
DAMON design document is not explaining how min_nr_regions limit is kept,
and what happens if the number of regions exceeds max_nr_regions. Add
more clarification for those.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "Docs/damon: minor fixups and improvements".
Fixup typos, clarify regions merging operation design with recent change,
add access pattern snapshot example use case, and improve readability of
the design document and subsystem documents index by
reorganizing/wordsmithing and adding links to other sections and/or
documents for easy browsing.
This patch (of 9):
Fix two typos. The first one is just a simple typo: s/accurach/accuracy/
The second one is made by the author being out of their mind. 'Region
Based Sampling' section of the doc is mistakenly calling the access
frequency counter of region as 'nr_regions'. Fix it with the correct
name, 'nr_accesses'.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Commit 19eaf44954df ("mm: thp: support allocation of anonymous multi-size
THP") added mTHP support for anonymous shmem. We can configure different
policies through the multi-size THP sysfs interface for anonymous shmem.
But when we configure the "advise" policy of
/sys/kernel/mm/transparent_hugepage/hugepages-xxxkB/shmem_enabled, we
cannot write the "advise", but write the "madvise", which is unreasonable.
We should keep the output and input values consistent, which is more
convenient for users.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 61a57f1b1da9 ("mm: shmem: add multi-size THP sysfs interface for anonymous shmem")
Signed-off-by: Bang Li <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Cc: Bang Li <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Centralize the _GNU_SOURCE definition to CFLAGS in lib.mk. Remove
redundant defines from Makefiles that import lib.mk. Convert any usage of
"#define _GNU_SOURCE 1" to "#define _GNU_SOURCE".
This uses the form "-D_GNU_SOURCE=", which is equivalent to
"#define _GNU_SOURCE".
Otherwise using "-D_GNU_SOURCE" is equivalent to "-D_GNU_SOURCE=1" and
"#define _GNU_SOURCE 1", which is less commonly seen in source code and
would require many changes in selftests to avoid redefinition warnings.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Edward Liaw <[email protected]>
Suggested-by: John Hubbard <[email protected]>
Acked-by: Shuah Khan <[email protected]>
Reviewed-by: Muhammad Usama Anjum <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: André Almeida <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Jarkko Sakkinen <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Reinette Chatre <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Both Ryan and Chris have been utilizing the small test program to aid in
debugging and identifying issues with swap entry allocation. While a real
or intricate workload might be more suitable for assessing the correctness
and effectiveness of the swap allocation policy, a small test program
presents a simpler means of understanding the problem and initially
verifying the improvements being made.
Let's endeavor to integrate it into tools/mm. Although it presently only
accommodates 64KB and 4KB, I'm optimistic that we can expand its
capabilities to support multiple sizes and simulate more complex systems
in the future as required.
Basically, we have
1. Use MADV_PAGEPUT for rapid swap-out, putting the swap allocation
code under high exercise in a short time.
2. Use MADV_DONTNEED to simulate the behavior of libc and Java heap in
freeing memory, as well as for munmap, app exits, or OOM killer
scenarios. This ensures new mTHP is always generated, released or
swapped out, similar to the behavior on a PC or Android phone where
many applications are frequently started and terminated.
3. Swap in with or without the "-a" option to observe how fragments
due to swap-in and the incoming swap-in of large folios will impact
swap-out fallback.
Due to 2, we ensure a certain proportion of mTHP. Similarly, because of
3, we maintain a certain proportion of small folios, as we don't support
large folios swap-in, meaning any swap-in will immediately result in small
folios. Therefore, with both 2 and 3, we automatically achieve a system
containing both mTHP and small folios. Additionally, 1 provides the
ability to continuously swap them out.
We can also use "-s" to add a dedicated small folios memory area.
[[email protected]: thp_swap_allocator_test.c needs mman.h, per Kairui Song]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Barry Song <[email protected]>
Acked-by: Chris Li <[email protected]>
Tested-by: Chris Li <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Tested-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Kalesh Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The folio_migrate_copy() is just a wrapper of folio_copy() and
folio_migrate_flags(), it is simple and only aio use it for now, unfold it
and remove folio_migrate_copy().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: Jane Chu <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This is similar to __migrate_folio(), use folio_mc_copy() in HugeTLB folio
migration to avoid panic when copy from poisoned folio.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The folio migration is widely used in kernel, memory compaction, memory
hotplug, soft offline page, numa balance, memory demote/promotion, etc,
but once access a poisoned source folio when migrating, the kerenl will
panic.
There is a mechanism in the kernel to recover from uncorrectable memory
errors, ARCH_HAS_COPY_MC, which is already used in other core-mm paths,
eg, CoW, khugepaged, coredump, ksm copy, see copy_mc_to_{user,kernel},
copy_mc_{user_}highpage callers.
In order to support poisoned folio copy recover from migrate folio, we
chose to make folio migration tolerant of memory failures and return error
for folio migration, because folio migration is no guarantee of success,
this could avoid the similar panic shown below.
CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0
pc : copy_page+0x10/0xc0
lr : copy_highpage+0x38/0x50
...
Call trace:
copy_page+0x10/0xc0
folio_copy+0x78/0x90
migrate_folio_extra+0x54/0xa0
move_to_new_folio+0xd8/0x1f0
migrate_folio_move+0xb8/0x300
migrate_pages_batch+0x528/0x788
migrate_pages_sync+0x8c/0x258
migrate_pages+0x440/0x528
soft_offline_in_use_page+0x2ec/0x3c0
soft_offline_page+0x238/0x310
soft_offline_page_store+0x6c/0xc0
dev_attr_store+0x20/0x40
sysfs_kf_write+0x4c/0x68
kernfs_fop_write_iter+0x130/0x1c8
new_sync_write+0xa4/0x138
vfs_write+0x238/0x2d8
ksys_write+0x74/0x110
Note, folio copy is moved in the begin of the __migrate_folio(), which
could simplify the error handling since there is no turning back if
folio_migrate_mapping() return success, the downside is the folio copied
even though folio_migrate_mapping() return fail, an optimization is to
check whether source folio does not have extra refs before we do folio
copy.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The folio refcount check is moved out for both !mapping and mapping folio,
also update comment from page to folio for folio_migrate_mapping().
No functional change intended.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add a #MC variant of folio_copy() which uses copy_mc_highpage() to support
#MC handled during folio copy, it will be used in folio migration soon.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: Jane Chu <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm: migrate: support poison recover from migrate folio", v5.
The folio migration is widely used in kernel, memory compaction, memory
hotplug, soft offline page, numa balance, memory demote/promotion, etc,
but once access a poisoned source folio when migrating, the kernel will
panic.
There is a mechanism in the kernel to recover from uncorrectable memory
errors, ARCH_HAS_COPY_MC(eg, Machine Check Safe Memory Copy on x86), which
is already used in NVDIMM or core-mm paths(eg, CoW, khugepaged, coredump,
ksm copy), see copy_mc_to_{user,kernel}, copy_mc_{user_}highpage callers.
This series of patches provide the recovery mechanism from folio copy for
the widely used folio migration. Please note, because folio migration is
no guarantee of success, so we could chose to make folio migration
tolerant of memory failures, adding folio_mc_copy() which is a #MC
versions of folio_copy(), once accessing a poisoned source folio, we could
return error and make the folio migration fail, and this could avoid the
similar panic shown below.
CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0
pc : copy_page+0x10/0xc0
lr : copy_highpage+0x38/0x50
...
Call trace:
copy_page+0x10/0xc0
folio_copy+0x78/0x90
migrate_folio_extra+0x54/0xa0
move_to_new_folio+0xd8/0x1f0
migrate_folio_move+0xb8/0x300
migrate_pages_batch+0x528/0x788
migrate_pages_sync+0x8c/0x258
migrate_pages+0x440/0x528
soft_offline_in_use_page+0x2ec/0x3c0
soft_offline_page+0x238/0x310
soft_offline_page_store+0x6c/0xc0
dev_attr_store+0x20/0x40
sysfs_kf_write+0x4c/0x68
kernfs_fop_write_iter+0x130/0x1c8
new_sync_write+0xa4/0x138
vfs_write+0x238/0x2d8
ksys_write+0x74/0x110
This patch (of 5):
There is a memory_failure_queue() call after copy_mc_[user]_highpage(),
see callers, eg, CoW/KSM page copy, it is used to mark the source page as
h/w poisoned and unmap it from other tasks, and the upcomming poison
recover from migrate folio will do the similar thing, so let's move the
memory_failure_queue() into the copy_mc_[user]_highpage() instead of
adding it into each user, this should also enhance the handling of
poisoned page in khugepaged.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: Jane Chu <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
crashes from deferred split racing folio migration", needed by "mm:
migrate: split folio_migrate_mapping()".
|
|
Now working at Oracle.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lorenzo Stoakes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
flags when freeing, yet the flags shown are not bad: PG_locked had been
set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
symptoms implying double free by deferred split and large folio migration.
6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large
folio migration") was right to fix the memcg-dependent locking broken in
85ce2c517ade ("memcontrol: only transfer the memcg data for migration"),
but missed a subtlety of deferred_split_scan(): it moves folios to its own
local list to work on them without split_queue_lock, during which time
folio->_deferred_list is not empty, but even the "right" lock does nothing
to secure the folio and the list it is on.
Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
while the old folio's reference count is temporarily frozen to 0 - adding
such a freeze in the !mapping case too (originally, folio lock and
unmapping and no swap cache left an anon folio unreachable, so no freezing
was needed there: but the deferred split queue offers a way to reach it).
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration")
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Cc: Barry Song <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
compat
On a system with Perl 5.12.1, commit 5ef6dc08cfde
("lib/build_OID_registry: don't mention the full path of the script in
output") causes the build to fail with the error below.
Bareword found where operator expected at ./lib/build_OID_registry line 41, near "s#^\Q$abs_srctree/\E##r"
syntax error at ./lib/build_OID_registry line 41, near "s#^\Q$abs_srctree/\E##r"
Execution of ./lib/build_OID_registry aborted due to compilation errors.
make[3]: *** [lib/Makefile:352: lib/oid_registry_data.c] Error 255
Ahmad Fatoum analyzed that non-destructive substitution is only supported since
Perl 5.13.2. Instead of dropping `r` and having the side effect of modifying
`$0`, introduce a dedicated variable to support older Perl versions.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 5ef6dc08cfde ("lib/build_OID_registry: don't mention the full path of the script in output")
Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Paul Menzel <[email protected]>
Suggested-by: Ahmad Fatoum <[email protected]>
Cc: Uwe Kleine-König <[email protected]>
Cc: Nicolas Schier <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Ahmad Fatoum <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
A kernel warning was reported when pinning folio in CMA memory when
launching SEV virtual machine. The splat looks like:
[ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520
[ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6
[ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520
[ 464.325515] Call Trace:
[ 464.325520] <TASK>
[ 464.325523] ? __get_user_pages+0x423/0x520
[ 464.325528] ? __warn+0x81/0x130
[ 464.325536] ? __get_user_pages+0x423/0x520
[ 464.325541] ? report_bug+0x171/0x1a0
[ 464.325549] ? handle_bug+0x3c/0x70
[ 464.325554] ? exc_invalid_op+0x17/0x70
[ 464.325558] ? asm_exc_invalid_op+0x1a/0x20
[ 464.325567] ? __get_user_pages+0x423/0x520
[ 464.325575] __gup_longterm_locked+0x212/0x7a0
[ 464.325583] internal_get_user_pages_fast+0xfb/0x190
[ 464.325590] pin_user_pages_fast+0x47/0x60
[ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd]
[ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd]
Per the analysis done by yangge, when starting the SEV virtual machine, it
will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the memory.
But the page is in CMA area, so fast GUP will fail then fallback to the
slow path due to the longterm pinnalbe check in try_grab_folio().
The slow path will try to pin the pages then migrate them out of CMA area.
But the slow path also uses try_grab_folio() to pin the page, it will
also fail due to the same check then the above warning is triggered.
In addition, the try_grab_folio() is supposed to be used in fast path and
it elevates folio refcount by using add ref unless zero. We are guaranteed
to have at least one stable reference in slow path, so the simple atomic add
could be used. The performance difference should be trivial, but the
misuse may be confusing and misleading.
Redefined try_grab_folio() to try_grab_folio_fast(), and try_grab_page()
to try_grab_folio(), and use them in the proper paths. This solves both
the abuse and the kernel warning.
The proper naming makes their usecase more clear and should prevent from
abusing in the future.
peterx said:
: The user will see the pin fails, for gpu-slow it further triggers the WARN
: right below that failure (as in the original report):
:
: folio = try_grab_folio(page, page_increm - 1,
: foll_flags);
: if (WARN_ON_ONCE(!folio)) { <------------------------ here
: /*
: * Release the 1st page ref if the
: * folio is problematic, fail hard.
: */
: gup_put_folio(page_folio(page), 1,
: foll_flags);
: ret = -EFAULT;
: goto out;
: }
[1] https://lore.kernel.org/linux-mm/[email protected]/
[[email protected]: fix implicit declaration of function try_grab_folio_fast]
Link: https://lkml.kernel.org/r/CAHbLzkowMSso-4Nufc9hcMehQsK9PNz3OSu-+eniU-2Mm-xjhA@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"")
Signed-off-by: Yang Shi <[email protected]>
Reported-by: yangge <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: <[email protected]> [6.6+]
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add the documentation for soft offline behaviors / costs, and what the new
enable_soft_offline sysctl is for.
[[email protected]: fix kerneldoc warnings]
Link: https://lkml.kernel.org/r/CACw3F52=GxTCDw-PqFh3-GDM-fo3GbhGdu0hedxYXOTT4TQSTg@mail.gmail.com
[[email protected]: there are more blank lines needed]
Link: https://lkml.kernel.org/r/CACw3F52_obAB742XeDRNun4BHBYtrxtbvp5NkUincXdaob0j1g@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jiaqi Yan <[email protected]>
Acked-by: Oscar Salvador <[email protected]>
Acked-by: Miaohe Lin <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Frank van der Linden <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add regression and new tests when hugepage has correctable memory errors,
and how userspace wants to deal with it:
* if enable_soft_offline=1, mapped hugepage is soft offlined
* if enable_soft_offline=0, mapped hugepage is intact
Free hugepages case is not explicitly covered by the tests.
Hugepage having corrected memory errors is emulated with
MADV_SOFT_OFFLINE.
[[email protected]: v7]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jiaqi Yan <[email protected]>
Acked-by: Miaohe Lin <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Frank van der Linden <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Correctable memory errors are very common on servers with large amount of
memory, and are corrected by ECC. Soft offline is kernel's additional
recovery handling for memory pages having (excessive) corrected memory
errors. Impacted page is migrated to a healthy page if inuse; the
original page is discarded for any future use.
The actual policy on whether (and when) to soft offline should be
maintained by userspace, especially in case of an 1G HugeTLB page.
Soft-offline dissolves the HugeTLB page, either in-use or free, into
chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage. If
userspace has not acknowledged such behavior, it may be surprised when
later failed to mmap hugepages due to lack of hugepages. In case of a
transparent hugepage, it will be split into 4K pages as well; userspace
will stop enjoying the transparent performance.
In addition, discarding the entire 1G HugeTLB page only because of
corrected memory errors sounds very costly and kernel better not doing
under the hood. But today there are at least 2 such cases doing so:
1. when GHES driver sees both GHES_SEV_CORRECTED and
CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER.
2. RAS Correctable Errors Collector counts correctable errors per
PFN and when the counter for a PFN reaches threshold
In both cases, userspace has no control of the soft offline performed
by kernel's memory failure recovery.
This commit gives userspace the control of softofflining any page: kernel
only soft offlines raw page / transparent hugepage / HugeTLB hugepage if
userspace has agreed to. The interface to userspace is a new sysctl at
/proc/sys/vm/enable_soft_offline. By default its value is set to 1 to
preserve existing behavior in kernel. When set to 0, soft-offline (e.g.
MADV_SOFT_OFFLINE) will fail with EOPNOTSUPP.
[[email protected]: v7]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jiaqi Yan <[email protected]>
Acked-by: Miaohe Lin <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Frank van der Linden <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "Userspace controls soft-offline pages", v6.
Correctable memory errors are very common on servers with large amount of
memory, and are corrected by ECC, but with two pain points to users:
1. Correction usually happens on the fly and adds latency overhead
2. Not-fully-proved theory states excessive correctable memory
errors can develop into uncorrectable memory error.
Soft offline is kernel's additional solution for memory pages having
(excessive) corrected memory errors. Impacted page is migrated to healthy
page if it is in use, then the original page is discarded for any future
use.
The actual policy on whether (and when) to soft offline should be
maintained by userspace, especially in case of an 1G HugeTLB page.
Soft-offline dissolves the HugeTLB page, either in-use or free, into
chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage. If
userspace has not acknowledged such behavior, it may be surprised when
later mmap hugepages MAP_FAILED due to lack of hugepages. In case of a
transparent hugepage, it will be split into 4K pages as well; userspace
will stop enjoying the transparent performance.
In addition, discarding the entire 1G HugeTLB page only because of
corrected memory errors sounds very costly and kernel better not doing
under the hood. But today there are at least 2 such cases:
1. GHES driver sees both GHES_SEV_CORRECTED and
CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER.
2. RAS Correctable Errors Collector counts correctable errors per
PFN and when the counter for a PFN reaches threshold
In both cases, userspace has no control of the soft offline performed by
kernel's memory failure recovery.
This patch series give userspace the control of softofflining any page:
kernel only soft offlines raw page / transparent hugepage / HugeTLB
hugepage if userspace has agreed to. The interface to userspace is a new
sysctl called enable_soft_offline under /proc/sys/vm. By default
enable_soft_line is 1 to preserve existing behavior in kernel.
This patch (of 4):
Logs from soft_offline_page and soft_offline_in_use_page have different
formats than majority of the memory failure code:
"Memory failure: 0x${pfn}: ${lower_case_message}"
Convert them to the following format:
"Soft offline: 0x${pfn}: ${lower_case_message}"
No functional change in this commit.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jiaqi Yan <[email protected]>
Acked-by: Miaohe Lin <[email protected]>
Reviewed-by: Lance Yang <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Frank van der Linden <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Currently it uses WARN_ON_ONCE() if seq_buf overflows when user reads
memory.stat, the only advantage of WARN_ON_ONCE is that the splat is so
verbose that it gets noticed. And also it panics the system if
panic_on_warn is enabled. It seems like the warning is just an over
reaction and a simple pr_warn should just achieve the similar effect.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Xiu Jianfeng <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|