aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-07-12fs/procfs: implement efficient VMA querying API for /proc/<pid>/mapsAndrii Nakryiko2-1/+364
/proc/<pid>/maps file is extremely useful in practice for various tasks involving figuring out process memory layout, what files are backing any given memory range, etc. One important class of applications that absolutely rely on this are profilers/stack symbolizers (perf tool being one of them). Patterns of use differ, but they generally would fall into two categories. In on-demand pattern, a profiler/symbolizer would normally capture stack trace containing absolute memory addresses of some functions, and would then use /proc/<pid>/maps file to find corresponding backing ELF files (normally, only executable VMAs are of interest), file offsets within them, and then continue from there to get yet more information (ELF symbols, DWARF information) to get human-readable symbolic information. This pattern is used by Meta's fleet-wide profiler, as one example. In preprocessing pattern, application doesn't know the set of addresses of interest, so it has to fetch all relevant VMAs (again, probably only executable ones), store or cache them, then proceed with profiling and stack trace capture. Once done, it would do symbolization based on stored VMA information. This can happen at much later point in time. This patterns is used by perf tool, as an example. In either case, there are both performance and correctness requirement involved. This address to VMA information translation has to be done as efficiently as possible, but also not miss any VMA (especially in the case of loading/unloading shared libraries). In practice, correctness can't be guaranteed (due to process dying before VMA data can be captured, or shared library being unloaded, etc), but any effort to maximize the chance of finding the VMA is appreciated. Unfortunately, for all the /proc/<pid>/maps file universality and usefulness, it doesn't fit the above use cases 100%. First, it's main purpose is to emit all VMAs sequentially, but in practice captured addresses would fall only into a smaller subset of all process' VMAs, mainly containing executable text. Yet, library would need to parse most or all of the contents to find needed VMAs, as there is no way to skip VMAs that are of no use. Efficient library can do the linear pass and it is still relatively efficient, but it's definitely an overhead that can be avoided, if there was a way to do more targeted querying of the relevant VMA information. Second, it's a text based interface, which makes its programmatic use from applications and libraries more cumbersome and inefficient due to the need to handle text parsing to get necessary pieces of information. The overhead is actually payed both by kernel, formatting originally binary VMA data into text, and then by user space application, parsing it back into binary data for further use. For the on-demand pattern of usage, described above, another problem when writing generic stack trace symbolization library is an unfortunate performance-vs-correctness tradeoff that needs to be made. Library has to make a decision to either cache parsed contents of /proc/<pid>/maps (after initial processing) to service future requests (if application requests to symbolize another set of addresses (for the same process), captured at some later time, which is typical for periodic/continuous profiling cases) to avoid higher costs of re-parsing this file. Or it has to choose to cache the contents in memory to speed up future requests. In the former case, more memory is used for the cache and there is a risk of getting stale data if application loads or unloads shared libraries, or otherwise changed its set of VMAs somehow, e.g., through additional mmap() calls. In the latter case, it's the performance hit that comes from re-opening the file and re-parsing its contents all over again. This patch aims to solve this problem by providing a new API built on top of /proc/<pid>/maps. It's meant to address both non-selectiveness and text nature of /proc/<pid>/maps, by giving user more control of what sort of VMA(s) needs to be queried, and being binary-based interface eliminates the overhead of text formatting (on kernel side) and parsing (on user space side). It's also designed to be extensible and forward/backward compatible by including required struct size field, which user has to provide. We use established copy_struct_from_user() approach to handle extensibility. User has a choice to pick either getting VMA that covers provided address or -ENOENT if none is found (exact, least surprising, case). Or, with an extra query flag (PROCMAP_QUERY_COVERING_OR_NEXT_VMA), they can get either VMA that covers the address (if there is one), or the closest next VMA (i.e., VMA with the smallest vm_start > addr). The latter allows more efficient use, but, given it could be a surprising behavior, requires an explicit opt-in. There is another query flag that is useful for some use cases. PROCMAP_QUERY_FILE_BACKED_VMA instructs this API to only return file-backed VMAs. Combining this with PROCMAP_QUERY_COVERING_OR_NEXT_VMA makes it possible to efficiently iterate only file-backed VMAs of the process, which is what profilers/symbolizers are normally interested in. All the above querying flags can be combined with (also optional) set of desired VMA permissions flags. This allows to, for example, iterate only an executable subset of VMAs, which is what preprocessing pattern, used by perf tool, would benefit from, as the assumption is that captured stack traces would have addresses of executable code. This saves time by skipping non-executable VMAs altogether efficienty. All these querying flags (modifiers) are orthogonal and can be combined in a semantically meaningful and natural way. Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes sense given it's querying the same set of VMA data. It's also benefitial because permission checks for /proc/<pid>/maps is performed at open time once, and the actual data read of text contents of /proc/<pid>/maps is done without further permission checks. We piggyback on this pattern with ioctl()-based API as well, as that's a desired property. Both for performance reasons, but also for security and flexibility reasons. Allowing application to open an FD for /proc/self/maps without any extra capabilities, and then passing it to some sort of profiling agent through Unix-domain socket, would allow such profiling agent to not require some of the capabilities that are otherwise expected when opening /proc/<pid>/maps file for *another* process. This is a desirable property for some more restricted setups. This new ioctl-based implementation doesn't interfere with seq_file-based implementation of /proc/<pid>/maps textual interface, and so could be used together or independently without paying any price for that. Note also, that fetching VMA name (e.g., backing file path, or special hard-coded or user-provided names) is optional just like build ID. If user sets vma_name_size to zero, kernel code won't attempt to retrieve it, saving resources. Earlier versions of this patch set were adding per-VMA locking, which is why we have a code structure that is ready for abstracting mmap_lock vs vm_lock differences (query_vma_setup(), query_vma_teardown(), and query_vma_find_by_addr()), but given anon_vma_name() is not yet compatible with per-VMA locking, initial implementation sticks to using only mmap_lock for now. It will be easy to add back per-VMA locking once all the pieces are ready later on. Which is why we keep existing code structure with setup/teardown/query helper functions. [[email protected]: improve PROCMAP_QUERY's compat mode handling] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Liam R. Howlett <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12fs/procfs: extract logic for getting VMA name constituentsAndrii Nakryiko1-54/+71
Patch series "ioctl()-based API to query VMAs from /proc/<pid>/maps", v6. Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow applications to query VMA information more efficiently than reading *all* VMAs nonselectively through text-based interface of /proc/<pid>/maps file. Patch #2 goes into a lot of details and background on some common patterns of using /proc/<pid>/maps in the area of performance profiling and subsequent symbolization of captured stack traces. As mentioned in that patch, patterns of VMA querying can differ depending on specific use case, but can generally be grouped into two main categories: the need to query a small subset of VMAs covering a given batch of addresses, or reading/storing/caching all (typically, executable) VMAs upfront for later processing. The new PROCMAP_QUERY ioctl() API added in this patch set was motivated by the former pattern of usage. Earlier revisions had a patch adding a tool that faithfully reproduces an efficient VMA matching pass of a symbolizer, collecting a subset of covering VMAs for a given set of addresses as efficiently as possible. This tool served both as a testing ground, as well as a benchmarking tool. It implements everything both for currently existing text-based /proc/<pid>/maps interface, as well as for newly-added PROCMAP_QUERY ioctl(). This revision dropped the tool from the patch set and, once the API lands upstream, this tool might be added separately on Github as an example. Based on discussion on earlier revisions of this patch set, it turned out that this ioctl() API is competitive with highly-optimized text-based pre-processing pattern that perf tool is using. Based on perf discussion, this revision adds more flexibility in specifying a subset of VMAs that are of interest. Now it's possible to specify desired permissions of VMAs (e.g., request only executable ones) and/or restrict to only a subset of VMAs that have file backing. This further improves the efficiency when using this new API thanks to more selective (executable VMAs only) querying. In addition to a custom benchmarking tool, and experimental perf integration (available at [0]), Daniel Mueller has since also implemented an experimental integration into blazesym (see [1]), a library used for stack trace symbolization by our server fleet-wide profiler and another on-device profiler agent that runs on weaker ARM devices. The latter ARM-based device profiler is especially sensitive to performance, and so we benchmarked and compared text-based /proc/<pid>/maps solution to the equivalent one using PROCMAP_QUERY ioctl(). Results are very encouraging, giving us 5x improvement for end-to-end so-called "address normalization" pass, which is the part of the symbolization process that happens locally on ARM device, before being sent out for further heavier-weight processing on more powerful remote server. Note that this is not an artificial microbenchmark. It's a full end-to-end API call being measured with real-world data on real-world device. TEXT-BASED ========== Benchmarking main/normalize_process_no_build_ids_uncached_maps main/normalize_process_no_build_ids_uncached_maps time: [49.777 µs 49.982 µs 50.250 µs] IOCTL-BASED =========== Benchmarking main/normalize_process_no_build_ids_uncached_maps main/normalize_process_no_build_ids_uncached_maps time: [10.328 µs 10.391 µs 10.457 µs] change: [−79.453% −79.304% −79.166%] (p = 0.00 < 0.02) Performance has improved. You can see above that we see the drop from 50µs down to 10µs for exactly the same amount of work, with the same data and target process. With the aforementioned custom tool, we see about ~40x improvement (it might vary a bit, depending on a specific captured set of addresses). And even for perf-based benchmark it's on par or slightly ahead when using permission-based filtering (fetching only executable VMAs). Earlier revisions attempted to use per-VMA locking, if kernel was compiled with CONFIG_PER_VMA_LOCK=y, but it turned out that anon_vma_name() is not yet compatible with per-VMA locking and assumes mmap_lock to be taken, which makes the use of per-VMA locking for this API premature. It was agreed ([2]) to continue for now with just mmap_lock, but the code structure is such that it should be easy to add per-VMA locking support once all the pieces are ready. One thing that did not change was basing this new API as an ioctl() command on /proc/<pid>/maps file. An ioctl-based API on top of pidfd was considered, but has its own downsides. Implementing ioctl() directly on pidfd will cause access permission checks on every single ioctl(), which leads to performance concerns and potential spam of capable() audit messages. It also prevents a nice pattern, possible with /proc/<pid>/maps, in which application opens /proc/self/maps FD (requiring no additional capabilities) and passed this FD to profiling agent for querying. To achieve similar pattern, a new file would have to be created from pidf just for VMA querying, which is considered to be inferior to just querying /proc/<pid>/maps FD as proposed in current approach. These aspects were discussed in the hallway track at recent LSF/MM/BPF 2024 and sticking to procfs ioctl() was the final agreement we arrived at. [0] https://github.com/anakryiko/linux/commits/procfs-proc-maps-ioctl-v2/ [1] https://github.com/libbpf/blazesym/pull/675 [2] https://lore.kernel.org/bpf/7rm3izyq2vjp5evdjc7c6z4crdd3oerpiknumdnmmemwyiwx7t@hleldw7iozi3/ This patch (of 6): Extract generic logic to fetch relevant pieces of data to describe VMA name. This could be just some string (either special constant or user-provided), or a string with some formatted wrapping text (e.g., "[anon_shmem:<something>]"), or, commonly, file path. seq_file-based logic has different methods to handle all three cases, but they are currently mixed in with extracting underlying sources of data. This patch splits this into data fetching and data formatting, so that data fetching can be reused later on. There should be no functional changes. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Liam R. Howlett <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12selftests/udmabuf: add tests to verify data after page migrationVivek Kasireddy1-31/+183
Since the memfd pages associated with a udmabuf may be migrated as part of udmabuf create, we need to verify the data coherency after successful migration. The new tests added in this patch try to do just that using 4k sized pages and also 2 MB sized huge pages for the memfd. Successful completion of the tests would mean that there is no disconnect between the memfd pages and the ones associated with a udmabuf. And, these tests can also be augmented in the future to test newer udmabuf features (such as handling memfd hole punch). The idea for these tests comes from a patch by Mike Kravetz here: https://lists.freedesktop.org/archives/dri-devel/2023-June/410623.html v1->v2: (suggestions from Shuah) - Use ksft_* functions to print and capture results of tests - Use appropriate KSFT_* status codes for exit() - Add Mike Kravetz's suggested-by tag Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vivek Kasireddy <[email protected]> Suggested-by: Mike Kravetz <[email protected]> Acked-by: Dave Airlie <[email protected]> Acked-by: Gerd Hoffmann <[email protected]> Cc: Shuah Khan <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Peter Xu <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Dongwon Kim <[email protected]> Cc: Junxiao Chang <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12udmabuf: pin the pages using memfd_pin_folios() APIVivek Kasireddy1-75/+80
Using memfd_pin_folios() will ensure that the pages are pinned correctly using FOLL_PIN. And, this also ensures that we don't accidentally break features such as memory hotunplug as it would not allow pinning pages in the movable zone. Using this new API also simplifies the code as we no longer have to deal with extracting individual pages from their mappings or handle shmem and hugetlb cases separately. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vivek Kasireddy <[email protected]> Acked-by: Dave Airlie <[email protected]> Acked-by: Gerd Hoffmann <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Peter Xu <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Dongwon Kim <[email protected]> Cc: Junxiao Chang <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12udmabuf: convert udmabuf driver to use foliosVivek Kasireddy1-56/+83
This is mainly a preparatory patch to use memfd_pin_folios() API for pinning folios. Using folios instead of pages makes sense as the udmabuf driver needs to handle both shmem and hugetlb cases. And, using the memfd_pin_folios() API makes this easier as we no longer need to separately handle shmem vs hugetlb cases in the udmabuf driver. Note that, the function vmap_udmabuf() still needs a list of pages; so, we collect all the head pages into a local array in this case. Other changes in this patch include the addition of helpers for checking the memfd seals and exporting dmabuf. Moving code from udmabuf_create() into these helpers improves readability given that udmabuf_create() is a bit long. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vivek Kasireddy <[email protected]> Acked-by: Dave Airlie <[email protected]> Acked-by: Gerd Hoffmann <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Peter Xu <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Dongwon Kim <[email protected]> Cc: Junxiao Chang <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12udmabuf: add back support for mapping hugetlb pagesVivek Kasireddy1-21/+101
A user or admin can configure a VMM (Qemu) Guest's memory to be backed by hugetlb pages for various reasons. However, a Guest OS would still allocate (and pin) buffers that are backed by regular 4k sized pages. In order to map these buffers and create dma-bufs for them on the Host, we first need to find the hugetlb pages where the buffer allocations are located and then determine the offsets of individual chunks (within those pages) and use this information to eventually populate a scatterlist. Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options were passed to the Host kernel and Qemu was launched with these relevant options: qemu-system-x86_64 -m 4096m.... -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080 -display gtk,gl=on -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M -machine memory-backend=mem1 Replacing -display gtk,gl=on with -display gtk,gl=off above would exercise the mmap handler. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vivek Kasireddy <[email protected]> Acked-by: Mike Kravetz <[email protected]> (v2) Acked-by: Dave Airlie <[email protected]> Acked-by: Gerd Hoffmann <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Peter Xu <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Dongwon Kim <[email protected]> Cc: Junxiao Chang <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12udmabuf: use vmf_insert_pfn and VM_PFNMAP for handling mmapVivek Kasireddy1-3/+5
Add VM_PFNMAP to vm_flags in the mmap handler to ensure that the mappings would be managed without using struct page. And, in the vm_fault handler, use vmf_insert_pfn to share the page's pfn to userspace instead of directly sharing the page (via struct page *). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vivek Kasireddy <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Dave Airlie <[email protected]> Acked-by: Gerd Hoffmann <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Peter Xu <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Dongwon Kim <[email protected]> Cc: Junxiao Chang <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12udmabuf: add CONFIG_MMU dependencyArnd Bergmann1-0/+1
There is no !CONFIG_MMU version of vmf_insert_pfn(): arm-linux-gnueabi-ld: drivers/dma-buf/udmabuf.o: in function `udmabuf_vm_fault': udmabuf.c:(.text+0xaa): undefined reference to `vmf_insert_pfn' Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Vivek Kasireddy <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Dave Airlie <[email protected]> Cc: Dongwon Kim <[email protected]> Cc: Gerd Hoffmann <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Junxiao Chang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm/gup: introduce memfd_pin_folios() for pinning memfd foliosVivek Kasireddy4-0/+192
For drivers that would like to longterm-pin the folios associated with a memfd, the memfd_pin_folios() API provides an option to not only pin the folios via FOLL_PIN but also to check and migrate them if they reside in movable zone or CMA block. This API currently works with memfds but it should work with any files that belong to either shmemfs or hugetlbfs. Files belonging to other filesystems are rejected for now. The folios need to be located first before pinning them via FOLL_PIN. If they are found in the page cache, they can be immediately pinned. Otherwise, they need to be allocated using the filesystem specific APIs and then pinned. [[email protected]: improve the CONFIG_MMU=n situation, per SeongJae] [[email protected]: return -EINVAL if the end offset is greater than the size of memfd] Link: https://lkml.kernel.org/r/IA0PR11MB71850525CBC7D541CAB45DF1F8DB2@IA0PR11MB7185.namprd11.prod.outlook.com Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vivek Kasireddy <[email protected]> Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> (v2) Reviewed-by: David Hildenbrand <[email protected]> (v3) Reviewed-by: Christoph Hellwig <[email protected]> (v6) Acked-by: Dave Airlie <[email protected]> Acked-by: Gerd Hoffmann <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Peter Xu <[email protected]> Cc: Dongwon Kim <[email protected]> Cc: Junxiao Chang <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm/gup: introduce check_and_migrate_movable_folios()Vivek Kasireddy1-47/+77
This helper is the folio equivalent of check_and_migrate_movable_pages(). Therefore, all the rules that apply to check_and_migrate_movable_pages() also apply to this one as well. Currently, this helper is only used by memfd_pin_folios(). This patch also includes changes to rename and convert the internal functions collect_longterm_unpinnable_pages() and migrate_longterm_unpinnable_pages() to work on folios. As a result, check_and_migrate_movable_pages() is now a wrapper around check_and_migrate_movable_folios(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vivek Kasireddy <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Dave Airlie <[email protected]> Acked-by: Gerd Hoffmann <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Peter Xu <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Dongwon Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Junxiao Chang <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm/gup: introduce unpin_folio/unpin_folios helpersVivek Kasireddy2-0/+49
Patch series "mm/gup: Introduce memfd_pin_folios() for pinning memfd folios", v16. Currently, some drivers (e.g, Udmabuf) that want to longterm-pin the pages/folios associated with a memfd, do so by simply taking a reference on them. This is not desirable because the pages/folios may reside in Movable zone or CMA block. Therefore, having drivers use memfd_pin_folios() API ensures that the folios are appropriately pinned via FOLL_PIN for longterm DMA. This patchset also introduces a few helpers and converts the Udmabuf driver to use folios and memfd_pin_folios() API to longterm-pin the folios for DMA. Two new Udmabuf selftests are also included to test the driver and the new API. This patch (of 9): These helpers are the folio versions of unpin_user_page/unpin_user_pages. They are currently only useful for unpinning folios pinned by memfd_pin_folios() or other associated routines. However, they could find new uses in the future, when more and more folio-only helpers are added to GUP. We should probably sanity check the folio as part of unpin similar to how it is done in unpin_user_page/unpin_user_pages but we cannot cleanly do that at the moment without also checking the subpage. Therefore, sanity checking needs to be added to these routines once we have a way to determine if any given folio is anon-exclusive (via a per folio AnonExclusive flag). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vivek Kasireddy <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Dave Airlie <[email protected]> Acked-by: Gerd Hoffmann <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Peter Xu <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Dongwon Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Junxiao Chang <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm/zswap: use only one pool in zswapChengming Zhou1-41/+20
Zswap uses 32 pools to workaround the locking scalability problem in zswap backends (mainly zsmalloc nowadays), which brings its own problems like memory waste and more memory fragmentation. Testing results show that we can have near performance with only one pool in zswap after changing zsmalloc to use per-size_class lock instead of pool spinlock. Testing kernel build (make bzImage -j32) on tmpfs with memory.max=1GB, and zswap shrinker enabled with 10GB swapfile on ext4. real user sys 6.10.0-rc3 138.18 1241.38 1452.73 6.10.0-rc3-onepool 149.45 1240.45 1844.69 6.10.0-rc3-onepool-perclass 138.23 1242.37 1469.71 And do the same testing using zbud, which shows a little worse performance as expected since we don't do any locking optimization for zbud. I think it's acceptable since zsmalloc became a lot more popular than other backends, and we may want to support only zsmalloc in the future. real user sys 6.10.0-rc3-zbud 138.23 1239.58 1430.09 6.10.0-rc3-onepool-zbud 139.64 1241.37 1516.59 [[email protected]: fix error handling in zswap_pool_create(), per Dan Carpenter] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix error handling again in zswap_pool_create(), per Yosry] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Cc: Chengming Zhou <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm/zsmalloc: change back to per-size_class lockChengming Zhou1-35/+50
Patch series "mm/zsmalloc: change back to per-size_class lock, v2". Commit c0547d0b6a4b ("zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks") changed per-size_class lock to pool spinlock to prepare reclaim support in zsmalloc. Then reclaim support in zsmalloc had been dropped in favor of LRU reclaim in zswap, but this locking change had been left there. Obviously, the scalability of pool spinlock is worse than per-size_class. And we have a workaround that using 32 pools in zswap to avoid this scalability problem, which brings its own problems like memory waste and more memory fragmentation. So this series changes back to use per-size_class lock and using testing data in much stressed situation to verify that we can use only one pool in zswap. Note we only test and care about the zsmalloc backend, which makes sense now since zsmalloc became a lot more popular than other backends. Testing kernel build (make bzImage -j32) on tmpfs with memory.max=1GB, and zswap shrinker enabled with 10GB swapfile on ext4. real user sys 6.10.0-rc3 138.18 1241.38 1452.73 6.10.0-rc3-onepool 149.45 1240.45 1844.69 6.10.0-rc3-onepool-perclass 138.23 1242.37 1469.71 We can see from "sys" column that per-size_class locking with only one pool in zswap can have near performance with the current 32 pools. This patch (of 2): This patch is almost the revert of the commit c0547d0b6a4b ("zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks"), which changed to use a global pool->lock instead of per-size_class lock and pool->migrate_lock, was preparation for suppporting reclaim in zsmalloc. Then reclaim in zsmalloc had been dropped in favor of LRU reclaim in zswap. In theory, per-size_class is more fine-grained than the pool->lock, since a pool can have many size_classes. As for the additional pool->migrate_lock, only free() and map() need to grab it to access stable handle to get zspage, and only in read lock mode. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Chengming Zhou <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm/hugetlb.c: undo errant changeAndrew Morton1-0/+1
During conflict resolution a line was unintentionally removed by a ksm.c patch. Link: https://lkml.kernel.org/r/[email protected] Fixes: ac90c56bbd73 ("mm/ksm: refactor out try_to_merge_with_zero_page()") Reported-by: Hugh Dickins <[email protected]> Cc: Aristeu Rozanski <[email protected]> Cc: Chengming Zhou <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10mm: zswap: fix zswap_never_enabled() for CONFIG_ZSWAP==NBarry Song1-1/+1
If CONFIG_ZSWAP is set to N, it means zswap cannot be enabled. zswap_never_enabled() should return true. The only effect of this issue is that with Barry's latest large folio swapin patches for zram ("mm: support mTHP swap-in for zRAM-like swapfile"), we will always fallback to order-0 swapin, even mistakenly when !CONFIG_ZSWAP. Basically this bug makes Barry's in progress patches not work at all. The API was created to inform the mm core that zswap has never been enabled, allowing the mm core to perform mTHP swap-in. This is a transitional solution until zswap supports mTHP. If zswap has been enabled, performing mTHP swap-in will result in corrupted data. You may find the answer in the mTHP swap-in series: https://lore.kernel.org/linux-mm/CAJD7tkZ4FQr6HZpduOdvmqgg_-whuZYE-Bz5O2t6yzw6Yg+v1A@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 0300e17d67c3 ("mm: zswap: add zswap_never_enabled()") Signed-off-by: Barry Song <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Acked-by: Chris Li <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10mm/vmscan: drop checking if _deferred_list is empty before using TTU_SYNCBarry Song1-1/+1
The optimization of list_empty(&folio->_deferred_list) aimed to prevent increasing the PTL duration when a large folio is partially unmapped, for example, from subpage 0 to subpage (nr - 2). But Ryan's commit 5ed890ce5147 ("mm: vmscan: avoid split during shrink_folio_list()") actually splits this kind of large folios. This makes the "optimization" useless. Additionally, the list_empty() technically required a data_race() annotation. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10mm/page_alloc: remove prefetchw() on freeing page to buddy systemWei Yang1-11/+2
The prefetchw() is introduced from an ancient patch[1]. The change log says: The basic idea is to free higher order pages instead of going through every single one. Also, some unnecessary atomic operations are done away with and replaced with non-atomic equivalents, and prefetching is done where it helps the most. For a more in-depth discusion of this patch, please see the linux-ia64 archives (topic is "free bootmem feedback patch"). So there are several changes improve the bootmem freeing, in which the most basic idea is freeing higher order pages. And as Matthew says, "Itanium CPUs of this era had no prefetchers." I did 10 round bootup tests before and after this change, the data doesn't prove prefetchw() help speeding up bootmem freeing. The sum of the 10 round bootmem freeing time after prefetchw() removal even 5.2% faster than before. [1]: https://lore.kernel.org/linux-ia64/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Reviewed-by: Matthew Wilcox <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10kernel/fork.c: put set_max_threads()/task_struct_whitelist() in __init sectionWei Yang1-2/+2
The functions set_max_threads() and task_struct_whitelist() are only used by fork_init() during bootup. Let's add __init tag to them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Suggested-by: Oleg Nesterov <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10kernel/fork.c: get totalram_pages from memblock to calculate max_threadsWei Yang1-1/+2
Since we plan to move the accounting into __free_pages_core(), totalram_pages may not represent the total usable pages on system at this point when defer_init is enabled. Instead we can get the total usable pages from memblock directly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10mm: remove CONFIG_MEMCG_KMEMJohannes Weiner20-131/+59
CONFIG_MEMCG_KMEM used to be a user-visible option for whether slab tracking is enabled. It has been default-enabled and equivalent to CONFIG_MEMCG for almost a decade. We've only grown more kernel memory accounting sites since, and there is no imaginable cgroup usecase going forward that wants to track user pages but not the multitude of user-drivable kernel allocations. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10mm: memcg: add cache line padding to mem_cgroup_per_nodeRoman Gushchin1-2/+4
Memcg v1-specific fields serve a buffer function between read-mostly and update often parts of the mem_cgroup_per_node structure. If CONFIG_MEMCG_V1 is not set and these fields are not present, an explicit cacheline padding is needed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Suggested-by: Shakeel Butt <[email protected]> Acked-by: Shakeel Butt <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10mm: memcg: drop obsolete cache line padding in struct mem_cgroupRoman Gushchin1-4/+0
After the grouping of the cgroup v1-related fields and the corresponding reorganization of the struct mem_cgroup, the existing cache line padding doesn't make much sense anymore. Let's drop it for now and put back to new places, if necessary. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Suggested-by: Shakeel Butt <[email protected]> Acked-by: Shakeel Butt <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10Docs/mm/damon/index: add links to admin-guide docSeongJae Park1-7/+10
Readers of DAMON subsystem documents index would want to further learn how they can use DAMON from the user-space. Add the link to the admin guide. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10Docs/mm/damon/index: add links to designSeongJae Park2-5/+7
DAMON subsystem documents index page provides a short intro of DAMON core concepts. Add links to sections of the design document to let users easily browse to the details. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10Docs/mm/damon/design: add links to sections of DAMON sysfs interface usage docSeongJae Park1-0/+48
Readers of the design document would wonder how they can configure and use specific DAMON features. Add links to sections of DAMON sysfs interface usage document that provides the answers for easier browsing. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10Docs/mm/damon/design: remove 'Programmable Modules' section in favor of ↵SeongJae Park1-10/+0
'Modules' section 'Programmable Modules' section provides high level descriptions of the DAMON API-based kernel modules layer. But 'Modules' section, which is at the end of the document, provides every detail about the layer including that of 'Programmable Modules' section. Since the brief summary of the layers at the beginning of the document has a link to the 'Modules' section, browsing to the section is not that difficult. Remove 'Programmable Modules' section in favor of 'Modules' section and reducing duplicates. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10Docs/mm/damon/design: move 'Configurable Operations Set' section into ↵SeongJae Park1-25/+22
'Operations Set Layer' section 'Configurable Operations Set' section is for providing a description of the pluggability of the operations set layer. Just after that, 'Operations Set Layer' section, which is dedicated for the entire things of the layer, follows. The layout is odd, and some descriptions are duplicated. Move 'Configurable Operations Set' section into 'Operations Set Layer' and re-write some of the detailed descriptions. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10Docs/mm/damon/design: add links from overall architecture to sections of detailsSeongJae Park1-7/+13
DAMON design document briefly explains the overall layers architecture first, and then provides detailed explanations of each layer with dedicated sections. Letting readers go directly to the detailed sections for specific layers could help easy browsing of the not-very-short document. Add links from the overall summary to the sections of details. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10Docs/admin-guide/mm/damon/start: add access pattern snapshot exampleSeongJae Park1-4/+42
DAMON user-space tool (damo) provides access pattern snapshot feature, which is expected to be frequently used for real time access pattern analysis. The snapshot output is also showing what DAMON provides on its own, including the 'age' information. In contrast, the recorded access patterns, which is shown as an example usage on the quick start section, shows what users can make from what DAMON provided. It includes information that generated outside of DAMON and makes the 'age' concept bit unclear. Hence snapshot output is easier at understanding the raw realtime output of DAMON. Add the snapshot usage example on the quick start section. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10Docs/mm/damon/design: clarify regions merging operationSeongJae Park1-5/+12
DAMON design document is not explaining how min_nr_regions limit is kept, and what happens if the number of regions exceeds max_nr_regions. Add more clarification for those. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10Docs/mm/damon/design: fix two typosSeongJae Park1-2/+2
Patch series "Docs/damon: minor fixups and improvements". Fixup typos, clarify regions merging operation design with recent change, add access pattern snapshot example use case, and improve readability of the design document and subsystem documents index by reorganizing/wordsmithing and adding links to other sections and/or documents for easy browsing. This patch (of 9): Fix two typos. The first one is just a simple typo: s/accurach/accuracy/ The second one is made by the author being out of their mind. 'Region Based Sampling' section of the doc is mistakenly calling the access frequency counter of region as 'nr_regions'. Fix it with the correct name, 'nr_accesses'. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10mm/shmem: fix input and output inconsistenciesBang Li1-1/+1
Commit 19eaf44954df ("mm: thp: support allocation of anonymous multi-size THP") added mTHP support for anonymous shmem. We can configure different policies through the multi-size THP sysfs interface for anonymous shmem. But when we configure the "advise" policy of /sys/kernel/mm/transparent_hugepage/hugepages-xxxkB/shmem_enabled, we cannot write the "advise", but write the "madvise", which is unreasonable. We should keep the output and input values consistent, which is more convenient for users. Link: https://lkml.kernel.org/r/[email protected] Fixes: 61a57f1b1da9 ("mm: shmem: add multi-size THP sysfs interface for anonymous shmem") Signed-off-by: Bang Li <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Cc: Bang Li <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10selftests: centralize -D_GNU_SOURCE= to CFLAGS in lib.mkEdward Liaw15-15/+12
Centralize the _GNU_SOURCE definition to CFLAGS in lib.mk. Remove redundant defines from Makefiles that import lib.mk. Convert any usage of "#define _GNU_SOURCE 1" to "#define _GNU_SOURCE". This uses the form "-D_GNU_SOURCE=", which is equivalent to "#define _GNU_SOURCE". Otherwise using "-D_GNU_SOURCE" is equivalent to "-D_GNU_SOURCE=1" and "#define _GNU_SOURCE 1", which is less commonly seen in source code and would require many changes in selftests to avoid redefinition warnings. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Edward Liaw <[email protected]> Suggested-by: John Hubbard <[email protected]> Acked-by: Shuah Khan <[email protected]> Reviewed-by: Muhammad Usama Anjum <[email protected]> Cc: Albert Ou <[email protected]> Cc: André Almeida <[email protected]> Cc: Darren Hart <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David S. Miller <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Jarkko Sakkinen <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Kees Cook <[email protected]> Cc: Kevin Tian <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paolo Abeni <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Reinette Chatre <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10tools/mm: introduce a tool to assess swap entry allocation for thp_swapoutBarry Song2-1/+235
Both Ryan and Chris have been utilizing the small test program to aid in debugging and identifying issues with swap entry allocation. While a real or intricate workload might be more suitable for assessing the correctness and effectiveness of the swap allocation policy, a small test program presents a simpler means of understanding the problem and initially verifying the improvements being made. Let's endeavor to integrate it into tools/mm. Although it presently only accommodates 64KB and 4KB, I'm optimistic that we can expand its capabilities to support multiple sizes and simulate more complex systems in the future as required. Basically, we have 1. Use MADV_PAGEPUT for rapid swap-out, putting the swap allocation code under high exercise in a short time. 2. Use MADV_DONTNEED to simulate the behavior of libc and Java heap in freeing memory, as well as for munmap, app exits, or OOM killer scenarios. This ensures new mTHP is always generated, released or swapped out, similar to the behavior on a PC or Android phone where many applications are frequently started and terminated. 3. Swap in with or without the "-a" option to observe how fragments due to swap-in and the incoming swap-in of large folios will impact swap-out fallback. Due to 2, we ensure a certain proportion of mTHP. Similarly, because of 3, we maintain a certain proportion of small folios, as we don't support large folios swap-in, meaning any swap-in will immediately result in small folios. Therefore, with both 2 and 3, we automatically achieve a system containing both mTHP and small folios. Additionally, 1 provides the ability to continuously swap them out. We can also use "-s" to add a dedicated small folios memory area. [[email protected]: thp_swap_allocator_test.c needs mman.h, per Kairui Song] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Acked-by: Chris Li <[email protected]> Tested-by: Chris Li <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Tested-by: Ryan Roberts <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kairui Song <[email protected]> Cc: Kalesh Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-06mm: migrate: remove folio_migrate_copy()Kefeng Wang3-9/+2
The folio_migrate_copy() is just a wrapper of folio_copy() and folio_migrate_flags(), it is simple and only aio use it for now, unfold it and remove folio_migrate_copy(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: Jane Chu <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Benjamin LaHaise <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Jiaqi Yan <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-06fs: hugetlbfs: support poisoned recover from hugetlbfs_migrate_folio()Kefeng Wang2-3/+9
This is similar to __migrate_folio(), use folio_mc_copy() in HugeTLB folio migration to avoid panic when copy from poisoned folio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Benjamin LaHaise <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jane Chu <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Jiaqi Yan <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-06mm: migrate: support poisoned recover from migrate folioKefeng Wang1-3/+11
The folio migration is widely used in kernel, memory compaction, memory hotplug, soft offline page, numa balance, memory demote/promotion, etc, but once access a poisoned source folio when migrating, the kerenl will panic. There is a mechanism in the kernel to recover from uncorrectable memory errors, ARCH_HAS_COPY_MC, which is already used in other core-mm paths, eg, CoW, khugepaged, coredump, ksm copy, see copy_mc_to_{user,kernel}, copy_mc_{user_}highpage callers. In order to support poisoned folio copy recover from migrate folio, we chose to make folio migration tolerant of memory failures and return error for folio migration, because folio migration is no guarantee of success, this could avoid the similar panic shown below. CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0 pc : copy_page+0x10/0xc0 lr : copy_highpage+0x38/0x50 ... Call trace: copy_page+0x10/0xc0 folio_copy+0x78/0x90 migrate_folio_extra+0x54/0xa0 move_to_new_folio+0xd8/0x1f0 migrate_folio_move+0xb8/0x300 migrate_pages_batch+0x528/0x788 migrate_pages_sync+0x8c/0x258 migrate_pages+0x440/0x528 soft_offline_in_use_page+0x2ec/0x3c0 soft_offline_page+0x238/0x310 soft_offline_page_store+0x6c/0xc0 dev_attr_store+0x20/0x40 sysfs_kf_write+0x4c/0x68 kernfs_fop_write_iter+0x130/0x1c8 new_sync_write+0xa4/0x138 vfs_write+0x238/0x2d8 ksys_write+0x74/0x110 Note, folio copy is moved in the begin of the __migrate_folio(), which could simplify the error handling since there is no turning back if folio_migrate_mapping() return success, the downside is the folio copied even though folio_migrate_mapping() return fail, an optimization is to check whether source folio does not have extra refs before we do folio copy. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Benjamin LaHaise <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jane Chu <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Jiaqi Yan <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-06mm: migrate: split folio_migrate_mapping()Kefeng Wang1-16/+22
The folio refcount check is moved out for both !mapping and mapping folio, also update comment from page to folio for folio_migrate_mapping(). No functional change intended. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Benjamin LaHaise <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jane Chu <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Jiaqi Yan <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-06mm: add folio_mc_copy()Kefeng Wang2-0/+18
Add a #MC variant of folio_copy() which uses copy_mc_highpage() to support #MC handled during folio copy, it will be used in folio migration soon. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: Jane Chu <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Benjamin LaHaise <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Jiaqi Yan <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-06mm: move memory_failure_queue() into copy_mc_[user]_highpage()Kefeng Wang3-10/+9
Patch series "mm: migrate: support poison recover from migrate folio", v5. The folio migration is widely used in kernel, memory compaction, memory hotplug, soft offline page, numa balance, memory demote/promotion, etc, but once access a poisoned source folio when migrating, the kernel will panic. There is a mechanism in the kernel to recover from uncorrectable memory errors, ARCH_HAS_COPY_MC(eg, Machine Check Safe Memory Copy on x86), which is already used in NVDIMM or core-mm paths(eg, CoW, khugepaged, coredump, ksm copy), see copy_mc_to_{user,kernel}, copy_mc_{user_}highpage callers. This series of patches provide the recovery mechanism from folio copy for the widely used folio migration. Please note, because folio migration is no guarantee of success, so we could chose to make folio migration tolerant of memory failures, adding folio_mc_copy() which is a #MC versions of folio_copy(), once accessing a poisoned source folio, we could return error and make the folio migration fail, and this could avoid the similar panic shown below. CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0 pc : copy_page+0x10/0xc0 lr : copy_highpage+0x38/0x50 ... Call trace: copy_page+0x10/0xc0 folio_copy+0x78/0x90 migrate_folio_extra+0x54/0xa0 move_to_new_folio+0xd8/0x1f0 migrate_folio_move+0xb8/0x300 migrate_pages_batch+0x528/0x788 migrate_pages_sync+0x8c/0x258 migrate_pages+0x440/0x528 soft_offline_in_use_page+0x2ec/0x3c0 soft_offline_page+0x238/0x310 soft_offline_page_store+0x6c/0xc0 dev_attr_store+0x20/0x40 sysfs_kf_write+0x4c/0x68 kernfs_fop_write_iter+0x130/0x1c8 new_sync_write+0xa4/0x138 vfs_write+0x238/0x2d8 ksys_write+0x74/0x110 This patch (of 5): There is a memory_failure_queue() call after copy_mc_[user]_highpage(), see callers, eg, CoW/KSM page copy, it is used to mark the source page as h/w poisoned and unmap it from other tasks, and the upcomming poison recover from migrate folio will do the similar thing, so let's move the memory_failure_queue() into the copy_mc_[user]_highpage() instead of adding it into each user, this should also enhance the handling of poisoned page in khugepaged. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: Jane Chu <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Benjamin LaHaise <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Jiaqi Yan <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-06Merge branch 'mm-hotfixes-stable' into mm-stable to pick up "mm: fixAndrew Morton22-274/+323
crashes from deferred split racing folio migration", needed by "mm: migrate: split folio_migrate_mapping()".
2024-07-06MAINTAINERS: mailmap: update Lorenzo Stoakes's email addressLorenzo Stoakes2-1/+2
Now working at Oracle. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lorenzo Stoakes <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-06mm: fix crashes from deferred split racing folio migrationHugh Dickins2-11/+13
Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on flags when freeing, yet the flags shown are not bad: PG_locked had been set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN symptoms implying double free by deferred split and large folio migration. 6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration") was right to fix the memcg-dependent locking broken in 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"), but missed a subtlety of deferred_split_scan(): it moves folios to its own local list to work on them without split_queue_lock, during which time folio->_deferred_list is not empty, but even the "right" lock does nothing to secure the folio and the list it is on. Fortunately, deferred_split_scan() is careful to use folio_try_get(): so folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable() while the old folio's reference count is temporarily frozen to 0 - adding such a freeze in the !mapping case too (originally, folio lock and unmapping and no swap cache left an anon folio unreachable, so no freezing was needed there: but the deferred split queue offers a way to reach it). Link: https://lkml.kernel.org/r/[email protected] Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration") Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-06lib/build_OID_registry: avoid non-destructive substitution for Perl < 5.13.2 ↵Paul Menzel1-1/+3
compat On a system with Perl 5.12.1, commit 5ef6dc08cfde ("lib/build_OID_registry: don't mention the full path of the script in output") causes the build to fail with the error below. Bareword found where operator expected at ./lib/build_OID_registry line 41, near "s#^\Q$abs_srctree/\E##r" syntax error at ./lib/build_OID_registry line 41, near "s#^\Q$abs_srctree/\E##r" Execution of ./lib/build_OID_registry aborted due to compilation errors. make[3]: *** [lib/Makefile:352: lib/oid_registry_data.c] Error 255 Ahmad Fatoum analyzed that non-destructive substitution is only supported since Perl 5.13.2. Instead of dropping `r` and having the side effect of modifying `$0`, introduce a dedicated variable to support older Perl versions. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 5ef6dc08cfde ("lib/build_OID_registry: don't mention the full path of the script in output") Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Paul Menzel <[email protected]> Suggested-by: Ahmad Fatoum <[email protected]> Cc: Uwe Kleine-König <[email protected]> Cc: Nicolas Schier <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Ahmad Fatoum <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-06mm: gup: stop abusing try_grab_folioYang Shi3-139/+156
A kernel warning was reported when pinning folio in CMA memory when launching SEV virtual machine. The splat looks like: [ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520 [ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6 [ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520 [ 464.325515] Call Trace: [ 464.325520] <TASK> [ 464.325523] ? __get_user_pages+0x423/0x520 [ 464.325528] ? __warn+0x81/0x130 [ 464.325536] ? __get_user_pages+0x423/0x520 [ 464.325541] ? report_bug+0x171/0x1a0 [ 464.325549] ? handle_bug+0x3c/0x70 [ 464.325554] ? exc_invalid_op+0x17/0x70 [ 464.325558] ? asm_exc_invalid_op+0x1a/0x20 [ 464.325567] ? __get_user_pages+0x423/0x520 [ 464.325575] __gup_longterm_locked+0x212/0x7a0 [ 464.325583] internal_get_user_pages_fast+0xfb/0x190 [ 464.325590] pin_user_pages_fast+0x47/0x60 [ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd] [ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd] Per the analysis done by yangge, when starting the SEV virtual machine, it will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the memory. But the page is in CMA area, so fast GUP will fail then fallback to the slow path due to the longterm pinnalbe check in try_grab_folio(). The slow path will try to pin the pages then migrate them out of CMA area. But the slow path also uses try_grab_folio() to pin the page, it will also fail due to the same check then the above warning is triggered. In addition, the try_grab_folio() is supposed to be used in fast path and it elevates folio refcount by using add ref unless zero. We are guaranteed to have at least one stable reference in slow path, so the simple atomic add could be used. The performance difference should be trivial, but the misuse may be confusing and misleading. Redefined try_grab_folio() to try_grab_folio_fast(), and try_grab_page() to try_grab_folio(), and use them in the proper paths. This solves both the abuse and the kernel warning. The proper naming makes their usecase more clear and should prevent from abusing in the future. peterx said: : The user will see the pin fails, for gpu-slow it further triggers the WARN : right below that failure (as in the original report): : : folio = try_grab_folio(page, page_increm - 1, : foll_flags); : if (WARN_ON_ONCE(!folio)) { <------------------------ here : /* : * Release the 1st page ref if the : * folio is problematic, fail hard. : */ : gup_put_folio(page_folio(page), 1, : foll_flags); : ret = -EFAULT; : goto out; : } [1] https://lore.kernel.org/linux-mm/[email protected]/ [[email protected]: fix implicit declaration of function try_grab_folio_fast] Link: https://lkml.kernel.org/r/CAHbLzkowMSso-4Nufc9hcMehQsK9PNz3OSu-+eniU-2Mm-xjhA@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") Signed-off-by: Yang Shi <[email protected]> Reported-by: yangge <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Peter Xu <[email protected]> Cc: <[email protected]> [6.6+] Signed-off-by: Andrew Morton <[email protected]>
2024-07-04docs: mm: add enable_soft_offline sysctlJiaqi Yan1-0/+38
Add the documentation for soft offline behaviors / costs, and what the new enable_soft_offline sysctl is for. [[email protected]: fix kerneldoc warnings] Link: https://lkml.kernel.org/r/CACw3F52=GxTCDw-PqFh3-GDM-fo3GbhGdu0hedxYXOTT4TQSTg@mail.gmail.com [[email protected]: there are more blank lines needed] Link: https://lkml.kernel.org/r/CACw3F52_obAB742XeDRNun4BHBYtrxtbvp5NkUincXdaob0j1g@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jiaqi Yan <[email protected]> Acked-by: Oscar Salvador <[email protected]> Acked-by: Miaohe Lin <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Frank van der Linden <[email protected]> Cc: Jane Chu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Lance Yang <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-04selftest/mm: test enable_soft_offline behaviorsJiaqi Yan4-0/+236
Add regression and new tests when hugepage has correctable memory errors, and how userspace wants to deal with it: * if enable_soft_offline=1, mapped hugepage is soft offlined * if enable_soft_offline=0, mapped hugepage is intact Free hugepages case is not explicitly covered by the tests. Hugepage having corrected memory errors is emulated with MADV_SOFT_OFFLINE. [[email protected]: v7] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jiaqi Yan <[email protected]> Acked-by: Miaohe Lin <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Frank van der Linden <[email protected]> Cc: Jane Chu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Lance Yang <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-04mm/memory-failure: userspace controls soft-offlining pagesJiaqi Yan1-2/+20
Correctable memory errors are very common on servers with large amount of memory, and are corrected by ECC. Soft offline is kernel's additional recovery handling for memory pages having (excessive) corrected memory errors. Impacted page is migrated to a healthy page if inuse; the original page is discarded for any future use. The actual policy on whether (and when) to soft offline should be maintained by userspace, especially in case of an 1G HugeTLB page. Soft-offline dissolves the HugeTLB page, either in-use or free, into chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage. If userspace has not acknowledged such behavior, it may be surprised when later failed to mmap hugepages due to lack of hugepages. In case of a transparent hugepage, it will be split into 4K pages as well; userspace will stop enjoying the transparent performance. In addition, discarding the entire 1G HugeTLB page only because of corrected memory errors sounds very costly and kernel better not doing under the hood. But today there are at least 2 such cases doing so: 1. when GHES driver sees both GHES_SEV_CORRECTED and CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER. 2. RAS Correctable Errors Collector counts correctable errors per PFN and when the counter for a PFN reaches threshold In both cases, userspace has no control of the soft offline performed by kernel's memory failure recovery. This commit gives userspace the control of softofflining any page: kernel only soft offlines raw page / transparent hugepage / HugeTLB hugepage if userspace has agreed to. The interface to userspace is a new sysctl at /proc/sys/vm/enable_soft_offline. By default its value is set to 1 to preserve existing behavior in kernel. When set to 0, soft-offline (e.g. MADV_SOFT_OFFLINE) will fail with EOPNOTSUPP. [[email protected]: v7] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jiaqi Yan <[email protected]> Acked-by: Miaohe Lin <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Frank van der Linden <[email protected]> Cc: Jane Chu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Lance Yang <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-04mm/memory-failure: refactor log format in soft offline codeJiaqi Yan1-6/+9
Patch series "Userspace controls soft-offline pages", v6. Correctable memory errors are very common on servers with large amount of memory, and are corrected by ECC, but with two pain points to users: 1. Correction usually happens on the fly and adds latency overhead 2. Not-fully-proved theory states excessive correctable memory errors can develop into uncorrectable memory error. Soft offline is kernel's additional solution for memory pages having (excessive) corrected memory errors. Impacted page is migrated to healthy page if it is in use, then the original page is discarded for any future use. The actual policy on whether (and when) to soft offline should be maintained by userspace, especially in case of an 1G HugeTLB page. Soft-offline dissolves the HugeTLB page, either in-use or free, into chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage. If userspace has not acknowledged such behavior, it may be surprised when later mmap hugepages MAP_FAILED due to lack of hugepages. In case of a transparent hugepage, it will be split into 4K pages as well; userspace will stop enjoying the transparent performance. In addition, discarding the entire 1G HugeTLB page only because of corrected memory errors sounds very costly and kernel better not doing under the hood. But today there are at least 2 such cases: 1. GHES driver sees both GHES_SEV_CORRECTED and CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER. 2. RAS Correctable Errors Collector counts correctable errors per PFN and when the counter for a PFN reaches threshold In both cases, userspace has no control of the soft offline performed by kernel's memory failure recovery. This patch series give userspace the control of softofflining any page: kernel only soft offlines raw page / transparent hugepage / HugeTLB hugepage if userspace has agreed to. The interface to userspace is a new sysctl called enable_soft_offline under /proc/sys/vm. By default enable_soft_line is 1 to preserve existing behavior in kernel. This patch (of 4): Logs from soft_offline_page and soft_offline_in_use_page have different formats than majority of the memory failure code: "Memory failure: 0x${pfn}: ${lower_case_message}" Convert them to the following format: "Soft offline: 0x${pfn}: ${lower_case_message}" No functional change in this commit. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jiaqi Yan <[email protected]> Acked-by: Miaohe Lin <[email protected]> Reviewed-by: Lance Yang <[email protected]> Cc: David Rientjes <[email protected]> Cc: Frank van der Linden <[email protected]> Cc: Jane Chu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-04mm: memcg: adjust the warning when seq_buf overflowsXiu Jianfeng1-1/+2
Currently it uses WARN_ON_ONCE() if seq_buf overflows when user reads memory.stat, the only advantage of WARN_ON_ONCE is that the splat is so verbose that it gets noticed. And also it panics the system if panic_on_warn is enabled. It seems like the warning is just an over reaction and a simple pr_warn should just achieve the similar effect. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Xiu Jianfeng <[email protected]> Suggested-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>