aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-07-01drm/amd/pm: skip PrepareMp1ForUnload message in s0ixShyam Sundar S K1-1/+2
The documentation around PrepareMp1ForUnload message says that anything sent to SMU after this command would be stalled as the PMFW would not be in a state to take further job requests. Technically this is right in case of S3 scenario. But, this might not be the case during s0ix as the PMC driver would be the last to send the SMU on the OS_HINT. If SMU gets a PrepareMp1ForUnload message before the OS_HINT, this would stall the entire S0ix process. Results show that, this message to SMU is not required during S0ix and hence skip it. Reviewed-by: Prike Liang <[email protected]> Signed-off-by: Shyam Sundar S K <[email protected]> Acked-by: Huang Rui <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-07-01drm/amdgpu: move apu flags initialization to the start of device initHuang Rui3-10/+37
In some asics, we need to adjust the behavior according to the apu flags at very early stage. Signed-off-by: Huang Rui <[email protected]> Reviewed-by: Aaron Liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-07-01drm/amd/display: Respect CONFIG_FRAME_WARN=0 in dml MakefileReka Norman1-2/+6
Setting CONFIG_FRAME_WARN=0 should disable 'stack frame larger than' warnings. This is useful for example in KASAN builds. Make the dml Makefile respect this config. Fixes the following build warnings with CONFIG_KASAN=y and CONFIG_FRAME_WARN=0: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_mode_vba_30.c:3642:6: warning: stack frame size of 2216 bytes in function 'dml30_ModeSupportAndSystemConfigurationFull' [-Wframe-larger-than=] drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn31/display_mode_vba_31.c:3957:6: warning: stack frame size of 2568 bytes in function 'dml31_ModeSupportAndSystemConfigurationFull' [-Wframe-larger-than=] Reviewed-by: Harry Wentland <[email protected]> Signed-off-by: Reka Norman <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-07-01drm/radeon: Add the missed drm_gem_object_put() in ↵Jing Xiangfeng1-0/+1
radeon_user_framebuffer_create() radeon_user_framebuffer_create() misses to call drm_gem_object_put() in an error path. Add the missed function call to fix it. Reviewed-by: Christian König <[email protected]> Signed-off-by: Jing Xiangfeng <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2021-07-01drm/amdgpu/dc: Really fix DCN3.1 Makefile for PPC64Michal Suchanek1-0/+2
Also copy over the part that makes old gcc handling cross-platform. Fixes: df7a1658f257 ("drm/amdgpu/dc: fix DCN3.1 Makefile for PPC64") Fixes: 926d6972efb6 ("drm/amd/display: Add DCN3.1 blocks to the DC Makefile") Reviewed-by: Harry Wentland <[email protected]> Signed-off-by: Michal Suchanek <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-07-01drm/radeon: Call radeon_suspend_kms() in radeon_pci_shutdown() for Loongson64Tiezhu Yang1-4/+4
On the Loongson64 platform used with Radeon GPU, shutdown or reboot failed when console=tty is in the boot cmdline. radeon_suspend_kms() puts the hw in the suspend state, especially set fb state as FBINFO_STATE_SUSPENDED: if (fbcon) { console_lock(); radeon_fbdev_set_suspend(rdev, 1); console_unlock(); } Then avoid to do any more fb operations in the related functions: if (p->state != FBINFO_STATE_RUNNING) return; So call radeon_suspend_kms() in radeon_pci_shutdown() for Loongson64 to fix this issue, it looks like some kind of workaround like powerpc. Co-developed-by: Jianmin Lv <[email protected]> Signed-off-by: Jianmin Lv <[email protected]> Signed-off-by: Tiezhu Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2021-07-01drm/amdgpu: Set ttm caching flags during bo allocationOak Zeng2-4/+5
The ttm caching flags (ttm_cached, ttm_write_combined etc) are used to determine a buffer object's mapping attributes in both CPU page table and GPU page table (when that buffer is also accessed by GPU). Currently the ttm caching flags are set in function amdgpu_ttm_io_mem_reserve which is called during DRM_AMDGPU_GEM_MMAP ioctl. This has a problem since the GPU mapping of the buffer object (ioctl DRM_AMDGPU_GEM_VA) can happen earlier than the mmap time, thus the GPU page table update code can't pick up the right ttm caching flags to decide the right GPU page table attributes. This patch moves the ttm caching flags setting to function amdgpu_vram_mgr_new - this function is called during the first step of a buffer object create (eg, DRM_AMDGPU_GEM_CREATE) so the later both CPU and GPU mapping function calls will pick up this flag for CPU/GPU page table set up. v2: rebase (Alex) Signed-off-by: Oak Zeng <[email protected]> Suggested-by: Christian Koenig <[email protected]> Reviewed-by: Christian Koenig <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Tested-by: Po Huang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-07-01drm/amd/display: fix null pointer access in gpu resetGuchun Chen1-2/+2
During GPU reset, when receiving a DMCUB OUTBUX0 interrupt, DAL code will set it to be OUTBOX interrupt and sets hw interrupt. However, OUTBOX interrupt is not registered yet, so a NULL pointer access will be executed. Call Trace: dal_irq_service_set+0x30/0x90 [amdgpu] dc_interrupt_set+0x24/0x30 [amdgpu] amdgpu_dm_set_dmub_outbox_irq_state+0x22/0x30 [amdgpu] amdgpu_irq_update+0x77/0xa0 [amdgpu] amdgpu_irq_gpu_reset_resume_helper+0x67/0xa0 [amdgpu] amdgpu_do_asic_reset+0x219/0x260 [amdgpu] amdgpu_device_gpu_recover.cold+0x8c5/0xb64 [amdgpu] amdgpu_debugfs_gpu_recover_show+0x2c/0x60 [amdgpu] seq_read_iter+0xc2/0x450 ? do_anonymous_page+0x22c/0x3b0 seq_read+0xf9/0x140 full_proxy_read+0x5c/0x90 vfs_read+0xaa/0x190 ksys_read+0x67/0xe0 __x64_sys_read+0x1a/0x20 Fixes: effbf6ca7eafda ("drm/amdgpu/display: remove an old DCN3 guard") Signed-off-by: Guchun Chen <[email protected]> Reviewed-and-tested-by: Evan Quan <[email protected]> Reviewed-by: Harry Wentland <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-07-01drm/amd/display: fix incorrrect valid irq checkGuchun Chen1-1/+1
valid DAL irq should be < DAL_IRQ_SOURCES_NUMBER. Signed-off-by: Guchun Chen <[email protected]> Reviewed-and-tested-by: Evan Quan <[email protected]> Reviewed-by: Harry Wentland <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2021-07-01drm/amdgpu: enable sdma0 tmz for Raven/Renoir(V2)Aaron Liu1-2/+2
Without driver loaded, SDMA0_UTCL1_PAGE.TMZ_ENABLE is set to 1 by default for all asic. On Raven/Renoir, the sdma goldsetting changes SDMA0_UTCL1_PAGE.TMZ_ENABLE to 0. This patch restores SDMA0_UTCL1_PAGE.TMZ_ENABLE to 1. Signed-off-by: Aaron Liu <[email protected]> Acked-by: Luben Tuikov <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2021-06-30riscv: Introduce set_kernel_memory helperAlexandre Ghiti1-0/+16
This helper should be used for setting permissions to the kernel mapping as it takes pointers as arguments and then avoids explicit cast to unsigned long needed for the set_memory_* API. Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Alexandre Ghiti <[email protected]> Reviewed-by: Anup Patel <[email protected]> Reviewed-by: Atish Patra <[email protected]> Reviewed-by: Jisheng Zhang <[email protected]> Signed-off-by: Palmer Dabbelt <[email protected]>
2021-06-30riscv: Enable KFENCE for riscv64Liu Shixin3-1/+74
Add architecture specific implementation details for KFENCE and enable KFENCE for the riscv64 architecture. In particular, this implements the required interface in <asm/kfence.h>. KFENCE requires that attributes for pages from its memory pool can individually be set. Therefore, force the kfence pool to be mapped at page granularity. Testing this patch using the testcases in kfence_test.c and all passed. Signed-off-by: Liu Shixin <[email protected]> Acked-by: Marco Elver <[email protected]> Reviewed-by: Kefeng Wang <[email protected]> Signed-off-by: Palmer Dabbelt <[email protected]>
2021-06-30RISC-V: Use asm-generic for {in,out}{bwlq}Palmer Dabbelt1-13/+0
The asm-generic implementation is functionally identical to the RISC-V version. Signed-off-by: Palmer Dabbelt <[email protected]> Reviewed-by: Anup Patel <[email protected]> Signed-off-by: Palmer Dabbelt <[email protected]>
2021-06-30riscv: add ASID-based tlbflushing methodsGuo Ren3-8/+43
Implement optimized version of the tlb flushing routines for systems using ASIDs. These are behind the use_asid_allocator static branch to not affect existing systems not using ASIDs. Signed-off-by: Guo Ren <[email protected]> [hch: rebased on top of previous cleanups, use the same algorithm as the non-ASID based code for local vs global flushes, keep functions as local as possible] Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Guo Ren <[email protected]> Signed-off-by: Palmer Dabbelt <[email protected]>
2021-06-30riscv: pass the mm_struct to __sbi_tlb_flush_rangeChristoph Hellwig1-9/+6
Move the call mm_cpumask from the callers into __sbi_tlb_flush_range to reduce a bit of duplicate code and prepare for future changes. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Guo Ren <[email protected]> Signed-off-by: Palmer Dabbelt <[email protected]>
2021-06-30mm/zswap.c: fix two bugs in zswap_writeback_entry()Miaohe Lin1-10/+7
In the ZSWAP_SWAPCACHE_FAIL and ZSWAP_SWAPCACHE_EXIST case, we forgot to call zpool_unmap_handle() when zpool can't sleep. And we might sleep in zswap_get_swap_cache_page() while zpool can't sleep. To fix all of these, zpool_unmap_handle() should be done before zswap_get_swap_cache_page() when zpool can't sleep. Link: https://lkml.kernel.org/r/[email protected] Fixes: fc6697a89f56 ("mm/zswap: add the flag can_sleep_mapped") Signed-off-by: Miaohe Lin <[email protected]> Cc: Colin Ian King <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Tian Tao <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/zswap.c: avoid unnecessary copy-in at map timeMiaohe Lin1-1/+1
The buf mapped via zpool_map_handle() is only used to store compressed page buffer and there is no information to extract from it. So we could use ZPOOL_MM_WO instead to avoid unnecessary copy-in at map time. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Colin Ian King <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Tian Tao <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/zswap.c: remove unused function zswap_debugfs_exit()Miaohe Lin1-7/+0
Patch series "Cleanup and fixup for zswap". This series contains cleanups to remove unused function and avoid unnecessary copy-in at map time. Also this fixes two bugs in the function zswap_writeback_entry(). More details can be found in the respective changelogs. This patch (of 3): zswap_debugfs_exit() is unused, remove it. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Colin Ian King <[email protected]> Cc: Tian Tao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm,memory_hotplug: drop unneeded lockingOscar Salvador1-15/+1
Currently, memory-hotplug code takes zone's span_writelock and pgdat's resize_lock when resizing the node/zone's spanned pages via {move_pfn_range_to_zone(),remove_pfn_range_from_zone()} and when resizing node and zone's present pages via adjust_present_page_count(). These locks are also taken during the initialization of the system at boot time, where it protects parallel struct page initialization, but they should not really be needed in memory-hotplug where all operations are a) synchronized on device level and b) serialized by the mem_hotplug_lock lock. [[email protected]: remove now-unused locals] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Pavel Tatashin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/memory_hotplug: rate limit page migration warningsLiam Mark1-5/+11
When offlining memory the system can attempt to migrate a lot of pages, if there are problems with migration this can flood the logs. Printing all the data hogs the CPU and cause some RT threads to run for a long time, which may have some bad consequences. Rate limit the page migration warnings in order to avoid this. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam Mark <[email protected]> Signed-off-by: Georgi Djakov <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30selftests/vm: add test for MADV_POPULATE_(READ|WRITE)David Hildenbrand4-0/+360
Let's add a simple test for MADV_POPULATE_READ and MADV_POPULATE_WRITE, verifying some error handling, that population works, and that softdirty tracking works as expected. For now, limit the test to private anonymous memory. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Helge Deller <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Max Filippov <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Peter Xu <[email protected]> Cc: Rolf Eike Beer <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Ram Pai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30selftests/vm: add protection_keys_32 / protection_keys_64 to gitignoreDavid Hildenbrand1-0/+2
We missed adding two binaries to gitignore. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ram Pai <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Helge Deller <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Rolf Eike Beer <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30MAINTAINERS: add tools/testing/selftests/vm/ to MEMORY MANAGEMENTDavid Hildenbrand1-0/+1
MEMORY MANAGEMENT seems to be a good fit. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Mike Rapoport <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Peter Xu <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Helge Deller <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Ram Pai <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Rolf Eike Beer <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tablesDavid Hildenbrand8-0/+142
I. Background: Sparse Memory Mappings When we manage sparse memory mappings dynamically in user space - also sometimes involving MAP_NORESERVE - we want to dynamically populate/ discard memory inside such a sparse memory region. Example users are hypervisors (especially implementing memory ballooning or similar technologies like virtio-mem) and memory allocators. In addition, we want to fail in a nice way (instead of generating SIGBUS) if populating does not succeed because we are out of backend memory (which can happen easily with file-based mappings, especially tmpfs and hugetlbfs). While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for reliably discarding memory for most mapping types, there is no generic approach to populate page tables and preallocate memory. Although mmap() supports MAP_POPULATE, it is not applicable to the concept of sparse memory mappings, where we want to populate/discard dynamically and avoid expensive/problematic remappings. In addition, we never actually report errors during the final populate phase - it is best-effort only. fallocate() can be used to preallocate file-based memory and fail in a safe way. However, it cannot really be used for any private mappings on anonymous files via memfd due to COW semantics. In addition, fallocate() does not actually populate page tables, so we still always get pagefaults on first access - which is sometimes undesired (i.e., real-time workloads) and requires real prefaulting of page tables, not just a preallocation of backend storage. There might be interesting use cases for sparse memory regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy as it does not prefault page tables. II. On preallcoation/prefaulting from user space Because we don't have a proper interface, what applications (like QEMU and databases) end up doing is touching (i.e., reading+writing one byte to not overwrite existing data) all individual pages. However, that approach 1) Can result in wear on storage backing, because we end up reading/writing each page; this is especially a problem for dax/pmem. 2) Can result in mmap_sem contention when prefaulting via multiple threads. 3) Requires expensive signal handling, especially to catch SIGBUS in case of hugetlbfs/shmem/file-backed memory. For example, this is problematic in hypervisors like QEMU where SIGBUS handlers might already be used by other subsystems concurrently to e.g, handle hardware errors. "Simply" doing preallocation concurrently from other thread is not that easy. III. On MADV_WILLNEED Extending MADV_WILLNEED is not an option because 1. It would change the semantics: "Expect access in the near future." and "might be a good idea to read some pages" vs. "Definitely populate/ preallocate all memory and definitely fail on errors.". 2. Existing users (like virtio-balloon in QEMU when deflating the balloon) don't want populate/prealloc semantics. They treat this rather as a hint to give a little performance boost without too much overhead - and don't expect that a lot of memory might get consumed or a lot of time might be spent. IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by MAP_POPULATE, with the following semantics: 1. MADV_POPULATE_READ can be used to prefault page tables just like manually reading each individual page. This will not break any COW mappings. The shared zero page might get mapped and no backend storage might get preallocated -- allocation might be deferred to write-fault time. Especially shared file mappings require an explicit fallocate() upfront to actually preallocate backend memory (blocks in the file system) in case the file might have holes. 2. If MADV_POPULATE_READ succeeds, all page tables have been populated (prefaulted) readable once. 3. MADV_POPULATE_WRITE can be used to preallocate backend memory and prefault page tables just like manually writing (or reading+writing) each individual page. This will break any COW mappings -- e.g., the shared zeropage is never populated. 4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated (prefaulted) writable once. 5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special mappings marked with VM_PFNMAP and VM_IO. Also, proper access permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such mapping is encountered, madvise() fails with -EINVAL. 6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables might have been populated. 7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON when encountering a HW poisoned page in the range. 8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot protect from the OOM (Out Of Memory) handler killing the process. While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e., preallocate memory and prefault page tables for VMs), one issue is that whenever we prefault pages writable, the pages have to be marked dirty, because the CPU could dirty them any time. while not a real problem for hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each page will be marked dirty and has to be written back later when evicting. MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole mapping from backend storage without marking it dirty, such that eviction won't have to write it back. As discussed above, shared file mappings might require an explciit fallocate() upfront to achieve preallcoation+prepopulation. Although sparse memory mappings are the primary use case, this will also be useful for other preallocate/prefault use cases where MAP_POPULATE is not desired or the semantics of MAP_POPULATE are not sufficient: as one example, QEMU users can trigger preallocation/prefaulting of guest RAM after the mapping was created -- and don't want errors to be silently suppressed. Looking at the history, MADV_POPULATE was already proposed in 2013 [1], however, the main motivation back than was performance improvements -- which should also still be the case. V. Single-threaded performance comparison I did a short experiment, prefaulting page tables on completely *empty mappings/files* and repeated the experiment 10 times. The results correspond to the shortest execution time. In general, the performance benefit for huge pages is negligible with small mappings. V.1: Private mappings POPULATE_READ and POPULATE_WRITE is fastest. Note that Reading/POPULATE_READ will populate the shared zeropage where applicable -- which result in short population times. The fastest way to allocate backend storage (here: swap or huge pages) and prefault page tables is POPULATE_WRITE. V.2: Shared mappings fallocate() is fastest, however, doesn't prefault page tables. POPULATE_WRITE is faster than simple writes and read/writes. POPULATE_READ is faster than simple reads. Without a fd, the fastest way to allocate backend storage and prefault page tables is POPULATE_WRITE. With an fd, the fastest way is usually FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one exception are actual files: FALLOCATE+Read is slightly faster than FALLOCATE+POPULATE_READ. The fastest way to allocate backend storage prefault page tables is FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then, FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as dirty. v.3: Detailed results ================================================== 2 MiB MAP_PRIVATE: ************************************************** Anon 4 KiB : Read : 0.119 ms Anon 4 KiB : Write : 0.222 ms Anon 4 KiB : Read/Write : 0.380 ms Anon 4 KiB : POPULATE_READ : 0.060 ms Anon 4 KiB : POPULATE_WRITE : 0.158 ms Memfd 4 KiB : Read : 0.034 ms Memfd 4 KiB : Write : 0.310 ms Memfd 4 KiB : Read/Write : 0.362 ms Memfd 4 KiB : POPULATE_READ : 0.039 ms Memfd 4 KiB : POPULATE_WRITE : 0.229 ms Memfd 2 MiB : Read : 0.030 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read/Write : 0.030 ms Memfd 2 MiB : POPULATE_READ : 0.030 ms Memfd 2 MiB : POPULATE_WRITE : 0.030 ms tmpfs : Read : 0.033 ms tmpfs : Write : 0.313 ms tmpfs : Read/Write : 0.406 ms tmpfs : POPULATE_READ : 0.039 ms tmpfs : POPULATE_WRITE : 0.285 ms file : Read : 0.033 ms file : Write : 0.351 ms file : Read/Write : 0.408 ms file : POPULATE_READ : 0.039 ms file : POPULATE_WRITE : 0.290 ms hugetlbfs : Read : 0.030 ms hugetlbfs : Write : 0.030 ms hugetlbfs : Read/Write : 0.030 ms hugetlbfs : POPULATE_READ : 0.030 ms hugetlbfs : POPULATE_WRITE : 0.030 ms ************************************************** 4096 MiB MAP_PRIVATE: ************************************************** Anon 4 KiB : Read : 237.940 ms Anon 4 KiB : Write : 708.409 ms Anon 4 KiB : Read/Write : 1054.041 ms Anon 4 KiB : POPULATE_READ : 124.310 ms Anon 4 KiB : POPULATE_WRITE : 572.582 ms Memfd 4 KiB : Read : 136.928 ms Memfd 4 KiB : Write : 963.898 ms Memfd 4 KiB : Read/Write : 1106.561 ms Memfd 4 KiB : POPULATE_READ : 78.450 ms Memfd 4 KiB : POPULATE_WRITE : 805.881 ms Memfd 2 MiB : Read : 357.116 ms Memfd 2 MiB : Write : 357.210 ms Memfd 2 MiB : Read/Write : 357.606 ms Memfd 2 MiB : POPULATE_READ : 356.094 ms Memfd 2 MiB : POPULATE_WRITE : 356.937 ms tmpfs : Read : 137.536 ms tmpfs : Write : 954.362 ms tmpfs : Read/Write : 1105.954 ms tmpfs : POPULATE_READ : 80.289 ms tmpfs : POPULATE_WRITE : 822.826 ms file : Read : 137.874 ms file : Write : 987.025 ms file : Read/Write : 1107.439 ms file : POPULATE_READ : 80.413 ms file : POPULATE_WRITE : 857.622 ms hugetlbfs : Read : 355.607 ms hugetlbfs : Write : 355.729 ms hugetlbfs : Read/Write : 356.127 ms hugetlbfs : POPULATE_READ : 354.585 ms hugetlbfs : POPULATE_WRITE : 355.138 ms ************************************************** 2 MiB MAP_SHARED: ************************************************** Anon 4 KiB : Read : 0.394 ms Anon 4 KiB : Write : 0.348 ms Anon 4 KiB : Read/Write : 0.400 ms Anon 4 KiB : POPULATE_READ : 0.326 ms Anon 4 KiB : POPULATE_WRITE : 0.273 ms Anon 2 MiB : Read : 0.030 ms Anon 2 MiB : Write : 0.030 ms Anon 2 MiB : Read/Write : 0.030 ms Anon 2 MiB : POPULATE_READ : 0.030 ms Anon 2 MiB : POPULATE_WRITE : 0.030 ms Memfd 4 KiB : Read : 0.412 ms Memfd 4 KiB : Write : 0.372 ms Memfd 4 KiB : Read/Write : 0.419 ms Memfd 4 KiB : POPULATE_READ : 0.343 ms Memfd 4 KiB : POPULATE_WRITE : 0.288 ms Memfd 4 KiB : FALLOCATE : 0.137 ms Memfd 4 KiB : FALLOCATE+Read : 0.446 ms Memfd 4 KiB : FALLOCATE+Write : 0.330 ms Memfd 4 KiB : FALLOCATE+Read/Write : 0.454 ms Memfd 4 KiB : FALLOCATE+POPULATE_READ : 0.379 ms Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 0.268 ms Memfd 2 MiB : Read : 0.030 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read/Write : 0.030 ms Memfd 2 MiB : POPULATE_READ : 0.030 ms Memfd 2 MiB : POPULATE_WRITE : 0.030 ms Memfd 2 MiB : FALLOCATE : 0.030 ms Memfd 2 MiB : FALLOCATE+Read : 0.031 ms Memfd 2 MiB : FALLOCATE+Write : 0.031 ms Memfd 2 MiB : FALLOCATE+Read/Write : 0.031 ms Memfd 2 MiB : FALLOCATE+POPULATE_READ : 0.030 ms Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 0.030 ms tmpfs : Read : 0.416 ms tmpfs : Write : 0.369 ms tmpfs : Read/Write : 0.425 ms tmpfs : POPULATE_READ : 0.346 ms tmpfs : POPULATE_WRITE : 0.295 ms tmpfs : FALLOCATE : 0.139 ms tmpfs : FALLOCATE+Read : 0.447 ms tmpfs : FALLOCATE+Write : 0.333 ms tmpfs : FALLOCATE+Read/Write : 0.454 ms tmpfs : FALLOCATE+POPULATE_READ : 0.380 ms tmpfs : FALLOCATE+POPULATE_WRITE : 0.272 ms file : Read : 0.191 ms file : Write : 0.511 ms file : Read/Write : 0.524 ms file : POPULATE_READ : 0.196 ms file : POPULATE_WRITE : 0.434 ms file : FALLOCATE : 0.004 ms file : FALLOCATE+Read : 0.197 ms file : FALLOCATE+Write : 0.554 ms file : FALLOCATE+Read/Write : 0.480 ms file : FALLOCATE+POPULATE_READ : 0.201 ms file : FALLOCATE+POPULATE_WRITE : 0.381 ms hugetlbfs : Read : 0.030 ms hugetlbfs : Write : 0.030 ms hugetlbfs : Read/Write : 0.030 ms hugetlbfs : POPULATE_READ : 0.030 ms hugetlbfs : POPULATE_WRITE : 0.030 ms hugetlbfs : FALLOCATE : 0.030 ms hugetlbfs : FALLOCATE+Read : 0.031 ms hugetlbfs : FALLOCATE+Write : 0.031 ms hugetlbfs : FALLOCATE+Read/Write : 0.030 ms hugetlbfs : FALLOCATE+POPULATE_READ : 0.030 ms hugetlbfs : FALLOCATE+POPULATE_WRITE : 0.030 ms ************************************************** 4096 MiB MAP_SHARED: ************************************************** Anon 4 KiB : Read : 1053.090 ms Anon 4 KiB : Write : 913.642 ms Anon 4 KiB : Read/Write : 1060.350 ms Anon 4 KiB : POPULATE_READ : 893.691 ms Anon 4 KiB : POPULATE_WRITE : 782.885 ms Anon 2 MiB : Read : 358.553 ms Anon 2 MiB : Write : 358.419 ms Anon 2 MiB : Read/Write : 357.992 ms Anon 2 MiB : POPULATE_READ : 357.533 ms Anon 2 MiB : POPULATE_WRITE : 357.808 ms Memfd 4 KiB : Read : 1078.144 ms Memfd 4 KiB : Write : 942.036 ms Memfd 4 KiB : Read/Write : 1100.391 ms Memfd 4 KiB : POPULATE_READ : 925.829 ms Memfd 4 KiB : POPULATE_WRITE : 804.394 ms Memfd 4 KiB : FALLOCATE : 304.632 ms Memfd 4 KiB : FALLOCATE+Read : 1163.359 ms Memfd 4 KiB : FALLOCATE+Write : 933.186 ms Memfd 4 KiB : FALLOCATE+Read/Write : 1187.304 ms Memfd 4 KiB : FALLOCATE+POPULATE_READ : 1013.660 ms Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 794.560 ms Memfd 2 MiB : Read : 358.131 ms Memfd 2 MiB : Write : 358.099 ms Memfd 2 MiB : Read/Write : 358.250 ms Memfd 2 MiB : POPULATE_READ : 357.563 ms Memfd 2 MiB : POPULATE_WRITE : 357.334 ms Memfd 2 MiB : FALLOCATE : 356.735 ms Memfd 2 MiB : FALLOCATE+Read : 358.152 ms Memfd 2 MiB : FALLOCATE+Write : 358.331 ms Memfd 2 MiB : FALLOCATE+Read/Write : 358.018 ms Memfd 2 MiB : FALLOCATE+POPULATE_READ : 357.286 ms Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 357.523 ms tmpfs : Read : 1087.265 ms tmpfs : Write : 950.840 ms tmpfs : Read/Write : 1107.567 ms tmpfs : POPULATE_READ : 922.605 ms tmpfs : POPULATE_WRITE : 810.094 ms tmpfs : FALLOCATE : 306.320 ms tmpfs : FALLOCATE+Read : 1169.796 ms tmpfs : FALLOCATE+Write : 933.730 ms tmpfs : FALLOCATE+Read/Write : 1191.610 ms tmpfs : FALLOCATE+POPULATE_READ : 1020.474 ms tmpfs : FALLOCATE+POPULATE_WRITE : 798.945 ms file : Read : 654.101 ms file : Write : 1259.142 ms file : Read/Write : 1289.509 ms file : POPULATE_READ : 661.642 ms file : POPULATE_WRITE : 1106.816 ms file : FALLOCATE : 1.864 ms file : FALLOCATE+Read : 656.328 ms file : FALLOCATE+Write : 1153.300 ms file : FALLOCATE+Read/Write : 1180.613 ms file : FALLOCATE+POPULATE_READ : 668.347 ms file : FALLOCATE+POPULATE_WRITE : 996.143 ms hugetlbfs : Read : 357.245 ms hugetlbfs : Write : 357.413 ms hugetlbfs : Read/Write : 357.120 ms hugetlbfs : POPULATE_READ : 356.321 ms hugetlbfs : POPULATE_WRITE : 356.693 ms hugetlbfs : FALLOCATE : 355.927 ms hugetlbfs : FALLOCATE+Read : 357.074 ms hugetlbfs : FALLOCATE+Write : 357.120 ms hugetlbfs : FALLOCATE+Read/Write : 356.983 ms hugetlbfs : FALLOCATE+POPULATE_READ : 356.413 ms hugetlbfs : FALLOCATE+POPULATE_WRITE : 356.266 ms ************************************************** [1] https://lkml.org/lkml/2013/6/27/698 [[email protected]: coding style fixes] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Helge Deller <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Max Filippov <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Peter Xu <[email protected]> Cc: Rolf Eike Beer <[email protected]> Cc: Ram Pai <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: make variable names for populate_vma_page_range() consistentDavid Hildenbrand1-1/+1
Patch series "mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables", v2. Excessive details on MADV_POPULATE_(READ|WRITE) can be found in patch #2. This patch (of 5): Let's make the variable names in the function declaration match the variable names used in the definition. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Peter Xu <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Helge Deller <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Ram Pai <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Rolf Eike Beer <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: generalize ZONE_[DMA|DMA32]Kefeng Wang14-60/+23
ZONE_[DMA|DMA32] configs have duplicate definitions on platforms that subscribe to them. Instead, just make them generic options which can be selected on applicable platforms. Also only x86/arm64 architectures could enable both ZONE_DMA and ZONE_DMA32 if EXPERT, add ARCH_HAS_ZONE_DMA_SET to make dma zone configurable and visible on the two architectures. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Acked-by: Catalin Marinas <[email protected]> [arm64] Acked-by: Geert Uytterhoeven <[email protected]> [m68k] Acked-by: Mike Rapoport <[email protected]> Acked-by: Palmer Dabbelt <[email protected]> [RISC-V] Acked-by: Michal Simek <[email protected]> [microblaze] Acked-by: Michael Ellerman <[email protected]> [powerpc] Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Russell King <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/nommu: unexport do_munmap()Liam Howlett1-1/+0
do_munmap() does not take the mmap_write_lock(). vm_munmap() should be used instead. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30nommu: remove __GFP_HIGHMEM in vmalloc/vzallocChen Li1-2/+2
mm/nommu.c: void *__vmalloc(unsigned long size, gfp_t gfp_mask) { /* * You can't specify __GFP_HIGHMEM with kmalloc() since kmalloc() * returns only a logical address. */ return kmalloc(size, (gfp_mask | __GFP_COMP) & ~__GFP_HIGHMEM); } nommu's __vmalloc just uses kmalloc internally and elimitates __GFP_HIGHMEM, so it makes no sense to add __GFP_HIGHMEM for nommu's vmalloc/vzalloc. [[email protected]: coding style fixes] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chen Li <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/thp: fix strncpy warningMatthew Wilcox (Oracle)1-1/+1
Using MAX_INPUT_BUF_SZ as the maximum length of the string makes fortify complain as it thinks the string might be longer than the buffer, and if it is, we will end up with a "string" that is missing a NUL terminator. It's trivial to show that 'tok' points to a NUL-terminated string which is less than MAX_INPUT_BUF_SZ in length, so we may as well just use strcpy() and avoid the warning. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNCHugh Dickins1-1/+1
TTU_SYNC prevents an unlikely race, when try_to_unmap() returns shortly before the page is accounted as unmapped. It is unlikely to coincide with hwpoisoning, but now that we have the flag, hwpoison_user_mappings() would do well to use it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jue Wang <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Wang Yugui <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/thp: remap_page() is only needed on anonymous THPHugh Dickins1-0/+5
THP splitting's unmap_page() only sets TTU_SPLIT_FREEZE when PageAnon, and migration entries are only inserted when TTU_MIGRATION (unused here) or TTU_SPLIT_FREEZE is set: so it's just a waste of time for remap_page() to search for migration entries to remove when !PageAnon. Link: https://lkml.kernel.org/r/[email protected] Fixes: baa355fd3314 ("thp: file pages support for split_huge_page()") Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Yang Shi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jue Wang <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Wang Yugui <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: rmap: make try_to_unmap() void functionYang Shi4-21/+14
Currently try_to_unmap() return bool value by checking page_mapcount(), however this may return false positive since page_mapcount() doesn't check all subpages of compound page. The total_mapcount() could be used instead, but its cost is higher since it traverses all subpages. Actually the most callers of try_to_unmap() don't care about the return value at all. So just need check if page is still mapped by page_mapped() when necessary. And page_mapped() does bail out early when it finds mapped subpage. Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Hugh Dickins <[email protected]> Signed-off-by: Yang Shi <[email protected]> Acked-by: Minchan Kim <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jue Wang <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Wang Yugui <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/thp: make ARCH_ENABLE_SPLIT_PMD_PTLOCK dependent on PGTABLE_LEVELS > 2Anshuman Khandual2-2/+2
ARCH_ENABLE_SPLIT_PMD_PTLOCK is irrelevant unless there are more than two page table levels including PMD (also per Documentation/vm/split_page_table_lock.rst). Make this dependency explicit on remaining platforms i.e x86 and s390 where ARCH_ENABLE_SPLIT_PMD_PTLOCK is subscribed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Acked-by: Gerald Schaefer <[email protected]> # s390 Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: thp: skip make PMD PROT_NONE if THP migration is not supportedYang Shi1-0/+4
A quick grep shows x86_64, PowerPC (book3s), ARM64 and S390 support both NUMA balancing and THP. But S390 doesn't support THP migration so NUMA balancing actually can't migrate any misplaced pages. Skip make PMD PROT_NONE for such case otherwise CPU cycles may be wasted by pointless NUMA hinting faults on S390. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huang Ying <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: migrate: check mapcount for THP instead of refcountYang Shi1-12/+4
The generic migration path will check refcount, so no need check refcount here. But the old code actually prevents from migrating shared THP (mapped by multiple processes), so bail out early if mapcount is > 1 to keep the behavior. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huang Ying <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: migrate: don't split THP for misplaced NUMA pageYang Shi1-1/+3
The old behavior didn't split THP if migration is failed due to lack of memory on the target node. But the THP migration does split THP, so keep the old behavior for misplaced NUMA page migration. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huang Ying <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: migrate: account THP NUMA migration counters correctlyYang Shi1-3/+4
Now both base page and THP NUMA migration is done via migrate_misplaced_page(), keep the counters correctly for THP. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huang Ying <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: thp: refactor NUMA fault handlingYang Shi4-286/+77
When the THP NUMA fault support was added THP migration was not supported yet. So the ad hoc THP migration was implemented in NUMA fault handling. Since v4.14 THP migration has been supported so it doesn't make too much sense to still keep another THP migration implementation rather than using the generic migration code. This patch reworks the NUMA fault handling to use generic migration implementation to migrate misplaced page. There is no functional change. After the refactor the flow of NUMA fault handling looks just like its PTE counterpart: Acquire ptl Prepare for migration (elevate page refcount) Release ptl Isolate page from lru and elevate page refcount Migrate the misplaced THP If migration fails just restore the old normal PMD. In the old code anon_vma lock was needed to serialize THP migration against THP split, but since then the THP code has been reworked a lot, it seems anon_vma lock is not required anymore to avoid the race. The page refcount elevation when holding ptl should prevent from THP split. Use migrate_misplaced_page() for both base page and THP NUMA hinting fault and remove all the dead and duplicate code. [[email protected]: fix a double unlock bug] Link: https://lkml.kernel.org/r/YLX8uYN01JmfLnlK@mwanda Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Signed-off-by: Dan Carpenter <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huang Ying <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: memory: make numa_migrate_prep() non-staticYang Shi2-3/+5
The numa_migrate_prep() will be used by huge NUMA fault as well in the following patch, make it non-static. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huang Ying <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: memory: add orig_pmd to struct vm_faultYang Shi4-22/+29
Pach series "mm: thp: use generic THP migration for NUMA hinting fault", v3. When the THP NUMA fault support was added THP migration was not supported yet. So the ad hoc THP migration was implemented in NUMA fault handling. Since v4.14 THP migration has been supported so it doesn't make too much sense to still keep another THP migration implementation rather than using the generic migration code. It is definitely a maintenance burden to keep two THP migration implementation for different code paths and it is more error prone. Using the generic THP migration implementation allows us remove the duplicate code and some hacks needed by the old ad hoc implementation. A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP and NUMA balancing. The most of them support THP migration except for S390. Zi Yan tried to add THP migration support for S390 before but it was not accepted due to the design of S390 PMD. For the discussion, please see: https://lkml.org/lkml/2018/4/27/953. Per the discussion with Gerald Schaefer in v1 it is acceptible to skip huge PMD for S390 for now. I saw there were some hacks about gup from git history, but I didn't figure out if they have been removed or not since I just found FOLL_NUMA code in the current gup implementation and they seems useful. Patch #1 ~ #2 are preparation patches. Patch #3 is the real meat. Patch #4 ~ #6 keep consistent counters and behaviors with before. Patch #7 skips change huge PMD to prot_none if thp migration is not supported. Test ---- Did some tests to measure the latency of do_huge_pmd_numa_page. The test VM has 80 vcpus and 64G memory. The test would create 2 processes to consume 128G memory together which would incur memory pressure to cause THP splits. And it also creates 80 processes to hog cpu, and the memory consumer processes are bound to different nodes periodically in order to increase NUMA faults. The below test script is used: echo 3 > /proc/sys/vm/drop_caches # Run stress-ng for 24 hours ./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h & PID=$! ./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h & # Wait for vm stressors forked sleep 5 PID_1=`pgrep -P $PID | awk 'NR == 1'` PID_2=`pgrep -P $PID | awk 'NR == 2'` JOB1=`pgrep -P $PID_1` JOB2=`pgrep -P $PID_2` # Bind load jobs to different nodes periodically to force generate # cross node memory access while [ -d "/proc/$PID" ] do taskset -apc 8 $JOB1 taskset -apc 8 $JOB2 sleep 300 taskset -apc 58 $JOB1 taskset -apc 58 $JOB2 sleep 300 done With the above test the histogram of latency of do_huge_pmd_numa_page is as shown below. Since the number of do_huge_pmd_numa_page varies drastically for each run (should be due to scheduler), so I converted the raw number to percentage. patched base @us[stress-ng]: [0] 3.57% 0.16% [1] 55.68% 18.36% [2, 4) 10.46% 40.44% [4, 8) 7.26% 17.82% [8, 16) 21.12% 13.41% [16, 32) 1.06% 4.27% [32, 64) 0.56% 4.07% [64, 128) 0.16% 0.35% [128, 256) < 0.1% < 0.1% [256, 512) < 0.1% < 0.1% [512, 1K) < 0.1% < 0.1% [1K, 2K) < 0.1% < 0.1% [2K, 4K) < 0.1% < 0.1% [4K, 8K) < 0.1% < 0.1% [8K, 16K) < 0.1% < 0.1% [16K, 32K) < 0.1% < 0.1% [32K, 64K) < 0.1% < 0.1% Per the result, patched kernel is even slightly better than the base kernel. I think this is because the lock contention against THP split is less than base kernel due to the refactor. To exclude the affect from THP split, I also did test w/o memory pressure. No obvious regression is spotted. The below is the test result *w/o* memory pressure. patched base @us[stress-ng]: [0] 7.97% 18.4% [1] 69.63% 58.24% [2, 4) 4.18% 2.63% [4, 8) 0.22% 0.17% [8, 16) 1.03% 0.92% [16, 32) 0.14% < 0.1% [32, 64) < 0.1% < 0.1% [64, 128) < 0.1% < 0.1% [128, 256) < 0.1% < 0.1% [256, 512) 0.45% 1.19% [512, 1K) 15.45% 17.27% [1K, 2K) < 0.1% < 0.1% [2K, 4K) < 0.1% < 0.1% [4K, 8K) < 0.1% < 0.1% [8K, 16K) 0.86% 0.88% [16K, 32K) < 0.1% 0.15% [32K, 64K) < 0.1% < 0.1% [64K, 128K) < 0.1% < 0.1% [128K, 256K) < 0.1% < 0.1% The series also survived a series of tests that exercise NUMA balancing migrations by Mel. This patch (of 7): Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge page fault could be removed, just like its PTE counterpart does. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Zi Yan <[email protected]> Cc: Huang Ying <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Christian Borntraeger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm, thp: relax the VM_DENYWRITE constraint on file-backed THPsCollin Fijalkovich2-3/+26
Transparent huge pages are supported for read-only non-shmem files, but are only used for vmas with VM_DENYWRITE. This condition ensures that file THPs are protected from writes while an application is running (ETXTBSY). Any existing file THPs are then dropped from the page cache when a file is opened for write in do_dentry_open(). Since sys_mmap ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas produced by execve(). Systems that make heavy use of shared libraries (e.g. Android) are unable to apply VM_DENYWRITE through the dynamic linker, preventing them from benefiting from the resultant reduced contention on the TLB. This patch reduces the constraint on file THPs allowing use with any executable mapping from a file not opened for write (see inode_is_open_for_write()). It also introduces additional conditions to ensure that files opened for write will never be backed by file THPs. Restricting the use of THPs to executable mappings eliminates the risk that a read-only file later opened for write would encounter significant latencies due to page cache truncation. The ld linker flag '-z max-page-size=(hugepage size)' can be used to produce executables with the necessary layout. The dynamic linker must map these file's segments at a hugepage size aligned vma for the mapping to be backed with THPs. Comparison of the performance characteristics of 4KB and 2MB-backed libraries follows; the Android dex2oat tool was used to AOT compile an example application on a single ARM core. 4KB Pages: ========== count event_name # count / runtime 598,995,035,942 cpu-cycles # 1.800861 GHz 81,195,620,851 raw-stall-frontend # 244.112 M/sec 347,754,466,597 iTLB-loads # 1.046 G/sec 2,970,248,900 iTLB-load-misses # 0.854122% miss rate Total test time: 332.854998 seconds. 2MB Pages: ========== count event_name # count / runtime 592,872,663,047 cpu-cycles # 1.800358 GHz 76,485,624,143 raw-stall-frontend # 232.261 M/sec 350,478,413,710 iTLB-loads # 1.064 G/sec 803,233,322 iTLB-load-misses # 0.229182% miss rate Total test time: 329.826087 seconds A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use: /apex/com.android.art/lib64/libart.so FilePmdMapped: 4096 kB /apex/com.android.art/lib64/libart-compiler.so FilePmdMapped: 2048 kB Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Collin Fijalkovich <[email protected]> Acked-by: Hugh Dickins <[email protected]> Reviewed-by: William Kucharski <[email protected]> Acked-by: Song Liu <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Hridya Valsaraju <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Tim Murray <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Alexander Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: migrate: fix missing update page_private to hugetlb_page_subpoolMuchun Song2-1/+6
Since commit d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags") converts page.private for hugetlb specific page flags. We should use hugetlb_page_subpool() to get the subpool pointer instead of page_private(). This 'could' prevent the migration of hugetlb pages. page_private(hpage) is now used for hugetlb page specific flags. At migration time, the only flag which could be set is HPageVmemmapOptimized. This flag will only be set if the new vmemmap reduction feature is enabled. In addition, !page_mapping() implies an anonymous mapping. So, this will prevent migration of hugetb pages in anonymous mappings if the vmemmap reduction feature is enabled. In addition, that if statement checked for the rare race condition of a page being migrated while in the process of being freed. Since that check is now wrong, we could leak hugetlb subpool usage counts. The commit forgot to update it in the page migration routine. So fix it. [[email protected]: fix compiler error when !CONFIG_HUGETLB_PAGE reported by Randy] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags") Signed-off-by: Muchun Song <[email protected]> Reported-by: Anshuman Khandual <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Acked-by: Michal Hocko <[email protected]> Tested-by: Anshuman Khandual <[email protected]> [arm64] Cc: Oscar Salvador <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30arm64/mm: drop HAVE_ARCH_PFN_VALIDAnshuman Khandual4-39/+9
CONFIG_SPARSEMEM_VMEMMAP is now the only available memory model on arm64 platforms and free_unused_memmap() would just return without creating any holes in the memmap mapping. There is no need for any special handling in pfn_valid() and HAVE_ARCH_PFN_VALID can just be dropped. This also moves the pfn upper bits sanity check into generic pfn_valid(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Mike Rapoport <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30arm64: drop pfn_valid_within() and simplify pfn_valid()Mike Rapoport2-2/+1
The arm64's version of pfn_valid() differs from the generic because of two reasons: * Parts of the memory map are freed during boot. This makes it necessary to verify that there is actual physical memory that corresponds to a pfn which is done by querying memblock. * There are NOMAP memory regions. These regions are not mapped in the linear map and until the previous commit the struct pages representing these areas had default values. As the consequence of absence of the special treatment of NOMAP regions in the memory map it was necessary to use memblock_is_map_memory() in pfn_valid() and to have pfn_valid_within() aliased to pfn_valid() so that generic mm functionality would not treat a NOMAP page as a normal page. Since the NOMAP regions are now marked as PageReserved(), pfn walkers and the rest of core mm will treat them as unusable memory and thus pfn_valid_within() is no longer required at all and can be disabled on arm64. pfn_valid() can be slightly simplified by replacing memblock_is_map_memory() with memblock_is_memory(). [[email protected]: fix merge fix] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Reviewed-by: Kefeng Wang <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30arm64: decouple check whether pfn is in linear map from pfn_valid()Mike Rapoport6-6/+19
The intended semantics of pfn_valid() is to verify whether there is a struct page for the pfn in question and nothing else. Yet, on arm64 it is used to distinguish memory areas that are mapped in the linear map vs those that require ioremap() to access them. Introduce a dedicated pfn_is_map_memory() wrapper for memblock_is_map_memory() to perform such check and use it where appropriate. Using a wrapper allows to avoid cyclic include dependencies. While here also update style of pfn_valid() so that both pfn_valid() and pfn_is_map_memory() declarations will be consistent. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Reviewed-by: Kefeng Wang <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30memblock: update initialization of reserved pagesMike Rapoport2-3/+29
The struct pages representing a reserved memory region are initialized using reserve_bootmem_range() function. This function is called for each reserved region just before the memory is freed from memblock to the buddy page allocator. The struct pages for MEMBLOCK_NOMAP regions are kept with the default values set by the memory map initialization which makes it necessary to have a special treatment for such pages in pfn_valid() and pfn_valid_within(). Split out initialization of the reserved pages to a function with a meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the reserved regions and mark struct pages for the NOMAP regions as PageReserved. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Reviewed-by: Kefeng Wang <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30include/linux/mmzone.h: add documentation for pfn_valid()Mike Rapoport1-0/+11
Patch series "arm64: drop pfn_valid_within() and simplify pfn_valid()", v4. These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire pfn_valid_within() to 1. The idea is to mark NOMAP pages as reserved in the memory map and restore the intended semantics of pfn_valid() to designate availability of struct page for a pfn. With this the core mm will be able to cope with the fact that it cannot use NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks will be treated correctly even without the need for pfn_valid_within. This patch (of 4): Add comment describing the semantics of pfn_valid() that clarifies that pfn_valid() only checks for availability of a memory map entry (i.e. struct page) for a PFN rather than availability of usable memory backing that PFN. The most "generic" version of pfn_valid() used by the configurations with SPARSEMEM enabled resides in include/linux/mmzone.h so this is the most suitable place for documentation about semantics of pfn_valid(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Suggested-by: Anshuman Khandual <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Reviewed-by: Kefeng Wang <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/mempolicy: use unified 'nodes' for bind/interleave/prefer policiesBen Widawsky2-57/+46
Current structure 'mempolicy' uses a union to store the node info for bind/interleave/perfer policies. union { short preferred_node; /* preferred */ nodemask_t nodes; /* interleave/bind */ /* undefined for default */ } v; Since preferred node can also be represented by a nodemask_t with only ont bit set, unify these policies with using one nodemask_t 'nodes', which can remove a union, simplify the code and make it easier to support future's new policy's node info. Link: https://lore.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Co-developed-by: Feng Tang <[email protected]> Signed-off-by: Ben Widawsky <[email protected]> Signed-off-by: Feng Tang <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: mempolicy: don't have to split pmd for huge zero pageYang Shi1-4/+5
When trying to migrate pages to obey mempolicy, the huge zero page is split by inserting base zero pfn to all PTEs, then the page table walk fallback to PTE level and just skips zero page. Skipping zero page for mempolicy has been the behavior of kernel since v2.6.16 due to commit f4598c8b3678 ("[PATCH] migration: make sure there is no attempt to migrate reserved pages."). So it seems pointless to split huge zero page, it could be just skipped like base zero page. Set ACTION_CONTINUE to prevent the walk_page_range() split the pmd for this case. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Zi Yan <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicyFeng Tang1-18/+30
Currently the kernel_mbind() and kernel_set_mempolicy() do almost the same operation for parameter sanity check. Add a helper function to unify the code to reduce the redundancy, and make it easier for changing the sanity check code in future. [thanks to David Rientjes for suggesting using helper function instead of macro]. [[email protected]: add comment] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Feng Tang <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ben Widawsky <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Dan Williams <[email protected]> Cc: Huang Ying <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>