aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-06-30mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicyFeng Tang1-18/+30
Currently the kernel_mbind() and kernel_set_mempolicy() do almost the same operation for parameter sanity check. Add a helper function to unify the code to reduce the redundancy, and make it easier for changing the sanity check code in future. [thanks to David Rientjes for suggesting using helper function instead of macro]. [[email protected]: add comment] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Feng Tang <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ben Widawsky <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Dan Williams <[email protected]> Cc: Huang Ying <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policyFeng Tang2-81/+56
MPOL_LOCAL policy has been setup as a real policy, but it is still handled like a faked POL_PREFERRED policy with one internal MPOL_F_LOCAL flag bit set, and there are many places having to judge the real 'prefer' or the 'local' policy, which are quite confusing. In current code, there are 4 cases that MPOL_LOCAL are used: 1. user specifies 'local' policy 2. user specifies 'prefer' policy, but with empty nodemask 3. system 'default' policy is used 4. 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES flag set, and when it is 'rebind' to a nodemask which doesn't contains the 'preferred' node, it will perform as 'local' policy So make 'local' a real policy instead of a fake 'prefer' one, and kill MPOL_F_LOCAL bit, which can greatly reduce the confusion for code reading. For case 4, the logic of mpol_rebind_preferred() is confusing, as Michal Hocko pointed out: : I do believe that rebinding preferred policy is just bogus and it should : be dropped altogether on the ground that a preference is a mere hint from : userspace where to start the allocation. Unless I am missing something : cpusets will be always authoritative for the final placement. The : preferred node just acts as a starting point and it should be really : preserved when cpusets changes. Otherwise we have a very subtle behavior : corner cases. So dump all the tricky transformation between 'prefer' and 'local', and just record the new nodemask of rebinding. [[email protected]: fix a problem in mpol_set_nodemask(), per Michal Hocko] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: refine code and comments of mpol_set_nodemask(), per Michal] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Feng Tang <[email protected]> Suggested-by: Michal Hocko <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Ben Widawsky <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Huang Ying <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/mempolicy: cleanup nodemask intersection check for oomFeng Tang3-27/+11
Patch series "mm/mempolicy: some fix and semantics cleanup", v4. Current memory policy code has some confusing and ambiguous part about MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and there are many places having to distinguish them. Also the nodemask intersection check needs cleanup to be more explicit for OOM use, and handle MPOL_INTERLEAVE correctly. This patchset cleans up these and unifies the parameter sanity check for mbind() and set_mempolicy(). This patch (of 3): mempolicy_nodemask_intersects seem to be a general purpose mempolicy function. In fact it is partially tailored for the OOM purpose instead. The oom proper is the only existing user so rename the function to make that purpose explicit. While at it drop the MPOL_INTERLEAVE as those allocations never has a nodemask defined (see alloc_page_interleave) so this is a dead code and a confusing one because MPOL_INTERLEAVE is a hint rather than a hard requirement so it shouldn't be considered during the OOM. The final code can be reduced to a check for MPOL_BIND which is the only memory policy that is a hard requirement and thus relevant to a constrained OOM logic. [[email protected]: changelog edits] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Feng Tang <[email protected]> Suggested-by: Michal Hocko <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Ben Widawsky <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Huang Ying <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/compaction: fix 'limit' in fast_isolate_freepagesWonhyuk Yang1-3/+3
Because of 'min(1, ...)', fast_isolate_freepages set 'limit' to 0 or 1. This takes away the opportunities of find candinate pages. So, by making enough scans available, increases the probability of finding the appropriate freepage. Tested it on the thpscale and the results are as follows. 5.12.0 5.12.0 valnilla patched Amean fault-both-1 598.15 ( 0.00%) 592.56 ( 0.93%) Amean fault-both-3 1494.47 ( 0.00%) 1514.35 ( -1.33%) Amean fault-both-5 2519.48 ( 0.00%) 2471.76 ( 1.89%) Amean fault-both-7 3173.85 ( 0.00%) 3079.19 ( 2.98%) Amean fault-both-12 8063.83 ( 0.00%) 7858.29 ( 2.55%) Amean fault-both-18 8781.20 ( 0.00%) 7827.70 * 10.86%* Amean fault-both-24 12576.44 ( 0.00%) 12250.20 ( 2.59%) Amean fault-both-30 18503.27 ( 0.00%) 17528.11 * 5.27%* Amean fault-both-32 16133.69 ( 0.00%) 13874.24 * 14.00%* 5.12.0 5.12.0 vanilla patched Ops Compaction migrate scanned 6547133.00 5963901.00 Ops Compaction free scanned 32452453.00 26609101.00 5.12 5.12 vanilla patched Duration User 27.99 28.84 Duration System 244.08 236.76 Duration Elapsed 78.27 78.38 Link: https://lkml.kernel.org/r/[email protected] Fixes: 5a811889de10f ("mm, compaction: use free lists to quickly locate a migration target") Signed-off-by: Wonhyuk Yang <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: compaction: remove duplicate !list_empty(&sublist) checkLiu Xiang1-4/+2
The list_splice_tail(&sublist, freelist) also do !list_empty(&sublist) check, so remove the duplicate call. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liu Xiang <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/compaction: use DEVICE_ATTR_WO macroYueHaibing1-4/+4
Use DEVICE_ATTR_WO helper instead of plain DEVICE_ATTR, which makes the code a bit shorter and easier to read. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: YueHaibing <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/zbud: don't export any zbud APIMiaohe Lin4-138/+110
The zbud doesn't need to export any API and it is meant to be used via zpool API since the commit 12d79d64bfd3 ("mm/zpool: update zswap to use zpool"). So we can remove the unneeded zbud.h and move down zpool API to avoid any forward declaration. [[email protected]: fix unused function warnings when CONFIG_ZPOOL is disabled] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Nathan Chancellor <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/zbud: reuse unbuddied[0] as buddied in zbud_poolMiaohe Lin1-2/+8
Patch series "Cleanups for zbud", v2. This series contains just cleanups to save some possible memory in zbud_pool and avoid exporting any unneeded zbud API. More details can be found in the respective changelogs This patch (of 2): Since commit 9d8c5b5284e4 ("mm: zbud: fix condition check on allocation size"), zbud_pool.unbuddied[0] is always unused. We can reuse it as buddied field to save some possible memory. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Dan Streetman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/z3fold: use release_z3fold_page_locked() to release locked z3fold pageMiaohe Lin1-1/+1
We should use release_z3fold_page_locked() to release z3fold page when it's locked, although it looks harmless to use release_z3fold_page() now. Link: https://lkml.kernel.org/r/[email protected] Fixes: dcf5aedb24f8 ("z3fold: stricter locking and more careful reclaim") Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vitaly Wool <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/z3fold: fix potential memory leak in z3fold_destroy_pool()Miaohe Lin1-0/+1
There is a memory leak in z3fold_destroy_pool() as it forgets to free_percpu pool->unbuddied. Call free_percpu for pool->unbuddied to fix this issue. Link: https://lkml.kernel.org/r/[email protected] Fixes: d30561c56f41 ("z3fold: use per-cpu unbuddied lists") Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vitaly Wool <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/z3fold: remove unused function handle_to_z3fold_header()Miaohe Lin1-18/+4
handle_to_z3fold_header() is unused now. So we can remove it. As a result, get_z3fold_header() becomes the only caller of __get_z3fold_header() and the argument lock is always true. Therefore we could further fold the __get_z3fold_header() into get_z3fold_header() with lock = true. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vitaly Wool <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/z3fold: remove magic number in z3fold_create_pool()Miaohe Lin1-1/+2
It's meaningless to pass a magic number 2 to __alloc_percpu() as there is a minimum alignment size of PCPU_MIN_ALLOC_SIZE (> 2) in it. Also there is no special alignment requirement for unbuddied. So we could replace this magic number with nature alignment, i.e. __alignof__(struct list_head), to improve readability. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vitaly Wool <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/z3fold: avoid possible underflow in z3fold_alloc()Miaohe Lin1-2/+5
It is not enough to just make sure the z3fold header is not larger than the page size. When z3fold header is equal to PAGE_SIZE, we would underflow when check alloc size against PAGE_SIZE - ZHDR_SIZE_ALIGNED - CHUNK_SIZE in z3fold_alloc(). Make sure there has remaining spaces for its buddy to fix this theoretical issue. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vitaly Wool <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/z3fold: define macro NCHUNKS as TOTAL_CHUNKS - ZHDR_CHUNKSMiaohe Lin1-1/+1
Patch series "Cleanup and fixup for z3fold". This series contains cleanups to remove unused function, redefine macro to improve readability and so on. Also this fixes several bugs in z3fold, such as memory leak in z3fold_destroy_pool(). More details can be found in the respective changelogs. This patch (of 6): To improve code readability, we could define macro NCHUNKS as TOTAL_CHUNKS - ZHDR_CHUNKS. No functional change intended. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vitaly Wool <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30fs/proc/kcore: use page_offline_(freeze|thaw)David Hildenbrand1-0/+13
Let's properly synchronize with drivers that set PageOffline(). Unfreeze/thaw every now and then, so drivers that want to set PageOffline() can make progress. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Mike Rapoport <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Aili Yao <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Alex Shi <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jiri Bohac <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Steven Price <[email protected]> Cc: Wei Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30virtio-mem: use page_offline_(start|end) when setting PageOffline()David Hildenbrand1-0/+2
Let's properly use page_offline_(start|end) to synchronize setting PageOffline(), so we won't have valid page access to unplugged memory regions from /proc/kcore. Existing balloon implementations usually allow reading inflated memory; doing so might result in unnecessary overhead in the hypervisor, which is currently the case with virtio-mem. For future virtio-mem use cases, it will be different when using shmem, huge pages, !anonymous private mappings, ... as backing storage for a VM. virtio-mem unplugged memory must no longer be accessed and access might result in undefined behavior. There will be a virtio spec extension to document this change, including a new feature flag indicating the changed behavior. We really don't want to race against PFN walkers reading random page content. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Acked-by: Mike Rapoport <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Aili Yao <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Alex Shi <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jiri Bohac <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Steven Price <[email protected]> Cc: Wei Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting ↵David Hildenbrand2-0/+50
PageOffline() A driver might set a page logically offline -- PageOffline() -- and turn the page inaccessible in the hypervisor; after that, access to page content can be fatal. One example is virtio-mem; while unplugged memory -- marked as PageOffline() can currently be read in the hypervisor, this will no longer be the case in the future; for example, when having a virtio-mem device backed by huge pages in the hypervisor. Some special PFN walkers -- i.e., /proc/kcore -- read content of random pages after checking PageOffline(); however, these PFN walkers can race with drivers that set PageOffline(). Let's introduce page_offline_(begin|end|freeze|thaw) for synchronizing. page_offline_freeze()/page_offline_thaw() allows for a subsystem to synchronize with such drivers, achieving that a page cannot be set PageOffline() while frozen. page_offline_begin()/page_offline_end() is used by drivers that care about such races when setting a page PageOffline(). For simplicity, use a rwsem for now; neither drivers nor users are performance sensitive. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Aili Yao <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Alex Shi <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jiri Bohac <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Steven Price <[email protected]> Cc: Wei Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30fs/proc/kcore: don't read offline sections, logically offline pages and ↵David Hildenbrand2-1/+25
hwpoisoned pages Let's avoid reading: 1) Offline memory sections: the content of offline memory sections is stale as the memory is effectively unused by the kernel. On s390x with standby memory, offline memory sections (belonging to offline storage increments) are not accessible. With virtio-mem and the hyper-v balloon, we can have unavailable memory chunks that should not be accessed inside offline memory sections. Last but not least, offline memory sections might contain hwpoisoned pages which we can no longer identify because the memmap is stale. 2) PG_offline pages: logically offline pages that are documented as "The content of these pages is effectively stale. Such pages should not be touched (read/write/dump/save) except by their owner.". Examples include pages inflated in a balloon or unavailble memory ranges inside hotplugged memory sections with virtio-mem or the hyper-v balloon. 3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal. As documented: "Accessing is not safe since it may cause another machine check. Don't touch!" Introduce is_page_hwpoison(), adding a comment that it is inherently racy but best we can really do. Reading /proc/kcore now performs similar checks as when reading /proc/vmcore for kdump via makedumpfile: problematic pages are exclude. It's also similar to hibernation code, however, we don't skip hwpoisoned pages when processing pages in kernel/power/snapshot.c:saveable_page() yet. Note 1: we can race against memory offlining code, especially memory going offline and getting unplugged: however, we will properly tear down the identity mapping and handle faults gracefully when accessing this memory from kcore code. Note 2: we can race against drivers setting PageOffline() and turning memory inaccessible in the hypervisor. We'll handle this in a follow-up patch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Aili Yao <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Alex Shi <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jiri Bohac <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Steven Price <[email protected]> Cc: Wei Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAMDavid Hildenbrand1-8/+27
Let's resturcture the code, using switch-case, and checking pfn_is_ram() only when we are dealing with KCORE_RAM. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Cc: Aili Yao <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Alex Shi <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jiri Bohac <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Steven Price <[email protected]> Cc: Wei Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHERDavid Hildenbrand2-8/+2
Patch series "fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages", v3. Looking for places where the kernel might unconditionally read PageOffline() pages, I stumbled over /proc/kcore; turns out /proc/kcore needs some more love to not touch some other pages we really don't want to read -- i.e., hwpoisoned ones. Examples for PageOffline() pages are pages inflated in a balloon, memory unplugged via virtio-mem, and partially-present sections in memory added by the Hyper-V balloon. When reading pages inflated in a balloon, we essentially produce unnecessary load in the hypervisor; holes in partially present sections in case of Hyper-V are not accessible and already were a problem for /proc/vmcore, fixed in makedumpfile by detecting PageOffline() pages. In the future, virtio-mem might disallow reading unplugged memory -- marked as PageOffline() -- in some environments, resulting in undefined behavior when accessed; therefore, I'm trying to identify and rework all these (corner) cases. With this series, there is really only access via /dev/mem, /proc/vmcore and kdb left after I ripped out /dev/kmem. kdb is an advanced corner-case use case -- we won't care for now if someone explicitly tries to do nasty things by reading from/writing to physical addresses we better not touch. /dev/mem is a use case we won't support for virtio-mem, at least for now, so we'll simply disallow mapping any virtio-mem memory via /dev/mem next. /proc/vmcore is really only a problem when dumping the old kernel via something that's not makedumpfile (read: basically never), however, we'll try sanitizing that as well in the second kernel in the future. Tested via kcore_dump: https://github.com/schlafwandler/kcore_dump This patch (of 6): Commit db779ef67ffe ("proc/kcore: Remove unused kclist_add_remap()") removed the last user of KCORE_REMAP. Commit 595dd46ebfc1 ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when dumping vsyscall user page") removed the last user of KCORE_OTHER. Let's drop both types. While at it, also drop vaddr in "struct kcore_list", used by KCORE_REMAP only. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Jason Wang <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Alex Shi <[email protected]> Cc: Steven Price <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Aili Yao <[email protected]> Cc: Jiri Bohac <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Wei Liu <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30docs: proc.rst: meminfo: briefly describe gaps in memory accountingMike Rapoport1-2/+9
Add a paragraph that explains that it may happen that the counters in /proc/meminfo do not add up to the overall memory usage. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Eric Dumazet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/kconfig: move HOLES_IN_ZONE into mmKefeng Wang4-9/+4
commit a55749639dc1 ("ia64: drop marked broken DISCONTIGMEM and VIRTUAL_MEM_MAP") drop VIRTUAL_MEM_MAP, so there is no need HOLES_IN_ZONE on ia64. Also move HOLES_IN_ZONE into mm/Kconfig, select it if architecture needs this feature. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Acked-by: Catalin Marinas <[email protected]> [arm64] Cc: Will Deacon <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: workingset: define macro WORKINGSET_SHIFTMiaohe Lin1-4/+6
The magic number 1 is used in several places in workingset.c. Define a macro WORKINGSET_SHIFT for it to improve code readability. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30include/trace/events/vmscan.h: remove mm_vmscan_inactive_list_is_lowYu Zhao1-41/+0
mm_vmscan_inactive_list_is_low has no users after commit b91ac374346b ("mm: vmscan: enforce inactive:active ratio at the reclaim root"). Remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yu Zhao <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/vmscan.c: fix potential deadlock in reclaim_pages()Yu Zhao1-0/+15
Theoretically without the protect from memalloc_noreclaim_save() and memalloc_noreclaim_restore(), reclaim_pages() can go into the block I/O layer recursively and deadlock. Querying 'reclaim_pages' in our kernel crash databases didn't yield any results. So the deadlock seems unlikely to happen. A possible explanation is that the only user of reclaim_pages(), i.e., MADV_PAGEOUT, is usually called before memory pressure builds up, e.g., on Android and Chrome OS. Under such a condition, allocations in the block I/O layer can be fulfilled without diverting to direct reclaim and therefore the recursion is avoided. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yu Zhao <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/selftests: exercise minor fault handling shmem supportAxel Rasmussen1-4/+25
Enable test_uffdio_minor for test_type == TEST_SHMEM, and modify the test slightly to pass in / check for the right feature flags. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/selftests: reinitialize test context in each testAxel Rasmussen1-105/+117
Currently, the context (fds, mmap-ed areas, etc.) are global. Each test mutates this state in some way, in some cases really "clobbering it" (e.g., the events test mremap-ing area_dst over the top of area_src, or the minor faults tests overwriting the count_verify values in the test areas). We run the tests in a particular order, each test is careful to make the right assumptions about its starting state, etc. But, this is fragile. It's better for a test's success or failure to not depend on what some other prior test case did to the global state. To that end, clear and reinitialize the test context at the start of each test case, so whatever prior test cases did doesn't affect future tests. This is particularly relevant to this series because the events test's mremap of area_dst screws up assumptions the minor fault test was relying on. This wasn't a problem for hugetlb, as we don't mremap in that case. [[email protected]: fix conflict between this patch and the uffd pagemap series] Link: https://lkml.kernel.org/r/YKQqKrl+/cQ1utrb@t490s Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/selftests: create alias mappings in the shmem testAxel Rasmussen1-3/+19
Previously, we just allocated two shm areas: area_src and area_dst. With this commit, change this so we also allocate area_src_alias, and area_dst_alias. area_*_alias and area_* (respectively) point to the same underlying physical pages, but are different VMAs. In a future commit in this series, we'll leverage this setup to exercise minor fault handling support for shmem, just like we do in the hugetlb_shared test. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/selftests: use memfd_create for shmem test typeAxel Rasmussen1-1/+15
This is a preparatory commit. In the future, we want to be able to setup alias mappings for area_src and area_dst in the shmem test, like we do in the hugetlb_shared test. With a VMA obtained via mmap(MAP_ANONYMOUS | MAP_SHARED), it isn't clear how to do this. So, mmap() with an fd, so we can create alias mappings. Use memfd_create instead of actually passing in a tmpfs path like hugetlb does, since it's more convenient / simpler to run, and works just as well. Future commits will: 1. Setup the alias mappings. 2. Extend our tests to actually take advantage of this, to test new userfaultfd behavior being introduced in this series. Also, a small fix in the area we're changing: when the hugetlb setup fails in main(), pass in the right argv[] so we actually print out the hugetlb file path. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte()Axel Rasmussen3-57/+23
In a previous commit, we added the mfill_atomic_install_pte() helper. This helper does the job of setting up PTEs for an existing page, to map it into a given VMA. It deals with both the anon and shmem cases, as well as the shared and private cases. In other words, shmem_mfill_atomic_pte() duplicates a case it already handles. So, expose it, and let shmem_mfill_atomic_pte() use it directly, to reduce code duplication. This requires that we refactor shmem_mfill_atomic_pte() a bit: Instead of doing accounting (shmem_recalc_inode() et al) part-way through the PTE setup, do it afterward. This frees up mfill_atomic_install_pte() from having to care about this accounting, and means we don't need to e.g. shmem_uncharge() in the error path. A side effect is this switches shmem_mfill_atomic_pte() to use lru_cache_add_inactive_or_unevictable() instead of just lru_cache_add(). This wrapper does some extra accounting in an exceptional case, if appropriate, so it's actually the more correct thing to use. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Reviewed-by: Peter Xu <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/shmem: advertise shmem minor fault supportAxel Rasmussen3-3/+10
Now that the feature is fully implemented (the faulting path hooks exist so userspace is notified, and the ioctl to resolve such faults is available), advertise this as a supported feature. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Peter Xu <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/shmem: support UFFDIO_CONTINUE for shmemAxel Rasmussen1-45/+127
With this change, userspace can resolve a minor fault within a shmem-backed area with a UFFDIO_CONTINUE ioctl. The semantics for this match those for hugetlbfs - we look up the existing page in the page cache, and install a PTE for it. This commit introduces a new helper: mfill_atomic_install_pte. Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in shmem.c? The existing userfault implementation only relies on shmem.c for VM_SHARED VMAs. However, minor fault handling / CONTINUE work just fine for !VM_SHARED VMAs as well. We'd prefer to handle CONTINUE for shmem in one place, regardless of shared/private (to reduce code duplication). Why add a new mfill_atomic_install_pte helper? A problem we have with continue is that shmem_mfill_atomic_pte() and mcopy_atomic_pte() are *close* to what we want, but not exactly. We do want to setup the PTEs in a CONTINUE operation, but we don't want to e.g. allocate a new page, charge it (e.g. to the shmem inode), manipulate various flags, etc. Also we have the problem stated above: shmem_mfill_atomic_pte() and mcopy_atomic_pte() both handle one-half of the problem (shared / private) continue cares about. So, introduce mcontinue_atomic_pte(), to handle all of the shmem continue cases. Introduce the helper so it doesn't duplicate code with mcopy_atomic_pte(). In a future commit, shmem_mfill_atomic_pte() will also be modified to use this new helper. However, since this is a bigger refactor, it seems most clear to do it as a separate change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Peter Xu <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/shmem: support minor fault registration for shmemAxel Rasmussen3-6/+17
This patch allows shmem-backed VMAs to be registered for minor faults. Minor faults are appropriately relayed to userspace in the fault path, for VMAs with the relevant flag. This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed minor faults, though, so userspace doesn't yet have a way to resolve such faults. Because of this, we also don't yet advertise this as a supported feature. That will be done in a separate commit when the feature is fully implemented. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Acked-by: Peter Xu <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pteAxel Rasmussen3-54/+27
Patch series "userfaultfd: add minor fault handling for shmem", v6. Overview ======== See the series which added minor faults for hugetlbfs [3] for a detailed overview of minor fault handling in general. This series adds the same support for shmem-backed areas. This series is structured as follows: - Commits 1 and 2 are cleanups. - Commits 3 and 4 implement the new feature (minor fault handling for shmem). - Commit 5 advertises that the feature is now available since at this point it's fully implemented. - Commit 6 is a final cleanup, modifying an existing code path to re-use a new helper we've introduced. - Commits 7, 8, 9, 10 update the userfaultfd selftest to exercise the feature. Use Case ======== In some cases it is useful to have VM memory backed by tmpfs instead of hugetlbfs. So, this feature will be used to support the same VM live migration use case described in my original series. Additionally, Android folks (Lokesh Gidra <[email protected]>) hope to optimize the Android Runtime garbage collector using this feature: "The plan is to use userfaultfd for concurrently compacting the heap. With this feature, the heap can be shared-mapped at another location where the GC-thread(s) could continue the compaction operation without the need to invoke userfault ioctl(UFFDIO_COPY) each time. OTOH, if and when Java threads get faults on the heap, UFFDIO_CONTINUE can be used to resume execution. Furthermore, this feature enables updating references in the 'non-moving' portion of the heap efficiently. Without this feature, uneccessary page copying (ioctl(UFFDIO_COPY)) would be required." [1] https://lore.kernel.org/patchwork/cover/1388144/ [2] https://lore.kernel.org/patchwork/patch/1408161/ [3] https://lore.kernel.org/linux-fsdevel/[email protected]/T/#t This patch (of 9): Previously, we did a dance where we had one calling path in userfaultfd.c (mfill_atomic_pte), but then we split it into two in shmem_fs.h (shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined into a single shared function in shmem.c (shmem_mfill_atomic_pte). This is all a bit overly complex. Just call the single combined shmem function directly, allowing us to clean up various branches, boilerplate, etc. While we're touching this function, two other small cleanup changes: - offset is equivalent to pgoff, so we can get rid of offset entirely. - Split two VM_BUG_ON cases into two statements. This means the line number reported when the BUG is hit specifies exactly which condition was true. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Reviewed-by: Peter Xu <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/selftests: add pagemap uffd-wp testPeter Xu1-0/+154
Add one anonymous specific test to start using pagemap. With pagemap support, we can directly read the uffd-wp bit from pgtable without triggering any fault, so it's easier to do sanity checks in unit tests. Meanwhile this test also leverages the newly introduced MADV_PAGEOUT madvise function to test swap ptes with uffd-wp bit set, and across fork()s. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/pagemap: export uffd-wp protection informationPeter Xu2-0/+11
Export the PTE/PMD status of uffd-wp to pagemap too. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/userfaultfd: fail uffd-wp registration if not supportedPeter Xu1-1/+8
We should fail uffd-wp registration immediately if the arch does not even have CONFIG_HAVE_ARCH_USERFAULTFD_WP defined. That'll block also relevant ioctls on e.g. UFFDIO_WRITEPROTECT because that'll check against VM_UFFD_WP, which can only be applied with a success registration. Remove the WP feature bit too for those archs when handling UFFDIO_API ioctl. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/userfaultfd: fix uffd-wp special cases for fork()Peter Xu4-26/+30
We tried to do something similar in b569a1760782 ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork") previously, but it's not doing it all right.. A few fixes around the code path: 1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather than the new vma. That's overlooked in b569a1760782, so it won't work as expected. Thanks to the recent rework on fork code (7a4830c380f3a8b3), we can easily get the new vma now, so switch the checks to that. 2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the huge pmd is a migration huge pmd. When it happens, instead of using pmd_uffd_wp(), we should use pmd_swp_uffd_wp(). The fix is simply to handle them separately. 3. Forget to carry over uffd-wp bit for a write migration huge pmd entry. This also happens in copy_huge_pmd(), where we converted a write huge migration entry into a read one. 4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes. 5. In copy_present_page() when COW is enforced when fork(), we also need to pass over the uffd-wp bit if VM_UFFD_WP is armed on the new vma, and when the pte to be copied has uffd-wp bit set. Remove the comment in copy_present_pte() about this. It won't help a huge lot to only comment there, but comment everywhere would be an overkill. Let's assume the commit messages would help. [[email protected]: fix a few thp pmd missing uffd-wp bit] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: b569a1760782f ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork") Signed-off-by: Peter Xu <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/thp: simplify copying of huge zero page pmd when forkPeter Xu1-6/+3
Patch series "mm/uffd: Misc fix for uffd-wp and one more test". This series tries to fix some corner case bugs for uffd-wp on either thp or fork(). Then it introduced a new test with pagemap/pageout. Patch layout: Patch 1: cleanup for THP, it'll slightly simplify the follow up patches Patch 2-4: misc fixes for uffd-wp here and there; please refer to each patch Patch 5: add pagemap support for uffd-wp Patch 6: add pagemap/pageout test for uffd-wp The last test introduced can also verify some of the fixes in previous patches, as the test will fail without the fixes. However it's not easy to verify all the changes in patch 2-4, but hopefully they can still be properly reviewed. Note that if considering the ongoing uffd-wp shmem & hugetlbfs work, patch 5 will be incomplete as it's missing e.g. hugetlbfs part or the special swap pte detection. However that's not needed in this series, and since that series is still during review, this series does not depend on that one (the last test only runs with anonymous memory, not file-backed). So this series can be merged even before that series. This patch (of 6): Huge zero page is handled in a special path in copy_huge_pmd(), however it should share most codes with a normal thp page. Trying to share more code with it by removing the special path. The only leftover so far is the huge zero page refcounting (mm_get_huge_zero_page()), because that's separately done with a global counter. This prepares for a future patch to modify the huge pmd to be installed, so that we don't need to duplicate it explicitly into huge zero page case too. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mike Kravetz <[email protected]>, [email protected] Cc: Mike Rapoport <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Joe Perches <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/selftests: unify error handlingPeter Xu1-369/+187
Introduce err()/_err() and replace all the different ways to fail the program, mostly "fprintf" and "perror" with tons of exit() calls. Always stop the test program at any failure. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Axel Rasmussen <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/selftests: only dump counts if mode enabledPeter Xu1-10/+20
WP and MINOR modes are conditionally enabled on specific memory types. This patch avoids dumping tons of zeros for those cases when the modes are not supported at all. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Axel Rasmussen <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/selftests: dropping VERIFY check in locking_threadPeter Xu1-54/+1
It tries to check against all zeros and looped for quite a few times. However after that we'll verify the same page with count_verify, while count_verify can never be zero. So it means if it's a zero page we'll detect it anyways with below code. There's yet another place we conditionally check the fault flag - just do it unconditionally. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Axel Rasmussen <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/selftests: remove the time() check on delayed uffdPeter Xu1-8/+0
There seems to have no guarantee that time() will return the same for the two calls even if there's no delay, e.g. when a fault is accidentally crossing the changing of a second. Meanwhile, this message is also not helping that much since delay could happen with a lot of reasons, e.g., schedule latency of resolving thread. It may not mean an issue with uffd. Neither do I saw this error triggered either in the past runs. Even if it triggers, it'll be drown in all the rest of test logs. Remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Axel Rasmussen <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/selftests: use user mode onlyPeter Xu1-1/+1
Patch series "userfaultfd/selftests: A few cleanups", v2. I wanted to cleanup userfaultfd.c fault handling for a long time. If it's not cleaned, when the new code grows the file it'll also grow the size that needs to be cleaned... This is my attempt to cleanup the userfaultfd selftest on fault handling, to use an err() macro instead of either fprintf() or perror() then another exit() call. The huge cleanup is done in the last patch. The first 4 patches are some other standalone cleanups for the same file, so I put them together. This patch (of 5): Userfaultfd selftest does not need to handle kernel initiated fault. Set user mode so it can be run even if unprivileged_userfaultfd=0 (which is the default). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Axel Rasmussen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/hwpoison: disable pcp for page_handle_poison()Naoya Horiguchi1-3/+16
Recent changes by patch "mm/page_alloc: allow high-order pages to be stored on the per-cpu lists" makes kernels determine whether to use pcp by pcp_allowed_order(), which breaks soft-offline for hugetlb pages. Soft-offline dissolves a migration source page, then removes it from buddy free list, so it's assumed that any subpage of the soft-offlined hugepage are recognized as a buddy page just after returning from dissolve_free_huge_page(). pcp_allowed_order() returns true for hugetlb, so this assumption is no longer true. So disable pcp during dissolve_free_huge_page() and take_page_off_buddy() to prevent soft-offlined hugepages from linking to pcp lists. Soft-offline should not be common events so the impact on performance should be minimal. And I think that the optimization of Mel's patch could benefit to hugetlb so zone_pcp_disable() is called only in hwpoison context. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30hugetlb: address ref count racing in prep_compound_gigantic_pageMike Kravetz2-9/+64
In [1], Jann Horn points out a possible race between prep_compound_gigantic_page and __page_cache_add_speculative. The root cause of the possible race is prep_compound_gigantic_page uncondittionally setting the ref count of pages to zero. It does this because prep_compound_gigantic_page is handed a 'group' of pages from an allocator and needs to convert that group of pages to a compound page. The ref count of each page in this 'group' is one as set by the allocator. However, the ref count of compound page tail pages must be zero. The potential race comes about when ref counted pages are returned from the allocator. When this happens, other mm code could also take a reference on the page. __page_cache_add_speculative is one such example. Therefore, prep_compound_gigantic_page can not just set the ref count of pages to zero as it does today. Doing so would lose the reference taken by any other code. This would lead to BUGs in code checking ref counts and could possibly even lead to memory corruption. There are two possible ways to address this issue. 1) Make all allocators of gigantic groups of pages be able to return a properly constructed compound page. 2) Make prep_compound_gigantic_page be more careful when constructing a compound page. This patch takes approach 2. In prep_compound_gigantic_page, use cmpxchg to only set ref count to zero if it is one. If the cmpxchg fails, call synchronize_rcu() in the hope that the extra ref count will be driopped during a rcu grace period. This is not a performance critical code path and the wait should be accceptable. If the ref count is still inflated after the grace period, then undo any modifications made and return an error. Currently prep_compound_gigantic_page is type void and does not return errors. Modify the two callers to check for and handle error returns. On error, the caller must free the 'group' of pages as they can not be used to form a gigantic page. After freeing pages, the runtime caller (alloc_fresh_huge_page) will retry the allocation once. Boot time allocations can not be retried. The routine prep_compound_page also unconditionally sets the ref count of compound page tail pages to zero. However, in this case the buddy allocator is constructing a compound page from freshly allocated pages. The ref count on those freshly allocated pages is already zero, so the set_page_count(p, 0) is unnecessary and could lead to confusion. Just remove it. [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 58a84aa92723 ("thp: set compound tail page _count to zero") Signed-off-by: Mike Kravetz <[email protected]> Reported-by: Jann Horn <[email protected]> Cc: Youquan Song <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Jan Kara <[email protected]> Cc: John Hubbard <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30hugetlb: remove prep_compound_huge_page cleanupMike Kravetz1-19/+10
Patch series "Fix prep_compound_gigantic_page ref count adjustment". These patches address the possible race between prep_compound_gigantic_page and __page_cache_add_speculative as described by Jann Horn in [1]. The first patch simply removes the unnecessary/obsolete helper routine prep_compound_huge_page to make the actual fix a little simpler. The second patch is the actual fix and has a detailed explanation in the commit message. This potential issue has existed for almost 10 years and I am unaware of anyone actually hitting the race. I did not cc stable, but would be happy to squash the patches and send to stable if anyone thinks that is a good idea. [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/ This patch (of 2): I could not think of a reliable way to recreate the issue for testing. Rather, I 'simulated errors' to exercise all the error paths. The routine prep_compound_huge_page is a simple wrapper to call either prep_compound_gigantic_page or prep_compound_page. However, it is only called from gather_bootmem_prealloc which only processes gigantic pages. Eliminate the routine and call prep_compound_gigantic_page directly. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jann Horn <[email protected]> Cc: John Hubbard <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Youquan Song <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ONMuchun Song3-2/+17
When using HUGETLB_PAGE_FREE_VMEMMAP, the freeing unused vmemmap pages associated with each HugeTLB page is default off. Now the vmemmap is PMD mapped. So there is no side effect when this feature is enabled with no HugeTLB pages in the system. Someone may want to enable this feature in the compiler time instead of using boot command line. So add a config to make it default on when someone do not want to enable it via command line. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Chen Huang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: sparsemem: use huge PMD mapping for vmemmap pagesMuchun Song4-33/+9
The preparation of splitting huge PMD mapping of vmemmap pages is ready, so switch the mapping from PTE to PMD. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Chen Huang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: sparsemem: split the huge PMD mapping of vmemmap pagesMuchun Song3-43/+129
Patch series "Split huge PMD mapping of vmemmap pages", v4. In order to reduce the difficulty of code review in series[1]. We disable huge PMD mapping of vmemmap pages when that feature is enabled. In this series, we do not disable huge PMD mapping of vmemmap pages anymore. We will split huge PMD mapping when needed. When HugeTLB pages are freed from the pool we do not attempt coalasce and move back to a PMD mapping because it is much more complex. [1] https://lore.kernel.org/linux-doc/[email protected]/ This patch (of 3): In [1], PMD mappings of vmemmap pages were disabled if the the feature hugetlb_free_vmemmap was enabled. This was done to simplify the initial implementation of vmmemap freeing for hugetlb pages. Now, remove this simplification by allowing PMD mapping and switching to PTE mappings as needed for allocated hugetlb pages. When a hugetlb page is allocated, the vmemmap page tables are walked to free vmemmap pages. During this walk, split huge PMD mappings to PTE mappings as required. In the unlikely case PTE pages can not be allocated, return error(ENOMEM) and do not optimize vmemmap of the hugetlb page. When HugeTLB pages are freed from the pool, we do not attempt to coalesce and move back to a PMD mapping because it is much more complex. [1] https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Chen Huang <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>