aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-05-05mm,hugetlb: split prep_new_huge_page functionalityOscar Salvador1-3/+17
Currently, prep_new_huge_page() performs two functions. It sets the right state for a new hugetlb, and increases the hstate's counters to account for the new page. Let us split its functionality into two separate functions, decoupling the handling of the counters from initializing a hugepage. The outcome is having __prep_new_huge_page(), which only initializes the page , and __prep_account_new_huge_page(), which adds the new page to the hstate's counters. This allows us to be able to set a hugetlb without having to worry about the counter/locking. It will prove useful in the next patch. prep_new_huge_page() still calls both functions. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Muchun Song <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm,hugetlb: drop clearing of flag from prep_new_huge_pageOscar Salvador1-1/+0
Pages allocated via the page allocator or CMA get its private field cleared by means of post_alloc_hook(). Pages allocated during boot, that is directly from the memblock allocator, get cleared by paging_init()-> .. ->memmap_init_zone-> .. ->__init_single_page() before any memblock allocation. Based on this ground, let us remove the clearing of the flag from prep_new_huge_page() as it is not needed. This was a leftover from commit 6c0371490140 ("hugetlb: convert PageHugeFreed to HPageFreed flag"). Previously the explicit clearing was necessary because compound allocations do not get this initialization (see prep_compound_page). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm,compaction: let isolate_migratepages_{range,block} return error codesOscar Salvador3-33/+36
Currently, isolate_migratepages_{range,block} and their callers use a pfn == 0 vs pfn != 0 scheme to let the caller know whether there was any error during isolation. This does not work as soon as we need to start reporting different error codes and make sure we pass them down the chain, so they are properly interpreted by functions like e.g: alloc_contig_range. Let us rework isolate_migratepages_{range,block} so we can report error codes. Since isolate_migratepages_block will stop returning the next pfn to be scanned, we reuse the cc->migrate_pfn field to keep track of that. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Mike Kravetz <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm,page_alloc: bail out earlier on -ENOMEM in alloc_contig_migrate_rangeOscar Salvador1-1/+8
Patch series "Make alloc_contig_range handle Hugetlb pages", v10. alloc_contig_range lacks the ability to handle HugeTLB pages. This can be problematic for some users, e.g: CMA and virtio-mem, where those users will fail the call if alloc_contig_range ever sees a HugeTLB page, even when those pages lay in ZONE_MOVABLE and are free. That problem can be easily solved by replacing the page in the free hugepage pool. In-use HugeTLB are no exception though, as those can be isolated and migrated as any other LRU or Movable page. This aims to improve alloc_contig_range->isolate_migratepages_block, so that HugeTLB pages can be recognized and handled. Since we also need to start reporting errors down the chain (e.g: -ENOMEM due to not be able to allocate a new hugetlb page), isolate_migratepages_{range,block} interfaces need to change to start reporting error codes instead of the pfn == 0 vs pfn != 0 scheme it is using right now. From now on, isolate_migratepages_block will not return the next pfn to be scanned anymore, but -EINTR, -ENOMEM or 0, so we the next pfn to be scanned will be recorded in cc->migrate_pfn field (as it is already done in isolate_migratepages_range()). Below is an insight from David (thanks), where the problem can clearly be seen: "Start a VM with 4G. Hotplug 1G via virtio-mem and online it to ZONE_MOVABLE. Allocate 512 huge pages. [root@localhost ~]# cat /proc/meminfo MemTotal: 5061512 kB MemFree: 3319396 kB MemAvailable: 3457144 kB ... HugePages_Total: 512 HugePages_Free: 512 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB The huge pages get partially allocate from ZONE_MOVABLE. Try unplugging 1G via virtio-mem (remember, all ZONE_MOVABLE). Inside the guest: [ 180.058992] alloc_contig_range: [1b8000, 1c0000) PFNs busy [ 180.060531] alloc_contig_range: [1b8000, 1c0000) PFNs busy [ 180.061972] alloc_contig_range: [1b8000, 1c0000) PFNs busy [ 180.063413] alloc_contig_range: [1b8000, 1c0000) PFNs busy [ 180.064838] alloc_contig_range: [1b8000, 1c0000) PFNs busy [ 180.065848] alloc_contig_range: [1bfc00, 1c0000) PFNs busy [ 180.066794] alloc_contig_range: [1bfc00, 1c0000) PFNs busy [ 180.067738] alloc_contig_range: [1bfc00, 1c0000) PFNs busy [ 180.068669] alloc_contig_range: [1bfc00, 1c0000) PFNs busy [ 180.069598] alloc_contig_range: [1bfc00, 1c0000) PFNs busy" And then with this patchset running: "Same experiment with ZONE_MOVABLE: a) Free huge pages: all memory can get unplugged again. b) Allocated/populated but idle huge pages: all memory can get unplugged again. c) Allocated/populated but all 512 huge pages are read/written in a loop: all memory can get unplugged again, but I get a single [ 121.192345] alloc_contig_range: [180000, 188000) PFNs busy Most probably because it happened to try migrating a huge page while it was busy. As virtio-mem retries on ZONE_MOVABLE a couple of times, it can deal with this temporary failure. Last but not least, I did something extreme: # cat /proc/meminfo MemTotal: 5061568 kB MemFree: 186560 kB MemAvailable: 354524 kB ... HugePages_Total: 2048 HugePages_Free: 2048 HugePages_Rsvd: 0 HugePages_Surp: 0 Triggering unplug would require to dissolve+alloc - which now fails when trying to allocate an additional ~512 huge pages (1G). As expected, I can properly see memory unplug not fully succeeding. + I get a fairly continuous stream of [ 226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy ... But more importantly, the hugepage count remains stable, as configured by the admin (me): HugePages_Total: 2048 HugePages_Free: 2048 HugePages_Rsvd: 0 HugePages_Surp: 0" This patch (of 7): Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or -EBUSY, and report them down the chain. The problem is that when migrate_pages() reports -ENOMEM, we keep going till we exhaust all the try-attempts (5 at the moment) instead of bailing out. migrate_pages() bails out right away on -ENOMEM because it is considered a fatal error. Do the same here instead of keep going and retrying. Note that this is not fixing a real issue, just a cosmetic change. Although we can save some cycles by backing off ealier Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05hugetlb: add lockdep_assert_held() calls for hugetlb_lockMike Kravetz1-0/+9
After making hugetlb lock irq safe and separating some functionality done under the lock, add some lockdep_assert_held to help verify locking. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Waiman Long <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05hugetlb: make free_huge_page irq safeMike Kravetz2-110/+67
Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task context") was added to address the issue of free_huge_page being called from irq context. That commit hands off free_huge_page processing to a workqueue if !in_task. However, this doesn't cover all the cases as pointed out by 0day bot lockdep report [1]. : Possible interrupt unsafe locking scenario: : : CPU0 CPU1 : ---- ---- : lock(hugetlb_lock); : local_irq_disable(); : lock(slock-AF_INET); : lock(hugetlb_lock); : <Interrupt> : lock(slock-AF_INET); Shakeel has later explained that this is very likely TCP TX zerocopy from hugetlb pages scenario when the networking code drops a last reference to hugetlb page while having IRQ disabled. Hugetlb freeing path doesn't disable IRQ while holding hugetlb_lock so a lock dependency chain can lead to a deadlock. This commit addresses the issue by doing the following: - Make hugetlb_lock irq safe. This is mostly a simple process of changing spin_*lock calls to spin_*lock_irq* calls. - Make subpool lock irq safe in a similar manner. - Revert the !in_task check and workqueue handoff. [1] https://lore.kernel.org/linux-mm/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Waiman Long <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05hugetlb: change free_pool_huge_page to remove_pool_huge_pageMike Kravetz1-42/+51
free_pool_huge_page was called with hugetlb_lock held. It would remove a hugetlb page, and then free the corresponding pages to the lower level allocators such as buddy. free_pool_huge_page was called in a loop to remove hugetlb pages and these loops could hold the hugetlb_lock for a considerable time. Create new routine remove_pool_huge_page to replace free_pool_huge_page. remove_pool_huge_page will remove the hugetlb page, and it must be called with the hugetlb_lock held. It will return the removed page and it is the responsibility of the caller to free the page to the lower level allocators. The hugetlb_lock is dropped before freeing to these allocators which results in shorter lock hold times. Add new helper routine to call update_and_free_page for a list of pages. Note: Some changes to the routine return_unused_surplus_pages are in need of explanation. Commit e5bbc8a6c992 ("mm/hugetlb.c: fix reservation race when freeing surplus pages") modified this routine to address a race which could occur when dropping the hugetlb_lock in the loop that removes pool pages. Accounting changes introduced in that commit were subtle and took some thought to understand. This commit removes the cond_resched_lock() and the potential race. Therefore, remove the subtle code and restore the more straight forward accounting effectively reverting the commit. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Muchun Song <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Waiman Long <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05hugetlb: call update_and_free_page without hugetlb_lockMike Kravetz1-5/+26
With the introduction of remove_hugetlb_page(), there is no need for update_and_free_page to hold the hugetlb lock. Change all callers to drop the lock before calling. With additional code modifications, this will allow loops which decrease the huge page pool to drop the hugetlb_lock with each page to reduce long hold times. The ugly unlock/lock cycle in free_pool_huge_page will be removed in a subsequent patch which restructures free_pool_huge_page. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Waiman Long <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05hugetlb: create remove_hugetlb_page() to separate functionalityMike Kravetz1-25/+40
The new remove_hugetlb_page() routine is designed to remove a hugetlb page from hugetlbfs processing. It will remove the page from the active or free list, update global counters and set the compound page destructor to NULL so that PageHuge() will return false for the 'page'. After this call, the 'page' can be treated as a normal compound page or a collection of base size pages. update_and_free_page no longer decrements h->nr_huge_pages{_node} as this is performed in remove_hugetlb_page. The only functionality performed by update_and_free_page is to free the base pages to the lower level allocators. update_and_free_page is typically called after remove_hugetlb_page. remove_hugetlb_page is to be called with the hugetlb_lock held. Creating this routine and separating functionality is in preparation for restructuring code to reduce lock hold times. This commit should not introduce any changes to functionality. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Waiman Long <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05hugetlb: add per-hstate mutex to synchronize user adjustmentsMike Kravetz2-0/+9
The helper routine hstate_next_node_to_alloc accesses and modifies the hstate variable next_nid_to_alloc. The helper is used by the routines alloc_pool_huge_page and adjust_pool_surplus. adjust_pool_surplus is called with hugetlb_lock held. However, alloc_pool_huge_page can not be called with the hugetlb lock held as it will call the page allocator. Two instances of alloc_pool_huge_page could be run in parallel or alloc_pool_huge_page could run in parallel with adjust_pool_surplus which may result in the variable next_nid_to_alloc becoming invalid for the caller and pages being allocated on the wrong node. Both alloc_pool_huge_page and adjust_pool_surplus are only called from the routine set_max_huge_pages after boot. set_max_huge_pages is only called as the reusult of a user writing to the proc/sysfs nr_hugepages, or nr_hugepages_mempolicy file to adjust the number of hugetlb pages. It makes little sense to allow multiple adjustment to the number of hugetlb pages in parallel. Add a mutex to the hstate and use it to only allow one hugetlb page adjustment at a time. This will synchronize modifications to the next_nid_to_alloc variable. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Barry Song <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Waiman Long <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05hugetlb: no need to drop hugetlb_lock to call cma_releaseMike Kravetz1-6/+0
Now that cma_release is non-blocking and irq safe, there is no need to drop hugetlb_lock before calling. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Barry Song <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Muchun Song <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Waiman Long <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/cma: change cma mutex to irq safe spinlockMike Kravetz3-14/+14
Patch series "make hugetlb put_page safe for all calling contexts", v5. This effort is the result a recent bug report [1]. Syzbot found a potential deadlock in the hugetlb put_page/free_huge_page_path. WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected Since the free_huge_page_path already has code to 'hand off' page free requests to a workqueue, a suggestion was proposed to make the in_irq() detection accurate by always enabling PREEMPT_COUNT [2]. The outcome of that discussion was that the hugetlb put_page path (free_huge_page) path should be properly fixed and safe for all calling contexts. [1] https://lore.kernel.org/linux-mm/[email protected]/ [2] http://lkml.kernel.org/r/[email protected] This patch (of 8): cma_release is currently a sleepable operatation because the bitmap manipulation is protected by cma->lock mutex. Hugetlb code which relies on cma_release for CMA backed (giga) hugetlb pages, however, needs to be irq safe. The lock doesn't protect any sleepable operation so it can be changed to a (irq aware) spin lock. The bitmap processing should be quite fast in typical case but if cma sizes grow to TB then we will likely need to replace the lock by a more optimized bitmap implementation. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Muchun Song <[email protected]> Cc: David Rientjes <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Waiman Long <[email protected]> Cc: Peter Xu <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Barry Song <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/hugetlb: remove unused variable pseudo_vma in remove_inode_hugepages()Miaohe Lin1-3/+0
The local variable pseudo_vma is not used anymore. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Feilong Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts()Miaohe Lin1-2/+9
A rare out of memory error would prevent removal of the reserve map region for a page. hugetlb_fix_reserve_counts() handles this rare case to avoid dangling with incorrect counts. Unfortunately, hugepage_subpool_get_pages and hugetlb_acct_memory could possibly fail too. We should correctly handle these cases. Link: https://lkml.kernel.org/r/[email protected] Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages") Signed-off-by: Miaohe Lin <[email protected]> Cc: Feilong Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/hugeltb: clarify (chg - freed) won't go negative in hugetlb_unreserve_pages()Miaohe Lin1-0/+3
The resv_map could be NULL since this routine can be called in the evict inode path for all hugetlbfs inodes and we will have chg = 0 in this case. But (chg - freed) won't go negative as Mike pointed out: "If resv_map is NULL, then no hugetlb pages can be allocated/associated with the file. As a result, remove_inode_hugepages will never find any huge pages associated with the inode and the passed value 'freed' will always be zero." Add a comment clarifying this to make it clear and also avoid confusion. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Feilong Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/hugeltb: simplify the return code of __vma_reservation_common()Miaohe Lin1-21/+20
It's guaranteed that the vma is associated with a resv_map, i.e. either VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would have returned via !resv check above. So it's unneeded to check whether HPAGE_RESV_OWNER is set here. Simplify the return code to make it more clear. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Feilong Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/hugeltb: remove redundant VM_BUG_ON() in region_add()Miaohe Lin1-1/+0
Patch series "Cleanup and fixup for hugetlb", v2. This series contains cleanups to remove redundant VM_BUG_ON() and simplify the return code. Also this handles the error case in hugetlb_fix_reserve_counts() correctly. More details can be found in the respective changelogs. This patch (of 5): The same VM_BUG_ON() check is already done in the callee. Remove this extra one to simplify the code slightly. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Feilong Lin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm: huge_memory: debugfs for file-backed THP splitZi Yan2-6/+166
Further extend <debugfs>/split_huge_pages to accept "<path>,<pgoff_start>,<pgoff_end>" for file-backed THP split tests since tmpfs may have file backed by THP that mapped nowhere. Update selftest program to test file-backed THP split too. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Suggested-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Shuah Khan <[email protected]> Cc: John Hubbard <[email protected]> Cc: Sandipan Das <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mika Penttila <[email protected]> Cc: David Rientjes <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm: huge_memory: a new debugfs interface for splitting THP testsZi Yan4-8/+467
We did not have a direct user interface of splitting the compound page backing a THP and there is no need unless we want to expose the THP implementation details to users. Make <debugfs>/split_huge_pages accept a new command to do that. By writing "<pid>,<vaddr_start>,<vaddr_end>" to <debugfs>/split_huge_pages, THPs within the given virtual address range from the process with the given pid are split. It is used to test split_huge_page function. In addition, a selftest program is added to tools/testing/selftests/vm to utilize the interface by splitting PMD THPs and PTE-mapped THPs. This does not change the old behavior, i.e., writing 1 to the interface to split all THPs in the system. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: John Hubbard <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mika Penttila <[email protected]> Cc: Sandipan Das <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05khugepaged: remove meaningless !pte_present() check in khugepaged_scan_pmd()Miaohe Lin1-4/+0
We know it must meet the !is_swap_pte() and !pte_none() condition if we reach here. Since !is_swap_pte() indicates pte_none() or pte_present() is met, it's guaranteed that pte must be present here. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05khugepaged: remove unnecessary out label in collapse_huge_page()Miaohe Lin1-5/+3
The out label here is unneeded because it just goes to out_up_write label. Remove it to make code more concise. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05khugepaged: use helper function range_in_vma() in collapse_pte_mapped_thp()Miaohe Lin1-1/+1
Patch series "Cleanup for khugepaged". This series contains cleanups to remove unnecessary out label and meaningless !pte_present() check. Also use helper function to simplify the code. More details can be found in the respective changelogs. This patch (of 3): We could use helper function range_in_vma() to check whether the desired range is inside the vma to simplify the code. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/khugepaged.c: replace barrier() with READ_ONCE() for a selective variableYanfei Xu1-3/+1
READ_ONCE() is more selective and lightweight. It is more appropriate that using a READ_ONCE() for the certain variable to prevent the compiler from reordering. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yanfei Xu <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/huge_memory.c: use helper function migration_entry_to_page()Miaohe Lin1-2/+2
It's more recommended to use helper function migration_entry_to_page() to get the page via migration entry. We can also enjoy the PageLocked() check there. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Thomas Hellstrm (Intel) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: William Kucharski <[email protected]> Cc: Yang Shi <[email protected]> Cc: yuleixzhang <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/huge_memory.c: remove unused macro TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAGMiaohe Lin1-3/+0
Commit 4958e4d86ecb ("mm: thp: remove debug_cow switch") forgot to remove TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG macro. Remove it here. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Zi Yan <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Thomas Hellstrm (Intel) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: William Kucharski <[email protected]> Cc: Yang Shi <[email protected]> Cc: yuleixzhang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/huge_memory.c: remove redundant PageCompound() checkMiaohe Lin1-1/+1
The !PageCompound() check limits the page must be head or tail while !PageHead() further limits it to page head only. So !PageHead() check is equivalent here. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Thomas Hellstrm (Intel) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: William Kucharski <[email protected]> Cc: Yang Shi <[email protected]> Cc: yuleixzhang <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/huge_memory.c: rework the function do_huge_pmd_numa_page() slightlyMiaohe Lin1-6/+5
The current code that checks if migrating misplaced transhuge page is needed is pretty hard to follow. Rework it and add a comment to make its logic more clear and improve readability. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Zi Yan <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Thomas Hellstrm (Intel) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: William Kucharski <[email protected]> Cc: Yang Shi <[email protected]> Cc: yuleixzhang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/huge_memory.c: make get_huge_zero_page() return boolMiaohe Lin1-4/+4
It's guaranteed that huge_zero_page will not be NULL if huge_zero_refcount is increased successfully. When READ_ONCE(huge_zero_page) is returned, there must be a huge_zero_page and it can be replaced with returning 'true' when we do not care about the value of huge_zero_page. We can thus make it return bool to save READ_ONCE cpu cycles as the return value is just used to check if huge_zero_page exists. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Zi Yan <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Thomas Hellstrm (Intel) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: William Kucharski <[email protected]> Cc: Yang Shi <[email protected]> Cc: yuleixzhang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/huge_memory.c: rework the function vma_adjust_trans_huge()Miaohe Lin1-25/+19
Patch series "Some cleanups for huge_memory", v3. This series contains cleanups to rework some function logics to make it more readable, use helper function and so on. More details can be found in the respective changelogs. This patch (of 6): The current implementation of vma_adjust_trans_huge() contains some duplicated codes. Add helper function to get rid of these codes to make it more succinct. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Zi Yan <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: William Kucharski <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Peter Xu <[email protected]> Cc: yuleixzhang <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Thomas Hellstrm (Intel) <[email protected]> Cc: Yang Shi <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/huge_memory.c: remove unnecessary local variable ret2Miaohe Lin1-5/+3
There is no need to use a new local variable ret2 to get the return value of handle_userfault(). Use ret directly to make code more succinct. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05khugepaged: fix wrong result value for trace_mm_collapse_huge_page_isolate()Miaohe Lin1-9/+9
In writable and !referenced case, the result value should be SCAN_LACK_REFERENCED_PAGE for trace_mm_collapse_huge_page_isolate() instead of default 0 (SCAN_FAIL) here. Link: https://lkml.kernel.org/r/[email protected] Fixes: 7d2eba0557c1 ("mm: add tracepoint for scanning pages") Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Ebru Akagunduz <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05khugepaged: use helper khugepaged_test_exit() in __khugepaged_enter()Miaohe Lin1-1/+1
Commit 4d45e75a9955 ("mm: remove the now-unnecessary mmget_still_valid() hack") have made khugepaged_test_exit() suitable for check mm->mm_users against 0. Use this helper here. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Ebru Akagunduz <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05khugepaged: reuse the smp_wmb() inside __SetPageUptodate()Miaohe Lin1-7/+6
smp_wmb() is needed to avoid the copy_huge_page writes to become visible after the set_pmd_at() write here. But we can reuse the smp_wmb() inside __SetPageUptodate() to remove this redundant one. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Ebru Akagunduz <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05khugepaged: remove unneeded return value of ↵Miaohe Lin1-6/+4
khugepaged_collapse_pte_mapped_thps() Patch series "Cleanup and fixup for khugepaged", v2. This series contains cleanups to remove unneeded return value, use helper function and so on. And there is one fix to correct the wrong result value for trace_mm_collapse_huge_page_isolate(). This patch (of 4): The return value of khugepaged_collapse_pte_mapped_thps() is never checked since it's introduced. We should remove such unneeded return value. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ebru Akagunduz <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/hugetlb: avoid calculating fault_mutex_hash in truncate_op caseMiaohe Lin1-2/+2
The fault_mutex hashing overhead can be avoided in truncate_op case because page faults can not race with truncation in this routine. So calculate hash for fault_mutex only in !truncate_op case to save some cpu cycles. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/hugetlb: simplify the code when alloc_huge_page() failed in hugetlb_no_page()Miaohe Lin1-6/+3
Rework the error handling code when alloc_huge_page() failed to remove some duplicated code and simplify the code slightly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/hugetlb_cgroup: remove unnecessary VM_BUG_ON_PAGE in hugetlb_cgroup_migrate()Miaohe Lin1-1/+0
!PageHuge(oldhpage) is implicitly checked in page_hstate() above, so we remove this explicit one. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/hugetlb: optimize the surplus state transfer code in move_hugetlb_state()Miaohe Lin1-0/+6
We should not transfer the per-node surplus state when we do not cross the node in order to save some cpu cycles Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/hugetlb: use some helper functions to cleanup codeMiaohe Lin2-4/+4
Patch series "Some cleanups for hugetlb". This series contains cleanups to remove unnecessary VM_BUG_ON_PAGE, use helper function and so on. I also collect some previous patches into this series in case they are forgotten. This patch (of 5): We could use pages_per_huge_page to get the number of pages per hugepage, use get_hstate_idx to calculate hstate index, and use hstate_is_gigantic to check if a hstate is gigantic to make code more succinct. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm: generalize HUGETLB_PAGE_SIZE_VARIABLEAnshuman Khandual3-10/+9
HUGETLB_PAGE_SIZE_VARIABLE need not be defined for each individual platform subscribing it. Instead just make it generic. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Suggested-by: Christoph Hellwig <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Michael Ellerman <[email protected]> [powerpc] Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/hugetlb: remove redundant reservation check condition in alloc_huge_page()Miaohe Lin1-1/+1
vma_resv_map(vma) checks if a reserve map is associated with the vma. The routine vma_needs_reservation() will check vma_resv_map(vma) and return 1 if no reserv map is present. map_chg is set to the return value of vma_needs_reservation(). Therefore, !vma_resv_map(vma) is redundant in the expression: map_chg || avoid_reserve || !vma_resv_map(vma); Remove the redundant check. [Thanks Mike Kravetz for reshaping this commit message!] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05hugetlb/userfaultfd: unshare all pmds for hugetlbfs when register wpPeter Xu3-0/+58
Huge pmd sharing for hugetlbfs is racy with userfaultfd-wp because userfaultfd-wp is always based on pgtable entries, so they cannot be shared. Walk the hugetlb range and unshare all such mappings if there is, right before UFFDIO_REGISTER will succeed and return to userspace. This will pair with want_pmd_share() in hugetlb code so that huge pmd sharing is completely disabled for userfaultfd-wp registered range. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Peter Xu <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Adam Ruprecht <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Cannon Matthews <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chinwen Chang <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Huang Ying <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: "Michal Koutn" <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shawn Anastasio <[email protected]> Cc: Steven Price <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm/hugetlb: move flush_hugetlb_tlb_range() into hugetlb.hPeter Xu2-8/+8
Prepare for it to be called outside of mm/hugetlb.c. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Axel Rasmussen <[email protected]> Cc: Adam Ruprecht <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Cannon Matthews <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chinwen Chang <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Huang Ying <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: "Michal Koutn" <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shawn Anastasio <[email protected]> Cc: Steven Price <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05hugetlb/userfaultfd: forbid huge pmd sharing when uffd enabledPeter Xu4-8/+28
Huge pmd sharing could bring problem to userfaultfd. The thing is that userfaultfd is running its logic based on the special bits on page table entries, however the huge pmd sharing could potentially share page table entries for different address ranges. That could cause issues on either: - When sharing huge pmd page tables for an uffd write protected range, the newly mapped huge pmd range will also be write protected unexpectedly, or, - When we try to write protect a range of huge pmd shared range, we'll first do huge_pmd_unshare() in hugetlb_change_protection(), however that also means the UFFDIO_WRITEPROTECT could be silently skipped for the shared region, which could lead to data loss. While at it, a few other things are done altogether: - Move want_pmd_share() from mm/hugetlb.c into linux/hugetlb.h, because that's definitely something that arch code would like to use too - ARM64 currently directly check against CONFIG_ARCH_WANT_HUGE_PMD_SHARE when trying to share huge pmd. Switch to the want_pmd_share() helper. - Move vma_shareable() from huge_pmd_share() into want_pmd_share(). [[email protected]: fix build with !ARCH_WANT_HUGE_PMD_SHARE] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Axel Rasmussen <[email protected]> Tested-by: Naresh Kamboju <[email protected]> Cc: Adam Ruprecht <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Cannon Matthews <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chinwen Chang <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Huang Ying <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: "Michal Koutn" <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shawn Anastasio <[email protected]> Cc: Steven Price <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05hugetlb: pass vma into huge_pte_alloc() and huge_pmd_share()Peter Xu11-20/+24
Patch series "hugetlb: Disable huge pmd unshare for uffd-wp", v4. This series tries to disable huge pmd unshare of hugetlbfs backed memory for uffd-wp. Although uffd-wp of hugetlbfs is still during rfc stage, the idea of this series may be needed for multiple tasks (Axel's uffd minor fault series, and Mike's soft dirty series), so I picked it out from the larger series. This patch (of 4): It is a preparation work to be able to behave differently in the per architecture huge_pte_alloc() according to different VMA attributes. Pass it deeper into huge_pmd_share() so that we can avoid the find_vma() call. [[email protected]: build fix] Link: https://lkml.kernel.org/r/20210304164653.GB397383@xz-x1Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Suggested-by: Mike Kravetz <[email protected]> Cc: Adam Ruprecht <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Cannon Matthews <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chinwen Chang <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Huang Ying <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: "Michal Koutn" <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shawn Anastasio <[email protected]> Cc: Steven Price <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm: remove nrexceptional from inode: remove BUG_ONHugh Dickins1-1/+8
clear_inode()'s BUG_ON(!mapping_empty(&inode->i_data)) is unsafe: we know of two ways in which nodes can and do (on rare occasions) get left behind. Until those are fixed, do not BUG_ON() nor even WARN_ON(). Yes, this will then leak those nodes (or the next user of the struct inode may use them); but this has been happening for years, and the new BUG_ON(!mapping_empty) was only guilty of revealing that. A proper fix will follow, but no hurry. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm: remove nrexceptional from inodeMatthew Wilcox (Oracle)2-3/+1
We no longer track anything in nrexceptional, so remove it, saving 8 bytes per inode. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Tested-by: Vishal Verma <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05dax: account DAX entries as nrpagesMatthew Wilcox (Oracle)2-6/+3
Simplify mapping_needs_writeback() by accounting DAX entries as pages instead of exceptional entries. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Tested-by: Vishal Verma <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm: stop accounting shadow entriesMatthew Wilcox (Oracle)4-19/+0
We no longer need to keep track of how many shadow entries are present in a mapping. This saves a few writes to the inode and memory barriers. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Tested-by: Vishal Verma <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-05mm: introduce and use mapping_empty()Matthew Wilcox (Oracle)5-19/+11
Patch series "Remove nrexceptional tracking", v2. We actually use nrexceptional for very little these days. It's a minor pain to keep in sync with nrpages, but the pain becomes much bigger with the THP patches because we don't know how many indices a shadow entry occupies. It's easier to just remove it than keep it accurate. Also, we save 8 bytes per inode which is nothing to sneeze at; on my laptop, it would improve shmem_inode_cache from 22 to 23 objects per 16kB, and inode_cache from 26 to 27 objects. Combined, that saves a megabyte of memory from a combined usage of 25MB for both caches. Unfortunately, ext4 doesn't cross a magic boundary, so it doesn't save any memory for ext4. This patch (of 4): Instead of checking the two counters (nrpages and nrexceptional), we can just check whether i_pages is empty. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Tested-by: Vishal Verma <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>