aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-07-17mm: kfence: pass a pointer to virt_to_page()Linus Walleij1-2/+2
Functions that work on a pointer to virtual memory such as virt_to_pfn() and users of that function such as virt_to_page() are supposed to pass a pointer to virtual memory, ideally a (void *) or other pointer. However since many architectures implement virt_to_pfn() as a macro, this function becomes polymorphic and accepts both a (unsigned long) and a (void *). If we instead implement a proper virt_to_pfn(void *addr) function the following happens (occurred on arch/arm): mm/kfence/core.c:558:30: warning: passing argument 1 of 'virt_to_pfn' makes pointer from integer without a cast [-Wint-conversion] In one case we can refer to __kfence_pool directly (and that is a proper (char *) pointer) and in the other call site we use an explicit cast. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Walleij <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Jason Gunthorpe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm/highmem: pass a pointer to virt_to_page()Linus Walleij1-1/+1
Functions that work on a pointer to virtual memory such as virt_to_pfn() and users of that function such as virt_to_page() are supposed to pass a pointer to virtual memory, ideally a (void *) or other pointer. However since many architectures implement virt_to_pfn() as a macro, this function becomes polymorphic and accepts both a (unsigned long) and a (void *). If we instead implement a proper virt_to_pfn(void *addr) function the following happens (occurred on arch/arm): mm/highmem.c:153:29: warning: passing argument 1 of 'virt_to_pfn' makes pointer from integer without a cast [-Wint-conversion] We already have a proper void * pointer in the scope of this function named "vaddr" so pass that instead. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Walleij <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Marco Elver <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17lib/test_free_pages.c: pass a pointer to virt_to_page()Linus Walleij1-1/+1
In a recent change to the Arm architecture with the end goal of removing highmem we need to convert virt_to_phys() and virt_to_pfn() to static inline functions. This will make them strongly typed. However since virt_to_* is always implemented as macros they have become polymorphic and accept both (void *) and e.g. unsigned long as arguments. Other functions such as virt_to_page() simply wrap virt_to_pfn() and get affected indirectly. To be able to proceed, patch mm to use (void *) as argument to affected functions in all instances. This patch (of 5): A pointer into virtual memory is represented by a (void *) not an u32, so the compiler warns: lib/test_free_pages.c:20:50: warning: passing argument 1 of 'virt_to_pfn' makes pointer from integer without a cast [-Wint-conversion] Fix this with an explicit cast. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Walleij <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Marco Elver <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm/memcontrol.c: replace cgroup_memory_nokmem with mem_cgroup_kmem_disabled()Xiang Yang1-2/+2
mem_cgroup_kmem_disabled() checks whether the kmem accounting is off. Therefore, replace cgroup_memory_nokmem with mem_cgroup_kmem_disabled(), which is the same work in percpu.c and slab_common.c. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Xiang Yang <[email protected]> Reviewed-by: Muchun Song <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Souptick Joarder (HPE) <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm/page_alloc: replace local_lock with normal spinlockMel Gorman1-45/+95
struct per_cpu_pages is no longer strictly local as PCP lists can be drained remotely using a lock for protection. While the use of local_lock works, it goes against the intent of local_lock which is for "pure CPU local concurrency control mechanisms and not suited for inter-CPU concurrency control" (Documentation/locking/locktypes.rst) local_lock protects against migration between when the percpu pointer is accessed and the pcp->lock acquired. The lock acquisition is a preemption point so in the worst case, a task could migrate to another NUMA node and accidentally allocate remote memory. The main requirement is to pin the task to a CPU that is suitable for PREEMPT_RT and !PREEMPT_RT. Replace local_lock with helpers that pin a task to a CPU, lookup the per-cpu structure and acquire the embedded lock. It's similar to local_lock without breaking the intent behind the API. It is not a complete API as only the parts needed for PCP-alloc are implemented but in theory, the generic helpers could be promoted to a general API if there was demand for an embedded lock within a per-cpu struct with a guarantee that the per-cpu structure locked matches the running CPU and cannot use get_cpu_var due to RT concerns. PCP requires these semantics to avoid accidentally allocating remote memory. [[email protected]: use pcp_spin_trylock_irqsave instead of pcpu_spin_trylock_irqsave] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Tested-by: Yu Zhao <[email protected]> Reviewed-by: Nicolas Saenz Julienne <[email protected]> Tested-by: Nicolas Saenz Julienne <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Tested-by: Yu Zhao <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm/page_alloc: remotely drain per-cpu listsNicolas Saenz Julienne1-54/+4
Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu drain work queued by __drain_all_pages(). So introduce a new mechanism to remotely drain the per-cpu lists. It is made possible by remotely locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this new scheme is that drain operations are now migration safe. There was no observed performance degradation vs. the previous scheme. Both netperf and hackbench were run in parallel to triggering the __drain_all_pages(NULL, true) code path around ~100 times per second. The new scheme performs a bit better (~5%), although the important point here is there are no performance regressions vs. the previous mechanism. Per-cpu lists draining happens only in slow paths. Minchan Kim tested an earlier version and reported; My workload is not NOHZ CPUs but run apps under heavy memory pressure so they goes to direct reclaim and be stuck on drain_all_pages until work on workqueue run. unit: nanosecond max(dur) avg(dur) count(dur) 166713013 487511.77786438033 1283 From traces, system encountered the drain_all_pages 1283 times and worst case was 166ms and avg was 487us. The other problem was alloc_contig_range in CMA. The PCP draining takes several hundred millisecond sometimes though there is no memory pressure or a few of pages to be migrated out but CPU were fully booked. Your patch perfectly removed those wasted time. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nicolas Saenz Julienne <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Tested-by: Yu Zhao <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm/page_alloc: protect PCP lists with a spinlockMel Gorman2-21/+99
Currently the PCP lists are protected by using local_lock_irqsave to prevent migration and IRQ reentrancy but this is inconvenient. Remote draining of the lists is impossible and a workqueue is required and every task allocation/free must disable then enable interrupts which is expensive. As preparation for dealing with both of those problems, protect the lists with a spinlock. The IRQ-unsafe version of the lock is used because IRQs are already disabled by local_lock_irqsave. spin_trylock is used in combination with local_lock_irqsave() but later will be replaced with a spin_trylock_irqsave when the local_lock is removed. The per_cpu_pages still fits within the same number of cache lines after this patch relative to before the series. struct per_cpu_pages { spinlock_t lock; /* 0 4 */ int count; /* 4 4 */ int high; /* 8 4 */ int batch; /* 12 4 */ short int free_factor; /* 16 2 */ short int expire; /* 18 2 */ /* XXX 4 bytes hole, try to pack */ struct list_head lists[13]; /* 24 208 */ /* size: 256, cachelines: 4, members: 7 */ /* sum members: 228, holes: 1, sum holes: 4 */ /* padding: 24 */ } __attribute__((__aligned__(64))); There is overhead in the fast path due to acquiring the spinlock even though the spinlock is per-cpu and uncontended in the common case. Page Fault Test (PFT) running on a 1-socket reported the following results on a 1 socket machine. 5.19.0-rc3 5.19.0-rc3 vanilla mm-pcpspinirq-v5r16 Hmean faults/sec-1 869275.7381 ( 0.00%) 874597.5167 * 0.61%* Hmean faults/sec-3 2370266.6681 ( 0.00%) 2379802.0362 * 0.40%* Hmean faults/sec-5 2701099.7019 ( 0.00%) 2664889.7003 * -1.34%* Hmean faults/sec-7 3517170.9157 ( 0.00%) 3491122.8242 * -0.74%* Hmean faults/sec-8 3965729.6187 ( 0.00%) 3939727.0243 * -0.66%* There is a small hit in the number of faults per second but given that the results are more stable, it's borderline noise. [[email protected]: add missing local_unlock_irqrestore() on contention path] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Tested-by: Yu Zhao <[email protected]> Reviewed-by: Nicolas Saenz Julienne <[email protected]> Tested-by: Nicolas Saenz Julienne <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm/page_alloc: remove mistaken page == NULL check in rmqueueMel Gorman1-3/+1
If a page allocation fails, the ZONE_BOOSTER_WATERMARK should be tested, cleared and kswapd woken whether the allocation attempt was via the PCP or directly via the buddy list. Remove the page == NULL so the ZONE_BOOSTED_WATERMARK bit is checked unconditionally. As it is unlikely that ZONE_BOOSTED_WATERMARK is set, mark the branch accordingly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Tested-by: Yu Zhao <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nicolas Saenz Julienne <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm/page_alloc: split out buddy removal code from rmqueue into separate helperMel Gorman1-34/+47
This is a preparation page to allow the buddy removal code to be reused in a later patch. No functional change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Tested-by: Minchan Kim <[email protected]> Acked-by: Minchan Kim <[email protected]> Reviewed-by: Nicolas Saenz Julienne <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Tested-by: Yu Zhao <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm/page_alloc: use only one PCP list for THP-sized allocationsMel Gorman2-6/+9
The per_cpu_pages is cache-aligned on a standard x86-64 distribution configuration but a later patch will add a new field which would push the structure into the next cache line. Use only one list to store THP-sized pages on the per-cpu list. This assumes that the vast majority of THP-sized allocations are GFP_MOVABLE but even if it was another type, it would not contribute to serious fragmentation that potentially causes a later THP allocation failure. Align per_cpu_pages on the cacheline boundary to ensure there is no false cache sharing. After this patch, the structure sizing is; struct per_cpu_pages { int count; /* 0 4 */ int high; /* 4 4 */ int batch; /* 8 4 */ short int free_factor; /* 12 2 */ short int expire; /* 14 2 */ struct list_head lists[13]; /* 16 208 */ /* size: 256, cachelines: 4, members: 6 */ /* padding: 32 */ } __attribute__((__aligned__(64))); Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Tested-by: Minchan Kim <[email protected]> Acked-by: Minchan Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Tested-by: Yu Zhao <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nicolas Saenz Julienne <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm/page_alloc: add page->buddy_list and page->pcp_listMel Gorman2-12/+17
Patch series "Drain remote per-cpu directly", v5. Some setups, notably NOHZ_FULL CPUs, may be running realtime or latency-sensitive applications that cannot tolerate interference due to per-cpu drain work queued by __drain_all_pages(). Introduce a new mechanism to remotely drain the per-cpu lists. It is made possible by remotely locking 'struct per_cpu_pages' new per-cpu spinlocks. This has two advantages, the time to drain is more predictable and other unrelated tasks are not interrupted. This series has the same intent as Nicolas' series "mm/page_alloc: Remote per-cpu lists drain support" -- avoid interference of a high priority task due to a workqueue item draining per-cpu page lists. While many workloads can tolerate a brief interruption, it may cause a real-time task running on a NOHZ_FULL CPU to miss a deadline and at minimum, the draining is non-deterministic. Currently an IRQ-safe local_lock protects the page allocator per-cpu lists. The local_lock on its own prevents migration and the IRQ disabling protects from corruption due to an interrupt arriving while a page allocation is in progress. This series adjusts the locking. A spinlock is added to struct per_cpu_pages to protect the list contents while local_lock_irq is ultimately replaced by just the spinlock in the final patch. This allows a remote CPU to safely. Follow-on work should allow the spin_lock_irqsave to be converted to spin_lock to avoid IRQs being disabled/enabled in most cases. The follow-on patch will be one kernel release later as it is relatively high risk and it'll make bisections more clear if there are any problems. Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages and when it is storing per-cpu pages. Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking this is not necessary but it avoids per_cpu_pages consuming another cache line. Patch 3 is a preparation patch to avoid code duplication. Patch 4 is a minor correction. Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still relying on local_lock to prevent migration, stabilise the pcp lookup and prevent IRQ reentrancy. Patch 6 remote drains per-cpu pages directly instead of using a workqueue. Patch 7 uses a normal spinlock instead of local_lock for remote draining This patch (of 7): The page allocator uses page->lru for storing pages on either buddy or PCP lists. Create page->buddy_list and page->pcp_list as a union with page->lru. This is simply to clarify what type of list a page is on in the page allocator. No functional change intended. [[email protected]: fix page lru fields in macros] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Tested-by: Minchan Kim <[email protected]> Acked-by: Minchan Kim <[email protected]> Reviewed-by: Nicolas Saenz Julienne <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Tested-by: Yu Zhao <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Marek Szyprowski <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17hugetlb: lazy page table copies in fork()Mike Kravetz1-1/+1
Lazy page table copying at fork time was introduced with commit d992895ba2b2 ("[PATCH] Lazy page table copies in fork()"). At the time, hugetlb was very new and did not support page faulting. As a result, it was excluded. When full page fault support was added for hugetlb, the exclusion was not removed. Simply remove the check that prevents lazy copying of hugetlb page tables at fork. Of course, like other mappings this only applies to shared mappings. Lazy page table copying at fork will be less advantageous for hugetlb mappings because: - There are fewer page table entries with hugetlb - hugetlb pmds can be shared instead of copied In any case, completely eliminating the copy at fork time should speed things up. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Muchun Song <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: James Houghton <[email protected]> Cc: kernel test robot <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Peter Xu <[email protected]> Cc: Rolf Eike Beer <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17hugetlb: do not update address in huge_pmd_unshareMike Kravetz3-31/+21
As an optimization for loops sequentially processing hugetlb address ranges, huge_pmd_unshare would update a passed address if it unshared a pmd. Updating a loop control variable outside the loop like this is generally a bad idea. These loops are now using hugetlb_mask_last_page to optimize scanning when non-present ptes are discovered. The same can be done when huge_pmd_unshare returns 1 indicating a pmd was unshared. Remove address update from huge_pmd_unshare. Change the passed argument type and update all callers. In loops sequentially processing addresses use hugetlb_mask_last_page to update address if pmd is unshared. [[email protected]: fix an unused variable warning/error] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Acked-by: Muchun Song <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: James Houghton <[email protected]> Cc: kernel test robot <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Peter Xu <[email protected]> Cc: Rolf Eike Beer <[email protected]> Cc: Will Deacon <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17arm64/hugetlb: implement arm64 specific hugetlb_mask_last_pageBaolin Wang1-0/+22
The HugeTLB address ranges are linearly scanned during fork, unmap and remap operations, and the linear scan can skip to the end of range mapped by the page table page if hitting a non-present entry, which can help to speed linear scanning of the HugeTLB address ranges. So hugetlb_mask_last_page() is introduced to help to update the address in the loop of HugeTLB linear scanning with getting the last huge page mapped by the associated page table page[1], when a non-present entry is encountered. Considering ARM64 specific cont-pte/pmd size HugeTLB, this patch implemented an ARM64 specific hugetlb_mask_last_page() to help this case. [1] https://lore.kernel.org/linux-mm/[email protected]/ [[email protected]: fix build] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baolin Wang <[email protected]> Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Muchun Song <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: James Houghton <[email protected]> Cc: kernel test robot <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Peter Xu <[email protected]> Cc: Rolf Eike Beer <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17hugetlb: skip to end of PT page mapping when pte not presentMike Kravetz2-5/+52
Patch series "hugetlb: speed up linear address scanning", v2. At unmap, fork and remap time hugetlb address ranges are linearly scanned. We can optimize these scans if the ranges are sparsely populated. Also, enable page table "Lazy copy" for hugetlb at fork. NOTE: Architectures not defining CONFIG_ARCH_WANT_GENERAL_HUGETLB need to add an arch specific version hugetlb_mask_last_page() to take advantage of sparse address scanning improvements. Baolin Wang added the routine for arm64. Other architectures which could be optimized are: ia64, mips, parisc, powerpc, s390, sh and sparc. This patch (of 4): HugeTLB address ranges are linearly scanned during fork, unmap and remap operations. If a non-present entry is encountered, the code currently continues to the next huge page aligned address. However, a non-present entry implies that the page table page for that entry is not present. Therefore, the linear scan can skip to the end of range mapped by the page table page. This can speed operations on large sparsely populated hugetlb mappings. Create a new routine hugetlb_mask_last_page() that will return an address mask. When the mask is ORed with an address, the result will be the address of the last huge page mapped by the associated page table page. Use this mask to update addresses in routines which linearly scan hugetlb address ranges when a non-present pte is encountered. hugetlb_mask_last_page is related to the implementation of huge_pte_offset as hugetlb_mask_last_page is called when huge_pte_offset returns NULL. This patch only provides a complete hugetlb_mask_last_page implementation when CONFIG_ARCH_WANT_GENERAL_HUGETLB is defined. Architectures which provide their own versions of huge_pte_offset can also provide their own version of hugetlb_mask_last_page. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Tested-by: Baolin Wang <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Acked-by: Muchun Song <[email protected]> Reported-by: kernel test robot <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Peter Xu <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: James Houghton <[email protected]> Cc: Mina Almasry <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Rolf Eike Beer <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17kasan: separate double free case from invalid freeKuan-Ying Lee3-9/+14
Currently, KASAN describes all invalid-free/double-free bugs as "double-free or invalid-free". This is ambiguous. KASAN should report "double-free" when a double-free is a more likely cause (the address points to the start of an object) and report "invalid-free" otherwise [1]. [1] https://bugzilla.kernel.org/show_bug.cgi?id=212193 Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kuan-Ying Lee <[email protected]> Reviewed-by: Dmitry Vyukov <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Matthias Brugger <[email protected]> Cc: Chinwen Chang <[email protected]> Cc: Yee Lee <[email protected]> Cc: Andrew Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17doc: proc: fix the description to THPeligibleYang Shi1-1/+3
The THPeligible bit shows 1 if and only if the VMA is eligible for allocating THP and the THP is also PMD mappable. Some misaligned file VMAs may be eligible for allocating THP but the THP can't be mapped by PMD. Make this more explicitly to avoid ambiguity. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Zach O'Keefe <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: khugepaged: reorg some khugepaged helpersYang Shi4-23/+21
The khugepaged_{enabled|always|req_madv} are not khugepaged only anymore, move them to huge_mm.h and rename to hugepage_flags_xxx, and remove khugepaged_req_madv due to no users. Also move khugepaged_defrag to khugepaged.c since its only caller is in that file, it doesn't have to be in a header file. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Zach O'Keefe <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: thp: kill __transhuge_page_enabled()Yang Shi5-73/+52
The page fault path checks THP eligibility with __transhuge_page_enabled() which does the similar thing as hugepage_vma_check(), so use hugepage_vma_check() instead. However page fault allows DAX and !anon_vma cases, so added a new flag, in_pf, to hugepage_vma_check() to make page fault work correctly. The in_pf flag is also used to skip shmem and file THP for page fault since shmem handles THP in its own shmem_fault() and file THP allocation on fault is not supported yet. Also remove hugepage_vma_enabled() since hugepage_vma_check() is the only caller now, it is not necessary to have a helper function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Zach O'Keefe <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: thp: kill transparent_hugepage_active()Yang Shi5-62/+59
The transparent_hugepage_active() was introduced to show THP eligibility bit in smaps in proc, smaps is the only user. But it actually does the similar check as hugepage_vma_check() which is used by khugepaged. We definitely don't have to maintain two similar checks, so kill transparent_hugepage_active(). This patch also fixed the wrong behavior for VM_NO_KHUGEPAGED vmas. Also move hugepage_vma_check() to huge_memory.c and huge_mm.h since it is not only for khugepaged anymore. [[email protected]: check vma->vm_mm, per Zach] [[email protected]: add comment to vdso check] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Zach O'Keefe <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: khugepaged: better comments for anon vma check in hugepage_vma_revalidateYang Shi1-1/+7
The hugepage_vma_revalidate() needs to check if the vma is still anonymous vma or not since the address may be unmapped then remapped to file before khugepaged reaquired the mmap_lock. The old comment is not quite helpful, elaborate this with better comment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Zach O'Keefe <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: thp: consolidate vma size check to transhuge_vma_suitableYang Shi4-28/+18
There are couple of places that check whether the vma size is ok for THP or whether address fits, they are open coded and duplicate, use transhuge_vma_suitable() to do the job by passing in (vma->end - HPAGE_PMD_SIZE). Move vma size check into hugepage_vma_check(). This will make khugepaged_enter() is as same as khugepaged_enter_vma(). There is just one caller for khugepaged_enter(), replace it to khugepaged_enter_vma() and remove khugepaged_enter(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Zach O'Keefe <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: khugepaged: check THP flag in hugepage_vma_check()Yang Shi1-0/+3
Patch series "Cleanup transhuge_xxx helpers", v5. This series is the follow-up of the discussion about cleaning up transhuge_xxx helpers at https://lore.kernel.org/linux-mm/[email protected]/. THP has a bunch of helpers that do VMA sanity check for different paths, they do the similar checks for the most callsites and have a lot duplicate codes. And it is confusing what helpers should be used at what conditions. This series reorganized and cleaned up the code so that we could consolidate all the checks into hugepage_vma_check(). The transhuge_vma_enabled(), transparent_hugepage_active() and __transparent_hugepage_enabled() are killed by this series. This patch (of 7): Currently the THP flag check in hugepage_vma_check() will fallthrough if the flag is NEVER and VM_HUGEPAGE is set. This is not a problem for now since all the callers have the flag checked before or can't be invoked if the flag is NEVER. However, the following patch will call hugepage_vma_check() in more places, for example, page fault, so this flag must be checked in hugepge_vma_check(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Zach O'Keefe <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17xfs: add dax dedupe supportShiyang Ruan4-7/+69
Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files who are going to be deduped. After that, call compare range function only when files are both DAX or not. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17xfs: support CoW in fsdax modeShiyang Ruan3-6/+58
In fsdax mode, WRITE and ZERO on a shared extent need CoW performed. After that, new allocated extents needs to be remapped to the file. So, add a CoW identification in ->iomap_begin(), and implement ->iomap_end() to do the remapping work. [[email protected]: make xfs_dax_fault() static] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Cc: Al Viro <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17fsdax: dedup file range to use a compare functionShiyang Ruan5-11/+130
With dax we cannot deal with readpage() etc. So, we create a dax comparison function which is similar with vfs_dedupe_file_range_compare(). And introduce dax_remap_file_range_prep() for filesystem use. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Goldwyn Rodrigues <[email protected]> Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17fsdax: add dax_iomap_cow_copy() for dax zeroShiyang Ruan1-8/+19
Punch hole on a reflinked file needs dax_iomap_cow_copy() too. Otherwise, data in not aligned area will be not correct. So, add the CoW operation for not aligned case in dax_memzero(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Ritesh Harjani <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17fsdax: replace mmap entry in case of CoWShiyang Ruan1-35/+42
Replace the existing entry to the newly allocated one in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this entry as writeprotected. This helps us snapshots so new write pagefaults after snapshots trigger a CoW. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Goldwyn Rodrigues <[email protected]> Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Ritesh Harjani <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17fsdax: introduce dax_iomap_cow_copy()Shiyang Ruan1-5/+83
In the case where the iomap is a write operation and iomap is not equal to srcmap after iomap_begin, we consider it is a CoW operation. In this case, the destination (iomap->addr) points to a newly allocated extent. It is needed to copy the data from srcmap to the extent. In theory, it is better to copy the head and tail ranges which is outside of the non-aligned area instead of copying the whole aligned range. But in dax page fault, it will always be an aligned range. So copy the whole range in this case. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17fsdax: output address in dax_iomap_pfn() and rename itShiyang Ruan1-5/+13
Add address output in dax_iomap_pfn() in order to perform a memcpy() in CoW case. Since this function both output address and pfn, rename it to dax_iomap_direct_access(). [[email protected]: initialize `rc', per Dan] Link: https://lore.kernel.org/linux-fsdevel/Yp8FUZnO64Qvyx5G@kili/ Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Ritesh Harjani <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17fsdax: set a CoW flag when associate reflink mappingsShiyang Ruan2-9/+47
Introduce a PAGE_MAPPING_DAX_COW flag to support association with CoW file mappings. In this case, since the dax-rmap has already took the responsibility to look up for shared files by given dax page, the page->mapping is no longer to used for rmap but for marking that this dax page is shared. And to make sure disassociation works fine, we use page->index as refcount, and clear page->mapping to the initial state when page->index is decreased to 0. With the help of this new flag, it is able to distinguish normal case and CoW case, and keep the warning in normal case. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17xfs: implement ->notify_failure() for XFSShiyang Ruan6-3/+238
Introduce xfs_notify_failure.c to handle failure related works, such as implement ->notify_failure(), register/unregister dax holder in xfs, and so on. If the rmap feature of XFS enabled, we can query it to find files and metadata which are associated with the corrupt data. For now all we do is kill processes with that file mapped into their address spaces, but future patches could actually do something about corrupt metadata. After that, the memory failure needs to notify the processes who are using those files. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: introduce mf_dax_kill_procs() for fsdax caseShiyang Ruan2-10/+88
This new function is a variant of mf_generic_kill_procs that accepts a file, offset pair instead of a struct to support multiple files sharing a DAX mapping. It is intended to be called by the file systems as part of the memory_failure handler after the file system performed a reverse mapping from the storage address to the file and file offset. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17fsdax: introduce dax_lock_mapping_entry()Shiyang Ruan2-0/+78
The current dax_lock_page() locks dax entry by obtaining mapping and index in page. To support 1-to-N RMAP in NVDIMM, we need a new function to lock a specific dax entry corresponding to this file's mapping,index. And output the page corresponding to the specific dax entry for caller use. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17pagemap,pmem: introduce ->memory_failure()Shiyang Ruan3-0/+43
When memory-failure occurs, we call this function which is implemented by each kind of devices. For the fsdax case, pmem device driver implements it. Pmem device driver will find out the filesystem in which the corrupted page located in. With dax_holder notify support, we are able to notify the memory failure from pmem driver to upper layers. If there is something not support in the notify routine, memory_failure will fall back to the generic hanlder. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: factor helpers for memory_failure_dev_pagemapShiyang Ruan1-76/+95
memory_failure_dev_pagemap code is a bit complex before introduce RMAP feature for fsdax. So it is needed to factor some helper functions to simplify these code. [[email protected]: fix CONFIG_HUGETLB_PAGE=n build] [[email protected]: fix redefinition of mf_generic_kill_procs] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Signed-off-by: Zheng Bin <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17dax: introduce holder for dax_deviceShiyang Ruan7-23/+110
Patch series "v14 fsdax-rmap + v11 fsdax-reflink", v2. The patchset fsdax-rmap is aimed to support shared pages tracking for fsdax. It moves owner tracking from dax_assocaite_entry() to pmem device driver, by introducing an interface ->memory_failure() for struct pagemap. This interface is called by memory_failure() in mm, and implemented by pmem device. Then call holder operations to find the filesystem which the corrupted data located in, and call filesystem handler to track files or metadata associated with this page. Finally we are able to try to fix the corrupted data in filesystem and do other necessary processing, such as killing processes who are using the files affected. The call trace is like this: memory_failure() |* fsdax case |------------ |pgmap->ops->memory_failure() => pmem_pgmap_memory_failure() | dax_holder_notify_failure() => | dax_device->holder_ops->notify_failure() => | - xfs_dax_notify_failure() | |* xfs_dax_notify_failure() | |-------------------------- | | xfs_rmap_query_range() | | xfs_dax_failure_fn() | | * corrupted on metadata | | try to recover data, call xfs_force_shutdown() | | * corrupted on file data | | try to recover data, call mf_dax_kill_procs() |* normal case |------------- |mf_generic_kill_procs() The patchset fsdax-reflink attempts to add CoW support for fsdax, and takes XFS, which has both reflink and fsdax features, as an example. One of the key mechanisms needed to be implemented in fsdax is CoW. Copy the data from srcmap before we actually write data to the destination iomap. And we just copy range in which data won't be changed. Another mechanism is range comparison. In page cache case, readpage() is used to load data on disk to page cache in order to be able to compare data. In fsdax case, readpage() does not work. So, we need another compare data with direct access support. With the two mechanisms implemented in fsdax, we are able to make reflink and fsdax work together in XFS. This patch (of 14): To easily track filesystem from a pmem device, we introduce a holder for dax_device structure, and also its operation. This holder is used to remember who is using this dax_device: - When it is the backend of a filesystem, the holder will be the instance of this filesystem. - When this pmem device is one of the targets in a mapped device, the holder will be this mapped device. In this case, the mapped device has its own dax_device and it will follow the first rule. So that we can finally track to the filesystem we needed. The holder and holder_ops will be set when filesystem is being mounted, or an target device is being activated. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jane Chu <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Al Viro <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Dan Williams <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17tools: add selftests to hmm for COW in device memoryAlex Sierra1-0/+80
The objective is to test device migration mechanism in pages marked as COW, for private and coherent device type. In case of writing to COW private page(s), a page fault will migrate pages back to system memory first. Then, these pages will be duplicated. In case of COW device coherent type, pages are duplicated directly from device memory. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Sierra <[email protected]> Acked-by: Felix Kuehling <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17tools: add hmm gup tests for device coherent typeAlex Sierra1-0/+110
The intention is to test hmm device coherent type under different get user pages paths. Also, test gup with FOLL_LONGTERM flag set in device coherent pages. These pages should get migrated back to system memory. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17tools: update test_hmm script to support SP configAlex Sierra1-3/+21
Add two more parameters to set spm_addr_dev0 & spm_addr_dev1 addresses. These two parameters configure the start SP addresses for each device in test_hmm driver. Consequently, this configures zone device type as coherent. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Sierra <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Felix Kuehling <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17tools: update hmm-test to support device coherent typeAlex Sierra1-21/+100
Test cases such as migrate_fault and migrate_multiple, were modified to explicit migrate from device to sys memory without the need of page faults, when using device coherent type. Snapshot test case updated to read memory device type first and based on that, get the proper returned results migrate_ping_pong test case added to test explicit migration from device to sys memory for both private and coherent zone types. Helpers to migrate from device to sys memory and vicerversa were also added. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Sierra <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Felix Kuehling <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17lib: add support for device coherent type in test_hmmAlex Sierra2-61/+196
Device Coherent type uses device memory that is coherently accesible by the CPU. This could be shown as SP (special purpose) memory range at the BIOS-e820 memory enumeration. If no SP memory is supported in system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP. Currently, test_hmm only supports two different SP ranges of at least 256MB size. This could be specified in the kernel parameter variable efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x100000000 & 0x140000000 physical address. Ex. efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000 Private and coherent device mirror instances can be created in the same probed. This is done by passing the module parameters spm_addr_dev0 & spm_addr_dev1. In this case, it will create four instances of device_mirror. The first two correspond to private device type, the last two to coherent type. Then, they can be easily accessed from user space through /dev/hmm_mirror<num_device>. Usually num_device 0 and 1 are for private, and 2 and 3 for coherent types. If no module parameters are passed, two instances of private type device_mirror will be created only. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Sierra <[email protected]> Acked-by: Felix Kuehling <[email protected]> Reviewed-by: Alistair Poppple <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17lib: test_hmm add module param for zone device typeAlex Sierra2-21/+53
In order to configure device coherent in test_hmm, two module parameters should be passed, which correspond to the SP start address of each device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed, private device type is configured. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Sierra <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Felix Kuehling <[email protected]> Reviewed-by: Alistair Poppple <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17lib: test_hmm add ioctl to get zone device typeAlex Sierra2-6/+19
Add new ioctl cmd to query zone device type. This will be used once the test_hmm adds zone device coherent type. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Sierra <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Felix Kuehling <[email protected]> Reviewed-by: Alistair Poppple <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17drm/amdkfd: add SPM support for SVMAlex Sierra1-13/+21
When CPU is connected throug XGMI, it has coherent access to VRAM resource. In this case that resource is taken from a table in the device gmc aperture base. This resource is used along with the device type, which could be DEVICE_PRIVATE or DEVICE_COHERENT to create the device page map region. Also, MIGRATE_VMA_SELECT_DEVICE_COHERENT flag is selected for coherent type case during migration to device. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Sierra <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm/gup: migrate device coherent pages when pinning instead of failingAlistair Popple3-7/+96
Currently any attempts to pin a device coherent page will fail. This is because device coherent pages need to be managed by a device driver, and pinning them would prevent a driver from migrating them off the device. However this is no reason to fail pinning of these pages. These are coherent and accessible from the CPU so can be migrated just like pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin them first try migrating them out of ZONE_DEVICE. [[email protected]: rebased to the split device memory checks, moved migrate_device_page to migrate_device.c] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alistair Popple <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Felix Kuehling <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: add device coherent vma selection for memory migrationAlex Sierra2-3/+10
This case is used to migrate pages from device memory, back to system memory. Device coherent type memory is cache coherent from device and CPU point of view. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Sierra <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Felix Kuehling <[email protected]> Reviewed-by: Alistair Poppple <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: handling Non-LRU pages returned by vm_normal_pagesAlex Sierra10-16/+27
With DEVICE_COHERENT, we'll soon have vm_normal_pages() return device-managed anonymous pages that are not LRU pages. Although they behave like normal pages for purposes of mapping in CPU page, and for COW. They do not support LRU lists, NUMA migration or THP. Callers to follow_page() currently don't expect ZONE_DEVICE pages, however, with DEVICE_COHERENT we might now return ZONE_DEVICE. Check for ZONE_DEVICE pages in applicable users of follow_page() as well. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Sierra <[email protected]> Acked-by: Felix Kuehling <[email protected]> [v2] Reviewed-by: Alistair Popple <[email protected]> [v6] Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: add zone device coherent type memory supportAlex Sierra7-17/+53
Device memory that is cache coherent from device and CPU point of view. This is used on platforms that have an advanced system bus (like CAPI or CXL). Any page of a process can be migrated to such memory. However, no one should be allowed to pin such memory so that it can always be evicted. [[email protected]: rebased ontop of the refcount changes, remove is_dev_private_or_coherent_page] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Sierra <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Felix Kuehling <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: move page zone helpers from mm.h to mmzone.hAlex Sierra3-79/+81
It makes more sense to have these helpers in zone specific header file, rather than the generic mm.h Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Sierra <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>