aboutsummaryrefslogtreecommitdiff
path: root/mm/rmap.c
AgeCommit message (Collapse)AuthorFilesLines
2020-12-15mm/lru: revise the comments of lru_lockHugh Dickins1-2/+2
Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to fix the incorrect comments in code. Also fixed some zone->lru_lock comment error from ancient time. etc. I struggled to understand the comment above move_pages_to_lru() (surely it never calls page_referenced()), and eventually realized that most of it had got separated from shrink_active_list(): move that comment back. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Signed-off-by: Alex Shi <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Jann Horn <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Alexander Duyck <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Chen, Rong A" <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mika Penttilä <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Wei Yang <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm/rmap: stop store reordering issue on page->mappingAlex Shi1-1/+7
Hugh Dickins and Minchan Kim observed a long time issue which discussed here, but actully the mentioned fix in https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop/ was missed. The store reordering may cause problem in the scenario: CPU 0 CPU1 do_anonymous_page page_add_new_anon_rmap() page->mapping = anon_vma + PAGE_MAPPING_ANON lru_cache_add_inactive_or_unevictable() spin_lock(lruvec->lock) SetPageLRU() spin_unlock(lruvec->lock) /* idletacking judged it as LRU * page so pass the page in * page_idle_clear_pte_refs */ page_idle_clear_pte_refs rmap_walk if PageAnon(page) Johannes give detailed examples how the store reordering could cause trouble: "The concern is the SetPageLRU may get reorder before 'page->mapping' setting, That would make CPU 1 will observe at page->mapping after observing PageLRU set on the page. 1. anon_vma + PAGE_MAPPING_ANON That's the in-order scenario and is fine. 2. NULL That's possible if the page->mapping store gets reordered to occur after SetPageLRU. That's fine too because we check for it. 3. anon_vma without the PAGE_MAPPING_ANON bit That would be a problem and could lead to all kinds of undesirable behavior including crashes and data corruption. Is it possible? AFAICT the compiler is allowed to tear the store to page->mapping and I don't see anything that would prevent it. That said, I also don't see how the reader testing PageLRU under the lru_lock would prevent that in the first place. AFAICT we need that WRITE_ONCE() around the page->mapping assignment." [[email protected]: updated for comments change from Johannes] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Shi <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Alexander Duyck <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: "Chen, Rong A" <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Jann Horn <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mika Penttilä <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm/rmap: always do TTU_IGNORE_ACCESSShakeel Butt1-9/+0
Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2"), the code to check the secondary MMU's page table access bit is broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the secondary MMU's page table before the check. More specifically for those secondary MMUs which unmap the memory in mmu_notifier_invalidate_range_start() like kvm. However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the absence of TTU_IGNORE_ACCESS and it explicitly performs the page table access check before trying to unmap the page. So, at worst the reclaim will miss accesses in a very short window if we remove page table access check in unmapping code. There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg reclaim. From memcg reclaim the page_referenced() only account the accesses from the processes which are in the same memcg of the target page but the unmapping code is considering accesses from all the processes, so, decreasing the effectiveness of memcg reclaim. The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping code. Link: https://lkml.kernel.org/r/[email protected] Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2") Signed-off-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-11-14hugetlbfs: fix anon huge page migration raceMike Kravetz1-4/+1
Qian Cai reported the following BUG in [1] LTP: starting move_pages12 BUG: unable to handle page fault for address: ffffffffffffffe0 ... RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63 Call Trace: rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864 try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763 migrate_pages+0x1005/0x1fb0 move_pages_and_store_status.isra.47+0xd7/0x1a0 __x64_sys_move_pages+0xa5c/0x1100 do_syscall_64+0x5f/0x310 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Hugh Dickins diagnosed this as a migration bug caused by code introduced to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for anon pages as the anon_vma_lock should be held in this case. Further analysis suggested that i_mmap_rwsem was not required to he held at all when calling try_to_unmap for anon pages as an anon page could never be part of a shared pmd mapping. Discussion also revealed that the hack in hugetlb_page_mapping_lock_write to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to keep mapping valid while dropping page lock. This patch does the following: - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when calling try_to_unmap. - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine will now simply do a 'trylock' while still holding the page lock. If the trylock fails, it will return NULL. This could impact the callers: - migration calling code will receive -EAGAIN and retry up to the hard coded limit (10). - memory error code will treat the page as BUSY. This will force killing (SIGKILL) instead of SIGBUS any mapping tasks. Do note that this change in behavior only happens when there is a race. None of the standard kernel testing suites actually hit this race, but it is possible. [1] https://lore.kernel.org/lkml/[email protected]/ [2] https://lore.kernel.org/linux-mm/[email protected]/ Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") Reported-by: Qian Cai <[email protected]> Suggested-by: Hugh Dickins <[email protected]> Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/rmap: fix assumptions of THP sizeMatthew Wilcox (Oracle)1-5/+5
Ask the page what size it is instead of assuming it's PMD size. Do this for anon pages as well as file pages for when someone decides to support that. Leave the assumption alone for pages which are PMD mapped; we don't currently grow THPs beyond PMD size, so we don't need to change this code yet. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Huang Ying <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-09-05mm/rmap: fixup copying of soft dirty and uffd ptesAlistair Popple1-2/+7
During memory migration a pte is temporarily replaced with a migration swap pte. Some pte bits from the existing mapping such as the soft-dirty and uffd write-protect bits are preserved by copying these to the temporary migration swap pte. However these bits are not stored at the same location for swap and non-swap ptes. Therefore testing these bits requires using the appropriate helper function for the given pte type. Unfortunately several code locations were found where the wrong helper function is being used to test soft_dirty and uffd_wp bits which leads to them getting incorrectly set or cleared during page-migration. Fix these by using the correct tests based on pte type. Fixes: a5430dda8a3a ("mm/migrate: support un-addressable ZONE_DEVICE page in migration") Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration") Signed-off-by: Alistair Popple <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Alistair Popple <[email protected]> Cc: <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-14mm/rmap: annotate a data race at tlb_flush_batchedQian Cai1-1/+1
mm->tlb_flush_batched could be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one write to 0xffff93f754880bd0 of 1 bytes by task 822 on cpu 6: try_to_unmap_one+0x59a/0x1ab0 set_tlb_ubc_flush_pending at mm/rmap.c:635 (inlined by) try_to_unmap_one at mm/rmap.c:1538 rmap_walk_anon+0x296/0x650 rmap_walk+0xdf/0x100 try_to_unmap+0x18a/0x2f0 shrink_page_list+0xef6/0x2870 shrink_inactive_list+0x316/0x880 shrink_lruvec+0x8dc/0x1380 shrink_node+0x317/0xd80 balance_pgdat+0x652/0xd90 kswapd+0x396/0x8d0 kthread+0x1e0/0x200 ret_from_fork+0x27/0x50 read to 0xffff93f754880bd0 of 1 bytes by task 6364 on cpu 4: flush_tlb_batched_pending+0x29/0x90 flush_tlb_batched_pending at mm/rmap.c:682 change_p4d_range+0x5dd/0x1030 change_pte_range at mm/mprotect.c:44 (inlined by) change_pmd_range at mm/mprotect.c:212 (inlined by) change_pud_range at mm/mprotect.c:240 (inlined by) change_p4d_range at mm/mprotect.c:260 change_protection+0x222/0x310 change_prot_numa+0x3e/0x60 task_numa_work+0x219/0x350 task_work_run+0xed/0x140 prepare_exit_to_usermode+0x2cc/0x2e0 ret_from_intr+0x32/0x42 Reported by Kernel Concurrency Sanitizer on: CPU: 4 PID: 6364 Comm: mtest01 Tainted: G W L 5.5.0-next-20200210+ #5 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 flush_tlb_batched_pending() is under PTL but the write is not, but mm->tlb_flush_batched is only a bool type, so the value is unlikely to be shattered. Thus, mark it as an intentional data race by using the data race macro. Signed-off-by: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Marco Elver <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-14mm: replace hpage_nr_pages with thp_nr_pagesMatthew Wilcox (Oracle)1-4/+4
The thp prefix is more frequently used than hpage and we should be consistent between the various functions. [[email protected]: fix mm/migrate.c] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: William Kucharski <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsemMike Kravetz1-1/+1
Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem in at least read mode. This is because the explicit locking in huge_pmd_share (called by huge_pte_alloc) was removed. When restructuring the code, the call to huge_pte_alloc in the else block at the beginning of hugetlb_fault was missed. Unfortunately, that else clause is exercised when there is no page table entry. This will likely lead to a call to huge_pmd_share. If huge_pmd_share thinks pmd sharing is possible, it will traverse the mapping tree (i_mmap) without holding i_mmap_rwsem. If someone else is modifying the tree, bad things such as addressing exceptions or worse could happen. Simply remove the else clause. It should have been removed previously. The code following the else will call huge_pte_alloc with the appropriate locking. To prevent this type of issue in the future, add routines to assert that i_mmap_rwsem is held, and call these routines in huge pmd sharing routines. Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") Suggested-by: Matthew Wilcox <[email protected]> Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A.Shutemov" <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Prakash Sangappa <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-09mmap locking API: convert mmap_sem commentsMichel Lespinasse1-6/+6
Convert comments that reference mmap_sem to reference mmap_lock instead. [[email protected]: fix up linux-next leftovers] [[email protected]: s/lockaphore/lock/, per Vlastimil] [[email protected]: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ying Han <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: memcontrol: switch to native NR_ANON_THPS counterJohannes Weiner1-3/+3
With rmap memcg locking already in place for NR_ANON_MAPPED, it's just a small step to remove the MEMCG_RSS_HUGE wart and switch memcg to the native NR_ANON_THPS accounting sites. [[email protected]: fixes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Naresh Kamboju <[email protected]> Reviewed-by: Joonsoo Kim <[email protected]> Acked-by: Randy Dunlap <[email protected]> [build-tested] Cc: Alex Shi <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Balbir Singh <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: memcontrol: switch to native NR_ANON_MAPPED counterJohannes Weiner1-18/+29
Memcg maintains a private MEMCG_RSS counter. This divergence from the generic VM accounting means unnecessary code overhead, and creates a dependency for memcg that page->mapping is set up at the time of charging, so that page types can be told apart. Convert the generic accounting sites to mod_lruvec_page_state and friends to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the same way we do for NR_FILE_MAPPED. With the previous patch removing MEMCG_CACHE and the private NR_SHMEM counter, this patch finally eliminates the need to have page->mapping set up at charge time. However, we need to have page->mem_cgroup set up by the time rmap runs and does the accounting, so switch the commit and the rmap callbacks around. v2: fix temporary accounting bug by switching rmap<->commit (Joonsoo) Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Alex Shi <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Balbir Singh <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-07mm: prevent a warning when casting void* -> enumPalmer Dabbelt1-1/+1
I recently build the RISC-V port with LLVM trunk, which has introduced a new warning when casting from a pointer to an enum of a smaller size. This patch simply casts to a long in the middle to stop the warning. I'd be surprised this is the only one in the kernel, but it's the only one I saw. Signed-off-by: Palmer Dabbelt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-07userfaultfd: wp: support swap and page migrationPeter Xu1-0/+6
For either swap and page migration, we all use the bit 2 of the entry to identify whether this entry is uffd write-protected. It plays a similar role as the existing soft dirty bit in swap entries but only for keeping the uffd-wp tracking for a specific PTE/PMD. Something special here is that when we want to recover the uffd-wp bit from a swap/migration entry to the PTE bit we'll also need to take care of the _PAGE_RW bit and make sure it's cleared, otherwise even with the _PAGE_UFFD_WP bit we can't trap it at all. In change_pte_range() we do nothing for uffd if the PTE is a swap entry. That can lead to data mismatch if the page that we are going to write protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch also applies/removes the uffd-wp bit even for the swap entries. Signed-off-by: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Bobby Powers <[email protected]> Cc: Brian Geffon <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Denis Plotnikov <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Martin Cracauer <[email protected]> Cc: Marty McFadden <[email protected]> Cc: Maya Gokhale <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Shaohua Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-07mm: remove CONFIG_TRANSPARENT_HUGE_PAGECACHEMatthew Wilcox (Oracle)1-1/+1
Commit e496cf3d7821 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE") notes that it should be reverted when the PowerPC problem was fixed. The commit fixing the PowerPC problem (953c66c2b22a) did not revert the commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as CONFIG_TRANSPARENT_HUGEPAGE. Checking with Kirill and Aneesh, this was an oversight, so remove the Kconfig symbol and undo the work of commit e496cf3d7821. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Pankaj Gupta <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-07Revert "mm/rmap.c: reuse mergeable anon_vma as parent when fork"Li Xinhai1-13/+0
This reverts commit 4e4a9eb921332b9d1 ("mm/rmap.c: reuse mergeable anon_vma as parent when fork"). In dup_mmap(), anon_vma_fork() is called for attaching anon_vma and parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and ->vm_prev as its parent vma. That causes the anon_vma used by parent been mistakenly shared by child (In anon_vma_clone(), the code added by that commit will do this reuse work). Besides this issue, the design of reusing anon_vma from vma which has gone through fork should be avoided ([1]). So, this patch reverts that commit and maintains the consistent logic of reusing anon_vma for fork/split/merge vma. Reusing anon_vma within the process is fine. But if a vma has gone through fork(), then that vma's anon_vma should not be shared with its neighbor vma. As explained in [1], when vma gone through fork(), the check for list_is_singular(vma->anon_vma_chain) will be false, and don't share anon_vma. With current issue, one example can clarify more. Parent process do below two steps: 1. p_vma_1 is created and p_anon_vma_1 is prepared; 2. p_vma_2 is created and share p_anon_vma_1; (this is allowed, becaues p_vma_1 didn't gothrough fork()); parent process do fork(): 3. c_vma_1 is dup from p_vma_1, and has its own c_anon_vma_1 prepared; at this point, c_vma_1->anon_vma_chain has two items, one for p_anon_vma_1 and one for c_anon_vma_1; 4. c_vma_2 is dup from p_vma_2, it is not allowed to share c_anon_vma_1, because c_vma_1->anon_vma_chain has two items. [1] commit d0e9fe1758f2 ("Simplify and comment on anon_vma re-use for anon_vma_prepare()") explains the test of "list_is_singular()". Fixes: 4e4a9eb92133 ("mm/rmap.c: reuse mergeable anon_vma as parent when fork") Signed-off-by: Li Xinhai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronizationMike Kravetz1-1/+16
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2. While discussing the issue with huge_pte_offset [1], I remembered that there were more outstanding hugetlb races. These issues are: 1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become invalid via a call to huge_pmd_unshare by another thread. 2) hugetlbfs page faults can race with truncation causing invalid global reserve counts and state. A previous attempt was made to use i_mmap_rwsem in this manner as described at [2]. However, those patches were reverted starting with [3] due to locking issues. To effectively use i_mmap_rwsem to address the above issues it needs to be held (in read mode) during page fault processing. However, during fault processing we need to lock the page we will be adding. Lock ordering requires we take page lock before i_mmap_rwsem. Waiting until after taking the page lock is too late in the fault process for the synchronization we want to do. To address this lock ordering issue, the following patches change the lock ordering for hugetlb pages. This is not too invasive as hugetlbfs processing is done separate from core mm in many places. However, I don't really like this idea. Much ugliness is contained in the new routine hugetlb_page_mapping_lock_write() of patch 1. The only other way I can think of to address these issues is by catching all the races. After catching a race, cleanup, backout, retry ... etc, as needed. This can get really ugly, especially for huge page reservations. At one time, I started writing some of the reservation backout code for page faults and it got so ugly and complicated I went down the path of adding synchronization to avoid the races. Any other suggestions would be welcome. [1] https://lore.kernel.org/linux-mm/[email protected]/ [2] https://lore.kernel.org/linux-mm/[email protected]/ [3] https://lore.kernel.org/linux-mm/[email protected] [4] https://lore.kernel.org/linux-mm/[email protected]/ [5] https://lore.kernel.org/lkml/[email protected]/ This patch (of 2): While looking at BUGs associated with invalid huge page map counts, it was discovered and observed that a huge pte pointer could become 'invalid' and point to another task's page table. Consider the following: A task takes a page fault on a shared hugetlbfs file and calls huge_pte_alloc to get a ptep. Suppose the returned ptep points to a shared pmd. Now, another task truncates the hugetlbfs file. As part of truncation, it unmaps everyone who has the file mapped. If the range being truncated is covered by a shared pmd, huge_pmd_unshare will be called. For all but the last user of the shared pmd, huge_pmd_unshare will clear the pud pointing to the pmd. If the task in the middle of the page fault is not the last user, the ptep returned by huge_pte_alloc now points to another task's page table or worse. This leads to bad things such as incorrect page map/reference counts or invalid memory references. To fix, expand the use of i_mmap_rwsem as follows: - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. huge_pmd_share is only called via huge_pte_alloc, so callers of huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers of huge_pte_alloc continue to hold the semaphore until finished with the ptep. - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called. One problem with this scheme is that it requires taking i_mmap_rwsem before taking the page lock during page faults. This is not the order specified in the rest of mm code. Handling of hugetlbfs pages is mostly isolated today. Therefore, we use this alternative locking order for PageHuge() pages. mapping->i_mmap_rwsem hugetlb_fault_mutex (hugetlbfs specific page fault mutex) page->flags PG_locked (lock_page) To help with lock ordering issues, hugetlb_page_mapping_lock_write() is introduced to write lock the i_mmap_rwsem associated with a page. In most cases it is easy to get address_space via vma->vm_file->f_mapping. However, in the case of migration or memory errors for anon pages we do not have an associated vma. A new routine _get_hugetlb_page_mapping() will use anon_vma to get address_space in these cases. Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Prakash Sangappa <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02mm/vma: make is_vma_temporary_stack() available for general useAnshuman Khandual1-15/+1
Currently the declaration and definition for is_vma_temporary_stack() are scattered. Lets make is_vma_temporary_stack() helper available for general use and also drop the declaration from (include/linux/huge_mm.h) which is no longer required. While at this, rename this as vma_is_temporary_stack() in line with existing helpers. This should not cause any functional change. Signed-off-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pagesJohn Hubbard1-0/+6
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number of huge pages that can be pinned. This patch removes that limitation, by using an exact form of pin counting for compound pages of order > 1. The "order > 1" is required because this approach uses the 3rd struct page in the compound page, and order 1 compound pages only have two pages, so that won't work there. A new struct page field, hpage_pinned_refcount, has been added, replacing a padding field in the union (so no new space is used). This enhancement also has a useful side effect: huge pages and compound pages (of order > 1) do not suffer from the "potential false positives" problem that is discussed in the page_dma_pinned() comment block. That is because these compound pages have extra space for tracking things, so they get exact pin counts instead of overloading page->_refcount. Documentation/core-api/pin_user_pages.rst is updated accordingly. Suggested-by: Jan Kara <[email protected]> Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Al Viro <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Vlastimil Babka <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm, thp: do not queue fully unmapped pages for deferred splitKirill A. Shutemov1-4/+10
Adding fully unmapped pages into deferred split queue is not productive: these pages are about to be freed or they are pinned and cannot be split anyway. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/rmap.c: use VM_BUG_ON_PAGE() in __page_check_anon_rmap()Yang Shi1-4/+3
The __page_check_anon_rmap() just calls two BUG_ON()s protected by CONFIG_DEBUG_VM, the #ifdef could be eliminated by using VM_BUG_ON_PAGE(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/rmap.c: fix outdated comment in page_get_anon_vma()Miles Chen1-3/+4
Replace DESTROY_BY_RCU with SLAB_TYPESAFE_BY_RCU because SLAB_DESTROY_BY_RCU has been renamed to SLAB_TYPESAFE_BY_RCU by commit 5f0d5a3ae7cf ("mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Miles Chen <[email protected]> Cc: Paul E. McKenney <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/rmap.c: reuse mergeable anon_vma as parent when forkWei Yang1-0/+13
In __anon_vma_prepare(), we will try to find anon_vma if it is possible to reuse it. While on fork, the logic is different. Since commit 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue"), function anon_vma_clone() tries to allocate new anon_vma for child process. But the logic here will allocate a new anon_vma for each vma, even in parent this vma is mergeable and share the same anon_vma with its sibling. This may do better for scalability issue, while it is not necessary to do so especially after interval tree is used. Commit 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") tries to reuse some anon_vma by counting child anon_vma and attached vmas. While for those mergeable anon_vmas, we can just reuse it and not necessary to go through the logic. After this change, kernel build test reduces 20% anon_vma allocation. Do the same kernel build test, it shows run time in sys reduced 11.6%. Origin: real 2m50.467s user 17m52.002s sys 1m51.953s real 2m48.662s user 17m55.464s sys 1m50.553s real 2m51.143s user 17m59.687s sys 1m53.600s Patched: real 2m39.933s user 17m1.835s sys 1m38.802s real 2m39.321s user 17m1.634s sys 1m39.206s real 2m39.575s user 17m1.420s sys 1m38.845s Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Acked-by: Konstantin Khlebnikov <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: "Jérôme Glisse" <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Qian Cai <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/rmap.c: don't reuse anon_vma if we just want a copyWei Yang1-9/+15
Before commit 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy"), anon_vma_clone() doesn't change dst->anon_vma. While after this commit, anon_vma_clone() will try to reuse an exist one on forking. But this commit go a little bit further for the case not forking. anon_vma_clone() is called from __vma_split(), __split_vma(), copy_vma() and anon_vma_fork(). For the first three places, the purpose here is get a copy of src and we don't expect to touch dst->anon_vma even it is NULL. While after that commit, it is possible to reuse an anon_vma when dst->anon_vma is NULL. This is not we intend to have. This patch stops reuse of anon_vma for non-fork cases. Link: http://lkml.kernel.org/r/[email protected] Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") Signed-off-by: Wei Yang <[email protected]> Acked-by: Konstantin Khlebnikov <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: "Jérôme Glisse" <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Qian Cai <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-19mm: include <linux/huge_mm.h> for is_vma_temporary_stackBen Dooks1-0/+1
Include <linux/huge_mm.h> for the definition of is_vma_temporary_stack to fix the following sparse warning: mm/rmap.c:1673:6: warning: symbol 'is_vma_temporary_stack' was not declared. Should it be static? Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ben Dooks <[email protected]> Reviewed-by: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-09-24mm,thp: add read-only THP support for (non-shmem) FSSong Liu1-4/+8
This patch is (hopefully) the first step to enable THP for non-shmem filesystems. This patch enables an application to put part of its text sections to THP via madvise, for example: madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE); We tried to reuse the logic for THP on tmpfs. Currently, write is not supported for non-shmem THP. khugepaged will only process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests (see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is execve(). This requirement limits non-shmem THP to text sections. The next patch will handle writes, which would only happen when the all the vmas with VM_DENYWRITE are unmapped. An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this feature. [[email protected]: fix build without CONFIG_SHMEM] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: fix double unlock in collapse_file()] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: William Kucharski <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-09-24mm: introduce compound_nr()Matthew Wilcox (Oracle)1-2/+1
Replace 1 << compound_order(page) with compound_nr(page). Minor improvements in readability. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-09-24mm: introduce page_size()Matthew Wilcox (Oracle)1-4/+2
Patch series "Make working with compound pages easier", v2. These three patches add three helpers and convert the appropriate places to use them. This patch (of 3): It's unnecessarily hard to find out the size of a potentially huge page. Replace 'PAGE_SIZE << compound_order(page)' with page_size(page). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-09-24mm/rmap.c: remove set but not used variable 'cstart'YueHaibing1-3/+1
Fixes gcc '-Wunused-but-set-variable' warning: mm/rmap.c: In function page_mkclean_one: mm/rmap.c:906:17: warning: variable cstart set but not used [-Wunused-but-set-variable] It is not used any more since commit cdb07bdea28e ("mm/rmap.c: remove redundant variable cend") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: YueHaibing <[email protected]> Reported-by: Hulk Robot <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Kirill Tkhai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-08-13mm/hmm: fix bad subpage pointer in try_to_unmap_oneRalph Campbell1-0/+8
When migrating an anonymous private page to a ZONE_DEVICE private page, the source page->mapping and page->index fields are copied to the destination ZONE_DEVICE struct page and the page_mapcount() is increased. This is so rmap_walk() can be used to unmap and migrate the page back to system memory. However, try_to_unmap_one() computes the subpage pointer from a swap pte which computes an invalid page pointer and a kernel panic results such as: BUG: unable to handle page fault for address: ffffea1fffffffc8 Currently, only single pages can be migrated to device private memory so no subpage computation is needed and it can be set to "page". [[email protected]: add comment] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Fixes: a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration") Signed-off-by: Ralph Campbell <[email protected]> Cc: "Jérôme Glisse" <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jan Kara <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/rmap.c: use the pra.mapcount to do the checkHuang Shijie1-1/+1
We have the pra.mapcount already, and there is no need to call the page_mapped() which may do some complicated computing for compound page. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Huang Shijie <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/mmu_notifier: use correct mmu_notifier events for each invalidationJérôme Glisse1-3/+3
This updates each existing invalidation to use the correct mmu notifier event that represent what is happening to the CPU page table. See the patch which introduced the events to see the rational behind this. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Reviewed-by: Ralph Campbell <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Cc: Christian König <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Jan Kara <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Peter Xu <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Dan Williams <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Radim Krcmar <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Christian Koenig <[email protected]> Cc: John Hubbard <[email protected]> Cc: Arnd Bergmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/mmu_notifier: contextual information for event triggering invalidationJérôme Glisse1-2/+4
CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). Users of mmu notifier API track changes to the CPU page table and take specific action for them. While current API only provide range of virtual address affected by the change, not why the changes is happening. This patchset do the initial mechanical convertion of all the places that calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP event as well as the vma if it is know (most invalidation happens against a given vma). Passing down the vma allows the users of mmu notifier to inspect the new vma page protection. The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier should assume that every for the range is going away when that event happens. A latter patch do convert mm call path to use a more appropriate events for each call. This is done as 2 patches so that no call site is forgotten especialy as it uses this following coccinelle patch: %<---------------------------------------------------------------------- @@ identifier I1, I2, I3, I4; @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1, +enum mmu_notifier_event event, +unsigned flags, +struct vm_area_struct *vma, struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... } @@ @@ -#define mmu_notifier_range_init(range, mm, start, end) +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end) @@ expression E1, E3, E4; identifier I1; @@ <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, I1, I1->vm_mm, E3, E4) ...> @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(..., struct vm_area_struct *VMA, ...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(...) { struct vm_area_struct *VMA; <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN; @@ FN(...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, NULL, E2, E3, E4) ...> } ---------------------------------------------------------------------->% Applied with: spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place spatch --sp-file mmu-notifier.spatch --dir mm --in-place Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Reviewed-by: Ralph Campbell <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Cc: Christian König <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Jan Kara <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Peter Xu <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Dan Williams <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Radim Krcmar <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Christian Koenig <[email protected]> Cc: John Hubbard <[email protected]> Cc: Arnd Bergmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm: page_mkclean vs MADV_DONTNEED raceAneesh Kumar K.V1-1/+1
MADV_DONTNEED is handled with mmap_sem taken in read mode. We call page_mkclean without holding mmap_sem. MADV_DONTNEED implies that pages in the region are unmapped and subsequent access to the pages in that range is handled as a new page fault. This implies that if we don't have parallel access to the region when MADV_DONTNEED is run we expect those range to be unallocated. w.r.t page_mkclean() we need to make sure that we don't break the MADV_DONTNEED semantics. MADV_DONTNEED check for pmd_none without holding pmd_lock. This implies we skip the pmd if we temporarily mark pmd none. Avoid doing that while marking the page clean. Keep the sequence same for dax too even though we don't support MADV_DONTNEED for dax mapping The bug was noticed by code review and I didn't observe any failures w.r.t test run. This is similar to commit 58ceeb6bec86d9140f9d91d71a710e963523d063 Author: Kirill A. Shutemov <[email protected]> Date: Thu Apr 13 14:56:26 2017 -0700 thp: fix MADV_DONTNEED vs. MADV_FREE race commit ced108037c2aa542b3ed8b7afd1576064ad1362a Author: Kirill A. Shutemov <[email protected]> Date: Thu Apr 13 14:56:20 2017 -0700 thp: fix MADV_DONTNEED vs. numa balancing race Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Dan Williams <[email protected]> Cc:"Kirill A . Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-03-05mm: remove zone_lru_lock() function, access ->lru_lock directlyAndrey Ryabinin1-1/+1
We have common pattern to access lru_lock from a page pointer: zone_lru_lock(page_zone(page)) Which is silly, because it unfolds to this: &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock while we can simply do &NODE_DATA(page_to_nid(page))->lru_lock Remove zone_lru_lock() function, since it's only complicate things. Use 'page_pgdat(page)->lru_lock' pattern instead. [[email protected]: a slightly better version of __split_huge_page()] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Ryabinin <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: William Kucharski <[email protected]> Cc: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-01-10mm/mmu_notifier: mm/rmap.c: Fix a mmu_notifier range bug in try_to_unmap_oneSean Christopherson1-2/+2
The conversion to use a structure for mmu_notifier_invalidate_range_*() unintentionally changed the usage in try_to_unmap_one() to init the 'struct mmu_notifier_range' with vma->vm_start instead of @address, i.e. it invalidates the wrong address range. Revert to the correct address range. Manifests as KVM use-after-free WARNINGs and subsequent "BUG: Bad page state in process X" errors when reclaiming from a KVM guest due to KVM removing the wrong pages from its own mappings. Reported-by: [email protected] Reported-by: Mike Galbraith <[email protected]> Reported-and-tested-by: Adam Borowski <[email protected]> Reviewed-by: Jérôme Glisse <[email protected]> Reviewed-by: Pankaj gupta <[email protected]> Cc: Christian König <[email protected]> Cc: Jan Kara <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Dan Williams <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: John Hubbard <[email protected]> Cc: Andrew Morton <[email protected]> Fixes: ac46d4f3c432 ("mm/mmu_notifier: use structure for invalidate_range_start/end calls v2") Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-01-08hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization"Mike Kravetz1-4/+0
This reverts b43a9990055958e70347c56f90ea2ae32c67334c The reverted commit caused issues with migration and poisoning of anon huge pages. The LTP move_pages12 test will cause an "unable to handle kernel NULL pointer" BUG would occur with stack similar to: RIP: 0010:down_write+0x1b/0x40 Call Trace: migrate_pages+0x81f/0xb90 __ia32_compat_sys_migrate_pages+0x190/0x190 do_move_pages_to_node.isra.53.part.54+0x2a/0x50 kernel_move_pages+0x566/0x7b0 __x64_sys_move_pages+0x24/0x30 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The purpose of the reverted patch was to fix some long existing races with huge pmd sharing. It used i_mmap_rwsem for this purpose with the idea that this could also be used to address truncate/page fault races with another patch. Further analysis has determined that i_mmap_rwsem can not be used to address all these hugetlbfs synchronization issues. Therefore, revert this patch while working an another approach to the underlying issues. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Reported-by: Jan Stancek <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Prakash Sangappa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-12-28hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronizationMike Kravetz1-0/+4
While looking at BUGs associated with invalid huge page map counts, it was discovered and observed that a huge pte pointer could become 'invalid' and point to another task's page table. Consider the following: A task takes a page fault on a shared hugetlbfs file and calls huge_pte_alloc to get a ptep. Suppose the returned ptep points to a shared pmd. Now, another task truncates the hugetlbfs file. As part of truncation, it unmaps everyone who has the file mapped. If the range being truncated is covered by a shared pmd, huge_pmd_unshare will be called. For all but the last user of the shared pmd, huge_pmd_unshare will clear the pud pointing to the pmd. If the task in the middle of the page fault is not the last user, the ptep returned by huge_pte_alloc now points to another task's page table or worse. This leads to bad things such as incorrect page map/reference counts or invalid memory references. To fix, expand the use of i_mmap_rwsem as follows: - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. huge_pmd_share is only called via huge_pte_alloc, so callers of huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers of huge_pte_alloc continue to hold the semaphore until finished with the ptep. - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called. [[email protected]: add explicit check for mapping != null] Link: http://lkml.kernel.org/r/[email protected] Fixes: 39dde65c9940 ("shared page table for hugetlb page") Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Prakash Sangappa <[email protected]> Cc: Colin Ian King <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-12-28mm: remove __hugepage_set_anon_rmap()Kirill Tkhai1-21/+4
This function is identical to __page_set_anon_rmap() since the time, when it was introduced (8 years ago). The patch removes the function, and makes its users to use __page_set_anon_rmap() instead. Link: http://lkml.kernel.org/r/154504875359.30235.6237926369392564851.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Jerome Glisse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-12-28mm/mmu_notifier: use structure for invalidate_range_start/end calls v2Jérôme Glisse1-12/+18
To avoid having to change many call sites everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end cakks. No functional changes with this patch. [[email protected]: coding style fixes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Acked-by: Christian König <[email protected]> Acked-by: Jan Kara <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Dan Williams <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Radim Krcmar <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: John Hubbard <[email protected]> From: Jérôme Glisse <[email protected]> Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3 fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-11-30mm/huge_memory: rename freeze_page() to unmap_page()Hugh Dickins1-10/+3
The term "freeze" is used in several ways in the kernel, and in mm it has the particular meaning of forcing page refcount temporarily to 0. freeze_page() is just too confusing a name for a function that unmaps a page: rename it unmap_page(), and rename unfreeze_page() remap_page(). Went to change the mention of freeze_page() added later in mm/rmap.c, but found it to be incorrect: ordinary page reclaim reaches there too; but the substance of the comment still seems correct, so edit it down. Link: http://lkml.kernel.org/r/[email protected] Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()") Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: <[email protected]> [4.8+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-05mm: migration: fix migration of huge PMD shared pagesMike Kravetz1-3/+39
The page migration code employs try_to_unmap() to try and unmap the source page. This is accomplished by using rmap_walk to find all vmas where the page is mapped. This search stops when page mapcount is zero. For shared PMD huge pages, the page map count is always 1 no matter the number of mappings. Shared mappings are tracked via the reference count of the PMD page. Therefore, try_to_unmap stops prematurely and does not completely unmap all mappings of the source page. This problem can result is data corruption as writes to the original source page can happen after contents of the page are copied to the target page. Hence, data is lost. This problem was originally seen as DB corruption of shared global areas after a huge page was soft offlined due to ECC memory errors. DB developers noticed they could reproduce the issue by (hotplug) offlining memory used to back huge pages. A simple testcase can reproduce the problem by creating a shared PMD mapping (note that this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using migrate_pages() to migrate process pages between nodes while continually writing to the huge pages being migrated. To fix, have the try_to_unmap_one routine check for huge PMD sharing by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared mapping it will be 'unshared' which removes the page table entry and drops the reference on the PMD page. After this, flush caches and TLB. mmu notifiers are called before locking page tables, but we can not be sure of PMD sharing until page tables are locked. Therefore, check for the possibility of PMD sharing before locking so that notifiers can prepare for the worst possible case. Link: http://lkml.kernel.org/r/[email protected] [[email protected]: make _range_in_vma() a static inline] Link: http://lkml.kernel.org/r/[email protected] Fixes: 39dde65c9940 ("shared page table for hugetlb page") Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-07-14mm: do not drop unused pages when userfaultd is runningChristian Borntraeger1-1/+7
KVM guests on s390 can notify the host of unused pages. This can result in pte_unused callbacks to be true for KVM guest memory. If a page is unused (checked with pte_unused) we might drop this page instead of paging it. This can have side-effects on userfaultd, when the page in question was already migrated: The next access of that page will trigger a fault and a user fault instead of faulting in a new and empty zero page. As QEMU does not expect a userfault on an already migrated page this migration will fail. The most straightforward solution is to ignore the pte_unused hint if a userfault context is active for this VMA. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christian Borntraeger <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Janosch Frank <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Cornelia Huck <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-27Merge tag 'v4.17-rc2' into docs-nextJonathan Corbet1-3/+0
Merge -rc2 to pick up the changes to Documentation/core-api/kernel-api.rst that hit mainline via the networking tree. In their absence, subsequent patches cannot be applied.
2018-04-20mm: enable thp migration for shmem thpNaoya Horiguchi1-3/+0
My testing for the latest kernel supporting thp migration showed an infinite loop in offlining the memory block that is filled with shmem thps. We can get out of the loop with a signal, but kernel should return with failure in this case. What happens in the loop is that scan_movable_pages() repeats returning the same pfn without any progress. That's because page migration always fails for shmem thps. In memory offline code, memory blocks containing unmovable pages should be prevented from being offline targets by has_unmovable_pages() inside start_isolate_page_range(). So it's possible to change migratability for non-anonymous thps to avoid the issue, but it introduces more complex and thp-specific handling in migration code, so it might not good. So this patch is suggesting to fix the issue by enabling thp migration for shmem thp. Both of anon/shmem thp are migratable so we don't need precheck about the type of thps. Link: http://lkml.kernel.org/r/[email protected] Fixes: commit 72b39cfc4d75 ("mm, memory_hotplug: do not fail offlining too early") Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Zi Yan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-16Merge branch 'mm-rst' into docs-nextJonathan Corbet1-3/+3
Mike Rapoport says: These patches convert files in Documentation/vm to ReST format, add an initial index and link it to the top level documentation. There are no contents changes in the documentation, except few spelling fixes. The relatively large diffstat stems from the indentation and paragraph wrapping changes. I've tried to keep the formatting as consistent as possible, but I could miss some places that needed markup and add some markup where it was not necessary. [jc: significant conflicts in vm/hmm.rst]
2018-04-16docs/vm: rename documentation files to .rstMike Rapoport1-3/+3
Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Jonathan Corbet <[email protected]>
2018-04-11page cache: use xa_lockMatthew Wilcox1-2/+2
Remove the address_space ->tree_lock and use the xa_lock newly added to the radix_tree_root. Rename the address_space ->page_tree to ->i_pages, since we don't really care that it's a tree. [[email protected]: fix nds32, fs/dax.c] Link: http://lkml.kernel.org/r/[email protected]: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Jeff Layton <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Ryusuke Konishi <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm: kernel-doc: add missing parameter descriptionsMike Rapoport1-0/+1
Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-03-18mm, swap: Add infrastructure for saving page metadata on swapKhalid Aziz1-0/+14
If a processor supports special metadata for a page, for example ADI version tags on SPARC M7, this metadata must be saved when the page is swapped out. The same metadata must be restored when the page is swapped back in. This patch adds two new architecture specific functions - arch_do_swap_page() to be called when a page is swapped in, and arch_unmap_one() to be called when a page is being unmapped for swap out. These architecture hooks allow page metadata to be saved if the architecture supports it. Signed-off-by: Khalid Aziz <[email protected]> Cc: Khalid Aziz <[email protected]> Acked-by: Jerome Marchand <[email protected]> Reviewed-by: Anthony Yznaga <[email protected]> Acked-by: Andrew Morton <[email protected]> Signed-off-by: David S. Miller <[email protected]>