aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-03-22mm/thp: refix __split_huge_pmd_locked() for migration PMDHugh Dickins1-2/+2
Migration entries do not contribute to a page's reference count: move __split_huge_pmd_locked()'s page_ref_add() into pmd_migration's else block (along with the page_count() check - a page is quite likely to have reference count frozen to 0 when a migration entry is found). This will fix a very rare anonymous memory leak, after a split_huge_pmd() raced with an anon split_huge_page() or an anon THP migrate_pages(): since the wrongly raised refcount stopped the page (perhaps small, perhaps huge, depending on when the race hit) from ever being freed. At first I thought there were worse risks, from prematurely unfreezing a frozen page: but now think that would only affect page cache pages, which do not come this way (except for anonymous pages in swap cache, perhaps). Link: https://lkml.kernel.org/r/[email protected] Fixes: ec0abae6dcdf ("mm/thp: fix __split_huge_pmd_locked() for migration PMD") Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Zi Yan <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/migrate: fix race between lock page and clear PG_Isolatedandrew.yang2-7/+7
When memory is tight, system may start to compact memory for large continuous memory demands. If one process tries to lock a memory page that is being locked and isolated for compaction, it may wait a long time or even forever. This is because compaction will perform non-atomic PG_Isolated clear while holding page lock, this may overwrite PG_waiters set by the process that can't obtain the page lock and add itself to the waiting queue to wait for the lock to be unlocked. CPU1 CPU2 lock_page(page); (successful) lock_page(); (failed) __ClearPageIsolated(page); SetPageWaiters(page) (may be overwritten) unlock_page(page); The solution is to not perform non-atomic operation on page flags while holding page lock. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: andrew.yang <[email protected]> Cc: Matthias Brugger <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Vlastimil Babka" <[email protected]> Cc: David Howells <[email protected]> Cc: "William Kucharski" <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Yang Shi <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Nicholas Tang <[email protected]> Cc: Kuan-Ying Lee <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm,migrate: fix establishing demotion targetHuang Ying1-2/+5
In commit ac16ec835314 ("mm: migrate: support multiple target nodes demotion"), after the first demotion target node is found, we will continue to check the next candidate obtained via find_next_best_node(). This is to find all demotion target nodes with same NUMA distance. But one side effect of find_next_best_node() is that the candidate node returned will be set in "used" parameter, even if the candidate node isn't passed in the following NUMA distance checking, the candidate node will not be used as demotion target node for the following nodes. For example, for system as follows, node distances: node 0 1 2 3 0: 10 21 17 28 1: 21 10 28 17 2: 17 28 10 28 3: 28 17 28 10 when we establish demotion target node for node 0, in the first round node 2 is added to the demotion target node set. Then in the second round, node 3 is checked and failed because distance(0, 3) > distance(0, 2). But node 3 is set in "used" nodemask too. When we establish demotion target node for node 1, there is no available node. This is wrong, node 3 should be set as the demotion target of node 1. To fix this, if the candidate node is failed to pass the distance checking, it will be cleared in "used" nodemask. So that it can be used for the following node. The bug can be reproduced and fixed with this patch on a 2 socket server machine with DRAM and PMEM. Link: https://lkml.kernel.org/r/[email protected] Fixes: ac16ec835314 ("mm: migrate: support multiple target nodes demotion") Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Zi Yan <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Yang Shi <[email protected]> Cc: zhongjiang-ali <[email protected]> Cc: Xunlei Pang <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/oom_kill: remove unneeded is_memcg_oom checkMiaohe Lin1-3/+0
oom_cpuset_eligible() is always called when !is_memcg_oom(). Remove this unnecessary check. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mempolicy: mbind_range() set_policy() after vma_merge()Hugh Dickins1-7/+1
v2.6.34 commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") introduced vma_merge() to mbind_range(); but unlike madvise, mlock and mprotect, it put a "continue" to next vma where its precedents go to update flags on current vma before advancing: that left vma with the wrong setting in the infamous vma_merge() case 8. v3.10 commit 1444f92c8498 ("mm: merging memory blocks resets mempolicy") tried to fix that in vma_adjust(), without fully understanding the issue. v3.11 commit 3964acd0dbec ("mm: mempolicy: fix mbind_range() && vma_adjust() interaction") reverted that, and went about the fix in the right way, but chose to optimize out an unnecessary mpol_dup() with a prior mpol_equal() test. But on tmpfs, that also pessimized out the vital call to its ->set_policy(), leaving the new mbind unenforced. The user visible effect was that the pages got allocated on the local node (happened to be 0), after the mbind() caller had specifically asked for them to be allocated on node 1. There was not any page migration involved in the case reported: the pages simply got allocated on the wrong node. Just delete that optimization now (though it could be made conditional on vma not having a set_policy). Also remove the "next" variable: it turned out to be blameless, but also pointless. Link: https://lkml.kernel.org/r/[email protected] Fixes: 3964acd0dbec ("mm: mempolicy: fix mbind_range() && vma_adjust() interaction") Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: compaction: cleanup the compaction trace eventsBaolin Wang2-19/+16
As Steven suggested [1], we should access the pointers from the trace event to avoid dereferencing them to the tracepoint function when the tracepoint is disabled. [1] https://lkml.org/lkml/2021/11/3/409 Link: https://lkml.kernel.org/r/4cd393b4d57f8f01ed72c001509b28e3a3b1a8c1.1646985115.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: vmscan: fix documentation for page_check_references()Charan Teja Kalla1-1/+1
Commit b518154e59aa ("mm/vmscan: protect the workingset on anonymous LRU") requires to look twice for both mapped anon/file pages are used more than once to take the decission of reclaim or activation. Correct the documentation accordingly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Charan Teja Kalla <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: workingset: replace IRQ-off check with a lockdep assert.Sebastian Andrzej Siewior1-1/+4
Commit 68d48e6a2df57 ("mm: workingset: add vmstat counter for shadow nodes") introduced an IRQ-off check to ensure that a lock is held which also disabled interrupts. This does not work the same way on PREEMPT_RT because none of the locks, that are held, disable interrupts. Replace this check with a lockdep assert which ensures that the lock is held. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Zefan Li <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: lru_cache_disable: replace work queue synchronization with synchronize_rcuMarcelo Tosatti1-9/+14
On systems that run FIFO:1 applications that busy loop, any SCHED_OTHER task that attempts to execute on such a CPU (such as work threads) will not be scheduled, which leads to system hangs. Commit d479960e44f27e0e5 ("mm: disable LRU pagevec during the migration temporarily") relies on queueing work items on all online CPUs to ensure visibility of lru_disable_count. To fix this, replace the usage of work items with synchronize_rcu, which provides the same guarantees. Readers of lru_disable_count are protected by either disabling preemption or rcu_read_lock: preempt_disable, local_irq_disable [bh_lru_lock()] rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT] preempt_disable [local_lock !CONFIG_PREEMPT_RT] Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on preempt_disable() regions of code. So any CPU which sees lru_disable_count = 0 will have exited the critical section when synchronize_rcu() returns. Link: https://lkml.kernel.org/r/Yin7hDxdt0s/[email protected] Signed-off-by: Marcelo Tosatti <[email protected]> Reviewed-by: Nicolas Saenz Julienne <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Paul E. McKenney <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/list_lru: optimize memcg_reparent_list_lru_node()Waiman Long1-0/+6
Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node() to be race free"), we are tracking the total number of lru entries in a list_lru_node in its nr_items field. In the case of memcg_reparent_list_lru_node(), there is nothing to be done if nr_items is 0. We don't even need to take the nlru->lock as no new lru entry could be added by a racing list_lru_add() to the draining src_idx memcg at this point. On systems that serve a lot of containers, it is possible that there can be thousands of list_lru's present due to the fact that each container may mount its own container specific filesystems. As a typical container uses only a few cpus, it is likely that only the list_lru_node that contains those cpus will be utilized while the rests may be empty. In other words, there can be a lot of list_lru_node with 0 nr_items. By skipping a lock/unlock operation and loading a cacheline from memcg_lrus, a sizeable number of cpu cycles can be saved. That can be substantial if we are talking about thousands of list_lru_node's with 0 nr_items. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Waiman Long <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Muchun Song <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: __isolate_lru_page_prepare() in isolate_migratepages_block()Hugh Dickins3-91/+62
__isolate_lru_page_prepare() conflates two unrelated functions, with the flags to one disjoint from the flags to the other; and hides some of the important checks outside of isolate_migratepages_block(), where the sequence is better to be visible. It comes from the days of lumpy reclaim, before compaction, when the combination made more sense. Move what's needed by mm/compaction.c isolate_migratepages_block() inline there, and what's needed by mm/vmscan.c isolate_lru_pages() inline there. Shorten "isolate_mode" to "mode", so the sequence of conditions is easier to read. Declare a "mapping" variable, to save one call to page_mapping() (but not another: calling again after page is locked is necessary). Simplify isolate_lru_pages() with a "move_to" list pointer. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Acked-by: David Rientjes <[email protected]> Reviewed-by: Alex Shi <[email protected]> Cc: Alexander Duyck <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/fs: delete PF_SWAPWRITEHugh Dickins5-18/+3
PF_SWAPWRITE has been redundant since v3.2 commit ee72886d8ed5 ("mm: vmscan: do not writeback filesystem pages in direct reclaim"). Coincidentally, NeilBrown's current patch "remove inode_congested()" deletes may_write_to_inode(), which appeared to be the one function which took notice of PF_SWAPWRITE. But if you study the old logic, and the conditions under which may_write_to_inode() was called, you discover that flag and function have been pointless for a decade. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Cc: NeilBrown <[email protected]> Cc: Jan Kara <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22userfaultfd/selftests: fix uninitialized_var.cocci warningGuo Zhengkui1-1/+1
Fix following coccicheck warning: tools/testing/selftests/vm/userfaultfd.c:556:23-24: WARNING this kind of initialization is deprecated `unsigned long page_nr = *(&page_nr)` has the same form of uninitialized_var() macro. I remove the redundant assignement. It has been tested with gcc (Debian 8.3.0-6) 8.3.0. The patch which removed uninitialized_var() is: https://lore.kernel.org/all/[email protected]/ And there is very few "/* GCC */" comments in the Linux kernel code now. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Guo Zhengkui <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22userfaultfd: provide unmasked address on page-faultNadav Amit6-5/+19
Userfaultfd is supposed to provide the full address (i.e., unmasked) of the faulting access back to userspace. However, that is not the case for quite some time. Even running "userfaultfd_demo" from the userfaultfd man page provides the wrong output (and contradicts the man page). Notice that "UFFD_EVENT_PAGEFAULT event" shows the masked address (7fc5e30b3000) and not the first read address (0x7fc5e30b300f). Address returned by mmap() = 0x7fc5e30b3000 fault_handler_thread(): poll() returns: nready = 1; POLLIN = 1; POLLERR = 0 UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fc5e30b3000 (uffdio_copy.copy returned 4096) Read address 0x7fc5e30b300f in main(): A Read address 0x7fc5e30b340f in main(): A Read address 0x7fc5e30b380f in main(): A Read address 0x7fc5e30b3c0f in main(): A The exact address is useful for various reasons and specifically for prefetching decisions. If it is known that the memory is populated by certain objects whose size is not page-aligned, then based on the faulting address, the uffd-monitor can decide whether to prefetch and prefault the adjacent page. This bug has been for quite some time in the kernel: since commit 1a29d85eb0f1 ("mm: use vmf->address instead of of vmf->virtual_address") vmf->virtual_address"), which dates back to 2016. A concern has been raised that existing userspace application might rely on the old/wrong behavior in which the address is masked. Therefore, it was suggested to provide the masked address unless the user explicitly asks for the exact address. Add a new userfaultfd feature UFFD_FEATURE_EXACT_ADDRESS to direct userfaultfd to provide the exact address. Add a new "real_address" field to vmf to hold the unmasked address. Provide the address to userspace accordingly. Initialize real_address in various code-paths to be consistent with address, even when it is not used, to be on the safe side. [[email protected]: initialize real_address on all code paths, per Jan] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix typo in comment, per Jan] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nadav Amit <[email protected]> Acked-by: Peter Xu <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Mike Rapoport <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: remove unneeded local variable follflagsMiaohe Lin2-6/+2
We can pass FOLL_GET | FOLL_DUMP to follow_page directly to simplify the code a bit in add_page_for_migration and split_huge_pages_pid. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/hugetlb.c: export PageHeadHuge()David Howells1-0/+1
Export PageHeadHuge() - it's used by folio_test_hugetlb() and thence by such as folio_file_page() and folio_contains(). Matthew suggested I use the first of those instead of doing the same calculation manually - but I can't call it from a module. Kirill suggested rearranging things to put it in a header, but that introduces header dependencies because of where constants are defined. [[email protected]: s/EXPORT_SYMBOL/EXPORT_SYMBOL_GPL/, per Christoph] Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/163707085314.3221130.14783857863702203440.stgit@warthog.procyon.org.uk/ Signed-off-by: David Howells <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/hugetlb: use helper macro __ATTR_RWMiaohe Lin1-2/+1
Use helper macro __ATTR_RW to define HSTATE_ATTR to make code more clear. Minor readability improvement. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22hugetlb: clean up potential spectre issue warningsMike Kravetz1-3/+4
Recently introduced code allows numa nodes to be specified on the kernel command line for hugetlb allocations or CMA reservations. The node values are user specified and used as indicies into arrays. This generated the following smatch warnings: mm/hugetlb.c:4170 hugepages_setup() warn: potential spectre issue 'default_hugepages_in_node' [w] mm/hugetlb.c:4172 hugepages_setup() warn: potential spectre issue 'parsed_hstate->max_huge_pages_node' [w] mm/hugetlb.c:6898 cmdline_parse_hugetlb_cma() warn: potential spectre issue 'hugetlb_cma_size_in_node' [w] (local cap) Clean up by using array_index_nospec to sanitize array indicies. The routine cmdline_parse_hugetlb_cma has the same overflow/truncation issue addressed in [1]. That is also fixed with this change. [1] https://lore.kernel.org/linux-mm/[email protected]/ As Michal pointed out, this is unlikely to be exploitable because it is __init code. But the patch suppresses the warnings. [[email protected]: v2] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Zhenguo Yao <[email protected]> Cc: Liu Yuntao <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/hugetlb: generalize ARCH_WANT_GENERAL_HUGETLBAnshuman Khandual4-9/+6
ARCH_WANT_GENERAL_HUGETLB config has duplicate definitions on platforms that subscribe it. Instead make it a generic config option which can be selected on applicable platforms when required. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: sparsemem: move vmemmap related to HugeTLB to ↵Muchun Song2-0/+4
CONFIG_HUGETLB_PAGE_FREE_VMEMMAP The vmemmap_remap_free/alloc are relevant to HugeTLB, so move those functiongs to the scope of CONFIG_HUGETLB_PAGE_FREE_VMEMMAP. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Barry Song <[email protected]> Cc: Bodeddula Balasubramaniam <[email protected]> Cc: Chen Huang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22selftests: vm: add a hugetlb test caseMuchun Song4-0/+157
Since the head vmemmap page frame associated with each HugeTLB page is reused, we should hide the PG_head flag of tail struct page from the user. Add a tese case to check whether it is work properly. The test steps are as follows. 1) alloc 2MB hugeTLB 2) get each page frame 3) apply those APIs in each page frame 4) Those APIs work completely the same as before. Reading the flags of a page by /proc/kpageflags is done in stable_page_flags(), which has invoked PageHead(), PageTail(), PageCompound() and compound_head(). If those APIs work properly, the head page must have 15 and 17 bits set. And tail pages must have 16 and 17 bits set but 15 bit unset. Those flags are checked in check_page_flags(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Barry Song <[email protected]> Cc: Bodeddula Balasubramaniam <[email protected]> Cc: Chen Huang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: sparsemem: use page table lock to protect kernel pmd operationsMuchun Song2-20/+43
The init_mm.page_table_lock is used to protect kernel page tables, we can use it to serialize splitting vmemmap PMD mappings instead of mmap write lock, which can increase the concurrency of vmemmap_remap_free(). Actually, It increase the concurrency between allocations of HugeTLB pages. But it is not the only benefit. There are a lot of users of mmap read lock of init_mm. The mmap write lock is holding through vmemmap_remap_free(), removing mmap write lock usage to make it does not affect other users of mmap read lock. It is not making anything worse and always a win to move. Now the kernel page table walker does not hold the page_table_lock when walking pmd entries. There may be consistency issue of a pmd entry, because pmd entry might change from a huge pmd entry to a PTE page table. There is only one user of kernel page table walker, namely ptdump. The ptdump already considers the consistency, which use a local variable to cache the value of pmd entry. But we also need to update ->action to ACTION_CONTINUE to make sure the walker does not walk every pte entry again when concurrent thread has split the huge pmd. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Barry Song <[email protected]> Cc: Bodeddula Balasubramaniam <[email protected]> Cc: Chen Huang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_keyMuchun Song4-15/+21
The page_fixed_fake_head() is used throughout memory management and the conditional check requires checking a global variable, although the overhead of this check may be small, it increases when the memory cache comes under pressure. Also, the global variable will not be modified after system boot, so it is very appropriate to use static key machanism. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Barry Song <[email protected]> Cc: Bodeddula Balasubramaniam <[email protected]> Cc: Chen Huang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB pageMuchun Song4-33/+130
Patch series "Free the 2nd vmemmap page associated with each HugeTLB page", v7. This series can minimize the overhead of struct page for 2MB HugeTLB pages significantly. It further reduces the overhead of struct page by 12.5% for a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB HugeTLB. It is a nice gain. Comments and reviews are welcome. Thanks. The main implementation and details can refer to the commit log of patch 1. In this series, I have changed the following four helpers, the following table shows the impact of the overhead of those helpers. +------------------+-----------------------+ | APIs | head page | tail page | +------------------+-----------+-----------+ | PageHead() | Y | N | +------------------+-----------+-----------+ | PageTail() | Y | N | +------------------+-----------+-----------+ | PageCompound() | N | N | +------------------+-----------+-----------+ | compound_head() | Y | N | +------------------+-----------+-----------+ Y: Overhead is increased. N: Overhead is _NOT_ increased. It shows that the overhead of those helpers on a tail page don't change between "hugetlb_free_vmemmap=on" and "hugetlb_free_vmemmap=off". But the overhead on a head page will be increased when "hugetlb_free_vmemmap=on" (except PageCompound()). So I believe that Matthew Wilcox's folio series will help with this. The users of PageHead() and PageTail() are much less than compound_head() and most users of PageTail() are VM_BUG_ON(), so I have done some tests about the overhead of compound_head() on head pages. I have tested the overhead of calling compound_head() on a head page, which is 2.11ns (Measure the call time of 10 million times compound_head(), and then average). For a head page whose address is not aligned with PAGE_SIZE or a non-compound page, the overhead of compound_head() is 2.54ns which is increased by 20%. For a head page whose address is aligned with PAGE_SIZE, the overhead of compound_head() is 2.97ns which is increased by 40%. Most pages are the former. I do not think the overhead is significant since the overhead of compound_head() itself is low. This patch (of 5): This patch minimizes the overhead of struct page for 2MB HugeTLB pages significantly. It further reduces the overhead of struct page by 12.5% for a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB HugeTLB (2MB type). After the feature of "Free sonme vmemmap pages of HugeTLB page" is enabled, the mapping of the vmemmap addresses associated with a 2MB HugeTLB page becomes the figure below. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+---> PG_head | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | 2MB | +-----------+ | | | | | | 5 | ----------------------+ | | | | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+ As we can see, the 2nd vmemmap page frame (indexed by 1) is reused and remaped. However, the 2nd vmemmap page frame is also can be freed to the buddy allocator, then we can change the mapping from the figure above to the figure below. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+---> PG_head | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | ---------------^ ^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | | 2 | -----------------+ | | | | | | | +-----------+ | | | | | | | | 3 | -------------------+ | | | | | | +-----------+ | | | | | | | 4 | ---------------------+ | | | | 2MB | +-----------+ | | | | | | 5 | -----------------------+ | | | | +-----------+ | | | | | 6 | -------------------------+ | | | +-----------+ | | | | 7 | ---------------------------+ | | +-----------+ | | | | | | +-----------+ After we do this, all tail vmemmap pages (1-7) are mapped to the head vmemmap page frame (0). In other words, there are more than one page struct with PG_head associated with each HugeTLB page. We __know__ that there is only one head page struct, the tail page structs with PG_head are fake head page structs. We need an approach to distinguish between those two different types of page structs so that compound_head(), PageHead() and PageTail() can work properly if the parameter is the tail page struct but with PG_head. The following code snippet describes how to distinguish between real and fake head page struct. if (test_bit(PG_head, &page->flags)) { unsigned long head = READ_ONCE(page[1].compound_head); if (head & 1) { if (head == (unsigned long)page + 1) ==> head page struct else ==> tail page struct } else ==> head page struct } We can safely access the field of the @page[1] with PG_head because the @page is a compound page composed with at least two contiguous pages. [[email protected]: restore lost comment changes] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Barry Song <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Chen Huang <[email protected]> Cc: Bodeddula Balasubramaniam <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Qi Zheng <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/mlock: fix potential imbalanced rlimit ucounts adjustmentMiaohe Lin1-0/+1
user_shm_lock forgets to set allowed to 0 when get_ucounts fails. So the later user_shm_unlock might do the extra dec_rlimit_ucounts. Fix this by resetting allowed to 0. Link: https://lkml.kernel.org/r/[email protected] Fixes: d7c9e99aee48 ("Reimplement RLIMIT_MEMLOCK on top of ucounts") Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Herbert van den Bergh <[email protected]> Cc: Chris Mason <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm, fault-injection: declare should_fail_alloc_page()Vlastimil Babka1-0/+2
The mm/ directory can almost fully be built with W=1, which would help in local development. One remaining issue is missing prototype for should_fail_alloc_page(). Thus add it next to the should_failslab() prototype. Note the previous attempt by commit f7173090033c ("mm/page_alloc: make should_fail_alloc_page() static") had to be reverted by commit 54aa386661fe as it caused an unresolved symbol error with CONFIG_DEBUG_INFO_BTF=y Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memory-failure.c: make non-LRU movable pages unhandlableMiaohe Lin1-7/+13
We can not really handle non-LRU movable pages in memory failure. Typically they are balloon, zsmalloc, etc. Assuming we run into a base (4K) non-LRU movable page, we could reach as far as identify_page_state(), it should not fall into any category except me_unknown. For the non-LRU compound movable pages, they could be taken for transhuge pages but it's unexpected to split non-LRU movable pages using split_huge_page_to_list in memory_failure. So we could just simply make non-LRU movable pages unhandlable to avoid these possible nasty cases. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Suggested-by: Yang Shi <[email protected]> Reviewed-by: Yang Shi <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Tony Luck <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memory-failure.c: avoid calling invalidate_inode_page() with unexpected pagesMiaohe Lin1-1/+1
Since commit 042c4f32323b ("mm/truncate: Inline invalidate_complete_page() into its one caller"), invalidate_inode_page() can invalidate the pages in the swap cache because the check of page->mapping != mapping is removed. But invalidate_inode_page() is not expected to deal with the pages in swap cache. Also non-lru movable page can reach here too. They're not page cache pages. Skip these pages by checking PageSwapCache and PageLRU. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Tony Luck <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memory-failure.c: fix race with changing page compound againMiaohe Lin3-0/+14
Patch series "A few fixup patches for memory failure", v2. This series contains a few patches to fix the race with changing page compound page, make non-LRU movable pages unhandlable and so on. More details can be found in the respective changelogs. There is a race window where we got the compound_head, the hugetlb page could be freed to buddy, or even changed to another compound page just before we try to get hwpoison page. Think about the below race window: CPU 1 CPU 2 memory_failure_hugetlb struct page *head = compound_head(p); hugetlb page might be freed to buddy, or even changed to another compound page. get_hwpoison_page -- page is not what we want now... If this race happens, just bail out. Also MF_MSG_DIFFERENT_PAGE_SIZE is introduced to record this event. [[email protected]: s@/**@/*@, per Naoya Horiguchi] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Tony Luck <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/hwpoison: add in-use hugepage hwpoison filter judgementluofei1-0/+8
After successfully obtaining the reference count of the huge page, it is still necessary to call hwpoison_filter() to make a filter judgement, otherwise the filter hugepage will be unmaped and the related process may be killed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: luofei <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handlerluofei5-6/+18
When the hwpoison page meets the filter conditions, it should not be regarded as successful memory_failure() processing for mce handler, but should return a distinct value, otherwise mce handler regards the error page has been identified and isolated, which may lead to calling set_mce_nospec() to change page attribute, etc. Here memory_failure() return -EOPNOTSUPP to indicate that the error event is filtered, mce handler should not take any action for this situation and hwpoison injector should treat as correct. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: luofei <[email protected]> Acked-by: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/hwpoison-inject: support injecting hwpoison to free pageMiaohe Lin2-2/+3
memory_failure() can handle free buddy page. Support injecting hwpoison to free page by adding is_free_buddy_page check when hwpoison filter is disabled. [[email protected]: export is_free_buddy_page() to modules] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memory-failure.c: remove unnecessary PageTransTail checkMiaohe Lin1-1/+1
When we reach here, we're guaranteed to have non-compound page as thp is already splited. Remove this unnecessary PageTransTail check. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memory-failure.c: remove obsolete comment in __soft_offline_pageMiaohe Lin1-4/+0
Since commit add05cecef80 ("mm: soft-offline: don't free target page in successful page migration"), set_migratetype_isolate logic is removed. Remove this obsolete comment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memory-failure.c: rework the try_to_unmap logic in hwpoison_user_mappings()Miaohe Lin1-19/+15
Only for hugetlb pages in shared mappings, try_to_unmap should take semaphore in write mode here. Rework the code to make it clear. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memory-failure.c: remove PageSlab check in hwpoison_filter_devMiaohe Lin1-6/+0
Since commit 03e5ac2fc3bf ("mm: fix crash when using XFS on loopback"), page_mapping() can handle the Slab pages. So remove this unnecessary PageSlab check and obsolete comment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memory-failure.c: fix race with changing page more robustlyMiaohe Lin1-5/+15
We're only intended to deal with the non-Compound page after we split thp in memory_failure. However, the page could have changed compound pages due to race window. If this happens, we could retry once to hopefully handle the page next round. Also remove unneeded orig_head. It's always equal to the hpage. So we can use hpage directly and remove this redundant one. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memory-failure.c: rework the signaling logic in kill_procMiaohe Lin1-10/+6
BUS_MCEERR_AR code is only sent when MF_ACTION_REQUIRED is set and the target is current. Rework the code to make this clear. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memory-failure.c: catch unexpected -EFAULT from vma_address()Miaohe Lin1-0/+1
It's unexpected to walk the page table when vma_address() return -EFAULT. But dev_pagemap_mapping_shift() is called only when vma associated to the error page is found already in collect_procs_{file,anon}, so vma_address() should not return -EFAULT except with some bug, as Naoya pointed out. We can use VM_BUG_ON_VMA() to catch this bug here. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memory-failure.c: minor clean up for memory_failure_dev_pagemapMiaohe Lin1-2/+2
Patch series "A few cleanup and fixup patches for memory failure", v3. This series contains a few patches to simplify the code logic, remove unneeded variable and remove obsolete comment. Also we fix race changing page more robustly in memory_failure. More details can be found in the respective changelogs. This patch (of 8): The flags always has MF_ACTION_REQUIRED and MF_MUST_KILL set. So we do not need to check these flags again. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: invalidate hwpoison page cache page in fault pathRik van Riel1-2/+7
Sometimes the page offlining code can leave behind a hwpoisoned clean page cache page. This can lead to programs being killed over and over and over again as they fault in the hwpoisoned page, get killed, and then get re-spawned by whatever wanted to run them. This is particularly embarrassing when the page was offlined due to having too many corrected memory errors. Now we are killing tasks due to them trying to access memory that probably isn't even corrupted. This problem can be avoided by invalidating the page from the page fault handler, which already has a branch for dealing with these kinds of pages. With this patch we simply pretend the page fault was successful if the page was invalidated, return to userspace, incur another page fault, read in the file from disk (to a new memory page), and then everything works again. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rik van Riel <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: John Hubbard <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/hwpoison: fix error page recovered but reported "not recovered"Naoya Horiguchi1-1/+3
When an uncorrected memory error is consumed there is a race between the CMCI from the memory controller reporting an uncorrected error with a UCNA signature, and the core reporting and SRAR signature machine check when the data is about to be consumed. If the CMCI wins that race, the page is marked poisoned when uc_decode_notifier() calls memory_failure() and the machine check processing code finds the page already poisoned. It calls kill_accessing_process() to make sure a SIGBUS is sent. But returns the wrong error code. Console log looks like this: mce: Uncorrected hardware memory error in user-access at 3710b3400 Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered Memory failure: 0x3710b3: already hardware poisoned Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption mce: Memory error not recovered kill_accessing_process() is supposed to return -EHWPOISON to notify that SIGBUS is already set to the process and kill_me_maybe() doesn't have to send it again. But current code simply fails to do this, so fix it to make sure to work as intended. This change avoids the noise message "Memory error not recovered" and skips duplicate SIGBUSs. [[email protected]: reword some parts of commit message] Link: https://lkml.kernel.org/r/[email protected] Fixes: a3f5d80ea401 ("mm,hwpoison: send SIGBUS with error virutal address") Signed-off-by: Naoya Horiguchi <[email protected]> Reported-by: Youquan Song <[email protected]> Cc: Tony Luck <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memory-failure.c: remove obsolete commentNaoya Horiguchi1-6/+0
With the introduction of mf_mutex, most of memory error handling process is mutually exclusive, so the in-line comment about subtlety about double-checking PageHWPoison is no more correct. So remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Suggested-by: Mike Kravetz <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/page_alloc: check high-order pages for corruption during PCP operationsMel Gorman1-23/+23
Eric Dumazet pointed out that commit 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") only checks the head page during PCP refill and allocation operations. This was an oversight and all pages should be checked. This will incur a small performance penalty but it's necessary for correctness. Link: https://lkml.kernel.org/r/[email protected] Fixes: 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") Signed-off-by: Mel Gorman <[email protected]> Reported-by: Eric Dumazet <[email protected]> Acked-by: Eric Dumazet <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wei Xu <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/page_alloc: call check_new_pages() while zone spinlock is not heldEric Dumazet1-9/+9
For high order pages not using pcp, rmqueue() is currently calling the costly check_new_pages() while zone spinlock is held, and hard irqs masked. This is not needed, we can release the spinlock sooner to reduce zone spinlock contention. Note that after this patch, we call __mod_zone_freepage_state() before deciding to leak the page because it is in bad state. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wei Xu <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: count time in drain_all_pages during direct reclaim as memory pressureSuren Baghdasaryan1-4/+6
When page allocation in direct reclaim path fails, the system will make one attempt to shrink per-cpu page lists and free pages from high alloc reserves. Draining per-cpu pages into buddy allocator can be a very slow operation because it's done using workqueues and the task in direct reclaim waits for all of them to finish before proceeding. Currently this time is not accounted as psi memory stall. While testing mobile devices under extreme memory pressure, when allocations are failing during direct reclaim, we notices that psi events which would be expected in such conditions were not triggered. After profiling these cases it was determined that the reason for missing psi events was that a big chunk of time spent in direct reclaim is not accounted as memory stall, therefore psi would not reach the levels at which an event is generated. Further investigation revealed that the bulk of that unaccounted time was spent inside drain_all_pages call. A typical captured case when drain_all_pages path gets activated: __alloc_pages_slowpath took 44.644.613ns __perform_reclaim took 751.668ns (1.7%) drain_all_pages took 43.887.167ns (98.3%) PSI in this case records the time spent in __perform_reclaim but ignores drain_all_pages, IOW it misses 98.3% of the time spent in __alloc_pages_slowpath. Annotate __alloc_pages_direct_reclaim in its entirety so that delays from handling page allocation failure in the direct reclaim path are accounted as memory stall. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reported-by: Tim Murray <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22arch/x86/mm/numa: Do not initialize nodes twiceOscar Salvador3-15/+21
On x86, prior to ("mm: handle uninitialized numa nodes gracecully"), NUMA nodes could be allocated at three different places. - numa_register_memblks - init_cpu_to_node - init_gi_nodes All these calls happen at setup_arch, and have the following order: setup_arch ... x86_numa_init numa_init numa_register_memblks ... init_cpu_to_node init_memory_less_node alloc_node_data free_area_init_memoryless_node init_gi_nodes init_memory_less_node alloc_node_data free_area_init_memoryless_node numa_register_memblks() is only interested in those nodes which have memory, so it skips over any memoryless node it founds. Later on, when we have read ACPI's SRAT table, we call init_cpu_to_node() and init_gi_nodes(), which initialize any memoryless node we might have that have either CPU or Initiator affinity, meaning we allocate pg_data_t struct for them and we mark them as ONLINE. So far so good, but the thing is that after ("mm: handle uninitialized numa nodes gracefully"), we allocate all possible NUMA nodes in free_area_init(), meaning we have a picture like the following: setup_arch x86_numa_init numa_init numa_register_memblks <-- allocate non-memoryless node x86_init.paging.pagetable_init ... free_area_init free_area_init_memoryless <-- allocate memoryless node init_cpu_to_node alloc_node_data <-- allocate memoryless node with CPU free_area_init_memoryless_node init_gi_nodes alloc_node_data <-- allocate memoryless node with Initiator free_area_init_memoryless_node free_area_init() already allocates all possible NUMA nodes, but init_cpu_to_node() and init_gi_nodes() are clueless about that, so they go ahead and allocate a new pg_data_t struct without checking anything, meaning we end up allocating twice. It should be mad clear that this only happens in the case where memoryless NUMA node happens to have a CPU/Initiator affinity. So get rid of init_memory_less_node() and just set the node online. Note that setting the node online is needed, otherwise we choke down the chain when bringup_nonboot_cpus() ends up calling __try_online_node()->register_one_node()->... and we blow up in bus_add_device(). As can be seen here: BUG: kernel NULL pointer dereference, address: 0000000000000060 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4 RIP: 0010:bus_add_device+0x5a/0x140 Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004 RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19 RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400 RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98 R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0 R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0 Call Trace: device_add+0x4c0/0x910 __register_one_node+0x97/0x2d0 __try_online_node+0x85/0xc0 try_online_node+0x25/0x40 cpu_up+0x4f/0x100 bringup_nonboot_cpus+0x4f/0x60 smp_init+0x26/0x79 kernel_init_freeable+0x130/0x2f1 kernel_init+0x17/0x150 ret_from_fork+0x22/0x30 The reason is simple, by the time bringup_nonboot_cpus() gets called, we did not register the node_subsys bus yet, so we crash when bus_add_device() tries to dereference bus()->p. The following shows the order of the calls: kernel_init_freeable smp_init bringup_nonboot_cpus ... bus_add_device() <- we did not register node_subsys yet do_basic_setup do_initcalls postcore_initcall(register_node_type); register_node_type subsys_system_register subsys_register bus_register <- register node_subsys bus Why setting the node online saves us then? Well, simply because __try_online_node() backs off when the node is online, meaning we do not end up calling register_one_node() in the first place. This is subtle, broken and deserves a deep analysis and thought about how to put this into shape, but for now let us have this easy fix for the leaking memory issue. [[email protected]: add comments] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully") Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Rafael Aquini <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Wei Yang <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Alexey Makhalov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/page_alloc: do not prefetch buddies during bulk freeMel Gorman1-24/+0
free_pcppages_bulk() has taken two passes through the pcp lists since commit 0a5f4e5b4562 ("mm/free_pcppages_bulk: do not hold lock when picking pages to free") due to deferring the cost of selecting PCP lists until the zone lock is held. As the list processing now takes place under the zone lock, it's less clear that this will always benefit for two reasons. 1. There is a guaranteed cost to calculating the buddy which definitely has to be calculated again. However, as the zone lock is held and there is no deferring of buddy merging, there is no guarantee that the prefetch will have completed when the second buddy calculation takes place and buddies are being merged. With or without the prefetch, there may be further stalls depending on how many pages get merged. In other words, a stall due to merging is inevitable and at best only one stall might be avoided at the cost of calculating the buddy location twice. 2. As the zone lock is held, prefetch_nr makes less sense as once prefetch_nr expires, the cache lines of interest have already been merged. The main concern is that there is a definite cost to calculating the buddy location early for the prefetch and it is a "maybe win" depending on whether the CPU prefetch logic and memory is fast enough. Remove the prefetch logic on the basis that reduced instructions in a path is always a saving where as the prefetch might save one memory stall depending on the CPU and memory. In most cases, this has marginal benefit as the calculations are a small part of the overall freeing of pages. However, it was detectable on at least one machine. 5.17.0-rc3 5.17.0-rc3 mm-highpcplimit-v2r1 mm-noprefetch-v1r1 Min elapsed 630.00 ( 0.00%) 610.00 ( 3.17%) Amean elapsed 639.00 ( 0.00%) 623.00 * 2.50%* Max elapsed 660.00 ( 0.00%) 660.00 ( 0.00%) Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Suggested-by: Aaron Lu <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Aaron Lu <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/page_alloc: limit number of high-order pages on PCP during bulk freeMel Gorman1-5/+21
When a PCP is mostly used for frees then high-order pages can exist on PCP lists for some time. This is problematic when the allocation pattern is all allocations from one CPU and all frees from another resulting in colder pages being used. When bulk freeing pages, limit the number of high-order pages that are stored on the PCP lists. Netperf running on localhost exhibits this pattern and while it does not matter for some machines, it does matter for others with smaller caches where cache misses cause problems due to reduced page reuse. Pages freed directly to the buddy list may be reused quickly while still cache hot where as storing on the PCP lists may be cold by the time free_pcppages_bulk() is called. Using perf kmem:mm_page_alloc, the 5 most used page frames were 5.17-rc3 13041 pfn=0x111a30 13081 pfn=0x5814d0 13097 pfn=0x108258 13121 pfn=0x689598 13128 pfn=0x5814d8 5.17-revert-highpcp 192009 pfn=0x54c140 195426 pfn=0x1081d0 200908 pfn=0x61c808 243515 pfn=0xa9dc20 402523 pfn=0x222bb8 5.17-full-series 142693 pfn=0x346208 162227 pfn=0x13bf08 166413 pfn=0x2711e0 166950 pfn=0x2702f8 The spread is wider as there is still time before pages freed to one PCP get released with a tradeoff between fast reuse and reduced zone lock acquisition. On the machine used to gather the traces, the headline performance was equivalent. netperf-tcp 5.17.0-rc3 5.17.0-rc3 5.17.0-rc3 vanilla mm-reverthighpcp-v1r1 mm-highpcplimit-v2 Hmean 64 839.93 ( 0.00%) 840.77 ( 0.10%) 841.02 ( 0.13%) Hmean 128 1614.22 ( 0.00%) 1622.07 * 0.49%* 1636.41 * 1.37%* Hmean 256 2952.00 ( 0.00%) 2953.19 ( 0.04%) 2977.76 * 0.87%* Hmean 1024 10291.67 ( 0.00%) 10239.17 ( -0.51%) 10434.41 * 1.39%* Hmean 2048 17335.08 ( 0.00%) 17399.97 ( 0.37%) 17134.81 * -1.16%* Hmean 3312 22628.15 ( 0.00%) 22471.97 ( -0.69%) 22422.78 ( -0.91%) Hmean 4096 25009.50 ( 0.00%) 24752.83 * -1.03%* 24740.41 ( -1.08%) Hmean 8192 32745.01 ( 0.00%) 31682.63 * -3.24%* 32153.50 * -1.81%* Hmean 16384 39759.59 ( 0.00%) 36805.78 * -7.43%* 38948.13 * -2.04%* On a 1-socket skylake machine with a small CPU cache that suffers more if cache misses are too high netperf-tcp 5.17.0-rc3 5.17.0-rc3 5.17.0-rc3 vanilla mm-reverthighpcp-v1 mm-highpcplimit-v2 Hmean 64 938.95 ( 0.00%) 941.50 * 0.27%* 943.61 * 0.50%* Hmean 128 1843.10 ( 0.00%) 1857.58 * 0.79%* 1861.09 * 0.98%* Hmean 256 3573.07 ( 0.00%) 3667.45 * 2.64%* 3674.91 * 2.85%* Hmean 1024 13206.52 ( 0.00%) 13487.80 * 2.13%* 13393.21 * 1.41%* Hmean 2048 22870.23 ( 0.00%) 23337.96 * 2.05%* 23188.41 * 1.39%* Hmean 3312 31001.99 ( 0.00%) 32206.50 * 3.89%* 31863.62 * 2.78%* Hmean 4096 35364.59 ( 0.00%) 36490.96 * 3.19%* 36112.54 * 2.11%* Hmean 8192 48497.71 ( 0.00%) 49954.05 * 3.00%* 49588.26 * 2.25%* Hmean 16384 58410.86 ( 0.00%) 60839.80 * 4.16%* 62282.96 * 6.63%* Note that this was a machine that did not benefit from caching high-order pages and performance is almost restored with the series applied. It's not fully restored as cache misses are still higher. This is a trade-off between optimising for a workload that does all allocs on one CPU and frees on another or more general workloads that need high-order pages for SLUB and benefit from avoiding zone->lock for every SLUB refill/drain. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Tested-by: Aaron Lu <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/page_alloc: free pages in a single pass during bulk freeMel Gorman1-35/+21
free_pcppages_bulk() has taken two passes through the pcp lists since commit 0a5f4e5b4562 ("mm/free_pcppages_bulk: do not hold lock when picking pages to free") due to deferring the cost of selecting PCP lists until the zone lock is held. Now that list selection is simplier, the main cost during selection is bulkfree_pcp_prepare() which in the normal case is a simple check and prefetching. As the list manipulations have cost in itself, go back to freeing pages in a single pass. The series up to this point was evaulated using a trunc microbenchmark that is truncating sparse files stored in page cache (mmtests config config-io-trunc). Sparse files were used to limit filesystem interaction. The results versus a revert of storing high-order pages in the PCP lists is 1-socket Skylake 5.17.0-rc3 5.17.0-rc3 5.17.0-rc3 vanilla mm-reverthighpcp-v1 mm-highpcpopt-v2 Min elapsed 540.00 ( 0.00%) 530.00 ( 1.85%) 530.00 ( 1.85%) Amean elapsed 543.00 ( 0.00%) 530.00 * 2.39%* 530.00 * 2.39%* Stddev elapsed 4.83 ( 0.00%) 0.00 ( 100.00%) 0.00 ( 100.00%) CoeffVar elapsed 0.89 ( 0.00%) 0.00 ( 100.00%) 0.00 ( 100.00%) Max elapsed 550.00 ( 0.00%) 530.00 ( 3.64%) 530.00 ( 3.64%) BAmean-50 elapsed 540.00 ( 0.00%) 530.00 ( 1.85%) 530.00 ( 1.85%) BAmean-95 elapsed 542.22 ( 0.00%) 530.00 ( 2.25%) 530.00 ( 2.25%) BAmean-99 elapsed 542.22 ( 0.00%) 530.00 ( 2.25%) 530.00 ( 2.25%) 2-socket CascadeLake 5.17.0-rc3 5.17.0-rc3 5.17.0-rc3 vanilla mm-reverthighpcp-v1 mm-highpcpopt-v2 Min elapsed 510.00 ( 0.00%) 500.00 ( 1.96%) 500.00 ( 1.96%) Amean elapsed 529.00 ( 0.00%) 521.00 ( 1.51%) 510.00 * 3.59%* Stddev elapsed 16.63 ( 0.00%) 12.87 ( 22.64%) 11.55 ( 30.58%) CoeffVar elapsed 3.14 ( 0.00%) 2.47 ( 21.46%) 2.26 ( 27.99%) Max elapsed 550.00 ( 0.00%) 540.00 ( 1.82%) 530.00 ( 3.64%) BAmean-50 elapsed 516.00 ( 0.00%) 512.00 ( 0.78%) 500.00 ( 3.10%) BAmean-95 elapsed 526.67 ( 0.00%) 518.89 ( 1.48%) 507.78 ( 3.59%) BAmean-99 elapsed 526.67 ( 0.00%) 518.89 ( 1.48%) 507.78 ( 3.59%) The original motivation for multi-passes was will-it-scale page_fault1 using $nr_cpu processes. 2-socket CascadeLake (40 cores, 80 CPUs HT enabled) 5.17.0-rc3 5.17.0-rc3 vanilla mm-highpcpopt-v2 Hmean page_fault1-processes-2 2694662.26 ( 0.00%) 2695780.35 ( 0.04%) Hmean page_fault1-processes-5 6425819.34 ( 0.00%) 6435544.57 * 0.15%* Hmean page_fault1-processes-8 9642169.10 ( 0.00%) 9658962.39 ( 0.17%) Hmean page_fault1-processes-12 12167502.10 ( 0.00%) 12190163.79 ( 0.19%) Hmean page_fault1-processes-21 15636859.03 ( 0.00%) 15612447.26 ( -0.16%) Hmean page_fault1-processes-30 25157348.61 ( 0.00%) 25169456.65 ( 0.05%) Hmean page_fault1-processes-48 27694013.85 ( 0.00%) 27671111.46 ( -0.08%) Hmean page_fault1-processes-79 25928742.64 ( 0.00%) 25934202.02 ( 0.02%) <-- Hmean page_fault1-processes-110 25730869.75 ( 0.00%) 25671880.65 * -0.23%* Hmean page_fault1-processes-141 25626992.42 ( 0.00%) 25629551.61 ( 0.01%) Hmean page_fault1-processes-172 25611651.35 ( 0.00%) 25614927.99 ( 0.01%) Hmean page_fault1-processes-203 25577298.75 ( 0.00%) 25583445.59 ( 0.02%) Hmean page_fault1-processes-234 25580686.07 ( 0.00%) 25608240.71 ( 0.11%) Hmean page_fault1-processes-265 25570215.47 ( 0.00%) 25568647.58 ( -0.01%) Hmean page_fault1-processes-296 25549488.62 ( 0.00%) 25543935.00 ( -0.02%) Hmean page_fault1-processes-320 25555149.05 ( 0.00%) 25575696.74 ( 0.08%) The differences are mostly within the noise and the difference close to $nr_cpus is negligible. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Tested-by: Aaron Lu <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>