blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2024-03-06	mm: constify more page/folio tests	Matthew Wilcox (Oracle)	1	-2/+2
	Constify the flag tests that aren't automatically generated and the tests that look like flag tests but are more complicated. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-06	mm: make dump_page() take a const argument	Matthew Wilcox (Oracle)	1	-1/+1
	Now that __dump_page() takes a const argument, we can make dump_page() take a const struct page too. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-06	mm: add __dump_folio()	Matthew Wilcox (Oracle)	1	-55/+73
	Turn __dump_page() into a wrapper around __dump_folio(). Snapshot the page & folio into a stack variable so we don't hit BUG_ON() if an allocation is freed under us and what was a folio pointer becomes a pointer to a tail page. [[email protected]: fix build issue] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix __dump_folio] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix pointer confusion] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: s/printk/pr_warn/] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-06	hugetlb: parallelize 1G hugetlb initialization	Gang Li	1	-8/+43
	Optimizing the initialization speed of 1G huge pages through parallelization. 1G hugetlbs are allocated from bootmem, a process that is already very fast and does not currently require optimization. Therefore, we focus on parallelizing only the initialization phase in `gather_bootmem_prealloc`. Here are some test results: test case no patch(ms) patched(ms) saved ------------------- -------------- ------------- -------- 256c2T(4 node) 1G 4745 2024 57.34% 128c1T(2 node) 1G 3358 1712 49.02% 12T 1G 77000 18300 76.23% [[email protected]: s/initialied/initialized/, per Alexey] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Gang Li <[email protected]> Tested-by: David Rientjes <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jane Chu <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Steffen Klassert <[email protected]> Cc: Tim Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-06	hugetlb: parallelize 2M hugetlb allocation and initialization	Gang Li	1	-17/+56
	By distributing both the allocation and the initialization tasks across multiple threads, the initialization of 2M hugetlb will be faster, thereby improving the boot speed. Here are some test results: test case no patch(ms) patched(ms) saved ------------------- -------------- ------------- -------- 256c2T(4 node) 2M 3336 1051 68.52% 128c1T(2 node) 2M 1943 716 63.15% Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Gang Li <[email protected]> Tested-by: David Rientjes <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jane Chu <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Steffen Klassert <[email protected]> Cc: Tim Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-06	Author: Gang Li padata: dispatch works on	Gang Li Subject: padata: dispatch works on	1	-0/+1
	different nodes Date: Thu, 22 Feb 2024 22:04:17 +0800 When a group of tasks that access different nodes are scheduled on the same node, they may encounter bandwidth bottlenecks and access latency. Thus, numa_aware flag is introduced here, allowing tasks to be distributed across different nodes to fully utilize the advantage of multi-node systems. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Gang Li <[email protected]> Tested-by: David Rientjes <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Tim Chen <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jane Chu <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Steffen Klassert <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-06	hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc	Gang Li	1	-10/+12
	With parallelization of hugetlb allocation across different threads, each thread works on a differnet node to allocate pages from, instead of all allocating from a common node h->next_nid_to_alloc. To address this, it's necessary to assign a separate next_nid_to_alloc for each thread. Consequently, the hstate_next_node_to_alloc and for_each_node_mask_to_alloc have been modified to directly accept a *next_nid_to_alloc parameter, ensuring thread-specific allocation and avoiding concurrent access issues. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Gang Li <[email protected]> Tested-by: David Rientjes <[email protected]> Reviewed-by: Tim Chen <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jane Chu <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Steffen Klassert <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-06	hugetlb: split hugetlb_hstate_alloc_pages	Gang Li	1	-44/+43
	1G and 2M huge pages have different allocation and initialization logic, which leads to subtle differences in parallelization. Therefore, it is appropriate to split hugetlb_hstate_alloc_pages into gigantic and non-gigantic. This patch has no functional changes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Gang Li <[email protected]> Tested-by: David Rientjes <[email protected]> Reviewed-by: Tim Chen <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jane Chu <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Steffen Klassert <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-06	hugetlb: code clean for hugetlb_hstate_alloc_pages	Gang Li	1	-17/+29
	Patch series "hugetlb: parallelize hugetlb page init on boot", v6. Introduction ------------ Hugetlb initialization during boot takes up a considerable amount of time. For instance, on a 2TB system, initializing 1,800 1GB huge pages takes 1-2 seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB Intel host takes more than 1 minute[1]. This is a noteworthy figure. Inspired by [2] and [3], hugetlb initialization can also be accelerated through parallelization. Kernel already has infrastructure like padata_do_multithreaded, this patch uses it to achieve effective results by minimal modifications. [1] https://lore.kernel.org/all/[email protected]/ [2] https://lore.kernel.org/all/[email protected]/ [3] https://lore.kernel.org/all/[email protected]/ [4] https://lore.kernel.org/all/[email protected]/ max_threads ----------- This patch use `padata_do_multithreaded` like this: ``` job.max_threads = num_node_state(N_MEMORY) * multiplier; padata_do_multithreaded(&job); ``` To fully utilize the CPU, the number of parallel threads needs to be carefully considered. `max_threads = num_node_state(N_MEMORY)` does not fully utilize the CPU, so we need to multiply it by a multiplier. Tests below indicate that a multiplier of 2 significantly improves performance, and although larger values also provide improvements, the gains are marginal. multiplier 1 2 3 4 5 ------------ ------- ------- ------- ------- ------- 256G 2node 358ms 215ms 157ms 134ms 126ms 2T 4node 979ms 679ms 543ms 489ms 481ms 50G 2node 71ms 44ms 37ms 30ms 31ms Therefore, choosing 2 as the multiplier strikes a good balance between enhancing parallel processing capabilities and maintaining efficient resource management. Test result ----------- test case no patch(ms) patched(ms) saved ------------------- -------------- ------------- -------- 256c2T(4 node) 1G 4745 2024 57.34% 128c1T(2 node) 1G 3358 1712 49.02% 12T 1G 77000 18300 76.23% 256c2T(4 node) 2M 3336 1051 68.52% 128c1T(2 node) 2M 1943 716 63.15% This patch (of 8): The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the code, its readability can be improved, facilitating future modifications. This patch extracts two functions to reduce the complexity of `hugetlb_hstate_alloc_pages` and has no functional changes. - hugetlb_hstate_alloc_pages_node_specific() to handle iterates through each online node and performs allocation if necessary. - hugetlb_hstate_alloc_pages_report() report error during allocation. And the value of h->max_huge_pages is updated accordingly. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Gang Li <[email protected]> Tested-by: David Rientjes <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Tim Chen <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jane Chu <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Steffen Klassert <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-06	mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.	Alexei Starovoitov	1	-0/+13
	There are various users of get_vm_area() + ioremap_page_range() APIs. Enforce that get_vm_area() was requested as VM_IOREMAP type and range passed to ioremap_page_range() matches created vm_area to avoid accidentally ioremap-ing into wrong address range. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-03-05	net: introduce page_frag_cache_drain()	Yunsheng Lin	1	-0/+10
	When draining a page_frag_cache, most user are doing the similar steps, so introduce an API to avoid code duplication. Signed-off-by: Yunsheng Lin <[email protected]> Acked-by: Jason Wang <[email protected]> Reviewed-by: Alexander Duyck <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2024-03-05	page_frag: unify gfp bits for order 3 page allocation	Yunsheng Lin	1	-2/+2
	Currently there seems to be three page frag implementations which all try to allocate order 3 page, if that fails, it then fail back to allocate order 0 page, and each of them all allow order 3 page allocation to fail under certain condition by using specific gfp bits. The gfp bits for order 3 page allocation are different between different implementation, __GFP_NOMEMALLOC is or'd to forbid access to emergency reserves memory for __page_frag_cache_refill(), but it is not or'd in other implementions, __GFP_DIRECT_RECLAIM is masked off to avoid direct reclaim in vhost_net_page_frag_refill(), but it is not masked off in __page_frag_cache_refill(). This patch unifies the gfp bits used between different implementions by or'ing __GFP_NOMEMALLOC and masking off __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid possible pressure for mm. Leave the gfp unifying for page frag implementation in sock.c for now as suggested by Paolo Abeni. Signed-off-by: Yunsheng Lin <[email protected]> Reviewed-by: Alexander Duyck <[email protected]> CC: Alexander Duyck <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2024-03-05	mm/page_alloc: modify page_frag_alloc_align() to accept align as an argument	Yunsheng Lin	1	-4/+4
	napi_alloc_frag_align() and netdev_alloc_frag_align() accept align as an argument, and they are thin wrappers around the __napi_alloc_frag_align() and __netdev_alloc_frag_align() APIs doing the alignment checking and align mask conversion, in order to call page_frag_alloc_align() directly. The intention here is to keep the alignment checking and the alignmask conversion in in-line wrapper to avoid those kind of operations during execution time since it can usually be handled during compile time. We are going to use page_frag_alloc_align() in vhost_net.c, it need the same kind of alignment checking and alignmask conversion, so split up page_frag_alloc_align into an inline wrapper doing the above operation, and add __page_frag_alloc_align() which is passed with the align mask the original function expected as suggested by Alexander. Signed-off-by: Yunsheng Lin <[email protected]> CC: Alexander Duyck <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2024-03-05	slab: remove PARTIAL_NODE slab_state	Chengming Zhou	1	-1/+0
	The PARTIAL_NODE slab_state has gone with SLAB removed, so just remove it. Signed-off-by: Chengming Zhou <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2024-03-04	mm/zsmalloc: don't need to reserve LSB in handle	Chengming Zhou	1	-4/+1
	We will save allocated tag in the object header to indicate that it's allocated. handle \|= OBJ_ALLOCATED_TAG; So the object header needs to reserve LSB for this tag bit. But the handle itself doesn't need to reserve LSB to save tag, since it's only used to find the position of object, by (pfn + obj_idx). So remove LSB reserve from handle, one more bit can be used as obj_idx. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm/memory.c: do_numa_page(): remove a redundant page table read	John Hubbard	1	-6/+6
	do_numa_page() is reading from the same page table entry, twice, while holding the page table lock: once while checking that the pte hasn't changed, and again in order to modify the pte. Instead, just read the pte once, and save it in the same old_pte variable that already exists. This has no effect on behavior, other than to provide a tiny potential improvement to performance, by avoiding the redundant memory read (which the compiler cannot elide, due to READ_ONCE()). Also improve the associated comments nearby. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: John Hubbard <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: add alloc_contig_migrate_range allocation statistics	Richard Chang	3	-7/+30
	alloc_contig_migrate_range has every information to be able to understand big contiguous allocation latency. For example, how many pages are migrated, how many times they were needed to unmap from page tables. This patch adds the trace event to collect the allocation statistics. In the field, it was quite useful to understand CMA allocation latency. [[email protected]: a/trace_mm_alloc_config_migrate_range_info_enabled/trace_mm_alloc_contig_migrate_range_info_enabled] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Richard Chang <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]. Cc: Martin Liu <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: use folio more widely in __split_huge_page	Matthew Wilcox (Oracle)	1	-10/+11
	We already have a folio; use it instead of the head page where reasonable. Saves a couple of calls to compound_head() and elimimnates a few references to page->mapping. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Sidhartha Kumar <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: convert free_swap_cache() to take a folio	Matthew Wilcox (Oracle)	3	-8/+8
	All but one caller already has a folio, so convert free_page_and_swap_cache() to have a folio and remove the call to page_folio(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: use a folio in __collapse_huge_page_copy_succeeded()	Matthew Wilcox (Oracle)	1	-16/+14
	These pages are all chained together through the lru list, so we know they're folios. Use the folio APIs to save three hidden calls to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: convert free_pages_and_swap_cache() to use folios_put()	Matthew Wilcox (Oracle)	1	-8/+13
	Process the pages in batch-sized quantities instead of all-at-once. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: remove free_unref_page_list()	Matthew Wilcox (Oracle)	2	-19/+0
	All callers now use free_unref_folios() so we can delete this function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	memcg: remove mem_cgroup_uncharge_list()	Matthew Wilcox (Oracle)	1	-19/+0
	All users have been converted to mem_cgroup_uncharge_folios() so we can remove this API. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: free folios directly in move_folios_to_lru()	Matthew Wilcox (Oracle)	1	-20/+12
	The few folios which can't be moved to the LRU list (because their refcount dropped to zero) used to be returned to the caller to dispose of. Make this simpler to call by freeing the folios directly through free_unref_folios(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: free folios in a batch in shrink_folio_list()	Matthew Wilcox (Oracle)	1	-11/+9
	Use free_unref_page_batch() to free the folios. This may increase the number of IPIs from calling try_to_unmap_flush() more often, but that's going to be very workload-dependent. It may even reduce the number of IPIs as we now batch-free large folios instead of freeing them one at a time. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: allow non-hugetlb large folios to be batch processed	Matthew Wilcox (Oracle)	1	-2/+3
	Hugetlb folios still get special treatment, but normal large folios can now be freed by free_unref_folios(). This should have a reasonable performance impact, TBD. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: handle large folios in free_unref_folios()	Matthew Wilcox (Oracle)	1	-8/+17
	Call folio_undo_large_rmappable() if needed. free_unref_page_prepare() destroys the ability to call folio_order(), so stash the order in folio->private for the benefit of the second loop. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: use __page_cache_release() in folios_put()	Matthew Wilcox (Oracle)	1	-33/+29
	Pass a pointer to the lruvec so we can take advantage of the folio_lruvec_relock_irqsave(). Adjust the calling convention of folio_lruvec_relock_irqsave() to suit and add a page_cache_release() wrapper. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: use free_unref_folios() in put_pages_list()	Matthew Wilcox (Oracle)	1	-7/+10
	Break up the list of folios into batches here so that the folios are more likely to be cache hot when doing the rest of the processing. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: remove use of folio list from folios_put()	Matthew Wilcox (Oracle)	1	-7/+12
	Instead of putting the interesting folios on a list, delete the uninteresting one from the folio_batch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	memcg: add mem_cgroup_uncharge_folios()	Matthew Wilcox (Oracle)	1	-0/+13
	Almost identical to mem_cgroup_uncharge_list(), except it takes a folio_batch instead of a list_head. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: use folios_put() in __folio_batch_release()	Matthew Wilcox (Oracle)	1	-2/+1
	There's no need to indirect through release_pages() and iterate over this batch of folios an extra time; we can just use the batch that we have. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: add free_unref_folios()	Matthew Wilcox (Oracle)	2	-25/+39
	Iterate over a folio_batch rather than a linked list. This is easier for the CPU to prefetch and has a batch count naturally built in so we don't need to track it. Again, this lowers the maximum lock hold time from 32 folios to 15, but I do not expect this to have a significant effect. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: convert free_unref_page_list() to use folios	Matthew Wilcox (Oracle)	1	-18/+20
	Most of its callees are not yet ready to accept a folio, but we know all of the pages passed in are actually folios because they're linked through ->lru. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: make folios_put() the basis of release_pages()	Matthew Wilcox (Oracle)	2	-43/+60
	Patch series "Rearrange batched folio freeing", v3. Other than the obvious "remove calls to compound_head" changes, the fundamental belief here is that iterating a linked list is much slower than iterating an array (5-15x slower in my testing). There's also an associated belief that since we iterate the batch of folios three times, we do better when the array is small (ie 15 entries) than we do with a batch that is hundreds of entries long, which only gives us the opportunity for the first pages to fall out of cache by the time we get to the end. It is possible we should increase the size of folio_batch. Hopefully the bots let us know if this introduces any performance regressions. This patch (of 3): By making release_pages() call folios_put(), we can get rid of the calls to compound_head() for the callers that already know they have folios. We can also get rid of the lock_batch tracking as we know the size of the batch is limited by folio_batch. This does reduce the maximum number of pages for which the lruvec lock is held, from SWAP_CLUSTER_MAX (32) to PAGEVEC_SIZE (15). I do not expect this to make a significant difference, but if it does, we can increase PAGEVEC_SIZE to 31. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm/khugepaged: keep mm in mm_slot without MMF_DISABLE_THP check	Lance Yang	1	-3/+3
	Previously, we removed the mm from mm_slot and dropped mm_count if the MMF_THP_DISABLE flag was set. However, we didn't re-add the mm back after clearing the MMF_THP_DISABLE flag. Additionally, We add a check for the MMF_THP_DISABLE flag in hugepage_vma_revalidate(). Link: https://lkml.kernel.org/r/[email protected] Fixes: 879c6000e191 ("mm/khugepaged: bypassing unnecessary scans with MMF_DISABLE_THP check") Signed-off-by: Lance Yang <[email protected]> Suggested-by: Yang Shi <[email protected]> Reviewed-by: Yang Shi <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Muchun Song <[email protected]> Cc: Peter Xu <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm/memfd: refactor memfd_tag_pins() and memfd_wait_for_pins()	David Hildenbrand	1	-29/+18
	Patch series "mm: remove total_mapcount()", v2. Let's remove the remaining user from mm/memfd.c so we can get rid of total_mapcount(). This patch (of 2): Both functions are the remaining users of total_mapcount(). Let's get rid of the calls by converting the code to folios. As it turns out, the code is unnecessarily complicated, especially: 1) We can query the number of pagecache references for a folio simply via folio_nr_pages(). This will handle other folio sizes in the future correctly. 2) The xas_set(xas, page->index + cache_count) call to increment the iterator for large folios is not required. Remove it. Further, simplify the XA_CHECK_SCHED check, counting each entry exactly once. Memfd pages can be swapped out when using shmem; leave xa_is_value() checks in place. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Co-developed-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: huge_memory: enable debugfs to split huge pages to any order	Zi Yan	1	-12/+22
	It is used to test split_huge_page_to_list_to_order for pagecache THPs. Also add test cases for split_huge_page_to_list_to_order via both debugfs. [[email protected]: fix issue discovered with NFS] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Tested-by: Aishwarya TCV <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michal Koutny <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zach O'Keefe <[email protected]> Cc: Aishwarya TCV <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: thp: split huge page to any lower order pages	Zi Yan	1	-24/+83
	To split a THP to any lower order pages, we need to reform THPs on subpages at given order and add page refcount based on the new page order. Also we need to reinitialize page_deferred_list after removing the page from the split_queue, otherwise a subsequent split will see list corruption when checking the page_deferred_list again. Note: Anonymous order-1 folio is not supported because _deferred_list, which is used by partially mapped folios, is stored in subpage 2 and an order-1 folio only has subpage 0 and 1. File-backed order-1 folios are fine, since they do not use _deferred_list. [[email protected]: fixup per discussion with Ryan] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michal Koutny <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: page_owner: add support for splitting to any order in split page_owner	Zi Yan	3	-7/+6
	It adds a new_order parameter to set new page order in page owner. It prepares for upcoming changes to support split huge page to any lower order. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Cc: David Hildenbrand <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michal Koutny <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: memcg: make memcg huge page split support any order split	Zi Yan	3	-8/+9
	It sets memcg information for the pages after the split. A new parameter new_order is added to tell the order of subpages in the new page, always 0 for now. It prepares for upcoming changes to support split huge page to any lower order. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michal Koutny <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm/page_owner: use order instead of nr in split_page_owner()	Zi Yan	3	-4/+5
	We do not have non power of two pages, using nr is error prone if nr is not power-of-two. Use page order instead. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michal Koutny <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm/memcg: use order instead of nr in split_page_memcg()	Zi Yan	3	-5/+7
	We do not have non power of two pages, using nr is error prone if nr is not power-of-two. Use page order instead. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michal Koutny <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: support order-1 folios in the page cache	Matthew Wilcox (Oracle)	4	-11/+16
	Folios of order 1 have no space to store the deferred list. This is not a problem for the page cache as file-backed folios are never placed on the deferred list. All we need to do is prevent the core MM from touching the deferred list for order 1 folios and remove the code which prevented us from allocating order 1 folios. Link: https://lore.kernel.org/linux-mm/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Zi Yan <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Michal Koutny <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm/huge_memory: only split PMD mapping when necessary in unmap_folio()	Zi Yan	1	-2/+5
	Patch series "Split a folio to any lower order folios", v5. File folio supports any order and multi-size THP is upstreamed[1], so both file and anonymous folios can be >0 order. Currently, split_huge_page() only splits a huge page to order-0 pages, but splitting to orders higher than 0 might better utilize large folios, if done properly. In addition, Large Block Sizes in XFS support would benefit from it during truncate[2]. This patchset adds support for splitting a large folio to any lower order folios. In addition to this implementation of split_huge_page_to_list_to_order(), a possible optimization could be splitting a large folio to arbitrary smaller folios instead of a single order. As both Hugh and Ryan pointed out [3,5] that split to a single order might not be optimal, an order-9 folio might be better split into 1 order-8, 1 order-7, ..., 1 order-1, and 2 order-0 folios, depending on subsequent folio operations. Leave this as future work. [1] https://lore.kernel.org/all/[email protected]/ [2] https://lore.kernel.org/linux-mm/[email protected]/ [3] https://lore.kernel.org/linux-mm/[email protected]/ [4] https://lore.kernel.org/linux-mm/[email protected]/ [5] https://lore.kernel.org/linux-mm/[email protected]/ This patch (of 8): As multi-size THP support is added, not all THPs are PMD-mapped, thus during a huge page split, there is no need to always split PMD mapping in unmap_folio(). Make it conditional. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Koutny <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: madvise: pageout: ignore references rather than clearing young	Barry Song	4	-11/+13
	While doing MADV_PAGEOUT, the current code will clear PTE young so that vmscan won't read young flags to allow the reclamation of madvised folios to go ahead. It seems we can do it by directly ignoring references, thus we can remove tlb flush in madvise and rmap overhead in vmscan. Regarding the side effect, in the original code, if a parallel thread runs side by side to access the madvised memory with the thread doing madvise, folios will get a chance to be re-activated by vmscan (though the time gap is actually quite small since checking PTEs is done immediately after clearing PTEs young). But with this patch, they will still be reclaimed. But this behaviour doing PAGEOUT and doing access at the same time is quite silly like DoS. So probably, we don't need to care. Or ignoring the new access during the quite small time gap is even better. For DAMON's DAMOS_PAGEOUT based on physical address region, we still keep its behaviour as is since a physical address might be mapped by multiple processes. MADV_PAGEOUT based on virtual address is actually much more aggressive on reclamation. To untouch paddr's DAMOS_PAGEOUT, we simply pass ignore_references as false in reclaim_pages(). A microbench as below has shown 6% decrement on the latency of MADV_PAGEOUT, #define PGSIZE 4096 main() { int i; #define SIZE 51210241024 volatile long *p = mmap(NULL, SIZE, PROT_READ \| PROT_WRITE, MAP_PRIVATE \| MAP_ANONYMOUS, -1, 0); for (i = 0; i < SIZE/sizeof(long); i += PGSIZE / sizeof(long)) p[i] = 0x11; madvise(p, SIZE, MADV_PAGEOUT); } w/o patch w/ patch root@10:~# time ./a.out root@10:~# time ./a.out real 0m49.634s real 0m46.334s user 0m0.637s user 0m0.648s sys 0m47.434s sys 0m44.265s Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	kasan: fix a2 allocation and remove explicit cast in atomic tests	Paul Heidekrüger	1	-3/+3
	Address the additional feedback since 4e76c8cc3378 kasan: add atomic tests (""kasan: add atomic tests") by removing an explicit cast and fixing the size as well as the check of the allocation of `a2`. Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/[email protected]/T/#u Fixes: 4e76c8cc3378 ("kasan: add atomic tests") Signed-off-by: Paul Heidekrüger <[email protected]> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=214055 Reviewed-by: Marco Elver <[email protected]> Tested-by: Marco Elver <[email protected]> Acked-by: Mark Rutland <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	mm: update mark_victim tracepoints fields	Carlos Galo	1	-1/+5
	The current implementation of the mark_victim tracepoint provides only the process ID (pid) of the victim process. This limitation poses challenges for userspace tools requiring real-time OOM analysis and intervention. Although this information is available from the kernel logs, it’s not the appropriate format to provide OOM notifications. In Android, BPF programs are used with the mark_victim trace events to notify userspace of an OOM kill. For consistency, update the trace event to include the same information about the OOMed victim as the kernel logs. - UID In Android each installed application has a unique UID. Including the `uid` assists in correlating OOM events with specific apps. - Process Name (comm) Enables identification of the affected process. - OOM Score Will allow userspace to get additional insight of the relative kill priority of the OOM victim. In Android, the oom_score_adj is used to categorize app state (foreground, background, etc.), which aids in analyzing user-perceptible impacts of OOM events [1]. - Total VM, RSS Stats, and pgtables Amount of memory used by the victim that will, potentially, be freed up by killing it. [1] https://cs.android.com/android/platform/superproject/main/+/246dc8fc95b6d93afcba5c6d6c133307abb3ac2e:frameworks/base/services/core/java/com/android/server/am/ProcessList.java;l=188-283 Signed-off-by: Carlos Galo <[email protected]> Reviewed-by: Steven Rostedt <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Masami Hiramatsu (Google)" <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	hugetlb: allow faults to be handled under the VMA lock	Vishal Moola (Oracle)	1	-6/+0
	Hugetlb can now safely handle faults under the VMA lock, so allow it to do so. This patch may cause ltp hugemmap10 to "fail". Hugemmap10 tests hugetlb counters, and expects the counters to remain unchanged on failure to handle a fault. In hugetlb_no_page(), vmf_anon_prepare() may bailout with no anon_vma under the VMA lock after allocating a folio for the hugepage. In free_huge_folio(), this folio is completely freed on bailout iff there is a surplus of hugetlb pages. This will remove a folio off the freelist and decrement the number of hugepages while ltp expects these counters to remain unchanged on failure. Originally this could only happen due to OOM failures, but now it may also occur after we allocate a hugetlb folio without a suitable anon_vma under the VMA lock. This should only happen for the first freshly allocated hugepage in this vma. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04	hugetlb: use vmf_anon_prepare() instead of anon_vma_prepare()	Vishal Moola (Oracle)	1	-9/+9
	hugetlb_no_page() and hugetlb_wp() call anon_vma_prepare(). In preparation for hugetlb to safely handle faults under the VMA lock, use vmf_anon_prepare() here instead. Additionally, passing hugetlb_wp() the vm_fault struct from hugetlb_fault() works toward cleaning up the hugetlb code and function stack. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>