Age | Commit message (Collapse) | Author | Files | Lines |
|
Constify the flag tests that aren't automatically generated and the tests
that look like flag tests but are more complicated.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Now that __dump_page() takes a const argument, we can make dump_page()
take a const struct page too.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Turn __dump_page() into a wrapper around __dump_folio(). Snapshot the
page & folio into a stack variable so we don't hit BUG_ON() if an
allocation is freed under us and what was a folio pointer becomes a
pointer to a tail page.
[[email protected]: fix build issue]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix __dump_folio]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix pointer confusion]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: s/printk/pr_warn/]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Optimizing the initialization speed of 1G huge pages through
parallelization.
1G hugetlbs are allocated from bootmem, a process that is already very
fast and does not currently require optimization. Therefore, we focus on
parallelizing only the initialization phase in `gather_bootmem_prealloc`.
Here are some test results:
test case no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2T(4 node) 1G 4745 2024 57.34%
128c1T(2 node) 1G 3358 1712 49.02%
12T 1G 77000 18300 76.23%
[[email protected]: s/initialied/initialized/, per Alexey]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Steffen Klassert <[email protected]>
Cc: Tim Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
By distributing both the allocation and the initialization tasks across
multiple threads, the initialization of 2M hugetlb will be faster, thereby
improving the boot speed.
Here are some test results:
test case no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2T(4 node) 2M 3336 1051 68.52%
128c1T(2 node) 2M 1943 716 63.15%
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Steffen Klassert <[email protected]>
Cc: Tim Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
different nodes Date: Thu, 22 Feb 2024 22:04:17 +0800
When a group of tasks that access different nodes are scheduled on the
same node, they may encounter bandwidth bottlenecks and access latency.
Thus, numa_aware flag is introduced here, allowing tasks to be distributed
across different nodes to fully utilize the advantage of multi-node
systems.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: Tim Chen <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Steffen Klassert <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
With parallelization of hugetlb allocation across different threads, each
thread works on a differnet node to allocate pages from, instead of all
allocating from a common node h->next_nid_to_alloc. To address this, it's
necessary to assign a separate next_nid_to_alloc for each thread.
Consequently, the hstate_next_node_to_alloc and
for_each_node_mask_to_alloc have been modified to directly accept a
*next_nid_to_alloc parameter, ensuring thread-specific allocation and
avoiding concurrent access issues.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
Reviewed-by: Tim Chen <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Steffen Klassert <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
1G and 2M huge pages have different allocation and initialization logic,
which leads to subtle differences in parallelization. Therefore, it is
appropriate to split hugetlb_hstate_alloc_pages into gigantic and
non-gigantic.
This patch has no functional changes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
Reviewed-by: Tim Chen <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Steffen Klassert <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "hugetlb: parallelize hugetlb page init on boot", v6.
Introduction
------------
Hugetlb initialization during boot takes up a considerable amount of time.
For instance, on a 2TB system, initializing 1,800 1GB huge pages takes
1-2 seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB
Intel host takes more than 1 minute[1]. This is a noteworthy figure.
Inspired by [2] and [3], hugetlb initialization can also be accelerated
through parallelization. Kernel already has infrastructure like
padata_do_multithreaded, this patch uses it to achieve effective results
by minimal modifications.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/all/[email protected]/
[4] https://lore.kernel.org/all/[email protected]/
max_threads
-----------
This patch use `padata_do_multithreaded` like this:
```
job.max_threads = num_node_state(N_MEMORY) * multiplier;
padata_do_multithreaded(&job);
```
To fully utilize the CPU, the number of parallel threads needs to be
carefully considered. `max_threads = num_node_state(N_MEMORY)` does not
fully utilize the CPU, so we need to multiply it by a multiplier.
Tests below indicate that a multiplier of 2 significantly improves
performance, and although larger values also provide improvements, the
gains are marginal.
multiplier 1 2 3 4 5
------------ ------- ------- ------- ------- -------
256G 2node 358ms 215ms 157ms 134ms 126ms
2T 4node 979ms 679ms 543ms 489ms 481ms
50G 2node 71ms 44ms 37ms 30ms 31ms
Therefore, choosing 2 as the multiplier strikes a good balance between
enhancing parallel processing capabilities and maintaining efficient
resource management.
Test result
-----------
test case no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2T(4 node) 1G 4745 2024 57.34%
128c1T(2 node) 1G 3358 1712 49.02%
12T 1G 77000 18300 76.23%
256c2T(4 node) 2M 3336 1051 68.52%
128c1T(2 node) 2M 1943 716 63.15%
This patch (of 8):
The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the
code, its readability can be improved, facilitating future modifications.
This patch extracts two functions to reduce the complexity of
`hugetlb_hstate_alloc_pages` and has no functional changes.
- hugetlb_hstate_alloc_pages_node_specific() to handle iterates through
each online node and performs allocation if necessary.
- hugetlb_hstate_alloc_pages_report() report error during allocation.
And the value of h->max_huge_pages is updated accordingly.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: Tim Chen <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Steffen Klassert <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
There are various users of get_vm_area() + ioremap_page_range() APIs.
Enforce that get_vm_area() was requested as VM_IOREMAP type and range
passed to ioremap_page_range() matches created vm_area to avoid
accidentally ioremap-ing into wrong address range.
Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
When draining a page_frag_cache, most user are doing
the similar steps, so introduce an API to avoid code
duplication.
Signed-off-by: Yunsheng Lin <[email protected]>
Acked-by: Jason Wang <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
Currently there seems to be three page frag implementations
which all try to allocate order 3 page, if that fails, it
then fail back to allocate order 0 page, and each of them
all allow order 3 page allocation to fail under certain
condition by using specific gfp bits.
The gfp bits for order 3 page allocation are different
between different implementation, __GFP_NOMEMALLOC is
or'd to forbid access to emergency reserves memory for
__page_frag_cache_refill(), but it is not or'd in other
implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
direct reclaim in vhost_net_page_frag_refill(), but it is
not masked off in __page_frag_cache_refill().
This patch unifies the gfp bits used between different
implementions by or'ing __GFP_NOMEMALLOC and masking off
__GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
possible pressure for mm.
Leave the gfp unifying for page frag implementation in sock.c
for now as suggested by Paolo Abeni.
Signed-off-by: Yunsheng Lin <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
CC: Alexander Duyck <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
napi_alloc_frag_align() and netdev_alloc_frag_align() accept
align as an argument, and they are thin wrappers around the
__napi_alloc_frag_align() and __netdev_alloc_frag_align() APIs
doing the alignment checking and align mask conversion, in order
to call page_frag_alloc_align() directly. The intention here is
to keep the alignment checking and the alignmask conversion in
in-line wrapper to avoid those kind of operations during execution
time since it can usually be handled during compile time.
We are going to use page_frag_alloc_align() in vhost_net.c, it
need the same kind of alignment checking and alignmask conversion,
so split up page_frag_alloc_align into an inline wrapper doing the
above operation, and add __page_frag_alloc_align() which is passed
with the align mask the original function expected as suggested by
Alexander.
Signed-off-by: Yunsheng Lin <[email protected]>
CC: Alexander Duyck <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
The PARTIAL_NODE slab_state has gone with SLAB removed, so just
remove it.
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
We will save allocated tag in the object header to indicate that it's
allocated.
handle |= OBJ_ALLOCATED_TAG;
So the object header needs to reserve LSB for this tag bit.
But the handle itself doesn't need to reserve LSB to save tag, since it's
only used to find the position of object, by (pfn + obj_idx). So remove
LSB reserve from handle, one more bit can be used as obj_idx.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chengming Zhou <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
do_numa_page() is reading from the same page table entry, twice, while
holding the page table lock: once while checking that the pte hasn't
changed, and again in order to modify the pte.
Instead, just read the pte once, and save it in the same old_pte variable
that already exists. This has no effect on behavior, other than to
provide a tiny potential improvement to performance, by avoiding the
redundant memory read (which the compiler cannot elide, due to
READ_ONCE()).
Also improve the associated comments nearby.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
alloc_contig_migrate_range has every information to be able to understand
big contiguous allocation latency. For example, how many pages are
migrated, how many times they were needed to unmap from page tables.
This patch adds the trace event to collect the allocation statistics. In
the field, it was quite useful to understand CMA allocation latency.
[[email protected]: a/trace_mm_alloc_config_migrate_range_info_enabled/trace_mm_alloc_contig_migrate_range_info_enabled]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Richard Chang <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected].
Cc: Martin Liu <[email protected]>
Cc: "Masami Hiramatsu (Google)" <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
We already have a folio; use it instead of the head page where reasonable.
Saves a couple of calls to compound_head() and elimimnates a few
references to page->mapping.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Sidhartha Kumar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
All but one caller already has a folio, so convert
free_page_and_swap_cache() to have a folio and remove the call to
page_folio().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
These pages are all chained together through the lru list, so we know
they're folios. Use the folio APIs to save three hidden calls to
compound_head().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Process the pages in batch-sized quantities instead of all-at-once.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
All callers now use free_unref_folios() so we can delete this function.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
All users have been converted to mem_cgroup_uncharge_folios() so we can
remove this API.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The few folios which can't be moved to the LRU list (because their
refcount dropped to zero) used to be returned to the caller to dispose of.
Make this simpler to call by freeing the folios directly through
free_unref_folios().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Use free_unref_page_batch() to free the folios. This may increase the
number of IPIs from calling try_to_unmap_flush() more often, but that's
going to be very workload-dependent. It may even reduce the number of
IPIs as we now batch-free large folios instead of freeing them one at a
time.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Hugetlb folios still get special treatment, but normal large folios can
now be freed by free_unref_folios(). This should have a reasonable
performance impact, TBD.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Call folio_undo_large_rmappable() if needed. free_unref_page_prepare()
destroys the ability to call folio_order(), so stash the order in
folio->private for the benefit of the second loop.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Pass a pointer to the lruvec so we can take advantage of the
folio_lruvec_relock_irqsave(). Adjust the calling convention of
folio_lruvec_relock_irqsave() to suit and add a page_cache_release()
wrapper.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Break up the list of folios into batches here so that the folios are more
likely to be cache hot when doing the rest of the processing.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Instead of putting the interesting folios on a list, delete the
uninteresting one from the folio_batch.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Almost identical to mem_cgroup_uncharge_list(), except it takes a
folio_batch instead of a list_head.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
There's no need to indirect through release_pages() and iterate over this
batch of folios an extra time; we can just use the batch that we have.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Iterate over a folio_batch rather than a linked list. This is easier for
the CPU to prefetch and has a batch count naturally built in so we don't
need to track it. Again, this lowers the maximum lock hold time from
32 folios to 15, but I do not expect this to have a significant effect.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Most of its callees are not yet ready to accept a folio, but we know all
of the pages passed in are actually folios because they're linked through
->lru.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "Rearrange batched folio freeing", v3.
Other than the obvious "remove calls to compound_head" changes, the
fundamental belief here is that iterating a linked list is much slower
than iterating an array (5-15x slower in my testing). There's also an
associated belief that since we iterate the batch of folios three times,
we do better when the array is small (ie 15 entries) than we do with a
batch that is hundreds of entries long, which only gives us the
opportunity for the first pages to fall out of cache by the time we get to
the end.
It is possible we should increase the size of folio_batch. Hopefully the
bots let us know if this introduces any performance regressions.
This patch (of 3):
By making release_pages() call folios_put(), we can get rid of the calls
to compound_head() for the callers that already know they have folios. We
can also get rid of the lock_batch tracking as we know the size of the
batch is limited by folio_batch. This does reduce the maximum number of
pages for which the lruvec lock is held, from SWAP_CLUSTER_MAX (32) to
PAGEVEC_SIZE (15). I do not expect this to make a significant difference,
but if it does, we can increase PAGEVEC_SIZE to 31.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Previously, we removed the mm from mm_slot and dropped mm_count
if the MMF_THP_DISABLE flag was set. However, we didn't re-add
the mm back after clearing the MMF_THP_DISABLE flag. Additionally,
We add a check for the MMF_THP_DISABLE flag in hugepage_vma_revalidate().
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 879c6000e191 ("mm/khugepaged: bypassing unnecessary scans with MMF_DISABLE_THP check")
Signed-off-by: Lance Yang <[email protected]>
Suggested-by: Yang Shi <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm: remove total_mapcount()", v2.
Let's remove the remaining user from mm/memfd.c so we can get rid of
total_mapcount().
This patch (of 2):
Both functions are the remaining users of total_mapcount(). Let's get rid
of the calls by converting the code to folios.
As it turns out, the code is unnecessarily complicated, especially:
1) We can query the number of pagecache references for a folio simply via
folio_nr_pages(). This will handle other folio sizes in the future
correctly.
2) The xas_set(xas, page->index + cache_count) call to increment the
iterator for large folios is not required. Remove it.
Further, simplify the XA_CHECK_SCHED check, counting each entry exactly
once.
Memfd pages can be swapped out when using shmem; leave xa_is_value()
checks in place.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Co-developed-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
It is used to test split_huge_page_to_list_to_order for pagecache THPs.
Also add test cases for split_huge_page_to_list_to_order via both debugfs.
[[email protected]: fix issue discovered with NFS]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Tested-by: Aishwarya TCV <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michal Koutny <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Cc: Aishwarya TCV <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
To split a THP to any lower order pages, we need to reform THPs on
subpages at given order and add page refcount based on the new page order.
Also we need to reinitialize page_deferred_list after removing the page
from the split_queue, otherwise a subsequent split will see list
corruption when checking the page_deferred_list again.
Note: Anonymous order-1 folio is not supported because _deferred_list,
which is used by partially mapped folios, is stored in subpage 2 and an
order-1 folio only has subpage 0 and 1. File-backed order-1 folios are
fine, since they do not use _deferred_list.
[[email protected]: fixup per discussion with Ryan]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michal Koutny <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
It adds a new_order parameter to set new page order in page owner. It
prepares for upcoming changes to support split huge page to any lower
order.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michal Koutny <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
It sets memcg information for the pages after the split. A new parameter
new_order is added to tell the order of subpages in the new page, always 0
for now. It prepares for upcoming changes to support split huge page to
any lower order.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michal Koutny <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
We do not have non power of two pages, using nr is error prone if nr is
not power-of-two. Use page order instead.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michal Koutny <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
We do not have non power of two pages, using nr is error prone if nr is
not power-of-two. Use page order instead.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Michal Koutny <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Folios of order 1 have no space to store the deferred list. This is not a
problem for the page cache as file-backed folios are never placed on the
deferred list. All we need to do is prevent the core MM from touching the
deferred list for order 1 folios and remove the code which prevented us
from allocating order 1 folios.
Link: https://lore.kernel.org/linux-mm/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: Michal Koutny <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "Split a folio to any lower order folios", v5.
File folio supports any order and multi-size THP is upstreamed[1], so both
file and anonymous folios can be >0 order. Currently, split_huge_page()
only splits a huge page to order-0 pages, but splitting to orders higher
than 0 might better utilize large folios, if done properly. In addition,
Large Block Sizes in XFS support would benefit from it during truncate[2].
This patchset adds support for splitting a large folio to any lower order
folios.
In addition to this implementation of split_huge_page_to_list_to_order(),
a possible optimization could be splitting a large folio to arbitrary
smaller folios instead of a single order. As both Hugh and Ryan pointed
out [3,5] that split to a single order might not be optimal, an order-9
folio might be better split into 1 order-8, 1 order-7, ..., 1 order-1, and
2 order-0 folios, depending on subsequent folio operations. Leave this as
future work.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://lore.kernel.org/linux-mm/[email protected]/
[4] https://lore.kernel.org/linux-mm/[email protected]/
[5] https://lore.kernel.org/linux-mm/[email protected]/
This patch (of 8):
As multi-size THP support is added, not all THPs are PMD-mapped, thus
during a huge page split, there is no need to always split PMD mapping in
unmap_folio(). Make it conditional.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Koutny <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
While doing MADV_PAGEOUT, the current code will clear PTE young so that
vmscan won't read young flags to allow the reclamation of madvised folios
to go ahead. It seems we can do it by directly ignoring references, thus
we can remove tlb flush in madvise and rmap overhead in vmscan.
Regarding the side effect, in the original code, if a parallel thread runs
side by side to access the madvised memory with the thread doing madvise,
folios will get a chance to be re-activated by vmscan (though the time gap
is actually quite small since checking PTEs is done immediately after
clearing PTEs young). But with this patch, they will still be reclaimed.
But this behaviour doing PAGEOUT and doing access at the same time is
quite silly like DoS. So probably, we don't need to care. Or ignoring
the new access during the quite small time gap is even better.
For DAMON's DAMOS_PAGEOUT based on physical address region, we still keep
its behaviour as is since a physical address might be mapped by multiple
processes. MADV_PAGEOUT based on virtual address is actually much more
aggressive on reclamation. To untouch paddr's DAMOS_PAGEOUT, we simply
pass ignore_references as false in reclaim_pages().
A microbench as below has shown 6% decrement on the latency of
MADV_PAGEOUT,
#define PGSIZE 4096
main()
{
int i;
#define SIZE 512*1024*1024
volatile long *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
for (i = 0; i < SIZE/sizeof(long); i += PGSIZE / sizeof(long))
p[i] = 0x11;
madvise(p, SIZE, MADV_PAGEOUT);
}
w/o patch w/ patch
root@10:~# time ./a.out root@10:~# time ./a.out
real 0m49.634s real 0m46.334s
user 0m0.637s user 0m0.648s
sys 0m47.434s sys 0m44.265s
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Barry Song <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Address the additional feedback since 4e76c8cc3378 kasan: add atomic tests
(""kasan: add atomic tests") by removing an explicit cast and fixing the
size as well as the check of the allocation of `a2`.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/[email protected]/T/#u
Fixes: 4e76c8cc3378 ("kasan: add atomic tests")
Signed-off-by: Paul Heidekrüger <[email protected]>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=214055
Reviewed-by: Marco Elver <[email protected]>
Tested-by: Marco Elver <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The current implementation of the mark_victim tracepoint provides only the
process ID (pid) of the victim process. This limitation poses challenges
for userspace tools requiring real-time OOM analysis and intervention.
Although this information is available from the kernel logs, it’s not
the appropriate format to provide OOM notifications. In Android, BPF
programs are used with the mark_victim trace events to notify userspace of
an OOM kill. For consistency, update the trace event to include the same
information about the OOMed victim as the kernel logs.
- UID
In Android each installed application has a unique UID. Including
the `uid` assists in correlating OOM events with specific apps.
- Process Name (comm)
Enables identification of the affected process.
- OOM Score
Will allow userspace to get additional insight of the relative kill
priority of the OOM victim. In Android, the oom_score_adj is used to
categorize app state (foreground, background, etc.), which aids in
analyzing user-perceptible impacts of OOM events [1].
- Total VM, RSS Stats, and pgtables
Amount of memory used by the victim that will, potentially, be freed up
by killing it.
[1] https://cs.android.com/android/platform/superproject/main/+/246dc8fc95b6d93afcba5c6d6c133307abb3ac2e:frameworks/base/services/core/java/com/android/server/am/ProcessList.java;l=188-283
Signed-off-by: Carlos Galo <[email protected]>
Reviewed-by: Steven Rostedt <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Masami Hiramatsu (Google)" <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Hugetlb can now safely handle faults under the VMA lock, so allow it to do
so.
This patch may cause ltp hugemmap10 to "fail". Hugemmap10 tests hugetlb
counters, and expects the counters to remain unchanged on failure to
handle a fault.
In hugetlb_no_page(), vmf_anon_prepare() may bailout with no anon_vma
under the VMA lock after allocating a folio for the hugepage. In
free_huge_folio(), this folio is completely freed on bailout iff there is
a surplus of hugetlb pages. This will remove a folio off the freelist and
decrement the number of hugepages while ltp expects these counters to
remain unchanged on failure.
Originally this could only happen due to OOM failures, but now it may also
occur after we allocate a hugetlb folio without a suitable anon_vma under
the VMA lock. This should only happen for the first freshly allocated
hugepage in this vma.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
hugetlb_no_page() and hugetlb_wp() call anon_vma_prepare(). In
preparation for hugetlb to safely handle faults under the VMA lock, use
vmf_anon_prepare() here instead.
Additionally, passing hugetlb_wp() the vm_fault struct from
hugetlb_fault() works toward cleaning up the hugetlb code and function
stack.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|