| Age | Commit message (Collapse) | Author | Files | Lines |
|
We already have memory in the union here that is being wasted in AMD's
case, use it to store the nid.
Putting the nid here further isolates the io_pgtable code from the struct
protection_domain.
Fixup protection_domain_alloc so that the NID from the device is provided,
at this point dev is never NULL for AMD so this will now allocate the
first table pointer on the correct NUMA node.
Signed-off-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Vasant Hegde <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Joerg Roedel <[email protected]>
|
|
SCX_TASK_BAL_KEEP is used by balance_one() to tell pick_next_task_scx() to
keep running the current task. It's not really a task property. Replace it
with SCX_RQ_BAL_KEEP which resides in rq->scx.flags and is a better fit for
the usage. Also, the existing clearing rule is unnecessarily strict and
makes it difficult to use with core-sched. Just clear it on entry to
balance_one().
Signed-off-by: Tejun Heo <[email protected]>
|
|
There are no more any board files that use the platform data for
gpio-davinci. We can remove the header defining it and port the code to
no longer store any context in pdata.
Reviewed-by: Linus Walleij <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Bartosz Golaszewski <[email protected]>
|
|
Add isolate_folio_to_list() helper to try to isolate HugeTLB, no-LRU
movable and LRU folios to a list, which will be reused by
do_migrate_range() from memory hotplug soon, also drop the
mf_isolate_folio() since we could directly use new helper in the
soft_offline_in_use_page().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Miaohe Lin <[email protected]>
Tested-by: Miaohe Lin <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The do_vma_munmap() wrapper existed for callers that didn't have a vma
iterator and needed to check the vma mseal status prior to calling the
underlying munmap(). All callers now use a vma iterator and since the
mseal check has been moved to do_vmi_align_munmap() and the vmas are
aligned, this function can just be called instead.
do_vmi_align_munmap() can no longer be static as ipc/shm is using it and
it is exported via the mm.h header.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Cc: Bert Karwatzki <[email protected]>
Cc: Jeff Xu <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Paul Moore <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm: store zero pages to be swapped out in a bitmap", v8.
As shown in the patch series that introduced the zswap same-filled
optimization [1], 10-20% of the pages stored in zswap are same-filled.
This is also observed across Meta's server fleet. By using VM counters in
swap_writepage (not included in this patchseries) it was found that less
than 1% of the same-filled pages to be swapped out are non-zero pages.
For conventional swap setup (without zswap), rather than reading/writing
these pages to flash resulting in increased I/O and flash wear, a bitmap
can be used to mark these pages as zero at write time, and the pages can
be filled at read time if the bit corresponding to the page is set.
When using zswap with swap, this also means that a zswap_entry does not
need to be allocated for zero filled pages resulting in memory savings
which would offset the memory used for the bitmap.
A similar attempt was made earlier in [2] where zswap would only track
zero-filled pages instead of same-filled. This patchseries adds
zero-filled pages optimization to swap (hence it can be used even if zswap
is disabled) and removes the same-filled code from zswap (as only 1% of
the same-filled pages are non-zero), simplifying code.
[1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
[2] https://lore.kernel.org/lkml/[email protected]/
This patch (of 2):
Approximately 10-20% of pages to be swapped out are zero pages [1].
Rather than reading/writing these pages to flash resulting
in increased I/O and flash wear, a bitmap can be used to mark these
pages as zero at write time, and the pages can be filled at
read time if the bit corresponding to the page is set.
With this patch, NVMe writes in Meta server fleet decreased
by almost 10% with conventional swap setup (zswap disabled).
[1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Usama Arif <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Usama Arif <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Convert x86 to use PG_arch_2 instead of PG_uncached and remove
PG_uncached.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This flag has similar constraints to PG_owner_priv_1 -- it is ignored by
core code, and is entirely for the use of the code which allocated the
folio. Since the pagecache does not use it, individual filesystems can
use it. The bufferhead code does use it, so filesystems which use the
buffer cache must not use it for another purpose.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This function has no more callers, except folio_has_private(). Combine
the two functions.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
While there are many aliases for this flag, nobody actually uses the
*PageOwnerPriv1() nor folio_*_owner_priv_1() accessors. Remove them.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This flag is now only used on folios, so we can remove all the page
accessors.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
There is only one caller of PageUnevictable() left; convert it to call
folio_test_unevictable() and remove all the page accessors.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This flag is now only used on folios, so we can remove all the page
accessors and reword the comments that refer to them.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This flag is now only used on folios, so we can remove all the page
accessors.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This flag is now only used on folios, so we can remove all the page
accessors.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "Simplify the page flags a little".
In the course of our folio conversions, we have made many page flags only
used on folios, so we can now remove the page-based accessors. This
should cut down compile time a little, and prevent new users from cropping
up.
There is more that could be done in this area, but it would produce merge
conflicts, so I'll sit on those patches until next merge window. We now
have line of sight to removing PG_private_2 and PG_private.
This patch (of 10):
This flag is now only used on folios, so we can remove all the page
accessors.
[[email protected]: fix arch/powerpc/mm/pgtable-frag.c]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
By using a few values in the top byte, users of page_type can store up to
24 bits of additional data in page_type. It also reduces the code size as
(with replacement of READ_ONCE() with data_race()), the kernel can check
just a single byte. eg:
ffffffff811e3a79: 8b 47 30 mov 0x30(%rdi),%eax
ffffffff811e3a7c: 55 push %rbp
ffffffff811e3a7d: 48 89 e5 mov %rsp,%rbp
ffffffff811e3a80: 25 00 00 00 82 and $0x82000000,%eax
ffffffff811e3a85: 3d 00 00 00 80 cmp $0x80000000,%eax
ffffffff811e3a8a: 74 4d je ffffffff811e3ad9 <folio_mapping+0x69>
becomes:
ffffffff811e3a69: 80 7f 33 f5 cmpb $0xf5,0x33(%rdi)
ffffffff811e3a6d: 55 push %rbp
ffffffff811e3a6e: 48 89 e5 mov %rsp,%rbp
ffffffff811e3a71: 74 4d je ffffffff811e3ac0 <folio_mapping+0x60>
replacing three instructions with one.
[[email protected]: fix ubsan warnings]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Resolve the awkward "and add one to this opaque constant" test into a
self-documenting inline function.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Readahead support for IOCB_NOWAIT was introduced in commit 2e85abf053b9
("mm: allow read-ahead with IOCB_NOWAIT set"). However, this
implementation broke the semantics of IOCB_NOWAIT by potentially causing
it to wait on I/O during memory reclamation. This behavior was later
modified in commit efa8480a8316 ("fs: RWF_NOWAIT should imply IOCB_NOIO").
To resolve the blocking issue during memory reclamation, we can use
memalloc_noio_{save,restore} to ensure non-blocking behavior. This change
restores the original functionality, allowing preadv2(IOCB_NOWAIT) to
trigger readahead if the file content is not present in the page cache.
While this process may trigger direct memory reclamation, the
__GFP_NORETRY flag is set in the readahead GFP flags, ensuring it won't
block.
A use case for this change is when we want to trigger readahead in the
preadv2(2) syscall if the file cache is absent, but without waiting for
certain filesystem locks, like xfs_ilock. A simple example is as follows:
retry:
if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
do_other_work();
goto retry;
}
Link: https://lore.gnuweeb.org/io-uring/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yafang Shao <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Christian Brauner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
We already force-inline page_fixed_fake_head(), page_is_fake_head() and
PageTail(), however the compiler might decide that _compound_head() is not
worthy to be inlined, because of page_fixed_fake_head().
The result is that, for example, PageAnonExclusive() now might involve a
function call when checking PageHuge(), which performs a
page_folio()->_compound_head() call. This can lead to a slight regression
of the stress-ng.clone benchmark.
This is not super-urgent to fix, but always inlining _compound_head()
seems like the obvious thing to do for this primitive, similar to the
other ones.
This change restores the slight regression and a compilation with
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y shows no relevant bloat [2]:
add/remove: 15/14 grow/shrink: 79/87 up/down: 12836/-13917 (-1081)
...
Total: Before=32786363, After=32785282, chg -0.00%
[1] https://lkml.kernel.org/r/[email protected]
[2] https://lore.kernel.org/all/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Fixes: c0bff412e67b ("mm: allow anon exclusive check over hugetlb tail pages")
Signed-off-by: David Hildenbrand <[email protected]>
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Tested-by: Yin Fengwei <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add ERR_PTR_PCPU(), PTR_ERR_PCPU() and IS_ERR_PCPU() macros that operate
on pointers in the percpu address space.
These macros remove the need for (__force void *) function argument casts
(to avoid sparse -Wcast-from-as warnings).
The patch will also avoid future build errors due to pointer address space
mismatch with enabled strict percpu address space checks.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uros Bizjak <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Properly document that if __GFP_ZERO logic is requested, callers must
ensure that, starting with the initial memory allocation, every subsequent
call to this API for the same memory allocation is flagged with
__GFP_ZERO. Otherwise, it is possible that __GFP_ZERO is not fully
honored by this API.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Danilo Krummrich <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
We have some cases left whereby we operate on small folios and still refer
to page->_mapcount. Let's just use folio->_mapcount instead, which
currently still overlays page->_mapcount, so no change.
This change will make it easier to later spot any remaining users of
page->_mapcount that target tail pages.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Use __GFP_COMP for gigantic folios to greatly reduce not only the amount
of code but also the allocation and free time.
LOC (approximately): +60, -240
Allocate and free 500 1GB hugeTLB memory without HVO by:
time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
Before After
Alloc ~13s ~10s
Free ~15s <1s
The above magnitude generally holds for multiple x86 and arm64 CPU models.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yu Zhao <[email protected]>
Reported-by: Frank van der Linden <[email protected]>
Acked-by: Zi Yan <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
With alloc_contig_range() and free_contig_range() supporting large folios,
CMA can allocate and free large folios too, by cma_alloc_folio() and
cma_free_folio().
[[email protected]: fix WARN in cma_alloc_folio()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Zi Yan <[email protected]>
Cc: Frank van der Linden <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm/hugetlb: alloc/free gigantic folios", v2.
Use __GFP_COMP for gigantic folios can greatly reduce not only the amount
of code but also the allocation and free time.
Approximate LOC to mm/hugetlb.c: +60, -240
Allocate and free 500 1GB hugeTLB memory without HVO by:
time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
Before After
Alloc ~13s ~10s
Free ~15s <1s
The above magnitude generally holds for multiple x86 and arm64 CPU
models.
Perf profile before:
Alloc
- 99.99% alloc_pool_huge_folio
- __alloc_fresh_hugetlb_folio
- 83.23% alloc_contig_pages_noprof
- 47.46% alloc_contig_range_noprof
- 20.96% isolate_freepages_range
16.10% split_page
- 14.10% start_isolate_page_range
- 12.02% undo_isolate_page_range
Free
- update_and_free_pages_bulk
- 87.71% free_contig_range
- 76.02% free_unref_page
- 41.30% free_unref_page_commit
- 32.58% free_pcppages_bulk
- 24.75% __free_one_page
13.96% _raw_spin_trylock
12.27% __update_and_free_hugetlb_folio
Perf profile after:
Alloc
- 99.99% alloc_pool_huge_folio
alloc_gigantic_folio
- alloc_contig_pages_noprof
- 59.15% alloc_contig_range_noprof
- 20.72% start_isolate_page_range
20.64% prep_new_page
- 17.13% undo_isolate_page_range
Free
- update_and_free_pages_bulk
- __folio_put
- __free_pages_ok
7.46% free_tail_page_prepare
- 1.97% free_one_page
1.86% __free_one_page
This patch (of 3):
Support __GFP_COMP in alloc_contig_range(). When the flag is set, upon
success the function returns a large folio prepared by prep_new_page(),
rather than a range of order-0 pages prepared by split_free_pages() (which
is renamed from split_map_pages()).
alloc_contig_range() can be used to allocate folios larger than
MAX_PAGE_ORDER, e.g., gigantic hugeTLB folios. So on the free path,
free_one_page() needs to handle that by split_large_buddy().
[[email protected]: fix folio_alloc_gigantic_noprof() WARN expression, per Yu Liao]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yu Zhao <[email protected]>
Acked-by: Zi Yan <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Frank van der Linden <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The ability to observe the demotion and promotion decisions made by the
kernel on a per-cgroup basis is important for monitoring and tuning
containerized workloads on machines equipped with tiered memory.
Different containers in the system may experience drastically different
memory tiering actions that cannot be distinguished from the global
counters alone.
For example, a container running a workload that has a much hotter memory
accesses will likely see more promotions and fewer demotions, potentially
depriving a colocated container of top tier memory to such an extent that
its performance degrades unacceptably.
For another example, some containers may exhibit longer periods between
data reuse, causing much more numa_hint_faults than numa_pages_migrated.
In this case, tuning hot_threshold_ms may be appropriate, but the signal
can easily be lost if only global counters are available.
In the long term, we hope to introduce per-cgroup control of promotion and
demotion actions to implement memory placement policies in tiering.
This patch set adds seven counters to memory.stat in a cgroup:
numa_pages_migrated, numa_pte_updates, numa_hint_faults, pgdemote_kswapd,
pgdemote_khugepaged, pgdemote_direct and pgpromote_success. pgdemote_*
and pgpromote_success are also available in memory.numa_stat.
count_memcg_events_mm() is added to count multiple event occurrences at
once, and get_mem_cgroup_from_folio() is added because we need to get a
reference to the memcg of a folio before it's migrated to track
numa_pages_migrated. The accounting of PGDEMOTE_* is moved to
shrink_inactive_list() before being changed to per-cgroup.
[[email protected]: add documentation of the memcg counters in cgroup-v2.rst]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kaiyang Zhao <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Wei Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Shmem will support large folio allocation [1] [2] to get a better
performance, however, the memory reclaim still splits the precious large
folios when trying to swap out shmem, which may lead to the memory
fragmentation issue and can not take advantage of the large folio for
shmeme.
Moreover, the swap code already supports for swapping out large folio
without split, hence this patch set supports the large folio swap out for
shmem.
Note the i915_gem_shmem driver still need to be split when swapping, thus
add a new flag 'split_large_folio' for writeback_control to indicate
spliting the large folio.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[[email protected]: shmem_writepage() split folio at EOF before swapout]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: remove the wbc->split_large_folio per Hugh]
Link: https://lkml.kernel.org/r/1236a002daa301b3b9ba73d6c0fab348427cf295.1724833399.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/d80c21abd20e1b0f5ca66b330f074060fb2f082d.1723434324.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "support large folio swap-out and swap-in for shmem", v5.
Shmem will support large folio allocation [1] [2] to get a better
performance, however, the memory reclaim still splits the precious large
folios when trying to swap-out shmem, which may lead to the memory
fragmentation issue and can not take advantage of the large folio for
shmeme.
Moreover, the swap code already supports for swapping out large folio
without split, and large folio swap-in[3] series is queued into
mm-unstable branch. Hence this patch set also supports the large folio
swap-out and swap-in for shmem.
This patch (of 9):
To support shmem large folio swap operations, add a new parameter to
swap_shmem_alloc() that allows batch SWAP_MAP_SHMEM flag setting for shmem
swap entries.
While we are at it, using folio_nr_pages() to get the number of pages of
the folio as a preparation.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/99f64115d04b285e009580eb177352c57119ffd0.1723434324.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Daniel Gomez <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The x86 implementation of range-to-target_node lookup (i.e.
phys_to_target_node() and memory_add_physaddr_to_nid()) relies on
numa_memblks.
Since numa_memblks are now part of the generic code, move these functions
from x86 to mm/numa_memblks.c and select CONFIG_NUMA_KEEP_MEMINFO when
CONFIG_NUMA_MEMBLKS=y for dax and cxl.
[[email protected]: fix build]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Reviewed-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Make functions and variables that are exclusively used by numa_memblks
static.
Move numa_nodemask_from_meminfo() before its callers to avoid forward
declaration.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Move most of x86::numa_init() to numa_memblks so that the latter will be
more self-contained.
With this numa_memblk data structures should not be exposed to the
architecture specific code.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Move numa_emulation code from arch/x86 to mm/numa_emulation.c
This code will be later reused by arch_numa.
No functional changes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Move code dealing with numa_distance array from arch/x86 to
mm/numa_memblks.c
This code will be later reused by arch_numa.
No functional changes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Move code dealing with numa_memblks from arch/x86 to mm/ and add Kconfig
options to let x86 select it in its Kconfig.
This code will be later reused by arch_numa.
No functional changes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Architectures that support NUMA duplicate the code that allocates
NODE_DATA on the node-local memory with slight variations in reporting of
the addresses where the memory was allocated.
Use x86 version as the basis for the generic alloc_node_data() function
and call this function in architecture specific numa initialization.
Round up node data size to SMP_CACHE_BYTES rather than to PAGE_SIZE like
x86 used to do since the bootmem era when allocation granularity was
PAGE_SIZE anyway.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
There are no users of HAVE_ARCH_NODEDATA_EXTENSION left, so
arch_alloc_nodedata() and arch_refresh_nodedata() are not needed anymore.
Replace the call to arch_alloc_nodedata() in free_area_init() with a new
helper alloc_offline_node_data(), remove arch_refresh_nodedata() and
cleanup include/linux/memory_hotplug.h from the associated ifdefery.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Acked-by: Dan Williams <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Every architecture that supports NUMA defines node_data in the same way:
struct pglist_data *node_data[MAX_NUMNODES];
No reason to keep multiple copies of this definition and its forward
declarations, especially when such forward declaration is the only thing
in include/asm/mmzone.h for many architectures.
Add definition and declaration of node_data to generic code and drop
architecture-specific versions.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Acked-by: Davidlohr Bueso <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Link all full cluster with one full list, and reclaim from it when the
allocation have ran out of all usable clusters.
There are many reason a folio can end up being in the swap cache while
having no swap count reference. So the best way to search for such slots
is still by iterating the swap clusters.
With the list as an LRU, iterating from the oldest cluster and keep them
rotating is a very doable and clean way to free up potentially not inuse
clusters.
When any allocation failure, try reclaim and rotate only one cluster.
This is adaptive for high order allocations they can tolerate fallback.
So this avoids latency, and give the full cluster list an fair chance to
get reclaimed. It release the usage stress for the fallback order 0
allocation or following up high order allocation.
If the swap device is getting very full, reclaim more aggresively to
ensure no OOM will happen. This ensures order 0 heavy workload won't go
OOM as order 0 won't fail if any cluster still have any space.
[[email protected]: fix discard of full cluster]
Link: https://lkml.kernel.org/r/CAMgjq7CWwK75_2Zi5P40K08pk9iqOcuWKL6khu=x4Yg_nXaQag@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Reported-by: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Kairui Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This commit implements reclaim during scan for cluster allocator.
Cluster scanning were unable to reuse SWAP_HAS_CACHE slots, which could
result in low allocation success rate or early OOM.
So to ensure maximum allocation success rate, integrate reclaiming with
scanning. If found a range of suitable swap slots but fragmented due to
HAS_CACHE, just try to reclaim the slots.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Reported-by: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Now swap cluster allocator arranges the clusters in LRU style, so the
"cold" cluster stay at the head of nonfull lists are the ones that were
used for allocation long time ago and still partially occupied. So if
allocator can't find enough contiguous slots to satisfy an high order
allocation, it's unlikely there will be slot being free on them to satisfy
the allocation, at least in a short period.
As a result, nonfull cluster scanning will waste time repeatly scanning
the unusable head of the list.
Also, multiple CPUs could content on the same head cluster of nonfull
list. Unlike free clusters which are removed from the list when any CPU
starts using it, nonfull cluster stays on the head.
So introduce a new list frag list, all scanned nonfull clusters will be
moved to this list. Both for avoiding repeated scanning and contention.
Frag list is still used as fallback for allocations, so if one CPU failed
to allocate one order of slots, it can still steal other CPU's clusters.
And order 0 will favor the fragmented clusters to better protect nonfull
clusters
If any slots on a fragment list are being freed, move the fragment list
back to nonfull list indicating it worth another scan on the cluster.
Compared to scan upon freeing a slot, this keep the scanning lazy and save
some CPU if there are still other clusters to use.
It may seems unneccessay to keep the fragmented cluster on list at all if
they can't be used for specific order allocation. But this will start to
make sense once reclaim dring scanning is ready.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kairui Song <[email protected]>
Reported-by: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Track the nonfull cluster as well as the empty cluster on lists. Each
order has one nonfull cluster list.
The cluster will remember which order it was used during new cluster
allocation.
When the cluster has free entry, add to the nonfull[order] list. When
the free cluster list is empty, also allocate from the nonempty list of
that order.
This improves the mTHP swap allocation success rate.
There are limitations if the distribution of numbers of different orders
of mTHP changes a lot. e.g. there are a lot of nonfull cluster assign to
order A while later time there are a lot of order B allocation while very
little allocation in order A. Currently the cluster used by order A will
not reused by order B unless the cluster is 100% empty.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Li <[email protected]>
Reported-by: Barry Song <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm: swap: mTHP swap allocator base on swap cluster order",
v5.
This is the short term solutions "swap cluster order" listed in my "Swap
Abstraction" discussion slice 8 in the recent LSF/MM conference.
When commit 845982eb264bc "mm: swap: allow storage of all mTHP orders" is
introduced, it only allocates the mTHP swap entries from the new empty
cluster list. It has a fragmentation issue reported by Barry.
https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
The reason is that all the empty clusters have been exhausted while there
are plenty of free swap entries in the cluster that are not 100% free.
Remember the swap allocation order in the cluster. Keep track of the per
order non full cluster list for later allocation.
This series gives the swap SSD allocation a new separate code path from
the HDD allocation. The new allocator use cluster list only and do not
global scan swap_map[] without lock any more.
This streamline the swap allocation for SSD. The code matches the
execution flow much better.
User impact: For users that allocate and free mix order mTHP swapping, It
greatly improves the success rate of the mTHP swap allocation after the
initial phase.
It also performs faster when the swapfile is close to full, because the
allocator can get the non full cluster from a list rather than scanning a
lot of swap_map entries.
With Barry's mthp test program V2:
Without:
$ ./thp_swap_allocator_test -a
Iteration 1: swpout inc: 32, swpout fallback inc: 192, Fallback percentage: 85.71%
Iteration 2: swpout inc: 0, swpout fallback inc: 231, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00%
...
Iteration 98: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 215, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
$ ./thp_swap_allocator_test -a -s
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
$ ./thp_swap_allocator_test -s
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
$ ./thp_swap_allocator_test
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
With: # with all 0.00% filter out
$ ./thp_swap_allocator_test -a | grep -v "0.00%"
$ # all result are 0.00%
$ ./thp_swap_allocator_test -a -s | grep -v "0.00%"
./thp_swap_allocator_test -a -s | grep -v "0.00%"
Iteration 14: swpout inc: 223, swpout fallback inc: 3, Fallback percentage: 1.33%
Iteration 19: swpout inc: 219, swpout fallback inc: 7, Fallback percentage: 3.10%
Iteration 28: swpout inc: 225, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 29: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 34: swpout inc: 220, swpout fallback inc: 8, Fallback percentage: 3.51%
Iteration 35: swpout inc: 222, swpout fallback inc: 11, Fallback percentage: 4.72%
Iteration 38: swpout inc: 217, swpout fallback inc: 4, Fallback percentage: 1.81%
Iteration 40: swpout inc: 222, swpout fallback inc: 6, Fallback percentage: 2.63%
Iteration 42: swpout inc: 221, swpout fallback inc: 2, Fallback percentage: 0.90%
Iteration 43: swpout inc: 215, swpout fallback inc: 7, Fallback percentage: 3.15%
Iteration 47: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
Iteration 49: swpout inc: 217, swpout fallback inc: 1, Fallback percentage: 0.46%
Iteration 52: swpout inc: 221, swpout fallback inc: 8, Fallback percentage: 3.49%
Iteration 56: swpout inc: 224, swpout fallback inc: 4, Fallback percentage: 1.75%
Iteration 58: swpout inc: 214, swpout fallback inc: 5, Fallback percentage: 2.28%
Iteration 62: swpout inc: 220, swpout fallback inc: 3, Fallback percentage: 1.35%
Iteration 64: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 67: swpout inc: 221, swpout fallback inc: 1, Fallback percentage: 0.45%
Iteration 75: swpout inc: 220, swpout fallback inc: 9, Fallback percentage: 3.93%
Iteration 82: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 86: swpout inc: 211, swpout fallback inc: 12, Fallback percentage: 5.38%
Iteration 89: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
Iteration 93: swpout inc: 220, swpout fallback inc: 1, Fallback percentage: 0.45%
Iteration 94: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 96: swpout inc: 221, swpout fallback inc: 6, Fallback percentage: 2.64%
Iteration 98: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 99: swpout inc: 227, swpout fallback inc: 3, Fallback percentage: 1.30%
$ ./thp_swap_allocator_test
./thp_swap_allocator_test
Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%
$ ./thp_swap_allocator_test -s
./thp_swap_allocator_test -s
Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 97, swpout fallback inc: 135, Fallback percentage: 58.19%
Iteration 3: swpout inc: 42, swpout fallback inc: 192, Fallback percentage: 82.05%
Iteration 4: swpout inc: 19, swpout fallback inc: 214, Fallback percentage: 91.85%
Iteration 5: swpout inc: 12, swpout fallback inc: 213, Fallback percentage: 94.67%
Iteration 6: swpout inc: 11, swpout fallback inc: 217, Fallback percentage: 95.18%
Iteration 7: swpout inc: 9, swpout fallback inc: 214, Fallback percentage: 95.96%
Iteration 8: swpout inc: 8, swpout fallback inc: 213, Fallback percentage: 96.38%
Iteration 9: swpout inc: 2, swpout fallback inc: 223, Fallback percentage: 99.11%
Iteration 10: swpout inc: 2, swpout fallback inc: 228, Fallback percentage: 99.13%
Iteration 11: swpout inc: 4, swpout fallback inc: 214, Fallback percentage: 98.17%
Iteration 12: swpout inc: 5, swpout fallback inc: 226, Fallback percentage: 97.84%
Iteration 13: swpout inc: 3, swpout fallback inc: 212, Fallback percentage: 98.60%
Iteration 14: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
Iteration 15: swpout inc: 3, swpout fallback inc: 222, Fallback percentage: 98.67%
Iteration 16: swpout inc: 4, swpout fallback inc: 223, Fallback percentage: 98.24%
=========
Kernel compile under tmpfs with cgroup memory.max = 470M.
12 core 24 hyperthreading, 32 jobs. 10 Run each group
SSD swap 10 runs average, 20G swap partition:
With:
user 2929.064
system 1479.381 : 1376.89 1398.22 1444.64 1477.39 1479.04 1497.27
1504.47 1531.4 1532.92 1551.57
real 1441.324
Without:
user 2910.872
system 1482.732 : 1440.01 1451.4 1462.01 1467.47 1467.51 1469.3
1470.19 1496.32 1544.1 1559.01
real 1580.822
Two zram swap: zram0 3.0G zram1 20G.
The idea is forcing the zram0 almost full then overflow to zram1:
With:
user 4320.301
system 4272.403 : 4236.24 4262.81 4264.75 4269.13 4269.44 4273.06
4279.85 4285.98 4289.64 4293.13
real 431.759
Without
user 4301.393
system 4387.672 : 4374.47 4378.3 4380.95 4382.84 4383.06 4388.05
4389.76 4397.16 4398.23 4403.9
real 433.979
------ more test result from Kaiui ----------
Test with build linux kernel using a 4G ZRAM, 1G memory.max limit on top of shmem:
System info: 32 Core AMD Zen2, 64G total memory.
Test 3 times using only 4K pages:
=================================
With:
-----
1838.74user 2411.21system 2:37.86elapsed 2692%CPU (0avgtext+0avgdata 847060maxresident)k
1839.86user 2465.77system 2:39.35elapsed 2701%CPU (0avgtext+0avgdata 847060maxresident)k
1840.26user 2454.68system 2:39.43elapsed 2693%CPU (0avgtext+0avgdata 847060maxresident)k
Summary (~4.6% improment of system time):
User: 1839.62
System: 2443.89: 2465.77 2454.68 2411.21
Real: 158.88
Without:
--------
1837.99user 2575.95system 2:43.09elapsed 2706%CPU (0avgtext+0avgdata 846520maxresident)k
1838.32user 2555.15system 2:42.52elapsed 2709%CPU (0avgtext+0avgdata 846520maxresident)k
1843.02user 2561.55system 2:43.35elapsed 2702%CPU (0avgtext+0avgdata 846520maxresident)k
Summary:
User: 1839.78
System: 2564.22: 2575.95 2555.15 2561.55
Real: 162.99
Test 5 times using enabled all mTHP pages:
==========================================
With:
-----
1796.44user 2937.33system 2:59.09elapsed 2643%CPU (0avgtext+0avgdata 846936maxresident)k
1802.55user 3002.32system 2:54.68elapsed 2750%CPU (0avgtext+0avgdata 847072maxresident)k
1806.59user 2986.53system 2:55.17elapsed 2736%CPU (0avgtext+0avgdata 847092maxresident)k
1803.27user 2982.40system 2:54.49elapsed 2742%CPU (0avgtext+0avgdata 846796maxresident)k
1807.43user 3036.08system 2:56.06elapsed 2751%CPU (0avgtext+0avgdata 846488maxresident)k
Summary (~8.4% improvement of system time):
User: 1803.25
System: 2988.93: 2937.33 3002.32 2986.53 2982.40 3036.08
Real: 175.90
mTHP swapout status:
/sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout:347721
/sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout_fallback:3110
/sys/kernel/mm/transparent_hugepage/hugepages-512kB/stats/swpout:3365
/sys/kernel/mm/transparent_hugepage/hugepages-512kB/stats/swpout_fallback:8269
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/swpout:24
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/swpout_fallback:3341
/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/stats/swpout:145
/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/stats/swpout_fallback:5038
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout:322737
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback:36808
/sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout:380455
/sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout_fallback:1010
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/stats/swpout:24973
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/stats/swpout_fallback:13223
/sys/kernel/mm/transparent_hugepage/hugepages-128kB/stats/swpout:197348
/sys/kernel/mm/transparent_hugepage/hugepages-128kB/stats/swpout_fallback:80541
Without:
--------
1794.41user 3151.29system 3:05.97elapsed 2659%CPU (0avgtext+0avgdata 846704maxresident)k
1810.27user 3304.48system 3:05.38elapsed 2759%CPU (0avgtext+0avgdata 846636maxresident)k
1809.84user 3254.85system 3:03.83elapsed 2755%CPU (0avgtext+0avgdata 846952maxresident)k
1813.54user 3259.56system 3:04.28elapsed 2752%CPU (0avgtext+0avgdata 846848maxresident)k
1829.97user 3338.40system 3:07.32elapsed 2759%CPU (0avgtext+0avgdata 847024maxresident)k
Summary:
User: 1811.61
System: 3261.72 : 3151.29 3304.48 3254.85 3259.56 3338.40
Real: 185.356
mTHP swapout status:
hugepages-32kB/stats/swpout:35630
hugepages-32kB/stats/swpout_fallback:1809908
hugepages-512kB/stats/swpout:523
hugepages-512kB/stats/swpout_fallback:55235
hugepages-2048kB/stats/swpout:53
hugepages-2048kB/stats/swpout_fallback:17264
hugepages-1024kB/stats/swpout:85
hugepages-1024kB/stats/swpout_fallback:24979
hugepages-64kB/stats/swpout:30117
hugepages-64kB/stats/swpout_fallback:1825399
hugepages-16kB/stats/swpout:42775
hugepages-16kB/stats/swpout_fallback:1951123
hugepages-256kB/stats/swpout:2326
hugepages-256kB/stats/swpout_fallback:170165
hugepages-128kB/stats/swpout:17925
hugepages-128kB/stats/swpout_fallback:1309757
This patch (of 9):
Previously, the swap cluster used a cluster index as a pointer to
construct a custom single link list type "swap_cluster_list". The next
cluster pointer is shared with the cluster->count. It prevents puting the
non free cluster into a list.
Change the cluster to use the standard double link list instead. This
allows tracing the nonfull cluster in the follow up patch. That way, it
is faster to get to the nonfull cluster of that order.
Remove the cluster getter/setter for accessing the cluster struct member.
The list operation is protected by the swap_info_struct->lock.
Change cluster code to use "struct swap_cluster_info *" to reference the
cluster rather than by using index. That is more consistent with the list
manipulation. It avoids the repeat adding index to the cluser_info. The
code is easier to understand.
Remove the cluster next pointer is NULL flag, the double link list can
handle the empty list pretty well.
The "swap_cluster_info" struct is two pointer bigger, because 512 swap
entries share one swap_cluster_info struct, it has very little impact on
the average memory usage per swap entry. For 1TB swapfile, the swap
cluster data structure increases from 8MB to 24MB.
Other than the list conversion, there is no real function change in this
patch.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Li <[email protected]>
Reported-by: Barry Song <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
DMA ops are a helper for architectures and not for drivers to override
the DMA implementation.
Unfortunately driver authors keep ignoring this. Make the fact more
clear by renaming the symbol to ARCH_HAS_DMA_OPS and having the two drivers
overriding their dma_ops depend on that. These drivers should probably be
marked broken, but we can give them a bit of a grace period for that.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Acked-by: Sakari Ailus <[email protected]> # for IPU6
Acked-by: Robin Murphy <[email protected]>
|
|
- Resolve trivial context conflicts from dl_server clearing being moved
around.
- Add @next to put_prev_task_scx() and @prev to pick_next_task_scx() to
match sched/core.
- Merge sched_class->switch_class() addition from sched_ext with
tip/sched/core changes in __pick_next_task().
- Make pick_next_task_scx() call put_prev_task_scx() to emulate the previous
behavior where sched_class->put_prev_task() was called before
sched_class->pick_next_task().
While this makes sched_ext build and function, the behavior is not in line
with other sched classes. The follow-up patches will address the
discrepancies and remove sched_class->switch_class().
Signed-off-by: Tejun Heo <[email protected]>
|
|
If we trigger the bus rescan from sysfs, we'll try to lock the PCI rescan
mutex recursively and deadlock - the platform device will be populated and
probed on the same thread that handles the sysfs write.
Add a workqueue to the pwrctl code on which we schedule the rescan for
controlled PCI devices. While at it: add a new interface for initializing
the pwrctl context where we'd now assign the parent device address and
initialize the workqueue.
Link: https://lore.kernel.org/r/[email protected]
Fixes: 4565d2652a37 ("PCI/pwrctl: Add PCI power control core code")
Reported-by: Konrad Dybcio <[email protected]>
Signed-off-by: Bartosz Golaszewski <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Manivannan Sadhasivam <[email protected]>
|
|
Immutable branch between MFD, IIO and power-supply providing the
register definitions needed for AXP717 support in the axp20x
axp20x_battery and axp20x_usb_power drivers.
Signed-off-by: Sebastian Reichel <[email protected]>
|
|
Changing usb_types type from array to bitmap in the power_supply_desc
struct requires updating power-supply drivers living in different
subsystem, so it is handled via an immutable branch.
Signed-off-by: Sebastian Reichel <[email protected]>
|
|
The bit_types array just hold a list of valid enum power_supply_usb_type
values which map to 0 - 9. This can easily be represented as a bitmap.
This reduces the size of struct power_supply_desc and further reduces
the data section size by drivers no longer needing to store the array.
This also unifies how usb_types are handled with charge_behaviours,
which allows power_supply_show_usb_type() to be removed.
Signed-off-by: Hans de Goede <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sebastian Reichel <[email protected]>
|
|
e01c9797c0eb ("PCI: endpoint: Clean up hardware description for BARs")
added enum pci_epc_bar_type with incomplete kerneldoc. Add the missing
piece.
Signed-off-by: Bjorn Helgaas <[email protected]>
|