Age | Commit message (Collapse) | Author | Files | Lines |
|
compute_subtree_max_size() is unused, when building with
DEBUG_AUGMENT_PROPAGATE_CHECK=y.
mm/vmalloc.c:785:1: warning: unused function 'compute_subtree_max_size' [-Wunused-function].
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jiapeng Chong <[email protected]>
Reported-by: Abaci Robot <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
That extra variable has been introduced just for keeping an original
passed gfp_mask because it is updated with __GFP_NOWARN on entry, thus
error handling messages were broken.
Instead we can keep an original gfp_mask without modifying it and add an
extra __GFP_NOWARN flag together with gfp_mask as a parameter to the
vm_area_alloc_pages() function. It will make it less confused.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Cc: Vasily Averin <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Extend the find_vmap_lowest_match() function with one more parameter.
It is "adjust_search_size" boolean variable, so it is possible to
control an accuracy of search block if a specific alignment is required.
With this patch, a search size is always adjusted, to serve a request as
fast as possible because of performance reason.
But there is one exception though, it is short ranges where requested
size corresponds to passed vstart/vend restriction together with a
specific alignment request. In such scenario an adjustment wold not
lead to success allocation.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Vasily Averin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A caller initiates the drain procces from its context once the
drain threshold is reached or passed. There are at least two
drawbacks of doing so:
a) a caller can be a high-prio or RT task. In that case it can
stuck in doing the actual drain of all lazily freed areas.
This is not optimal because such tasks usually are latency
sensitive where the control should be returned back as soon
as possible in order to drive such workloads in time. See
96e2db456135 ("mm/vmalloc: rework the drain logic")
b) It is not safe to call vfree() during holding a spinlock due
to the vmap_purge_lock mutex. The was a report about this from
Zeal Robot <[email protected]> here:
https://lore.kernel.org/all/[email protected]
Moving the drain to the separate work context addresses those
issues.
v1->v2:
- Added prefix "_work" to the drain worker function.
v2->v3:
- Remove the drain_vmap_work_in_progress. Extra queuing
is expectable under heavy load but it can be disregarded
because a work will bail out if nothing to be done.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Cc: Vasily Averin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The forward declaration for lazy_max_pages() is unnecessary. Remove it.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It's only used in the sparse.c now. So we can make it static and further
clean up the relevant code.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Using vma_lookup() verifies the address is contained in the found vma.
This results in easier to read code.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
RLIMIT_MEMLOCK is already reimplemented on top of ucounts now. And
since commit 83c1fd763b32 ("mm,hugetlb: remove mlock ulimit for
SHM_HUGETLB"), mlock ulimit for SHM_HUGETLB is further removed.
So we should remove this obsolete comment.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
_install_special_mapping() adds the VM_SPECIAL bit VM_DONTEXPAND (and
never attempts to update locked_vm), so it ought to be consistent with
mmap_region() and mlock_fixup(), making sure not to add VM_LOCKED or
VM_LOCKONFAULT. I doubt that this fixes any problem in practice: just
do it for consistency.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use helper macro min and max to help simplify the code logic. Minor
readability improvement.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use helper function range_in_vma() to check if address, address + size are
within the vma range. Minor readability improvement.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
__setup() handlers should return 1 if the command line option is handled
and 0 if not (or maybe never return 0; it just pollutes init's
environment). This prevents:
Unknown kernel command line parameters \
"BOOT_IMAGE=/boot/bzImage-517rc5 stack_guard_gap=100", will be \
passed to user space.
Run /sbin/init as init process
with arguments:
/sbin/init
with environment:
HOME=/
TERM=linux
BOOT_IMAGE=/boot/bzImage-517rc5
stack_guard_gap=100
Return 1 to indicate that the boot option has been handled.
Note that there is no warning message if someone enters:
stack_guard_gap=anything_invalid
and 'val' and stack_guard_gap are both set to 0 due to the use of
simple_strtoul(). This could be improved by using kstrtoxxx() and
checking for an error.
It appears that having stack_guard_gap == 0 is valid (if unexpected) since
using "stack_guard_gap=0" on the kernel command line does that.
Link: https://lkml.kernel.org/r/[email protected]
Link: lore.kernel.org/r/[email protected]
Fixes: 1be7107fbe18e ("mm: larger stack guard gap, between vmas")
Signed-off-by: Randy Dunlap <[email protected]>
Reported-by: Igor Zhbanov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Clean the code up by merging the device private/exclusive swap entry
handling with the rest, then we merge the pte clear operation too.
struct* page is defined in multiple places in the function, move it
upward.
free_swap_and_cache() is only useful for !non_swap_entry() case, put it
into the condition.
No functional change intended.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently we have a zap_mapping pointer maintained in zap_details, when
it is specified we only want to zap the pages that has the same mapping
with what the caller has specified.
But what we want to do is actually simpler: we want to skip zapping
private (COW-ed) pages in some cases. We can refer to
unmap_mapping_pages() callers where we could have passed in different
even_cows values. The other user is unmap_mapping_folio() where we
always want to skip private pages.
According to Hugh, we used a mapping pointer for historical reason, as
explained here:
https://lore.kernel.org/lkml/[email protected]/
Quoting partly from Hugh:
Which raises the question again of why I did not just use a boolean flag
there originally: aah, I think I've found why. In those days there was a
horrible "optimization", for better performance on some benchmark I guess,
which when you read from /dev/zero into a private mapping, would map the zero
page there (look up read_zero_pagealigned() and zeromap_page_range() if you
dare). So there was another category of page to be skipped along with the
anon COWs, and I didn't want multiple tests in the zap loop, so checking
check_mapping against page->mapping did both. I think nowadays you could do
it by checking for PageAnon page (or genuine swap entry) instead.
This patch replaces the zap_details.zap_mapping pointer into the even_cows
boolean, then we check it against PageAnon.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Suggested-by: Hugh Dickins <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The previous name is against the natural way people think. Invert the
meaning and also the return value. No functional change intended.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Suggested-by: Hugh Dickins <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm: Rework zap ptes on swap entries", v5.
Patch 1 should fix a long standing bug for zap_pte_range() on
zap_details usage. The risk is we could have some swap entries skipped
while we should have zapped them.
Migration entries are not the major concern because file backed memory
always zap in the pattern that "first time without page lock, then
re-zap with page lock" hence the 2nd zap will always make sure all
migration entries are already recovered.
However there can be issues with real swap entries got skipped
errornoously. There's a reproducer provided in commit message of patch
1 for that.
Patch 2-4 are cleanups that are based on patch 1. After the whole
patchset applied, we should have a very clean view of zap_pte_range().
Only patch 1 needs to be backported to stable if necessary.
This patch (of 4):
The "details" pointer shouldn't be the token to decide whether we should
skip swap entries.
For example, when the callers specified details->zap_mapping==NULL, it
means the user wants to zap all the pages (including COWed pages), then
we need to look into swap entries because there can be private COWed
pages that was swapped out.
Skipping some swap entries when details is non-NULL may lead to wrongly
leaving some of the swap entries while we should have zapped them.
A reproducer of the problem:
===8<===
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <stdio.h>
#include <assert.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
int page_size;
int shmem_fd;
char *buffer;
void main(void)
{
int ret;
char val;
page_size = getpagesize();
shmem_fd = memfd_create("test", 0);
assert(shmem_fd >= 0);
ret = ftruncate(shmem_fd, page_size * 2);
assert(ret == 0);
buffer = mmap(NULL, page_size * 2, PROT_READ | PROT_WRITE,
MAP_PRIVATE, shmem_fd, 0);
assert(buffer != MAP_FAILED);
/* Write private page, swap it out */
buffer[page_size] = 1;
madvise(buffer, page_size * 2, MADV_PAGEOUT);
/* This should drop private buffer[page_size] already */
ret = ftruncate(shmem_fd, page_size);
assert(ret == 0);
/* Recover the size */
ret = ftruncate(shmem_fd, page_size * 2);
assert(ret == 0);
/* Re-read the data, it should be all zero */
val = buffer[page_size];
if (val == 0)
printf("Good\n");
else
printf("BUG\n");
}
===8<===
We don't need to touch up the pmd path, because pmd never had a issue with
swap entries. For example, shmem pmd migration will always be split into
pte level, and same to swapping on anonymous.
Add another helper should_zap_cows() so that we can also check whether we
should zap private mappings when there's no page pointer specified.
This patch drops that trick, so we handle swap ptes coherently. Meanwhile
we should do the same check upon migration entry, hwpoison entry and
genuine swap entries too.
To be explicit, we should still remember to keep the private entries if
even_cows==false, and always zap them when even_cows==true.
The issue seems to exist starting from the initial commit of git.
[[email protected]: comment tweaks]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Simplify the code by using flush_dcache_folio().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
__mcopy_atomic()
userfaultfd calls mcopy_atomic_pte() and __mcopy_atomic() which do not
do any cache flushing for the target page. Then the target page will be
mapped to the user space with a different address (user address), which
might have an alias issue with the kernel address used to copy the data
from the user to. Fix this by insert flush_dcache_page() after
copy_from_user() succeeds.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: b6ebaedb4cb1 ("userfaultfd: avoid mmap_sem read recursion in mcopy_atomic")
Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
Signed-off-by: Muchun Song <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
userfaultfd calls shmem_mfill_atomic_pte() which does not do any cache
flushing for the target page. Then the target page will be mapped to
the user space with a different address (user address), which might have
an alias issue with the kernel address used to copy the data from the
user to. Insert flush_dcache_page() in non-zero-page case. And replace
clear_highpage() with clear_user_highpage() which already considers the
cache maintenance.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 8d1039634206 ("userfaultfd: shmem: add shmem_mfill_zeropage_pte for userfaultfd support")
Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
folio_copy() will copy the data from one page to the target page, then
the target page will be mapped to the user space address, which might
have an alias issue with the kernel address used to copy the data from
the page to. There are 2 ways to fix this issue.
1) insert flush_dcache_page() after folio_copy().
2) replace folio_copy() with copy_user_huge_page() which already
considers the cache maintenance.
We chose 2) way to fix the issue since architectures can optimize this
situation. It is also make backports easier.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 8cc5fcbb5be8 ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY")
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
userfaultfd calls copy_huge_page_from_user() which does not do any cache
flushing for the target page. Then the target page will be mapped to
the user space with a different address (user address), which might have
an alias issue with the kernel address used to copy the data from the
user to.
Fix this issue by flushing dcache in copy_huge_page_from_user().
Link: https://lkml.kernel.org/r/[email protected]
Fixes: fa4d75c1de13 ("userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support")
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The D-cache maintenance inside move_to_new_page() only consider one
page, there is still D-cache maintenance issue for tail pages of
compound page (e.g. THP or HugeTLB).
THP migration is only enabled on x86_64, ARM64 and powerpc, while
powerpc and arm64 need to maintain the consistency between I-Cache and
D-Cache, which depends on flush_dcache_page() to maintain the
consistency between I-Cache and D-Cache.
But there is no issues on arm64 and powerpc since they already considers
the compound page cache flushing in their icache flush function.
HugeTLB migration is enabled on arm, arm64, mips, parisc, powerpc,
riscv, s390 and sh, while arm has handled the compound page cache flush
in flush_dcache_page(), but most others do not.
In theory, the issue exists on many architectures. Fix this by not
using flush_dcache_folio() since it is not backportable.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Fix some cache flush bugs", v5.
This series focuses on fixing cache maintenance.
This patch (of 7):
The flush_cache_range() is supposed to be justified only if the page is
already placed in process page table, and that is done right after
flush_cache_range(). So using this interface is wrong. And there is no
need to invalite cache since it was non-present before in
remove_migration_pmd(). So just to remove it.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Lars Persson <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Originally the mmu_gathers were removed in commit 1c3951769621 ("mm: now
that all old mmu_gather code is gone, remove the storage"). However,
the openrisc and hexagon architecture were merged around the same time
and mmu_gathers was not removed.
This patch removes them from openrisc, hexagon and nds32:
Noticed while cleaning this warning:
arch/openrisc/mm/init.c:41:1: warning: symbol 'mmu_gathers' was not declared. Should it be static?
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Stafford Horne <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Cc: Brian Cain <[email protected]>
Cc: Nick Hu <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Vincent Chen <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Stefan Kristiansson <[email protected]>
Cc: Russell King <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Christophe Leroy <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Each call into pte_mkhuge() is invariably followed by
arch_make_huge_pte(). Instead arch_make_huge_pte() can accommodate
pte_mkhuge() at the beginning. This updates generic fallback stub for
arch_make_huge_pte() and available platforms definitions. This makes huge
pte creation much cleaner and easier to follow.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Reviewed-by: Christophe Leroy <[email protected]>
Acked-by: Mike Kravetz <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The $(CC) variable used in Makefiles could contain several arguments
such as "ccache gcc". These need to be passed as a single string to
check_cc.sh, otherwise only the first argument will be used as the
compiler command. Without quotes, the $(CC) variable is passed as
distinct arguments which causes the script to fail to build trivial
programs.
Fix this by adding quotes around $(CC) when calling check_cc.sh to pass
the whole string as a single argument to the script even if it has
several words such as "ccache gcc".
Link: https://lkml.kernel.org/r/d0d460d7be0107a69e3c52477761a6fe694c1840.1646991629.git.guillaume.tucker@collabora.com
Fixes: e9886ace222e ("selftests, x86: Rework x86 target architecture detection")
Signed-off-by: Guillaume Tucker <[email protected]>
Tested-by: "kernelci.org bot" <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
At each login the user forces the kernel to create a new terminal and
allocate up to ~1Kb memory for the tty-related structures.
By default it's allowed to create up to 4096 ptys with 1024 reserve for
initial mount namespace only and the settings are controlled by host
admin.
Though this default is not enough for hosters with thousands of
containers per node. Host admin can be forced to increase it up to
NR_UNIX98_PTY_MAX = 1<<20.
By default container is restricted by pty mount_opt.max = 1024, but
admin inside container can change it via remount. As a result, one
container can consume almost all allowed ptys and allocate up to 1Gb of
unaccounted memory.
It is not enough per-se to trigger OOM on host, however anyway, it
allows to significantly exceed the assigned memcg limit and leads to
troubles on the over-committed node.
It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vasily Averin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Jiri Slaby <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The memcg_cache_id() introduced by commit 2633d7a02823 ("slab/slub:
consider a memcg parameter in kmem_create_cache") is used to index in the
kmem_cache->memcg_params->memcg_caches array. Since
kmem_cache->memcg_params.memcg_caches has been removed by commit
9855609bde03 ("mm: memcg/slab: use a single set of kmem_caches for all
accounted allocations"). So the name does not need to reflect cache
related. Just rename it to memcg_kmem_id. And it can reflect kmem
related.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kari Argillander <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The name of list_lru_memcg was occupied before and became free since
last commit. Rename list_lru_per_memcg to list_lru_memcg since the name
is brief.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kari Argillander <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The idr_alloc() does not include @max ID. So in the current
implementation, the maximum memcg ID is 65534 instead of 65535. It
seems a bug. So fix this.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kari Argillander <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There are two idrs being used by memory cgroup, one is for kmem ID,
another is for memory cgroup ID. The maximum ID of both is 64Ki. Both
of them can limit the total number of memory cgroups. Actually, we can
reuse memory cgroup ID for kmem ID to simplify the code.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kari Argillander <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If we run 10k containers in the system, the size of the
list_lru_memcg->lrus can be ~96KB per list_lru. When we decrease the
number containers, the size of the array will not be shrinked. It is
not scalable. The xarray is a good choice for this case. We can save a
lot of memory when there are tens of thousands continers in the system.
If we use xarray, we also can remove the logic code of resizing array,
which can simplify the code.
[[email protected]: remove unused local]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kari Argillander <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The purpose of the memcg_drain_all_list_lrus() is list_lrus reparenting.
It is very similar to memcg_reparent_objcgs(). Rename it to
memcg_reparent_list_lrus() so that the name can more consistent with
memcg_reparent_objcgs().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kari Argillander <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In our server, we found a suspected memory leak problem. The kmalloc-32
consumes more than 6GB of memory. Other kmem_caches consume less than
2GB memory.
After our in-depth analysis, the memory consumption of kmalloc-32 slab
cache is the cause of list_lru_one allocation.
crash> p memcg_nr_cache_ids
memcg_nr_cache_ids = $2 = 24574
memcg_nr_cache_ids is very large and memory consumption of each list_lru
can be calculated with the following formula.
num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
crash> list super_blocks | wc -l
952
Every mount will register 2 list lrus, one is for inode, another is for
dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3
MB (~5.6GB). But the number of memory cgroup is less than 500. So I
guess more than 12286 containers have been deployed on this machine (I do
not know why there are so many containers, it may be a user's bug or the
user really want to do that). And memcg_nr_cache_ids has not been reduced
to a suitable value. This can waste a lot of memory.
Now the infrastructure for dynamic list_lru_one allocation is ready, so
remove statically allocated memory code to save memory.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kari Argillander <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It will simplify the code if moving memcg_online_kmem() to
mem_cgroup_css_online() and do not need to set ->kmemcg_id to -1 to
indicate the memcg is offline. In the next patch, ->kmemcg_id will be
used to sync list lru reparenting which requires not to change
->kmemcg_id.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kari Argillander <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The workingset will add the xa_node to the shadow_nodes list. So the
allocation of xa_node should be done by kmem_cache_alloc_lru(). Using
xas_set_lru() to pass the list_lru which we want to insert xa_node into to
set up the xa_node reclaim context correctly.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Kari Argillander <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Like inode cache, the dentry will also be added to its memcg list_lru. So
replace kmem_cache_alloc() with kmem_cache_alloc_lru() to allocate dentry.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kari Argillander <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() to alloc_inode_sb().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kari Argillander <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() of all filesystems to alloc_inode_sb().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Theodore Ts'o <[email protected]> [ext4]
Acked-by: Roman Gushchin <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kari Argillander <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The allocated inode cache is supposed to be added to its memcg list_lru
which should be allocated as well in advance. That can be done by
kmem_cache_alloc_lru() which allocates object and list_lru. The file
systems is main user of it. So introduce alloc_inode_sb() to allocate
file system specific inodes and set up the inode reclaim context
properly. The file system is supposed to use alloc_inode_sb() to
allocate inodes.
In later patches, we will convert all users to the new API.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kari Argillander <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We currently allocate scope for every memcg to be able to tracked on
every superblock instantiated in the system, regardless of whether that
superblock is even accessible to that memcg.
These huge memcg counts come from container hosts where memcgs are
confined to just a small subset of the total number of superblocks that
instantiated at any given point in time.
For these systems with huge container counts, list_lru does not need the
capability of tracking every memcg on every superblock. What it comes
down to is that adding the memcg to the list_lru at the first insert.
So introduce kmem_cache_alloc_lru to allocate objects and its list_lru.
In the later patch, we will convert all inode and dentry allocation from
kmem_cache_alloc to kmem_cache_alloc_lru.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kari Argillander <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Optimize list lru memory consumption", v6.
In our server, we found a suspected memory leak problem. The kmalloc-32
consumes more than 6GB of memory. Other kmem_caches consume less than
2GB memory.
After our in-depth analysis, the memory consumption of kmalloc-32 slab
cache is the cause of list_lru_one allocation.
crash> p
memcg_nr_cache_ids memcg_nr_cache_ids = $2 = 24574
memcg_nr_cache_ids is very large and memory consumption of each list_lru
can be calculated with the following formula.
num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
crash> list super_blocks | wc -l
952
Every mount will register 2 list lrus, one is for inode, another is for
dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3
MB (~5.6GB). But now the number of memory cgroups is less than 500. So
I guess more than 12286 memory cgroups have been created on this machine
(I do not know why there are so many cgroups, it may be a user's bug or
the user really want to do that). Because memcg_nr_cache_ids has not
been reduced to a suitable value. It leads to waste a lot of memory.
If we want to reduce memcg_nr_cache_ids, we have to *reboot* the server.
This is not what we want.
In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do
this. But this did not fundamentally solve the problem.
We currently allocate scope for every memcg to be able to tracked on
every superblock instantiated in the system, regardless of whether that
superblock is even accessible to that memcg.
These huge memcg counts come from container hosts where memcgs are
confined to just a small subset of the total number of superblocks that
instantiated at any given point in time.
For these systems with huge container counts, list_lru does not need the
capability of tracking every memcg on every superblock.
What it comes down to is that the list_lru is only needed for a given
memcg if that memcg is instatiating and freeing objects on a given
list_lru.
As Dave said, "Which makes me think we should be moving more towards 'add
the memcg to the list_lru at the first insert' model rather than
'instantiate all at memcg init time just in case'."
This patchset aims to optimize the list lru memory consumption from
different aspects.
I had done a easy test to show the optimization. I create 10k memory
cgroups and mount 10k filesystems in the systems. We use free command to
show how many memory does the systems comsumes after this operation (There
are 2 numa nodes in the system).
+-----------------------+------------------------+
| condition | memory consumption |
+-----------------------+------------------------+
| without this patchset | 24464 MB |
+-----------------------+------------------------+
| after patch 1 | 21957 MB | <--------+
+-----------------------+------------------------+ |
| after patch 10 | 6895 MB | |
+-----------------------+------------------------+ |
| after patch 12 | 4367 MB | |
+-----------------------+------------------------+ |
|
The more the number of nodes, the more obvious the effect---+
BTW, there was a recent discussion [2] on the same issue.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
This series not only optimizes the memory usage of list_lru but also
simplifies the code.
This patch (of 16):
The current scheme of maintaining per-node per-memcg lru lists looks like:
struct list_lru {
struct list_lru_node *node; (for each node)
struct list_lru_memcg *memcg_lrus;
struct list_lru_one *lru[]; (for each memcg)
}
By effectively transposing the two-dimension array of list_lru_one's structures
(per-node per-memcg => per-memcg per-node) it's possible to save some memory
and simplify alloc/dealloc paths. The new scheme looks like:
struct list_lru {
struct list_lru_memcg *mlrus;
struct list_lru_per_memcg *mlru[]; (for each memcg)
struct list_lru_one node[0]; (for each node)
}
Memory savings are coming from not only 'struct rcu_head' but also some
pointer arrays used to store the pointer to 'struct list_lru_one'. The
array is per node and its size is 8 (a pointer) * num_memcgs. So the
total size of the arrays is 8 * num_nodes * memcg_nr_cache_ids. After
this patch, the size becomes 8 * memcg_nr_cache_ids.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Kari Argillander <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Xiongchun Duan <[email protected]>
Cc: Fam Zheng <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Before the for-each-CPU loop, preemption is disabled so that so that
drain_local_stock() can be invoked directly instead of scheduling a
worker. Ensuring that drain_local_stock() completed on the local CPU is
not correctness problem. It _could_ be that the charging path will be
forced to reclaim memory because cached charges are still waiting for
their draining.
Disabling preemption before invoking drain_local_stock() is problematic
on PREEMPT_RT due to the sleeping locks involved. To ensure that no CPU
migrations happens across for_each_online_cpu() it is enouhg to use
migrate_disable() which disables migration and keeps context preemptible
to a sleeping lock can be acquired. A race with CPU hotplug is not a
problem because pcp data is not going away. In the worst case we just
schedule draining of an empty stock.
Use migrate_disable() instead of get_cpu() around the
for_each_online_cpu() loop.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Waiman Long <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The members of the per-CPU structure memcg_stock_pcp are protected by
disabling interrupts. This is not working on PREEMPT_RT because it
creates atomic context in which actions are performed which require
preemptible context. One example is obj_cgroup_release().
The IRQ-disable sections can be replaced with local_lock_t which
preserves the explicit disabling of interrupts while keeps the code
preemptible on PREEMPT_RT.
drain_obj_stock() drops a reference on obj_cgroup which leads to an
invocat= ion of obj_cgroup_release() if it is the last object. This in
turn leads to recursive locking of the local_lock_t. To avoid this,
obj_cgroup_release() = is invoked outside of the locked section.
obj_cgroup_uncharge_pages() can be invoked with the local_lock_t
acquired a= nd without it. This will lead later to a recursion in
refill_stock(). To avoid the locking recursion provide
obj_cgroup_uncharge_pages_locked() which uses the locked version of
refill_stock().
- Replace disabling interrupts for memcg_stock with a local_lock_t.
- Let drain_obj_stock() return the old struct obj_cgroup which is
passed to obj_cgroup_put() outside of the locked section.
- Provide obj_cgroup_uncharge_pages_locked() which uses the locked
version of refill_stock() to avoid recursive locking in
drain_obj_stock().
Link: https://lkml.kernel.org/r/20220209014709.GA26885@xsang-OptiPlex-9020
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Reported-by: kernel test robot <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Waiman Long <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
drain_obj_stock()
Provide the inner part of refill_stock() as __refill_stock() without
disabling interrupts. This eases the integration of local_lock_t where
recursive locking must be avoided.
Open code obj_cgroup_uncharge_pages() in drain_obj_stock() and use
__refill_stock(). The caller of drain_obj_stock() already disables
interrupts.
[[email protected]: patch body around Johannes' diff]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Waiman Long <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
where needed.
The per-CPU counter are modified with the non-atomic modifier. The
consistency is ensured by disabling interrupts for the update. On non
PREEMPT_RT configuration this works because acquiring a spinlock_t typed
lock with the _irq() suffix disables interrupts. On PREEMPT_RT
configurations the RMW operation can be interrupted.
Another problem is that mem_cgroup_swapout() expects to be invoked with
disabled interrupts because the caller has to acquire a spinlock_t which
is acquired with disabled interrupts. Since spinlock_t never disables
interrupts on PREEMPT_RT the interrupts are never disabled at this
point.
The code is never called from in_irq() context on PREEMPT_RT therefore
disabling preemption during the update is sufficient on PREEMPT_RT. The
sections which explicitly disable interrupts can remain on PREEMPT_RT
because the sections remain short and they don't involve sleeping locks
(memcg_check_events() is doing nothing on PREEMPT_RT).
Disable preemption during update of the per-CPU variables which do not
explicitly disable interrupts.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]
Cc: Johannes Weiner <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Waiman Long <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
During the integration of PREEMPT_RT support, the code flow around
memcg_check_events() resulted in `twisted code'. Moving the code around
and avoiding then would then lead to an additional local-irq-save
section within memcg_check_events(). While looking better, it adds a
local-irq-save section to code flow which is usually within an
local-irq-off block on non-PREEMPT_RT configurations.
The threshold event handler is a deprecated memcg v1 feature. Instead
of trying to get it to work under PREEMPT_RT just disable it. There
should be no users on PREEMPT_RT. From that perspective it makes even
less sense to get it to work under PREEMPT_RT while having zero users.
Make memory.soft_limit_in_bytes and cgroup.event_control return
-EOPNOTSUPP on PREEMPT_RT. Make an empty memcg_check_events() and
memcg_write_event_control() which return only -EOPNOTSUPP on PREEMPT_RT.
Document that the two knobs are disabled on PREEMPT_RT.
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Michal Hocko <[email protected]>
Suggested-by: Michal Koutný <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Waiman Long <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm/memcg: Address PREEMPT_RT problems instead of disabling it", v5.
This series aims to address the memcg related problem on PREEMPT_RT.
I tested them on CONFIG_PREEMPT and CONFIG_PREEMPT_RT with the
tools/testing/selftests/cgroup/* tests and I haven't observed any
regressions (other than the lockdep report that is already there).
This patch (of 6):
The optimisation is based on a micro benchmark where local_irq_save() is
more expensive than a preempt_disable(). There is no evidence that it
is visible in a real-world workload and there are CPUs where the
opposite is true (local_irq_save() is cheaper than preempt_disable()).
Based on micro benchmarks, the optimisation makes sense on PREEMPT_NONE
where preempt_disable() is optimized away. There is no improvement with
PREEMPT_DYNAMIC since the preemption counter is always available.
The optimization makes also the PREEMPT_RT integration more complicated
since most of the assumption are not true on PREEMPT_RT.
Revert the optimisation since it complicates the PREEMPT_RT integration
and the improvement is hardly visible.
[[email protected]: patch body around Michal's diff]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/YgOGkXXCrD%[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Koutný <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
__setup() handlers should return 1 if the command line option is handled
and 0 if not (or maybe never return 0; it just pollutes init's
environment).
The only reason that this particular __setup handler does not pollute
init's environment is that the setup string contains a '.', as in
"cgroup.memory". This causes init/main.c::unknown_boottoption() to
consider it to be an "Unused module parameter" and ignore it. (This is
for parsing of loadable module parameters any time after kernel init.)
Otherwise the string "cgroup.memory=whatever" would be added to init's
environment strings.
Instead of relying on this '.' quirk, just return 1 to indicate that the
boot option has been handled.
Note that there is no warning message if someone enters:
cgroup.memory=anything_invalid
Link: https://lkml.kernel.org/r/[email protected]
Fixes: f7e1cb6ec51b0 ("mm: memcontrol: account socket memory in unified hierarchy memory controller")
Signed-off-by: Randy Dunlap <[email protected]>
Reported-by: Igor Zhbanov <[email protected]>
Link: lore.kernel.org/r/[email protected]
Reviewed-by: Michal Koutný <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The high limit is used to throttle the workload without invoking the
oom-killer. Recently we tried to use the high limit to right size our
internal workloads. More specifically dynamically adjusting the limits
of the workload without letting the workload get oom-killed. However
due to the limitation of the implementation of high limit enforcement,
we observed the mechanism fails for some real workloads.
The high limit is enforced on return-to-userspace i.e. the kernel let
the usage goes over the limit and when the execution returns to
userspace, the high reclaim is triggered and the process can get
throttled as well. However this mechanism fails for workloads which do
large allocations in a single kernel entry e.g. applications that
mlock() a large chunk of memory in a single syscall. Such applications
bypass the high limit and can trigger the oom-killer.
To make high limit enforcement more robust, this patch makes the limit
enforcement synchronous only if the accumulated overcharge becomes
larger than MEMCG_CHARGE_BATCH. So, most of the allocations would still
be throttled on the return-to-userspace path but only the extreme
allocations which accumulates large amount of overcharge without
returning to the userspace will be throttled synchronously. The value
MEMCG_CHARGE_BATCH is a bit arbitrary but most of other places in the
memcg codebase uses this constant therefore for now uses the same one.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Acked-by: Chris Down <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|