Age | Commit message (Collapse) | Author | Files | Lines |
|
The sun3 MMU variant of m68k uses GFP_KERNEL to allocate a PTE page and
then memset(0) or clear_highpage() to clear it.
This is equivalent to allocating the page with GFP_KERNEL | __GFP_ZERO,
which allows replacing sun3 implementation of pte_alloc_one() and
pte_alloc_one_kernel() with the generic ones.
The pte_free() and pte_free_kernel() versions are identical to the generic
ones and can be simply dropped.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Anton Ivanov <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ley Foon Tan <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Creasey <[email protected]>
Cc: Vincent Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The csky implementation pte_alloc_one(), pte_free_kernel() and pte_free()
is identical to the generic except of lack of __GFP_ACCOUNT for the user
PTEs allocation.
Switch csky to use generic version of these functions.
The csky implementation of pte_alloc_one_kernel() is not replaced because
it does not clear the allocated page but rather sets each PTE in it to a
non-zero value.
The pte_free_kernel() and pte_free() versions on csky are identical to the
generic ones and can be simply dropped.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Acked-by: Guo Ren <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Anton Ivanov <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ley Foon Tan <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Creasey <[email protected]>
Cc: Vincent Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The PTE allocations in arm64 are identical to the generic ones modulo the
GFP flags.
Using the generic pte_alloc_one() functions ensures that the user page
tables are allocated with __GFP_ACCOUNT set.
The arm64 definition of PGALLOC_GFP is removed and replaced with
GFP_PGTABLE_USER for p[gum]d_alloc_one() for the user page tables and
GFP_PGTABLE_KERNEL for the kernel page tables. The KVM memory cache is now
using GFP_PGTABLE_USER.
The mappings created with create_pgd_mapping() are now using
GFP_PGTABLE_KERNEL.
The conversion to the generic version of pte_free_kernel() removes the NULL
check for pte.
The pte_free() version on arm64 is identical to the generic one and
can be simply dropped.
[[email protected]: fix a bogus GFP flag in pgd_alloc()]
Link: https://lore.kernel.org/r/[email protected]/
[and fix it more]
Link: https://lore.kernel.org/linux-mm/20190617151252.GF16810@rapoport-lnx/
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Anton Ivanov <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ley Foon Tan <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Creasey <[email protected]>
Cc: Vincent Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Replace __get_free_page() and alloc_pages() calls with the generic
__pte_alloc_one_kernel() and __pte_alloc_one().
There is no functional change for the kernel PTE allocation.
The difference for the user PTEs, is that the clear_pte_table() is now
called after pgtable_page_ctor() and the addition of __GFP_ACCOUNT to the
GFP flags.
The conversion to the generic version of pte_free_kernel() removes the NULL
check for pte.
The pte_free() version on arm is identical to the generic one and can be
simply dropped.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Anton Ivanov <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ley Foon Tan <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Creasey <[email protected]>
Cc: Vincent Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
alpha allocates PTE pages with __get_free_page() and uses
GFP_KERNEL | __GFP_ZERO for the allocations.
Switch it to the generic version that does exactly the same thing for the
kernel page tables and adds __GFP_ACCOUNT for the user PTEs.
The alpha pte_free() and pte_free_kernel() versions are identical to the
generic ones and can be simply dropped.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Anton Ivanov <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ley Foon Tan <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Creasey <[email protected]>
Cc: Vincent Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Most architectures have identical or very similar implementation of
pte_alloc_one_kernel(), pte_alloc_one(), pte_free_kernel() and
pte_free().
Add a generic implementation that can be reused across architectures and
enable its use on x86.
The generic implementation uses
GFP_KERNEL | __GFP_ZERO
for the kernel page tables and
GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT
for the user page tables.
The "base" functions for PTE allocation, namely __pte_alloc_one_kernel()
and __pte_alloc_one() are intended for the architectures that require
additional actions after actual memory allocation or must use non-default
GFP flags.
x86 is switched to use generic pte_alloc_one_kernel(), pte_free_kernel() and
pte_free().
x86 still implements pte_alloc_one() to allow run-time control of GFP
flags required for "userpte" command line option.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Anton Ivanov <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ley Foon Tan <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Creasey <[email protected]>
Cc: Vincent Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Several mips builds generate the following build warning.
mm/gup.c:1788:13: warning: 'undo_dev_pagemap' defined but not used
The function is declared unconditionally but only called from behind
various ifdefs. Mark it __maybe_unused.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Guenter Roeck <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If we end up without a PGD or PUD entry backing the gate area, don't BUG
-- just fail gracefully.
It's not entirely implausible that this could happen some day on x86. It
doesn't right now even with an execute-only emulated vsyscall page because
the fixmap shares the PUD, but the core mm code shouldn't rely on that
particular detail to avoid OOPSing.
Link: http://lkml.kernel.org/r/a1d9f4efb75b9d464e59fd6af00104b21c58f6f7.1561610798.git.luto@kernel.org
Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: Jann Horn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Both hugetlb and thp locate on the same migration type of pageblock, since
they are allocated from a free_list[]. Based on this fact, it is enough
to check on a single subpage to decide the migration type of the whole
huge page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on
x86, similar on other archs.
Furthermore, when executing isolate_huge_page(), it avoid taking global
hugetlb_lock many times, and meanless remove/add to the local link list
cma_page_list.
[[email protected]: make `i' and `step' unsigned]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Pingfan Liu <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
All other get_user_page_fast cases mark the page referenced, so do this
here as well.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David Miller <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This applies the overflow fixes from 8fde12ca79aff ("mm: prevent
get_user_pages() from overflowing page refcount") to the powerpc hugepd
code and brings it back in sync with the other GUP cases.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David Miller <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
While only powerpc supports the hugepd case, the code is pretty generic
and I'd like to keep all GUP internals in one place.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David Miller <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We can only deal with FOLL_WRITE and/or FOLL_LONGTERM in
get_user_pages_fast, so reject all other flags.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David Miller <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Always build mm/gup.c so that we don't have to provide separate nommu
stubs. Also merge the get_user_pages_fast and __get_user_pages_fast stubs
when HAVE_FAST_GUP into the main implementations, which will never call
the fast path if HAVE_FAST_GUP is not set.
This also ensures the new put_user_pages* helpers are available for nommu,
as those are currently missing, which would create a problem as soon as we
actually grew users for it.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David Miller <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This moves the actually exported functions towards the end of the file,
and reorders some functions to be in more logical blocks as a preparation
for moving various stubs inline into the main functionality using
IS_ENABLED().
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David Miller <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We only support the generic GUP now, so rename the config option to
be more clear, and always use the mm/Kconfig definition of the
symbol and select it from the arch Kconfigs.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Khalid Aziz <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David Miller <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The sparc64 code is mostly equivalent to the generic one, minus various
bugfixes and two arch overrides that this patch adds to pgtable.h.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Khalid Aziz <[email protected]>
Cc: David Miller <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a helper to untag a user pointer. This is needed for ADI support
in get_user_pages_fast.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Khalid Aziz <[email protected]>
Cc: David Miller <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
sparc64 only had pgd_page_vaddr, but not pgd_page.
[[email protected]: fix sparc64 build]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Cc: David Miller <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The sh code is mostly equivalent to the generic one, minus various
bugfixes and two arch overrides that this patch adds to pgtable.h.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David Miller <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
sh only had pud_page_vaddr, but not pud_page.
[[email protected]: sh: stub out pud_page]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Tested-by: Guenter Roeck <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David Miller <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The mips code is mostly equivalent to the generic one, minus various
bugfixes and an arch override for gup_fast_permitted.
Note that this defines ARCH_HAS_PTE_SPECIAL for mips as mips has
pte_special and pte_mkspecial implemented and used in the existing gup
code. They are no-op stubs, though which makes me a little unsure if this
is really right thing to do.
Note that this also adds back a missing cpu_has_dc_aliases check for
__get_user_pages_fast, which the old code was only doing for
get_user_pages_fast. This clearly looks like an oversight, as any
condition that makes get_user_pages_fast unsafe also applies to
__get_user_pages_fast.
[[email protected]: MIPS: don't select ARCH_HAS_PTE_SPECIAL]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Tested-by: Guenter Roeck <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David Miller <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The split low/high access is the only non-READ_ONCE version of gup_get_pte
that did show up in the various arch implemenations. Lift it to common
code and drop the ifdef based arch override.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David Miller <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Pass in the already calculated end value instead of recomputing it, and
leave the end > start check in the callers instead of duplicating them in
the arch code.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David Miller <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "switch the remaining architectures to use generic GUP", v4.
A series to switch mips, sh and sparc64 to use the generic GUP code so
that we only have one codebase to touch for further improvements to this
code.
This patch (of 16):
This will allow sparc64, or any future architecture with memory tagging to
override its tags for get_user_pages and get_user_pages_fast.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Khalid Aziz <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: David Miller <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Ralf Baechle <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There are concerns about memory leaks from extensive use of memory cgroups
as each memory cgroup creates its own set of kmem caches. There is a
possiblity that the memcg kmem caches may remain even after the memory
cgroups have been offlined. Therefore, it will be useful to show the
status of each of memcg kmem caches.
This patch introduces a new <debugfs>/memcg_slabinfo file which is
somewhat similar to /proc/slabinfo in format, but lists only information
about kmem caches that have child memcg kmem caches. Information
available in /proc/slabinfo are not repeated in memcg_slabinfo.
A portion of a sample output of the file was:
# <name> <css_id[:dead]> <active_objs> <num_objs> <active_slabs> <num_slabs>
rpc_inode_cache root 13 51 1 1
rpc_inode_cache 48 0 0 0 0
fat_inode_cache root 1 45 1 1
fat_inode_cache 41 2 45 1 1
xfs_inode root 770 816 24 24
xfs_inode 92 22 34 1 1
xfs_inode 88:dead 1 34 1 1
xfs_inode 89:dead 23 34 1 1
xfs_inode 85 4 34 1 1
xfs_inode 84 9 34 1 1
The css id of the memcg is also listed. If a memcg is not online,
the tag ":dead" will be attached as shown above.
[[email protected]: memcg: add ":deact" tag for reparented kmem caches in memcg_slabinfo]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: set the flag in the common code as suggested by Roman]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Waiman Long <[email protected]>
Suggested-by: Shakeel Butt <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Let's reparent non-root kmem_caches on memcg offlining. This allows us to
release the memory cgroup without waiting for the last outstanding kernel
object (e.g. dentry used by another application).
Since the parent cgroup is already charged, everything we need to do is to
splice the list of kmem_caches to the parent's kmem_caches list, swap the
memcg pointer, drop the css refcounter for each kmem_cache and adjust the
parent's css refcounter.
Please, note that kmem_cache->memcg_params.memcg isn't a stable pointer
anymore. It's safe to read it under rcu_read_lock(), cgroup_mutex held,
or any other way that protects the memory cgroup from being released.
We can race with the slab allocation and deallocation paths. It's not a
big problem: parent's charge and slab global stats are always correct, and
we don't care anymore about the child usage and global stats. The child
cgroup is already offline, so we don't use or show it anywhere.
Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE) aren't
used anywhere except count_shadow_nodes(). But even there it won't break
anything: after reparenting "nodes" will be 0 on child level (because
we're already reparenting shrinker lists), and on parent level page stats
always were 0, and this patch won't change anything.
[[email protected]: properly handle kmem_caches reparented to root_mem_cgroup]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Every slab page charged to a non-root memory cgroup has a pointer to the
memory cgroup and holds a reference to it, which protects a non-empty
memory cgroup from being released. At the same time the page has a
pointer to the corresponding kmem_cache, and also hold a reference to the
kmem_cache. And kmem_cache by itself holds a reference to the cgroup.
So there is clearly some redundancy, which allows to stop setting the
page->mem_cgroup pointer and rely on getting memcg pointer indirectly via
kmem_cache. Further it will allow to change this pointer easier, without
a need to go over all charged pages.
So let's stop setting page->mem_cgroup pointer for slab pages, and stop
using the css refcounter directly for protecting the memory cgroup from
going away. Instead rely on kmem_cache as an intermediate object.
Make sure that vmstats and shrinker lists are working as previously, as
well as /proc/kpagecgroup interface.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently each charged slab page holds a reference to the cgroup to which
it's charged. Kmem_caches are held by the memcg and are released all
together with the memory cgroup. It means that none of kmem_caches are
released unless at least one reference to the memcg exists, which is very
far from optimal.
Let's rework it in a way that allows releasing individual kmem_caches as
soon as the cgroup is offline, the kmem_cache is empty and there are no
pending allocations.
To make it possible, let's introduce a new percpu refcounter for non-root
kmem caches. The counter is initialized to the percpu mode, and is
switched to the atomic mode during kmem_cache deactivation. The counter
is bumped for every charged page and also for every running allocation.
So the kmem_cache can't be released unless all allocations complete.
To shutdown non-active empty kmem_caches, let's reuse the work queue,
previously used for the kmem_cache deactivation. Once the reference
counter reaches 0, let's schedule an asynchronous kmem_cache release.
* I used the following simple approach to test the performance
(stolen from another patchset by T. Harding):
time find / -name fname-no-exist
echo 2 > /proc/sys/vm/drop_caches
repeat 10 times
Results:
orig patched
real 0m1.455s real 0m1.355s
user 0m0.206s user 0m0.219s
sys 0m0.855s sys 0m0.807s
real 0m1.487s real 0m1.699s
user 0m0.221s user 0m0.256s
sys 0m0.806s sys 0m0.948s
real 0m1.515s real 0m1.505s
user 0m0.183s user 0m0.215s
sys 0m0.876s sys 0m0.858s
real 0m1.291s real 0m1.380s
user 0m0.193s user 0m0.198s
sys 0m0.843s sys 0m0.786s
real 0m1.364s real 0m1.374s
user 0m0.180s user 0m0.182s
sys 0m0.868s sys 0m0.806s
real 0m1.352s real 0m1.312s
user 0m0.201s user 0m0.212s
sys 0m0.820s sys 0m0.761s
real 0m1.302s real 0m1.349s
user 0m0.205s user 0m0.203s
sys 0m0.803s sys 0m0.792s
real 0m1.334s real 0m1.301s
user 0m0.194s user 0m0.201s
sys 0m0.806s sys 0m0.779s
real 0m1.426s real 0m1.434s
user 0m0.216s user 0m0.181s
sys 0m0.824s sys 0m0.864s
real 0m1.350s real 0m1.295s
user 0m0.200s user 0m0.190s
sys 0m0.842s sys 0m0.811s
So it looks like the difference is not noticeable in this test.
[[email protected]: fix an use-after-free in kmemcg_workfn()]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Qian Cai <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Andrei Vagin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently the memcg_params.dying flag and the corresponding workqueue used
for the asynchronous deactivation of kmem_caches is synchronized using the
slab_mutex.
It makes impossible to check this flag from the irq context, which will be
required in order to implement asynchronous release of kmem_caches.
So let's switch over to the irq-save flavor of the spinlock-based
synchronization.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is no point in checking the root_cache->memcg_params.dying flag on
kmem_cache creation path. New allocations shouldn't be performed using a
dead root kmem_cache, so no new memcg kmem_cache creation can be scheduled
after the flag is set. And if it was scheduled before,
flush_memcg_workqueue() will wait for it anyway.
So let's drop this check to simplify the code.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently the page accounting code is duplicated in SLAB and SLUB
internals. Let's move it into new (un)charge_slab_page helpers in the
slab_common.c file. These helpers will be responsible for statistics
(global and memcg-aware) and memcg charging. So they are replacing direct
memcg_(un)charge_slab() calls.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Let's separate the page counter modification code out of
__memcg_kmem_uncharge() in a way similar to what
__memcg_kmem_charge() and __memcg_kmem_charge_memcg() work.
This will allow to reuse this code later using a new
memcg_kmem_uncharge_memcg() wrapper, which calls
__memcg_kmem_uncharge_memcg() if memcg_kmem_enabled()
check is passed.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently SLUB uses a work scheduled after an RCU grace period to
deactivate a non-root kmem_cache. This mechanism can be reused for
kmem_caches release, but requires generalization for SLAB case.
Introduce kmemcg_cache_deactivate() function, which calls
allocator-specific __kmem_cache_deactivate() and schedules execution of
__kmem_cache_deactivate_after_rcu() with all necessary locks in a worker
context after an rcu grace period.
Here is the new calling scheme:
kmemcg_cache_deactivate()
__kmemcg_cache_deactivate() SLAB/SLUB-specific
kmemcg_rcufn() rcu
kmemcg_workfn() work
__kmemcg_cache_deactivate_after_rcu() SLAB/SLUB-specific
instead of:
__kmemcg_cache_deactivate() SLAB/SLUB-specific
slab_deactivate_memcg_cache_rcu_sched() SLUB-only
kmemcg_rcufn() rcu
kmemcg_workfn() work
kmemcg_cache_deact_after_rcu() SLUB-only
For consistency, all allocator-specific functions start with "__".
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The delayed work/rcu deactivation infrastructure of non-root kmem_caches
can be also used for asynchronous release of these objects. Let's get rid
of the word "deactivation" in corresponding names to make the code look
better after generalization.
It's easier to make the renaming first, so that the generalized code will
look consistent from scratch.
Let's rename struct memcg_cache_params fields:
deact_fn -> work_fn
deact_rcu_head -> rcu_head
deact_work -> work
And RCU/delayed work callbacks in slab common code:
kmemcg_deactivate_rcufn -> kmemcg_rcufn
kmemcg_deactivate_workfn -> kmemcg_workfn
This patch contains no functional changes, only renamings.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
memcg_link_cache()
Patch series "mm: reparent slab memory on cgroup removal", v7.
# Why do we need this?
We've noticed that the number of dying cgroups is steadily growing on most
of our hosts in production. The following investigation revealed an issue
in the userspace memory reclaim code [1], accounting of kernel stacks [2],
and also the main reason: slab objects.
The underlying problem is quite simple: any page charged to a cgroup holds
a reference to it, so the cgroup can't be reclaimed unless all charged
pages are gone. If a slab object is actively used by other cgroups, it
won't be reclaimed, and will prevent the origin cgroup from being
reclaimed.
Slab objects, and first of all vfs cache, is shared between cgroups, which
are using the same underlying fs, and what's even more important, it's
shared between multiple generations of the same workload. So if something
is running periodically every time in a new cgroup (like how systemd
works), we do accumulate multiple dying cgroups.
Strictly speaking pagecache isn't different here, but there is a key
difference: we disable protection and apply some extra pressure on LRUs of
dying cgroups, and these LRUs contain all charged pages. My experiments
show that with the disabled kernel memory accounting the number of dying
cgroups stabilizes at a relatively small number (~100, depends on memory
pressure and cgroup creation rate), and with kernel memory accounting it
grows pretty steadily up to several thousands.
Memory cgroups are quite complex and big objects (mostly due to percpu
stats), so it leads to noticeable memory losses. Memory occupied by dying
cgroups is measured in hundreds of megabytes. I've even seen a host with
more than 100Gb of memory wasted for dying cgroups. It leads to a
degradation of performance with the uptime, and generally limits the usage
of cgroups.
My previous attempt [3] to fix the problem by applying extra pressure on
slab shrinker lists caused a regressions with xfs and ext4, and has been
reverted [4]. The following attempts to find the right balance [5, 6]
were not successful.
So instead of trying to find a maybe non-existing balance, let's do
reparent accounted slab caches to the parent cgroup on cgroup removal.
# Implementation approach
There is however a significant problem with reparenting of slab memory:
there is no list of charged pages. Some of them are in shrinker lists,
but not all. Introducing of a new list is really not an option.
But fortunately there is a way forward: every slab page has a stable
pointer to the corresponding kmem_cache. So the idea is to reparent
kmem_caches instead of slab pages.
It's actually simpler and cheaper, but requires some underlying changes:
1) Make kmem_caches to hold a single reference to the memory cgroup,
instead of a separate reference per every slab page.
2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
page->kmem_cache->memcg indirection instead. It's used only on
slab page release, so performance overhead shouldn't be a big issue.
3) Introduce a refcounter for non-root slab caches. It's required to
be able to destroy kmem_caches when they become empty and release
the associated memory cgroup.
There is a bonus: currently we release all memcg kmem_caches all together
with the memory cgroup itself. This patchset allows individual
kmem_caches to be released as soon as they become inactive and free.
Some additional implementation details are provided in corresponding
commit messages.
# Results
Below is the average number of dying cgroups on two groups of our
production hosts. They do run some sort of web frontend workload, the
memory pressure is moderate. As we can see, with the kernel memory
reparenting the number stabilizes in 60s range; however with the original
version it grows almost linearly and doesn't show any signs of plateauing.
The difference in slab and percpu usage between patched and unpatched
versions also grows linearly. In 7 days it exceeded 200Mb.
day 0 1 2 3 4 5 6 7
original 56 362 628 752 1070 1250 1490 1560
patched 23 46 51 55 60 57 67 69
mem diff(Mb) 22 74 123 152 164 182 214 241
# Links
[1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error")
[2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
[3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
[4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
[5]: https://lkml.org/lkml/2019/1/28/1865
[6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2
This patch (of 10):
Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
rather than in init_memcg_params().
Once kmem_cache will hold a reference to the memory cgroup, it will
simplify the refcounting.
For non-root kmem_caches memcg_link_cache() is always called before the
kmem_cache becomes visible to a user, so it's safe.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The current cgroup OOM memory info dump doesn't include all the memory
we are tracking, nor does it give insight into what the VM tried to do
leading up to the OOM. All that useful info is in memory.stat.
Furthermore, the recursive printing for every child cgroup can
generate absurd amounts of data on the console for larger cgroup
trees, and it's not like we provide a per-cgroup breakdown during
global OOM kills.
When an OOM kill is triggered, print one set of recursive memory.stat
items at the level whose limit triggered the OOM condition.
Example output:
stress invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
CPU: 2 PID: 210 Comm: stress Not tainted 5.2.0-rc2-mm1-00247-g47d49835983c #135
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
Call Trace:
dump_stack+0x46/0x60
dump_header+0x4c/0x2d0
oom_kill_process.cold.10+0xb/0x10
out_of_memory+0x200/0x270
? try_to_free_mem_cgroup_pages+0xdf/0x130
mem_cgroup_out_of_memory+0xb7/0xc0
try_charge+0x680/0x6f0
mem_cgroup_try_charge+0xb5/0x160
__add_to_page_cache_locked+0xc6/0x300
? list_lru_destroy+0x80/0x80
add_to_page_cache_lru+0x45/0xc0
pagecache_get_page+0x11b/0x290
filemap_fault+0x458/0x6d0
ext4_filemap_fault+0x27/0x36
__do_fault+0x2f/0xb0
__handle_mm_fault+0x9c5/0x1140
? apic_timer_interrupt+0xa/0x20
handle_mm_fault+0xc5/0x180
__do_page_fault+0x1ab/0x440
? page_fault+0x8/0x30
page_fault+0x1e/0x30
RIP: 0033:0x55c32167fc10
Code: Bad RIP value.
RSP: 002b:00007fff1d031c50 EFLAGS: 00010206
RAX: 000000000dc00000 RBX: 00007fd2db000010 RCX: 00007fd2db000010
RDX: 0000000000000000 RSI: 0000000010001000 RDI: 0000000000000000
RBP: 000055c321680a54 R08: 00000000ffffffff R09: 0000000000000000
R10: 0000000000000022 R11: 0000000000000246 R12: ffffffffffffffff
R13: 0000000000000002 R14: 0000000000001000 R15: 0000000010000000
memory: usage 1024kB, limit 1024kB, failcnt 75131
swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /foo:
anon 0
file 0
kernel_stack 36864
slab 274432
sock 0
shmem 0
file_mapped 0
file_dirty 0
file_writeback 0
anon_thp 0
inactive_anon 126976
active_anon 0
inactive_file 0
active_file 0
unevictable 0
slab_reclaimable 0
slab_unreclaimable 274432
pgfault 59466
pgmajfault 1617
workingset_refault 2145
workingset_activate 0
workingset_nodereclaim 0
pgrefill 98952
pgscan 200060
pgsteal 59340
pgactivate 40095
pgdeactivate 96787
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0
Tasks state (memory values in pages):
[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 200] 0 200 1121 884 53248 29 0 bash
[ 209] 0 209 905 246 45056 19 0 stress
[ 210] 0 210 66442 56 499712 56349 0 stress
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),oom_memcg=/foo,task_memcg=/foo,task=stress,pid=210,uid=0
Memory cgroup out of memory: Killed process 210 (stress) total-vm:265768kB, anon-rss:0kB, file-rss:224kB, shmem-rss:0kB
oom_reaper: reaped process 210 (stress), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[[email protected]: s/kvmalloc/kmalloc/ per Michal]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The memory controller in cgroup v2 exposes memory.events file for each
memcg which shows the number of times events like low, high, max, oom
and oom_kill have happened for the whole tree rooted at that memcg.
Users can also poll or register notification to monitor the changes in
that file. Any event at any level of the tree rooted at memcg will
notify all the listeners along the path till root_mem_cgroup. There are
existing users which depend on this behavior.
However there are users which are only interested in the events
happening at a specific level of the memcg tree and not in the events in
the underlying tree rooted at that memcg. One such use-case is a
centralized resource monitor which can dynamically adjust the limits of
the jobs running on a system. The jobs can create their sub-hierarchy
for their own sub-tasks. The centralized monitor is only interested in
the events at the top level memcgs of the jobs as it can then act and
adjust the limits of the jobs. Using the current memory.events for such
centralized monitor is very inconvenient. The monitor will keep
receiving events which it is not interested and to find if the received
event is interesting, it has to read memory.event files of the next
level and compare it with the top level one. So, let's introduce
memory.events.local to the memcg which shows and notify for the events
at the memcg level.
Now, does memory.stat and memory.pressure need their local versions. IMHO
no due to the no internal process contraint of the cgroup v2. The
memory.stat file of the top level memcg of a job shows the stats and
vmevents of the whole tree. The local stats or vmevents of the top level
memcg will only change if there is a process running in that memcg but v2
does not allow that. Similarly for memory.pressure there will not be any
process in the internal nodes and thus no chance of local pressure.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Chris Down <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit d46eb14b735b ("fs: fsnotify: account fsnotify metadata to
kmemcg") added remote memcg charging for fanotify and inotify event
objects. The aim was to charge the memory to the listener who is
interested in the events but without triggering the OOM killer.
Otherwise there would be security concerns for the listener.
At the time, oom-kill trigger was not in the charging path. A parallel
work added the oom-kill back to charging path i.e. commit 29ef680ae7c2
("memcg, oom: move out_of_memory back to the charge path"). So to not
trigger oom-killer in the remote memcg, explicitly add
__GFP_RETRY_MAYFAIL to the fanotigy and inotify event allocations.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Acked-by: Jan Kara <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Amir Goldstein <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The documentation of __GFP_RETRY_MAYFAIL clearly mentioned that the OOM
killer will not be triggered and indeed the page alloc does not invoke OOM
killer for such allocations. However we do trigger memcg OOM killer for
__GFP_RETRY_MAYFAIL. Fix that. This flag will used later to not trigger
oom-killer in the charging path for fanotify and inotify event
allocations.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Via commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"),
after swapoff, the address_space associated with the swap device will be
freed. So swap_address_space() users which touch the address_space need
some kind of mechanism to prevent the address_space from being freed
during accessing.
When mincore processes an unmapped range for swapped shmem pages, it
doesn't hold the lock to prevent swap device from being swapped off. So
the following race is possible:
CPU1 CPU2
do_mincore() swapoff()
walk_page_range()
mincore_unmapped_range()
__mincore_unmapped_range
mincore_page
as = swap_address_space()
... exit_swap_address_space()
... kvfree(spaces)
find_get_page(as)
The address space may be accessed after being freed.
To fix the race, get_swap_device()/put_swap_device() is used to enclose
find_get_page() to check whether the swap entry is valid and prevent the
swap device from being swapoff during accessing.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks")
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Andrea Parri <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
swap_extent is used to map swap page offset to backing device's block
offset. For a continuous block range, one swap_extent is used and all
these swap_extents are managed in a linked list.
These swap_extents are used by map_swap_entry() during swap's read and
write path. To find out the backing device's block offset for a page
offset, the swap_extent list will be traversed linearly, with
curr_swap_extent being used as a cache to speed up the search.
This works well as long as swap_extents are not huge or when the number
of processes that access swap device are few, but when the swap device
has many extents and there are a number of processes accessing the swap
device concurrently, it can be a problem. On one of our servers, the
disk's remaining size is tight:
$df -h
Filesystem Size Used Avail Use% Mounted on
... ...
/dev/nvme0n1p1 1.8T 1.3T 504G 72% /home/t4
When creating a 80G swapfile there, there are as many as 84656 swap
extents. The end result is, kernel spends abou 30% time in
map_swap_entry() and swap throughput is only 70MB/s.
As a comparison, when I used smaller sized swapfile, like 4G whose
swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and
map_swap_entry() is about 3%.
One downside of using rbtree for swap_extent is, 'struct rbtree' takes
24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more
for each swap_extent. For a swapfile that has 80k swap_extents, that
means 625KiB more memory consumed.
Test:
Since it's not possible to reboot that server, I can not test this patch
diretly there. Instead, I tested it on another server with NVMe disk.
I created a 20G swapfile on an NVMe backed XFS fs. By default, the
filesystem is quite clean and the created swapfile has only 2 extents.
Testing vanilla and this patch shows no obvious performance difference
when swapfile is not fragmented.
To see the patch's effects, I used some tweaks to manually fragment the
swapfile by breaking the extent at 1M boundary. This made the swapfile
have 20K extents.
nr_task=4
kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
vanilla 165191 90.77% 171798 90.21%
patched 858993 +420% 2.16% 715827 +317% 0.77%
nr_task=8
kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
vanilla 306783 92.19% 318145 87.76%
patched 954437 +211% 2.35% 1073741 +237% 1.57%
swapout: the throughput of swap out, in KB/s, higher is better 1st
map_swap_entry: cpu cycles percent sampled by perf swapin: the
throughput of swap in, in KB/s, higher is better. 2nd map_swap_entry:
cpu cycles percent sampled by perf
nr_task=1 doesn't show any difference, this is due to the curr_swap_extent
can be effectively used to cache the correct swap extent for single task
workload.
[[email protected]: s/BUG_ON(1)/BUG()/]
Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronlu
Signed-off-by: Aaron Lu <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
total_swapcache_pages() may race with swapper_spaces[] allocation and
freeing. Previously, this is protected with a swapper_spaces[] specific
RCU mechanism. To simplify the logic/code complexity, it is replaced with
get/put_swap_device(). The code line number is reduced too. Although not
so important, the swapoff() performance improves too because one
synchronize_rcu() call during swapoff() is deleted.
[[email protected]: fix bad swap file entry warning]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Tested-by: Mike Kravetz <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Andrea Parri <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When swapin is performed, after getting the swap entry information from
the page table, system will swap in the swap entry, without any lock held
to prevent the swap device from being swapoff. This may cause the race
like below,
CPU 1 CPU 2
----- -----
do_swap_page
swapin_readahead
__read_swap_cache_async
swapoff swapcache_prepare
p->swap_map = NULL __swap_duplicate
p->swap_map[?] /* !!! NULL pointer access */
Because swapoff is usually done when system shutdown only, the race may
not hit many people in practice. But it is still a race need to be fixed.
To fix the race, get_swap_device() is added to check whether the specified
swap entry is valid in its swap device. If so, it will keep the swap
entry valid via preventing the swap device from being swapoff, until
put_swap_device() is called.
Because swapoff() is very rare code path, to make the normal path runs as
fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
reference count is used to implement get/put_swap_device(). >From
get_swap_device() to put_swap_device(), RCU reader side is locked, so
synchronize_rcu() in swapoff() will wait until put_swap_device() is
called.
In addition to swap_map, cluster_info, etc. data structure in the struct
swap_info_struct, the swap cache radix tree will be freed after swapoff,
so this patch fixes the race between swap cache looking up and swapoff
too.
Races between some other swap cache usages and swapoff are fixed too via
calling synchronize_rcu() between clearing PageSwapCache() and freeing
swap cache data structure.
Another possible method to fix this is to use preempt_off() +
stop_machine() to prevent the swap device from being swapoff when its data
structure is being accessed. The overhead in hot-path of both methods is
similar. The advantages of RCU based method are,
1. stop_machine() may disturb the normal execution code path on other
CPUs.
2. File cache uses RCU to protect its radix tree. If the similar
mechanism is used for swap cache too, it is easier to share code
between them.
3. RCU is used to protect swap cache in total_swapcache_pages() and
exit_swap_address_space() already. The two mechanisms can be
merged to simplify the logic.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 235b62176712 ("mm/swap: add cluster lock")
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Andrea Parri <[email protected]>
Not-nacked-by: Hugh Dickins <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Dave Jiang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking
operations") changed when mmap_sem is dropped during filemap page fault
and when returning VM_FAULT_RETRY.
Correct the comment to reflect the change.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Acked-by: Song Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix the callback 9p passes to read_cache_page to actually have the
proper type expected. Casting around function pointers can easily
hide typing bugs, and defeats control flow protection.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix the callback jffs2 passes to read_cache_page to actually have the
proper type expected. Casting around function pointers can easily hide
typing bugs, and defeats control flow protection.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We can just pass a NULL filler and do the right thing inside of
do_read_cache_page based on the NULL parameter.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "fix filler_t callback type mismatches", v2.
Casting mapping->a_ops->readpage to filler_t causes an indirect call
type mismatch with Control-Flow Integrity checking. This change fixes
the mismatch in read_cache_page_gfp and read_mapping_page by adding
using a NULL filler argument as an indication to call ->readpage
directly, and by passing the right parameter callbacks in nfs and jffs2.
This patch (of 4):
Code cleanup.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When debug_pagealloc is enabled, we currently allocate the page_ext
array to mark guard pages with the PAGE_EXT_DEBUG_GUARD flag. Now that
we have the page_type field in struct page, we can use that instead, as
guard pages are neither PageSlab nor mapped to userspace. This reduces
memory overhead when debug_pagealloc is enabled and there are no other
features requiring the page_ext array.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|