aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-03-22mm: rework swap handling of zap_pte_rangePeter Xu1-15/+6
Clean the code up by merging the device private/exclusive swap entry handling with the rest, then we merge the pte clear operation too. struct* page is defined in multiple places in the function, move it upward. free_swap_and_cache() is only useful for !non_swap_entry() case, put it into the condition. No functional change intended. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Yang Shi <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: change zap_details.zap_mapping into even_cowsPeter Xu1-9/+7
Currently we have a zap_mapping pointer maintained in zap_details, when it is specified we only want to zap the pages that has the same mapping with what the caller has specified. But what we want to do is actually simpler: we want to skip zapping private (COW-ed) pages in some cases. We can refer to unmap_mapping_pages() callers where we could have passed in different even_cows values. The other user is unmap_mapping_folio() where we always want to skip private pages. According to Hugh, we used a mapping pointer for historical reason, as explained here: https://lore.kernel.org/lkml/[email protected]/ Quoting partly from Hugh: Which raises the question again of why I did not just use a boolean flag there originally: aah, I think I've found why. In those days there was a horrible "optimization", for better performance on some benchmark I guess, which when you read from /dev/zero into a private mapping, would map the zero page there (look up read_zero_pagealigned() and zeromap_page_range() if you dare). So there was another category of page to be skipped along with the anon COWs, and I didn't want multiple tests in the zap loop, so checking check_mapping against page->mapping did both. I think nowadays you could do it by checking for PageAnon page (or genuine swap entry) instead. This patch replaces the zap_details.zap_mapping pointer into the even_cows boolean, then we check it against PageAnon. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Suggested-by: Hugh Dickins <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: rename zap_skip_check_mapping() to should_zap_page()Peter Xu1-9/+8
The previous name is against the natural way people think. Invert the meaning and also the return value. No functional change intended. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Suggested-by: Hugh Dickins <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: don't skip swap entry even if zap_details specifiedPeter Xu1-9/+31
Patch series "mm: Rework zap ptes on swap entries", v5. Patch 1 should fix a long standing bug for zap_pte_range() on zap_details usage. The risk is we could have some swap entries skipped while we should have zapped them. Migration entries are not the major concern because file backed memory always zap in the pattern that "first time without page lock, then re-zap with page lock" hence the 2nd zap will always make sure all migration entries are already recovered. However there can be issues with real swap entries got skipped errornoously. There's a reproducer provided in commit message of patch 1 for that. Patch 2-4 are cleanups that are based on patch 1. After the whole patchset applied, we should have a very clean view of zap_pte_range(). Only patch 1 needs to be backported to stable if necessary. This patch (of 4): The "details" pointer shouldn't be the token to decide whether we should skip swap entries. For example, when the callers specified details->zap_mapping==NULL, it means the user wants to zap all the pages (including COWed pages), then we need to look into swap entries because there can be private COWed pages that was swapped out. Skipping some swap entries when details is non-NULL may lead to wrongly leaving some of the swap entries while we should have zapped them. A reproducer of the problem: ===8<=== #define _GNU_SOURCE /* See feature_test_macros(7) */ #include <stdio.h> #include <assert.h> #include <unistd.h> #include <sys/mman.h> #include <sys/types.h> int page_size; int shmem_fd; char *buffer; void main(void) { int ret; char val; page_size = getpagesize(); shmem_fd = memfd_create("test", 0); assert(shmem_fd >= 0); ret = ftruncate(shmem_fd, page_size * 2); assert(ret == 0); buffer = mmap(NULL, page_size * 2, PROT_READ | PROT_WRITE, MAP_PRIVATE, shmem_fd, 0); assert(buffer != MAP_FAILED); /* Write private page, swap it out */ buffer[page_size] = 1; madvise(buffer, page_size * 2, MADV_PAGEOUT); /* This should drop private buffer[page_size] already */ ret = ftruncate(shmem_fd, page_size); assert(ret == 0); /* Recover the size */ ret = ftruncate(shmem_fd, page_size * 2); assert(ret == 0); /* Re-read the data, it should be all zero */ val = buffer[page_size]; if (val == 0) printf("Good\n"); else printf("BUG\n"); } ===8<=== We don't need to touch up the pmd path, because pmd never had a issue with swap entries. For example, shmem pmd migration will always be split into pte level, and same to swapping on anonymous. Add another helper should_zap_cows() so that we can also check whether we should zap private mappings when there's no page pointer specified. This patch drops that trick, so we handle swap ptes coherently. Meanwhile we should do the same check upon migration entry, hwpoison entry and genuine swap entries too. To be explicit, we should still remember to keep the private entries if even_cows==false, and always zap them when even_cows==true. The issue seems to exist starting from the initial commit of git. [[email protected]: comment tweaks] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Peter Xu <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: replace multiple dcache flush with flush_dcache_folio()Muchun Song1-6/+2
Simplify the code by using flush_dcache_folio(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lars Persson <[email protected]> Cc: Peter Xu <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: userfaultfd: fix missing cache flush in mcopy_atomic_pte() and ↵Muchun Song1-0/+3
__mcopy_atomic() userfaultfd calls mcopy_atomic_pte() and __mcopy_atomic() which do not do any cache flushing for the target page. Then the target page will be mapped to the user space with a different address (user address), which might have an alias issue with the kernel address used to copy the data from the user to. Fix this by insert flush_dcache_page() after copy_from_user() succeeds. Link: https://lkml.kernel.org/r/[email protected] Fixes: b6ebaedb4cb1 ("userfaultfd: avoid mmap_sem read recursion in mcopy_atomic") Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation") Signed-off-by: Muchun Song <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lars Persson <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Peter Xu <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: shmem: fix missing cache flush in shmem_mfill_atomic_pte()Muchun Song1-1/+3
userfaultfd calls shmem_mfill_atomic_pte() which does not do any cache flushing for the target page. Then the target page will be mapped to the user space with a different address (user address), which might have an alias issue with the kernel address used to copy the data from the user to. Insert flush_dcache_page() in non-zero-page case. And replace clear_highpage() with clear_user_highpage() which already considers the cache maintenance. Link: https://lkml.kernel.org/r/[email protected] Fixes: 8d1039634206 ("userfaultfd: shmem: add shmem_mfill_zeropage_pte for userfaultfd support") Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lars Persson <[email protected]> Cc: Peter Xu <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: hugetlb: fix missing cache flush in hugetlb_mcopy_atomic_pte()Muchun Song1-1/+2
folio_copy() will copy the data from one page to the target page, then the target page will be mapped to the user space address, which might have an alias issue with the kernel address used to copy the data from the page to. There are 2 ways to fix this issue. 1) insert flush_dcache_page() after folio_copy(). 2) replace folio_copy() with copy_user_huge_page() which already considers the cache maintenance. We chose 2) way to fix the issue since architectures can optimize this situation. It is also make backports easier. Link: https://lkml.kernel.org/r/[email protected] Fixes: 8cc5fcbb5be8 ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY") Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lars Persson <[email protected]> Cc: Peter Xu <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: hugetlb: fix missing cache flush in copy_huge_page_from_user()Muchun Song1-0/+2
userfaultfd calls copy_huge_page_from_user() which does not do any cache flushing for the target page. Then the target page will be mapped to the user space with a different address (user address), which might have an alias issue with the kernel address used to copy the data from the user to. Fix this issue by flushing dcache in copy_huge_page_from_user(). Link: https://lkml.kernel.org/r/[email protected] Fixes: fa4d75c1de13 ("userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support") Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lars Persson <[email protected]> Cc: Peter Xu <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: fix missing cache flush for all tail pages of compound pageMuchun Song1-2/+5
The D-cache maintenance inside move_to_new_page() only consider one page, there is still D-cache maintenance issue for tail pages of compound page (e.g. THP or HugeTLB). THP migration is only enabled on x86_64, ARM64 and powerpc, while powerpc and arm64 need to maintain the consistency between I-Cache and D-Cache, which depends on flush_dcache_page() to maintain the consistency between I-Cache and D-Cache. But there is no issues on arm64 and powerpc since they already considers the compound page cache flushing in their icache flush function. HugeTLB migration is enabled on arm, arm64, mips, parisc, powerpc, riscv, s390 and sh, while arm has handled the compound page cache flush in flush_dcache_page(), but most others do not. In theory, the issue exists on many architectures. Fix this by not using flush_dcache_folio() since it is not backportable. Link: https://lkml.kernel.org/r/[email protected] Fixes: 290408d4a250 ("hugetlb: hugepage migration core") Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lars Persson <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Peter Xu <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: thp: fix wrong cache flush in remove_migration_pmd()Muchun Song1-1/+2
Patch series "Fix some cache flush bugs", v5. This series focuses on fixing cache maintenance. This patch (of 7): The flush_cache_range() is supposed to be justified only if the page is already placed in process page table, and that is done right after flush_cache_range(). So using this interface is wrong. And there is no need to invalite cache since it was non-present before in remove_migration_pmd(). So just to remove it. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: David Rientjes <[email protected]> Cc: Lars Persson <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Zi Yan <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Muchun Song <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: remove mmu_gathers storage from remaining architecturesStafford Horne3-5/+0
Originally the mmu_gathers were removed in commit 1c3951769621 ("mm: now that all old mmu_gather code is gone, remove the storage"). However, the openrisc and hexagon architecture were merged around the same time and mmu_gathers was not removed. This patch removes them from openrisc, hexagon and nds32: Noticed while cleaning this warning: arch/openrisc/mm/init.c:41:1: warning: symbol 'mmu_gathers' was not declared. Should it be static? Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stafford Horne <[email protected]> Acked-by: Mike Rapoport <[email protected]> Cc: Brian Cain <[email protected]> Cc: Nick Hu <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Vincent Chen <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Stefan Kristiansson <[email protected]> Cc: Russell King <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Christophe Leroy <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: merge pte_mkhuge() call into arch_make_huge_pte()Anshuman Khandual6-6/+6
Each call into pte_mkhuge() is invariably followed by arch_make_huge_pte(). Instead arch_make_huge_pte() can accommodate pte_mkhuge() at the beginning. This updates generic fallback stub for arch_make_huge_pte() and available platforms definitions. This makes huge pte creation much cleaner and easier to follow. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Reviewed-by: Christophe Leroy <[email protected]> Acked-by: Mike Kravetz <[email protected]> Acked-by: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22selftests, x86: fix how check_cc.sh is being invokedGuillaume Tucker2-6/+6
The $(CC) variable used in Makefiles could contain several arguments such as "ccache gcc". These need to be passed as a single string to check_cc.sh, otherwise only the first argument will be used as the compiler command. Without quotes, the $(CC) variable is passed as distinct arguments which causes the script to fail to build trivial programs. Fix this by adding quotes around $(CC) when calling check_cc.sh to pass the whole string as a single argument to the script even if it has several words such as "ccache gcc". Link: https://lkml.kernel.org/r/d0d460d7be0107a69e3c52477761a6fe694c1840.1646991629.git.guillaume.tucker@collabora.com Fixes: e9886ace222e ("selftests, x86: Rework x86 target architecture detection") Signed-off-by: Guillaume Tucker <[email protected]> Tested-by: "kernelci.org bot" <[email protected]> Reviewed-by: Guenter Roeck <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22memcg: enable accounting for tty-related objectsVasily Averin1-1/+1
At each login the user forces the kernel to create a new terminal and allocate up to ~1Kb memory for the tty-related structures. By default it's allowed to create up to 4096 ptys with 1024 reserve for initial mount namespace only and the settings are controlled by host admin. Though this default is not enough for hosters with thousands of containers per node. Host admin can be forced to increase it up to NR_UNIX98_PTY_MAX = 1<<20. By default container is restricted by pty mount_opt.max = 1024, but admin inside container can change it via remount. As a result, one container can consume almost all allowed ptys and allocate up to 1Gb of unaccounted memory. It is not enough per-se to trigger OOM on host, however anyway, it allows to significantly exceed the assigned memcg limit and leads to troubles on the over-committed node. It makes sense to account for them to restrict the host's memory consumption from inside the memcg-limited container. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vasily Averin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Jiri Slaby <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: memcontrol: rename memcg_cache_id to memcg_kmem_idMuchun Song2-6/+6
The memcg_cache_id() introduced by commit 2633d7a02823 ("slab/slub: consider a memcg parameter in kmem_create_cache") is used to index in the kmem_cache->memcg_params->memcg_caches array. Since kmem_cache->memcg_params.memcg_caches has been removed by commit 9855609bde03 ("mm: memcg/slab: use a single set of kmem_caches for all accounted allocations"). So the name does not need to reflect cache related. Just rename it to memcg_kmem_id. And it can reflect kmem related. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: list_lru: rename list_lru_per_memcg to list_lru_memcgMuchun Song2-10/+10
The name of list_lru_memcg was occupied before and became free since last commit. Rename list_lru_per_memcg to list_lru_memcg since the name is brief. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: memcontrol: fix cannot alloc the maximum memcg IDMuchun Song1-2/+1
The idr_alloc() does not include @max ID. So in the current implementation, the maximum memcg ID is 65534 instead of 65535. It seems a bug. So fix this. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: memcontrol: reuse memory cgroup ID for kmem IDMuchun Song1-36/+3
There are two idrs being used by memory cgroup, one is for kmem ID, another is for memory cgroup ID. The maximum ID of both is 64Ki. Both of them can limit the total number of memory cgroups. Actually, we can reuse memory cgroup ID for kmem ID to simplify the code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: list_lru: replace linear array with xarrayMuchun Song4-243/+73
If we run 10k containers in the system, the size of the list_lru_memcg->lrus can be ~96KB per list_lru. When we decrease the number containers, the size of the array will not be shrinked. It is not scalable. The xarray is a good choice for this case. We can save a lot of memory when there are tens of thousands continers in the system. If we use xarray, we also can remove the logic code of resizing array, which can simplify the code. [[email protected]: remove unused local] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: list_lru: rename memcg_drain_all_list_lrus to memcg_reparent_list_lrusMuchun Song3-16/+16
The purpose of the memcg_drain_all_list_lrus() is list_lrus reparenting. It is very similar to memcg_reparent_objcgs(). Rename it to memcg_reparent_list_lrus() so that the name can more consistent with memcg_reparent_objcgs(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: list_lru: allocate list_lru_one only when neededMuchun Song3-57/+77
In our server, we found a suspected memory leak problem. The kmalloc-32 consumes more than 6GB of memory. Other kmem_caches consume less than 2GB memory. After our in-depth analysis, the memory consumption of kmalloc-32 slab cache is the cause of list_lru_one allocation. crash> p memcg_nr_cache_ids memcg_nr_cache_ids = $2 = 24574 memcg_nr_cache_ids is very large and memory consumption of each list_lru can be calculated with the following formula. num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) There are 4 numa nodes in our system, so each list_lru consumes ~3MB. crash> list super_blocks | wc -l 952 Every mount will register 2 list lrus, one is for inode, another is for dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 MB (~5.6GB). But the number of memory cgroup is less than 500. So I guess more than 12286 containers have been deployed on this machine (I do not know why there are so many containers, it may be a user's bug or the user really want to do that). And memcg_nr_cache_ids has not been reduced to a suitable value. This can waste a lot of memory. Now the infrastructure for dynamic list_lru_one allocation is ready, so remove statically allocated memory code to save memory. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online()Muchun Song1-21/+16
It will simplify the code if moving memcg_online_kmem() to mem_cgroup_css_online() and do not need to set ->kmemcg_id to -1 to indicate the memcg is offline. In the next patch, ->kmemcg_id will be used to sync list lru reparenting which requires not to change ->kmemcg_id. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22xarray: use kmem_cache_alloc_lru to allocate xa_nodeMuchun Song4-8/+18
The workingset will add the xa_node to the shadow_nodes list. So the allocation of xa_node should be done by kmem_cache_alloc_lru(). Using xas_set_lru() to pass the list_lru which we want to insert xa_node into to set up the xa_node reclaim context correctly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: dcache: use kmem_cache_alloc_lru() to allocate dentryMuchun Song1-1/+2
Like inode cache, the dentry will also be added to its memcg list_lru. So replace kmem_cache_alloc() with kmem_cache_alloc_lru() to allocate dentry. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22f2fs: allocate inode by using alloc_inode_sb()Muchun Song1-2/+6
The inode allocation is supposed to use alloc_inode_sb(), so convert kmem_cache_alloc() to alloc_inode_sb(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22fs: allocate inode by using alloc_inode_sb()Muchun Song57-57/+57
The inode allocation is supposed to use alloc_inode_sb(), so convert kmem_cache_alloc() of all filesystems to alloc_inode_sb(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Theodore Ts'o <[email protected]> [ext4] Acked-by: Roman Gushchin <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22fs: introduce alloc_inode_sb() to allocate filesystems specific inodeMuchun Song3-1/+18
The allocated inode cache is supposed to be added to its memcg list_lru which should be allocated as well in advance. That can be done by kmem_cache_alloc_lru() which allocates object and list_lru. The file systems is main user of it. So introduce alloc_inode_sb() to allocate file system specific inodes and set up the inode reclaim context properly. The file system is supposed to use alloc_inode_sb() to allocate inodes. In later patches, we will convert all users to the new API. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: introduce kmem_cache_alloc_lruMuchun Song9-53/+198
We currently allocate scope for every memcg to be able to tracked on every superblock instantiated in the system, regardless of whether that superblock is even accessible to that memcg. These huge memcg counts come from container hosts where memcgs are confined to just a small subset of the total number of superblocks that instantiated at any given point in time. For these systems with huge container counts, list_lru does not need the capability of tracking every memcg on every superblock. What it comes down to is that adding the memcg to the list_lru at the first insert. So introduce kmem_cache_alloc_lru to allocate objects and its list_lru. In the later patch, we will convert all inode and dentry allocation from kmem_cache_alloc to kmem_cache_alloc_lru. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: list_lru: transpose the array of per-node per-memcg lru listsMuchun Song2-137/+86
Patch series "Optimize list lru memory consumption", v6. In our server, we found a suspected memory leak problem. The kmalloc-32 consumes more than 6GB of memory. Other kmem_caches consume less than 2GB memory. After our in-depth analysis, the memory consumption of kmalloc-32 slab cache is the cause of list_lru_one allocation. crash> p memcg_nr_cache_ids memcg_nr_cache_ids = $2 = 24574 memcg_nr_cache_ids is very large and memory consumption of each list_lru can be calculated with the following formula. num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) There are 4 numa nodes in our system, so each list_lru consumes ~3MB. crash> list super_blocks | wc -l 952 Every mount will register 2 list lrus, one is for inode, another is for dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 MB (~5.6GB). But now the number of memory cgroups is less than 500. So I guess more than 12286 memory cgroups have been created on this machine (I do not know why there are so many cgroups, it may be a user's bug or the user really want to do that). Because memcg_nr_cache_ids has not been reduced to a suitable value. It leads to waste a lot of memory. If we want to reduce memcg_nr_cache_ids, we have to *reboot* the server. This is not what we want. In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do this. But this did not fundamentally solve the problem. We currently allocate scope for every memcg to be able to tracked on every superblock instantiated in the system, regardless of whether that superblock is even accessible to that memcg. These huge memcg counts come from container hosts where memcgs are confined to just a small subset of the total number of superblocks that instantiated at any given point in time. For these systems with huge container counts, list_lru does not need the capability of tracking every memcg on every superblock. What it comes down to is that the list_lru is only needed for a given memcg if that memcg is instatiating and freeing objects on a given list_lru. As Dave said, "Which makes me think we should be moving more towards 'add the memcg to the list_lru at the first insert' model rather than 'instantiate all at memcg init time just in case'." This patchset aims to optimize the list lru memory consumption from different aspects. I had done a easy test to show the optimization. I create 10k memory cgroups and mount 10k filesystems in the systems. We use free command to show how many memory does the systems comsumes after this operation (There are 2 numa nodes in the system). +-----------------------+------------------------+ | condition | memory consumption | +-----------------------+------------------------+ | without this patchset | 24464 MB | +-----------------------+------------------------+ | after patch 1 | 21957 MB | <--------+ +-----------------------+------------------------+ | | after patch 10 | 6895 MB | | +-----------------------+------------------------+ | | after patch 12 | 4367 MB | | +-----------------------+------------------------+ | | The more the number of nodes, the more obvious the effect---+ BTW, there was a recent discussion [2] on the same issue. [1] https://lore.kernel.org/all/[email protected]/ [2] https://lore.kernel.org/all/[email protected]/ This series not only optimizes the memory usage of list_lru but also simplifies the code. This patch (of 16): The current scheme of maintaining per-node per-memcg lru lists looks like: struct list_lru { struct list_lru_node *node; (for each node) struct list_lru_memcg *memcg_lrus; struct list_lru_one *lru[]; (for each memcg) } By effectively transposing the two-dimension array of list_lru_one's structures (per-node per-memcg => per-memcg per-node) it's possible to save some memory and simplify alloc/dealloc paths. The new scheme looks like: struct list_lru { struct list_lru_memcg *mlrus; struct list_lru_per_memcg *mlru[]; (for each memcg) struct list_lru_one node[0]; (for each node) } Memory savings are coming from not only 'struct rcu_head' but also some pointer arrays used to store the pointer to 'struct list_lru_one'. The array is per node and its size is 8 (a pointer) * num_memcgs. So the total size of the arrays is 8 * num_nodes * memcg_nr_cache_ids. After this patch, the size becomes 8 * memcg_nr_cache_ids. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Yang Shi <[email protected]> Cc: Alex Shi <[email protected]> Cc: Wei Yang <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Chao Yu <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Theodore Ts'o <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memcg: disable migration instead of preemption in drain_all_stock().Sebastian Andrzej Siewior1-2/+3
Before the for-each-CPU loop, preemption is disabled so that so that drain_local_stock() can be invoked directly instead of scheduling a worker. Ensuring that drain_local_stock() completed on the local CPU is not correctness problem. It _could_ be that the charging path will be forced to reclaim memory because cached charges are still waiting for their draining. Disabling preemption before invoking drain_local_stock() is problematic on PREEMPT_RT due to the sleeping locks involved. To ensure that no CPU migrations happens across for_each_online_cpu() it is enouhg to use migrate_disable() which disables migration and keeps context preemptible to a sleeping lock can be acquired. A race with CPU hotplug is not a problem because pcp data is not going away. In the worst case we just schedule draining of an empty stock. Use migrate_disable() instead of get_cpu() around the for_each_online_cpu() loop. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: kernel test robot <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Waiman Long <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memcg: protect memcg_stock with a local_lock_tSebastian Andrzej Siewior1-21/+38
The members of the per-CPU structure memcg_stock_pcp are protected by disabling interrupts. This is not working on PREEMPT_RT because it creates atomic context in which actions are performed which require preemptible context. One example is obj_cgroup_release(). The IRQ-disable sections can be replaced with local_lock_t which preserves the explicit disabling of interrupts while keeps the code preemptible on PREEMPT_RT. drain_obj_stock() drops a reference on obj_cgroup which leads to an invocat= ion of obj_cgroup_release() if it is the last object. This in turn leads to recursive locking of the local_lock_t. To avoid this, obj_cgroup_release() = is invoked outside of the locked section. obj_cgroup_uncharge_pages() can be invoked with the local_lock_t acquired a= nd without it. This will lead later to a recursion in refill_stock(). To avoid the locking recursion provide obj_cgroup_uncharge_pages_locked() which uses the locked version of refill_stock(). - Replace disabling interrupts for memcg_stock with a local_lock_t. - Let drain_obj_stock() return the old struct obj_cgroup which is passed to obj_cgroup_put() outside of the locked section. - Provide obj_cgroup_uncharge_pages_locked() which uses the locked version of refill_stock() to avoid recursive locking in drain_obj_stock(). Link: https://lkml.kernel.org/r/20220209014709.GA26885@xsang-OptiPlex-9020 Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Reported-by: kernel test robot <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Waiman Long <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memcg: opencode the inner part of obj_cgroup_uncharge_pages() in ↵Johannes Weiner1-6/+18
drain_obj_stock() Provide the inner part of refill_stock() as __refill_stock() without disabling interrupts. This eases the integration of local_lock_t where recursive locking must be avoided. Open code obj_cgroup_uncharge_pages() in drain_obj_stock() and use __refill_stock(). The caller of drain_obj_stock() already disables interrupts. [[email protected]: patch body around Johannes' diff] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: kernel test robot <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Waiman Long <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memcg: protect per-CPU counter by disabling preemption on PREEMPT_RT ↵Sebastian Andrzej Siewior1-1/+55
where needed. The per-CPU counter are modified with the non-atomic modifier. The consistency is ensured by disabling interrupts for the update. On non PREEMPT_RT configuration this works because acquiring a spinlock_t typed lock with the _irq() suffix disables interrupts. On PREEMPT_RT configurations the RMW operation can be interrupted. Another problem is that mem_cgroup_swapout() expects to be invoked with disabled interrupts because the caller has to acquire a spinlock_t which is acquired with disabled interrupts. Since spinlock_t never disables interrupts on PREEMPT_RT the interrupts are never disabled at this point. The code is never called from in_irq() context on PREEMPT_RT therefore disabling preemption during the update is sufficient on PREEMPT_RT. The sections which explicitly disable interrupts can remain on PREEMPT_RT because the sections remain short and they don't involve sleeping locks (memcg_check_events() is doing nothing on PREEMPT_RT). Disable preemption during update of the per-CPU variables which do not explicitly disable interrupts. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Roman Gushchin <[email protected]> Reviewed-by: Shakeel Butt <[email protected] Cc: Johannes Weiner <[email protected]> Cc: kernel test robot <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Waiman Long <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memcg: disable threshold event handlers on PREEMPT_RTSebastian Andrzej Siewior2-2/+14
During the integration of PREEMPT_RT support, the code flow around memcg_check_events() resulted in `twisted code'. Moving the code around and avoiding then would then lead to an additional local-irq-save section within memcg_check_events(). While looking better, it adds a local-irq-save section to code flow which is usually within an local-irq-off block on non-PREEMPT_RT configurations. The threshold event handler is a deprecated memcg v1 feature. Instead of trying to get it to work under PREEMPT_RT just disable it. There should be no users on PREEMPT_RT. From that perspective it makes even less sense to get it to work under PREEMPT_RT while having zero users. Make memory.soft_limit_in_bytes and cgroup.event_control return -EOPNOTSUPP on PREEMPT_RT. Make an empty memcg_check_events() and memcg_write_event_control() which return only -EOPNOTSUPP on PREEMPT_RT. Document that the two knobs are disabled on PREEMPT_RT. Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Michal Hocko <[email protected]> Suggested-by: Michal Koutný <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: kernel test robot <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Waiman Long <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memcg: revert ("mm/memcg: optimize user context object stock access")Michal Hocko1-67/+27
Patch series "mm/memcg: Address PREEMPT_RT problems instead of disabling it", v5. This series aims to address the memcg related problem on PREEMPT_RT. I tested them on CONFIG_PREEMPT and CONFIG_PREEMPT_RT with the tools/testing/selftests/cgroup/* tests and I haven't observed any regressions (other than the lockdep report that is already there). This patch (of 6): The optimisation is based on a micro benchmark where local_irq_save() is more expensive than a preempt_disable(). There is no evidence that it is visible in a real-world workload and there are CPUs where the opposite is true (local_irq_save() is cheaper than preempt_disable()). Based on micro benchmarks, the optimisation makes sense on PREEMPT_NONE where preempt_disable() is optimized away. There is no improvement with PREEMPT_DYNAMIC since the preemption counter is always available. The optimization makes also the PREEMPT_RT integration more complicated since most of the assumption are not true on PREEMPT_RT. Revert the optimisation since it complicates the PREEMPT_RT integration and the improvement is hardly visible. [[email protected]: patch body around Michal's diff] Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/YgOGkXXCrD%[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Waiman Long <[email protected]> Cc: kernel test robot <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memcontrol: return 1 from cgroup.memory __setup() handlerRandy Dunlap1-1/+1
__setup() handlers should return 1 if the command line option is handled and 0 if not (or maybe never return 0; it just pollutes init's environment). The only reason that this particular __setup handler does not pollute init's environment is that the setup string contains a '.', as in "cgroup.memory". This causes init/main.c::unknown_boottoption() to consider it to be an "Unused module parameter" and ignore it. (This is for parsing of loadable module parameters any time after kernel init.) Otherwise the string "cgroup.memory=whatever" would be added to init's environment strings. Instead of relying on this '.' quirk, just return 1 to indicate that the boot option has been handled. Note that there is no warning message if someone enters: cgroup.memory=anything_invalid Link: https://lkml.kernel.org/r/[email protected] Fixes: f7e1cb6ec51b0 ("mm: memcontrol: account socket memory in unified hierarchy memory controller") Signed-off-by: Randy Dunlap <[email protected]> Reported-by: Igor Zhbanov <[email protected]> Link: lore.kernel.org/r/[email protected] Reviewed-by: Michal Koutný <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22memcg: synchronously enforce memory.high for large overchargesShakeel Butt1-0/+5
The high limit is used to throttle the workload without invoking the oom-killer. Recently we tried to use the high limit to right size our internal workloads. More specifically dynamically adjusting the limits of the workload without letting the workload get oom-killed. However due to the limitation of the implementation of high limit enforcement, we observed the mechanism fails for some real workloads. The high limit is enforced on return-to-userspace i.e. the kernel let the usage goes over the limit and when the execution returns to userspace, the high reclaim is triggered and the process can get throttled as well. However this mechanism fails for workloads which do large allocations in a single kernel entry e.g. applications that mlock() a large chunk of memory in a single syscall. Such applications bypass the high limit and can trigger the oom-killer. To make high limit enforcement more robust, this patch makes the limit enforcement synchronous only if the accumulated overcharge becomes larger than MEMCG_CHARGE_BATCH. So, most of the allocations would still be throttled on the return-to-userspace path but only the extreme allocations which accumulates large amount of overcharge without returning to the userspace will be throttled synchronously. The value MEMCG_CHARGE_BATCH is a bit arbitrary but most of other places in the memcg codebase uses this constant therefore for now uses the same one. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Acked-by: Chris Down <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22selftests: memcg: test high limit for single entry allocationShakeel Butt3-3/+91
Test the enforcement of memory.high limit for large amount of memory allocation within a single kernel entry. There are valid use-cases where the application can trigger large amount of memory allocation within a single syscall e.g. mlock() or mmap(MAP_POPULATE). Make sure memory.high limit enforcement works for such use-cases. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Chris Down <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22memcg: unify force charging conditionsShakeel Butt1-10/+7
Currently the kernel force charges the allocations which have __GFP_HIGH flag without triggering the memory reclaim. __GFP_HIGH indicates that the caller is high priority and since commit 869712fd3de5 ("mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges") the kernel lets such allocations do force charging. Please note that __GFP_ATOMIC has been replaced by __GFP_HIGH. __GFP_HIGH does not tell if the caller can block or can trigger reclaim. There are separate checks to determine that. So, there is no need to skip reclaiming for __GFP_HIGH allocations. So, handle __GFP_HIGH together with __GFP_NOFAIL which also does force charging. Please note that this is a noop change as there are no __GFP_HIGH allocators in the kernel which also have __GFP_ACCOUNT (or SLAB_ACCOUNT) and does not allow reclaim for now. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Chris Down <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22memcg: refactor mem_cgroup_oomShakeel Butt1-27/+17
Patch series "memcg: robust enforcement of memory.high", v2. Due to the semantics of memory.high enforcement i.e. throttle the workload without oom-kill, we are trying to use it for right sizing the workloads in our production environment. However we observed the mechanism fails for some specific applications which does big chunck of allocations in a single syscall. The reason behind this failure is due to the limitation of the memory.high enforcement's current implementation. This patch series solves this issue by enforcing the memory.high synchronously if the current process has accumulated a large amount of high overcharge. This patch (of 4): The function mem_cgroup_oom returns enum which has four possible values but the caller does not care about such values and only cares if the return value is OOM_SUCCESS or not. So, remove the enum altogether and make mem_cgroup_oom returns a simple bool. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Chris Down <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memcg: retrieve parent memcg from css.parentWei Yang1-3/+1
The parent we get from page_counter is correct, while this is two different hierarchy. Let's retrieve the parent memcg from css.parent just like parent_cs(), blkcg_parent(), etc. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Muchun Song <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Yang Shi <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/memcg: mem_cgroup_per_node is already set to 0 on allocationWei Yang1-2/+0
kzalloc_node() would set data to 0, so it's not necessary to set it again. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Muchun Song <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22memcg: add per-memcg total kernel memory statYosry Ahmed3-6/+27
Currently memcg stats show several types of kernel memory: kernel stack, page tables, sock, vmalloc, and slab. However, there are other allocations with __GFP_ACCOUNT (or supersets such as GFP_KERNEL_ACCOUNT) that are not accounted in any of those stats, a few examples are: - various kvm allocations (e.g. allocated pages to create vcpus) - io_uring - tmp_page in pipes during pipe_write() - bpf ringbuffers - unix sockets Keeping track of the total kernel memory is essential for the ease of migration from cgroup v1 to v2 as there are large discrepancies between v1's kmem.usage_in_bytes and the sum of the available kernel memory stats in v2. Adding separate memcg stats for all __GFP_ACCOUNT kernel allocations is an impractical maintenance burden as there a lot of those all over the kernel code, with more use cases likely to show up in the future. Therefore, add a "kernel" memcg stat that is analogous to kmem page counter, with added benefits such as using rstat infrastructure which aggregates stats more efficiently. Additionally, this provides a lighter alternative in case the legacy kmem is deprecated in the future [[email protected]: v2] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yosry Ahmed <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22memcg: replace in_interrupt() with !in_task()Shakeel Butt1-2/+2
Replace the deprecated in_interrupt() with !in_task() because in_interrupt() returns true for BH disabled even if the call happens in the task context. in_task() is the right interface to differentiate task context from NMI, hard IRQ and softirq contexts. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vasily Averin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm: shmem: use helper macro __ATTR_RWMiaohe Lin1-2/+1
Use helper macro __ATTR_RW to define shmem_enabled_attr to make code more clear. Minor readability improvement. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22tmpfs: do not allocate pages on readHugh Dickins2-14/+12
Mikulas asked in "Do we still need commit a0ee5ec520ed ('tmpfs: allocate on read when stacked')?" in [1] Lukas noticed this unusual behavior of loop device backed by tmpfs in [2]. Normally, shmem_file_read_iter() copies the ZERO_PAGE when reading holes; but if it looks like it might be a read for "a stacking filesystem", it allocates actual pages to the page cache, and even marks them as dirty. And reads from the loop device do satisfy the test that is used. This oddity was added for an old version of unionfs, to help to limit its usage to the limited size of the tmpfs mount involved; but about the same time as the tmpfs mod went in (2.6.25), unionfs was reworked to proceed differently; and the mod kept just in case others needed it. Do we still need it? I cannot answer with more certainty than "Probably not". It's nasty enough that we really should try to delete it; but if a regression is reported somewhere, then we might have to revert later. It's not quite as simple as just removing the test (as Mikulas did): xfstests generic/013 hung because splice from tmpfs failed on page not up-to-date and page mapping unset. That can be fixed just by marking the ZERO_PAGE as Uptodate, which of course it is: do so in pagecache_init() - it might be useful to others than tmpfs. My intention, though, was to stop using the ZERO_PAGE here altogether: surely iov_iter_zero() is better for this case? Sadly not: it relies on clear_user(), and the x86 clear_user() is slower than its copy_user() [3]. But while we are still using the ZERO_PAGE, let's stop dirtying its struct page cacheline with unnecessary get_page() and put_page(). Link: https://lore.kernel.org/linux-mm/alpine.LRH.2.02.2007210510230.6959@file01.intranet.prod.int.rdu2.redhat.com/ [1] Link: https://lore.kernel.org/linux-mm/20211126075100.gd64odg2bcptiqeb@work/ [2] Link: https://lore.kernel.org/lkml/[email protected]/ [3] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Reported-by: Mikulas Patocka <[email protected]> Reported-by: Lukas Czerner <[email protected]> Acked-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Zdenek Kabelac <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Borislav Petkov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22shmem: mapping_set_exiting() to help mapped resilienceHugh Dickins1-0/+1
When I added page_mapped() resilience in __delete_from_page_cache() for the mapping_exiting() case, I missed that mapping_set_exiting() is done in truncate_inode_pages_final(), which is not actually called for shmem. (Today, it is folio_mapped() resilience in filemap_unaccount_folio().) So the fixup to avoid a memory leak in this case never worked on shmem: add a mapping_set_exiting() in shmem_evict_inode() at last. But this is hardly a candidate for stable, since it's only useful if "Bad page". Link: https://lkml.kernel.org/r/[email protected] Fixes: 06b241f32c71 ("mm: __delete_from_page_cache show Bad page if mapped") Signed-off-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22tmpfs: support for file creation timeXavier Roche2-3/+14
Various filesystems (including ext4) now support file creation time. This adds such support for tmpfs-based filesystems. Note that using shmem_getattr() on other file types than regular requires that shmem_is_huge() check type, to stop incorrect HPAGE_PMD_SIZE blksize. [[email protected]: three tweaks to creation time patch] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/20220314211150.GA123458@xavier-xps Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/20220211213628.GA1919658@xavier-xps Signed-off-by: Xavier Roche <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Tested-by: Jean Delvare <[email protected]> Tested-by: Sylvain Bellone <[email protected]> Reported-by: Xavier Grand <[email protected]> Reviewed-by: Jean Delvare <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22mm/swap: fix confusing comment in folio_mark_accessedBang Li1-1/+1
For unevictable pages, we don't need mark them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Bang Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>