aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-11-30mm: add early FAULT_FLAG_UNSHARE consistency checksDavid Hildenbrand3-11/+20
For now, FAULT_FLAG_UNSHARE only applies to anonymous pages, which implies a COW mapping. Let's hide FAULT_FLAG_UNSHARE early if we're not dealing with a COW mapping, such that we treat it like a read fault as documented and don't have to worry about the flag throughout all fault handlers. While at it, centralize the check for mutual exclusion of FAULT_FLAG_UNSHARE and FAULT_FLAG_WRITE and just drop the check that either flag is set in the WP handler. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30selftests/vm: cow: R/O long-term pinning reliability tests for non-anon pagesDavid Hildenbrand1-1/+27
Let's test whether R/O long-term pinning is reliable for non-anonymous memory: when R/O long-term pinning a page, the expectation is that we break COW early before pinning, such that actual write access via the page tables won't break COW later and end up replacing the R/O-pinned page in the page table. Consequently, R/O long-term pinning in private mappings would only target exclusive anonymous pages. For now, all tests fail: # [RUN] R/O longterm GUP pin ... with shared zeropage not ok 151 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd not ok 152 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with tmpfile not ok 153 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with huge zeropage not ok 154 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB) not ok 155 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB) not ok 156 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with shared zeropage not ok 157 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd not ok 158 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with tmpfile not ok 159 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with huge zeropage not ok 160 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB) not ok 161 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB) not ok 162 Longterm R/O pin is reliable Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30selftests/vm: cow: basic COW tests for non-anonymous pagesDavid Hildenbrand1-1/+337
Let's add basic tests for COW with non-anonymous pages in private mappings: write access should properly trigger COW and result in the private changes not being visible through other page mappings. Especially, add tests for: * Zeropage * Huge zeropage * Ordinary pagecache pages via memfd and tmpfile() * Hugetlb pages via memfd Fortunately, all tests pass. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30selftests/vm: anon_cow: prepare for non-anonymous COW testsDavid Hildenbrand5-19/+24
Patch series "mm/gup: remove FOLL_FORCE usage from drivers (reliable R/O long-term pinning)". For now, we did not support reliable R/O long-term pinning in COW mappings. That means, if we would trigger R/O long-term pinning in MAP_PRIVATE mapping, we could end up pinning the (R/O-mapped) shared zeropage or a pagecache page. The next write access would trigger a write fault and replace the pinned page by an exclusive anonymous page in the process page table; whatever the process would write to that private page copy would not be visible by the owner of the previous page pin: for example, RDMA could read stale data. The end result is essentially an unexpected and hard-to-debug memory corruption. Some drivers tried working around that limitation by using "FOLL_FORCE|FOLL_WRITE|FOLL_LONGTERM" for R/O long-term pinning for now. FOLL_WRITE would trigger a write fault, if required, and break COW before pinning the page. FOLL_FORCE is required because the VMA might lack write permissions, and drivers wanted to make that working as well, just like one would expect (no write access, but still triggering a write access to break COW). However, that is not a practical solution, because (1) Drivers that don't stick to that undocumented and debatable pattern would still run into that issue. For example, VFIO only uses FOLL_LONGTERM for R/O long-term pinning. (2) Using FOLL_WRITE just to work around a COW mapping + page pinning limitation is unintuitive. FOLL_WRITE would, for example, mark the page softdirty or trigger uffd-wp, even though, there actually isn't going to be any write access. (3) The purpose of FOLL_FORCE is debug access, not access without lack of VMA permissions by arbitrarty drivers. So instead, make R/O long-term pinning work as expected, by breaking COW in a COW mapping early, such that we can remove any FOLL_FORCE usage from drivers and make FOLL_FORCE ptrace-specific (renaming it to FOLL_PTRACE). More details in patch #8. This patch (of 19): Originally, the plan was to have a separate tests for testing COW of non-anonymous (e.g., shared zeropage) pages. Turns out, that we'd need a lot of similar functionality and that there isn't a really good reason to separate it. So let's prepare for non-anon tests by renaming to "cow". Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alex Williamson <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andy Walls <[email protected]> Cc: Anton Ivanov <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bernard Metzler <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Benvenuti <[email protected]> Cc: Christian Gmeiner <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Airlie <[email protected]> Cc: David S. Miller <[email protected]> Cc: Dennis Dalessandro <[email protected]> Cc: "Eric W . Biederman" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hans Verkuil <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Inki Dae <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: James Morris <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Johannes Berg <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kees Cook <[email protected]> Cc: Kentaro Takeda <[email protected]> Cc: Krzysztof Kozlowski <[email protected]> Cc: Kyungmin Park <[email protected]> Cc: Leon Romanovsky <[email protected]> Cc: Leon Romanovsky <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Lucas Stach <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Matt Turner <[email protected]> Cc: Mauro Carvalho Chehab <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Nelson Escobar <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oded Gabbay <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Paul Moore <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russell King <[email protected]> Cc: Serge Hallyn <[email protected]> Cc: Seung-Woo Kim <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tomasz Figa <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: Kconfig: make config SECRETMEM visible with EXPERTLukas Bulwahn1-1/+7
Commit 6a108a14fa35 ("kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT") introduces CONFIG_EXPERT to carry the previous intent of CONFIG_EMBEDDED and just gives that intent a much better name. That has been clearly a good and long overdue renaming, and it is clearly an improvement to the kernel build configuration that has shown to help managing the kernel build configuration in the last decade. However, rather than bravely and radically just deleting CONFIG_EMBEDDED, this commit gives CONFIG_EMBEDDED a new intended semantics, but keeps it open for future contributors to implement that intended semantics: A new CONFIG_EMBEDDED option is added that automatically selects CONFIG_EXPERT when enabled and can be used in the future to isolate options that should only be considered for embedded systems (RISC architectures, SLOB, etc). Since then, this CONFIG_EMBEDDED implicitly had two purposes: - It can make even more options visible beyond what CONFIG_EXPERT makes visible. In other words, it may introduce another level of enabling the visibility of configuration options: always visible, visible with CONFIG_EXPERT and visible with CONFIG_EMBEDDED. - Set certain default values of some configurations differently, following the assumption that configuring a kernel build for an embedded system generally starts with a different set of default values compared to kernel builds for all other kind of systems. Considering the second purpose, note that already probably arguing that a kernel build for an embedded system would choose some values differently is already tricky: the set of embedded systems with Linux kernels is already quite diverse. Many embedded system have powerful CPUs and it would not be clear that all embedded systems just optimize towards one specific aspect, e.g., a smaller kernel image size. So, it is unclear if starting with "one set of default configuration" that is induced by CONFIG_EMBEDDED is a good offer for developers configuring their kernels. Also, the differences of needed user-space features in an embedded system compared to a non-embedded system are probably difficult or even impossible to name in some generic way. So it is not surprising that in the last decade hardly anyone has contributed changes to make something default differently in case of CONFIG_EMBEDDED=y. Currently, in v6.0-rc4, SECRETMEM is the only config switched off if CONFIG_EMBEDDED=y. As long as that is actually the only option that currently is selected or deselected, it is better to just make SECRETMEM configurable at build time by experts using menuconfig instead. Make SECRETMEM configurable when EXPERT is set and otherwise default to yes. Further, SECRETMEM needs ARCH_HAS_SET_DIRECT_MAP. This allows us to remove CONFIG_EMBEDDED in the close future. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lukas Bulwahn <[email protected]> Acked-by: Mike Rapoport <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Reviewed-by: Masahiro Yamada <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/gup: remove the restriction on locked with FOLL_LONGTERMJason Gunthorpe1-82/+27
This restriction was created because FOLL_LONGTERM used to scan the vma list, so it could not tolerate becoming unlocked. That was fixed in commit 52650c8b466b ("mm/gup: remove the vma allocation from gup_longterm_locked()") and the restriction on !vma was removed. However, the locked restriction remained, even though it isn't necessary anymore. Adjust __gup_longterm_locked() so it can handle the mmap_read_lock() becoming unlocked while it is looping for migration. Migration does not require the mmap_read_sem because it is only handling struct pages. If we had to unlock then ensure the whole thing returns unlocked. Remove __get_user_pages_remote() and __gup_longterm_unlocked(). These cases can now just directly call other functions. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jason Gunthorpe <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Alistair Popple <[email protected]> Cc: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30selftests/damon: fix unnecessary compilation warningsRong Tao1-0/+9
When testing overflow and overread, there is no need to keep unnecessary compilation warnings, we should simply ignore them. The motivation for this patch is to eliminate the compilation warning, maybe one day we will compile the kernel with "-Werror -Wall", at which point this compilation warning will turn into a compilation error, we should fix this error in advance. How to reproduce the problem (with gcc-11.3.1): $ make -C tools/testing/selftests/ ... warning: `write' reading 4294967295 bytes from a region of size 1 [-Wstringop-overread] warning: `read' writing 4294967295 bytes into a region of size 25 overflows the destination [-Wstringop-overflow=] "-Wno-stringop-overread" is supported at least in gcc-11.1.0. Link: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=d14c547abd484d3540b692bb8048c4a6efe92c8b Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rong Tao <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30hugetlbfs: inode: remove unnecessary (void*) conversionsLi zeming1-1/+1
The ei pointer does not need to cast the type. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Li zeming <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: make drop_caches keep reclaiming on all nodesJan Kara1-15/+18
Currently, drop_caches are reclaiming node-by-node, looping on each node until reclaim could not make progress. This can however leave quite some slab entries (such as filesystem inodes) unreclaimed if objects say on node 1 keep objects on node 0 pinned. So move the "loop until no progress" loop to the node-by-node iteration to retry reclaim also on other nodes if reclaim on some nodes made progress. This fixes problem when drop_caches was not reclaiming lots of otherwise perfectly fine to reclaim inodes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reported-by: You Zhou <[email protected]> Reported-by: Pengfei Xu <[email protected]> Tested-by: Pengfei Xu <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: anonymous shared memory namingPasha Tatashin6-30/+57
Since commit 9a10064f5625 ("mm: add a field to store names for private anonymous memory"), name for private anonymous memory, but not shared anonymous, can be set. However, naming shared anonymous memory just as useful for tracking purposes. Extend the functionality to be able to set names for shared anon. There are two ways to create anonymous shared memory, using memfd or directly via mmap(): 1. fd = memfd_create(...) mem = mmap(..., MAP_SHARED, fd, ...) 2. mem = mmap(..., MAP_SHARED | MAP_ANONYMOUS, -1, ...) In both cases the anonymous shared memory is created the same way by mapping an unlinked file on tmpfs. The memfd way allows to give a name for anonymous shared memory, but not useful when parts of shared memory require to have distinct names. Example use case: The VMM maps VM memory as anonymous shared memory (not private because VMM is sandboxed and drivers are running in their own processes). However, the VM tells back to the VMM how parts of the memory are actually used by the guest, how each of the segments should be backed (i.e. 4K pages, 2M pages), and some other information about the segments. The naming allows us to monitor the effective memory footprint for each of these segments from the host without looking inside the guest. Sample output: /* Create shared anonymous segmenet */ anon_shmem = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); /* Name the segment: "MY-NAME" */ rv = prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, anon_shmem, SIZE, "MY-NAME"); cat /proc/<pid>/maps (and smaps): 7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 [anon_shmem:MY-NAME] If the segment is not named, the output is: 7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 /dev/zero (deleted) Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pasha Tatashin <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bagas Sanjaya <[email protected]> Cc: Colin Cross <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Paul Gortmaker <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Vincent Whitchurch <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: xu xin <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: shrinkers: add missing includes for undeclared typesT.J. Mercier1-0/+3
The shrinker.h header depends on a user including other headers before it for types used by shrinker.h. Fix this by including the appropriate headers in shrinker.h. ./include/linux/shrinker.h:13:9: error: unknown type name `gfp_t' 13 | gfp_t gfp_mask; | ^~~~~ ./include/linux/shrinker.h:71:26: error: field `list' has incomplete type 71 | struct list_head list; | ^~~~ ./include/linux/shrinker.h:82:9: error: unknown type name `atomic_long_t' 82 | atomic_long_t *nr_deferred; | Link: https://lkml.kernel.org/r/[email protected] Fixes: 83aeeada7c69 ("vmscan: use atomic-long for shrinker batching") Fixes: b0d40c92adaf ("superblock: introduce per-sb cache shrinker infrastructure") Signed-off-by: T.J. Mercier <[email protected]> Cc: Al Viro <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30hugetlb: remove duplicate mmu notificationsMike Kravetz1-9/+9
The common hugetlb unmap routine __unmap_hugepage_range performs mmu notification calls. However, in the case where __unmap_hugepage_range is called via __unmap_hugepage_range_final, mmu notification calls are performed earlier in other calling routines. Remove mmu notification calls from __unmap_hugepage_range. Add notification calls to the only other caller: unmap_hugepage_range. unmap_hugepage_range is called for truncation and hole punch, so change notification type from UNMAP to CLEAR as this is more appropriate. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Suggested-by: Peter Xu <[email protected]> Cc: Wei Chen <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/uffd: sanity check write bit for uffd-wp protected ptesPeter Xu1-1/+17
Let's add one sanity check for CONFIG_DEBUG_VM on the write bit in whatever chance we have when walking through the pgtables. It can bring the error earlier even before the app notices the data was corrupted on the snapshot. Also it helps us to identify this is a wrong pgtable setup, so hopefully a great information to have for debugging too. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nadav Amit <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/kmemleak.c: fix a commentYixuan Cao1-1/+1
I noticed a typo in a code comment and I fixed it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yixuan Cao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30docs: admin-guide: cgroup-v1: update description of inactive_fileJian Wen1-1/+2
MADV_FREE pages have been moved into the LRU_INACTIVE_FILE list by commit f7ad2a6cb9f7 ("mm: move MADV_FREE pages into LRU_INACTIVE_FILE list"). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jian Wen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/demotion: fix NULL vs IS_ERR checking in memory_tier_initMiaoqian Lin1-1/+1
alloc_memory_type() returns error pointers on error instead of NULL. Use IS_ERR() to check the return value to fix this. Link: https://lkml.kernel.org/r/[email protected] Fixes: 7b88bda3761b ("mm/demotion/dax/kmem: set node's abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE") Signed-off-by: Miaoqian Lin <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Wei Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30migrate: convert migrate_pages() to use foliosHuang Ying1-98/+112
Quite straightforward, the page functions are converted to corresponding folio functions. Same for comments. THP specific code are converted to be large folio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Tested-by: Baolin Wang <[email protected]> Cc: Zi Yan <[email protected]> Cc: Yang Shi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30migrate: convert unmap_and_move() to use foliosHuang Ying1-27/+27
Patch series "migrate: convert migrate_pages()/unmap_and_move() to use folios", v2. The conversion is quite straightforward, just replace the page API to the corresponding folio API. migrate_pages() and unmap_and_move() mostly work with folios (head pages) only. This patch (of 2): Quite straightforward, the page functions are converted to corresponding folio functions. Same for comments. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Yang Shi <[email protected]> Reviewed-by: Zi Yan <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30Revert "mm: migration: fix the FOLL_GET failure on following huge page"Baolin Wang1-8/+2
Revert commit 831568214883 ("mm: migration: fix the FOLL_GET failure on following huge page"), since after commit 1a6baaa0db73 ("s390/hugetlb: switch to generic version of follow_huge_pud()") and commit 57a196a58421 ("hugetlb: simplify hugetlb handling in follow_page_mask") were merged, now all the following huge page routines can support FOLL_GET operation. Link: https://lkml.kernel.org/r/496786039852aba90ffa68f10d0df3f4236a990b.1667983080.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Acked-by: Haiyue Wang <[email protected]> Cc: Baolin Wang <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/kfence: remove hung_task cruftPavankumar Kondeti1-11/+1
commit fdf756f71271 ("sched: Fix more TASK_state comparisons") makes hung_task not to monitor TASK_IDLE tasks. The special handling to workaround hung_task warnings is not required anymore. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pavankumar Kondeti <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30Docs/ABI/zram: document zram recompress sysfs knobsSergey Senozhatsky1-0/+14
Document zram re-compression sysfs knobs. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Nitin Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30zram: add incompressible flag to read_block_state()Sergey Senozhatsky2-6/+11
Add a new flag to zram block state that shows if the page is incompressible: that none of the algorithm (including secondary ones) could compress it. Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Minchan Kim <[email protected]> Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Alexey Romanov <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Suleiman Souhlal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30zram: add incompressible writebackSergey Senozhatsky2-7/+18
Add support for incompressible pages writeback: echo incompressible > /sys/block/zramX/writeback Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Alexey Romanov <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Suleiman Souhlal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30documentation: add zram recompression documentationSergey Senozhatsky1-0/+81
Document user-space visible device attributes that are enabled by ZRAM_MULTI_COMP. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Alexey Romanov <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Suleiman Souhlal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30zram: add algo parameter support to zram_recompress()Sergey Senozhatsky2-9/+46
Recompression iterates through all the registered secondary compression algorithms in order of their priorities so that we have higher chances of finding the algorithm that compresses a particular page. This, however, may not always be best approach and sometimes we may want to limit recompression to only one particular algorithm. For instance, when a higher priority algorithm uses too much power and device has a relatively low battery level we may want to limit recompression to use only a lower priority algorithm, which uses less power. Introduce algo= parameter support to recompression sysfs knob so that user-sapce can request recompression with particular algorithm only: echo "type=idle algo=zstd" > /sys/block/zramX/recompress Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Alexey Romanov <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Suleiman Souhlal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30zram: remove redundant checks from zram_recompress()Sergey Senozhatsky1-6/+2
Size class index comparison is powerful enough so we can remove object size comparisons. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Alexey Romanov <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Suleiman Souhlal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30zram: add size class equals check into recompressionAlexey Romanov3-1/+33
It makes no sense for us to recompress the object if it will be in the same size class. We anyway don't get any memory gain. But, at the same time, we get a CPU time overhead when inserting this object into zspage and decompressing it afterwards. [senozhatsky: rebased and fixed conflicts] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alexey Romanov <[email protected]> Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Suleiman Souhlal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30zram: use IS_ERR_VALUE() to check for zs_malloc() errorsSergey Senozhatsky1-3/+3
Avoid typecasts that are needed for IS_ERR() and use IS_ERR_VALUE() instead. Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Andrew Morton <[email protected]> Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Alexey Romanov <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Suleiman Souhlal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30zram: clarify writeback_store() commentSergey Senozhatsky1-2/+6
Re-phrase writeback BIO error comment. Link: https://lkml.kernel.org/r/[email protected] Reported-by: Andrew Morton <[email protected]> Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Alexey Romanov <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Suleiman Souhlal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30zram: add recompress flag to read_block_state()Sergey Senozhatsky2-5/+9
Add a new flag to zram block state that shows if the page was recompressed (using alternative compression algorithm). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Alexey Romanov <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Suleiman Souhlal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30zram: introduce recompress sysfs knobSergey Senozhatsky3-3/+277
Allow zram to recompress (using secondary compression streams) pages. Re-compression algorithms (we support up to 3 at this stage) are selected via recomp_algorithm: echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm Please read documentation for more details. We support several recompression modes: 1) IDLE pages recompression is activated by `idle` mode echo "type=idle" > /sys/block/zram0/recompress 2) Since there may be many idle pages user-space may pass a size threshold value (in bytes) and we will recompress pages only of equal or greater size: echo "threshold=888" > /sys/block/zram0/recompress 3) HUGE pages recompression is activated by `huge` mode echo "type=huge" > /sys/block/zram0/recompress 4) HUGE_IDLE pages recompression is activated by `huge_idle` mode echo "type=huge_idle" > /sys/block/zram0/recompress [[email protected]: we should always zero out err variable in recompress loop[ Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Alexey Romanov <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Suleiman Souhlal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30zram: factor out WB and non-WB zram read functionsSergey Senozhatsky1-23/+49
We will use non-WB variant in ZRAM page recompression path. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Alexey Romanov <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Suleiman Souhlal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30zram: add recompression algorithm sysfs knobSergey Senozhatsky1-19/+105
Introduce recomp_algorithm sysfs knob that controls secondary algorithm selection used for recompression. We will support up to 3 secondary compression algorithms which are sorted in order of their priority. To select an algorithm user has to provide its name and priority: echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm During recompression zram iterates through the list of registered secondary algorithms in order of their priorities. We also have a short version for cases when there is only one secondary compression algorithm: echo "algo=zstd" > /sys/block/zramX/recomp_algorithm This will register zstd as the secondary algorithm with priority 1. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Alexey Romanov <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Suleiman Souhlal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30zram: preparation for multi-zcomp supportSergey Senozhatsky4-32/+80
Patch series "zram: Support multiple compression streams", v5. This series adds support for multiple compression streams. The main idea is that different compression algorithms have different characteristics and zram may benefit when it uses a combination of algorithms: a default algorithm that is faster but have lower compression rate and a secondary algorithm that can use higher compression rate at a price of slower compression/decompression. There are several use-case for this functionality: - huge pages re-compression: zstd or deflate can successfully compress huge pages (~50% of huge pages on my synthetic ChromeOS tests), IOW pages that lzo was not able to compress. - idle pages re-compression: idle/cold pages sit in the memory and we may reduce zsmalloc memory usage if we recompress those idle pages. Userspace has a number of ways to control the behavior and impact of zram recompression: what type of pages should be recompressed, size watermarks, etc. Please refer to documentation patch. This patch (of 13): The patch turns compression streams and compressor algorithm name struct zram members into arrays, so that we can have multiple compression streams support (in the next patches). The patch uses a rather explicit API for compressor selection: - Get primary (default) compression stream zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]) - Get secondary compression stream zcomp_stream_get(zram->comps[ZRAM_SECONDARY_COMP]) We use similar API for compression streams put(). At this point we always have just one compression stream, since CONFIG_ZRAM_MULTI_COMP is not yet defined. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Suleiman Souhlal <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Alexey Romanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: mmu_gather: do not expose delayed_rmap flagAlexander Gordeev2-2/+4
Flag delayed_rmap of 'struct mmu_gather' is rather a private member, but it is still accessed directly. Instead, let the TLB gather code access the flag. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alexander Gordeev <[email protected]> Acked-by: Linus Torvalds <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: delay page_remove_rmap() until after the TLB has been flushedLinus Torvalds4-8/+82
When we remove a page table entry, we are very careful to only free the page after we have flushed the TLB, because other CPUs could still be using the page through stale TLB entries until after the flush. However, we have removed the rmap entry for that page early, which means that functions like folio_mkclean() would end up not serializing with the page table lock because the page had already been made invisible to rmap. And that is a problem, because while the TLB entry exists, we could end up with the following situation: (a) one CPU could come in and clean it, never seeing our mapping of the page (b) another CPU could continue to use the stale and dirty TLB entry and continue to write to said page resulting in a page that has been dirtied, but then marked clean again, all while another CPU might have dirtied it some more. End result: possibly lost dirty data. This extends our current TLB gather infrastructure to optionally track a "should I do a delayed page_remove_rmap() for this page after flushing the TLB". It uses the newly introduced 'encoded page pointer' to do that without having to keep separate data around. Note, this is complicated by a couple of issues: - we want to delay the rmap removal, but not past the page table lock, because that simplifies the memcg accounting - only SMP configurations want to delay TLB flushing, since on UP there are obviously no remote TLBs to worry about, and the page table lock means there are no preemption issues either - s390 has its own mmu_gather model that doesn't delay TLB flushing, and as a result also does not want the delayed rmap. As such, we can treat S390 like the UP case and use a common fallback for the "no delays" case. - we can track an enormous number of pages in our mmu_gather structure, with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each, all set up to be approximately 10k pending pages. We do not want to have a huge number of batched pages that we then need to check for delayed rmap handling inside the page table lock. Particularly that last point results in a noteworthy detail, where the normal page batch gathering is limited once we have delayed rmaps pending, in such a way that only the last batch (the so-called "active batch") in the mmu_gather structure can have any delayed entries. NOTE! While the "possibly lost dirty data" sounds catastrophic, for this all to happen you need to have a user thread doing either madvise() with MADV_DONTNEED or a full re-mmap() of the area concurrently with another thread continuing to use said mapping. So arguably this is about user space doing crazy things, but from a VM consistency standpoint it's better if we track the dirty bit properly even when user space goes off the rails. [[email protected]: fix UP build, per Linus] Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Hugh Dickins <[email protected]> Reported-by: Nadav Amit <[email protected]> Tested-by: Nadav Amit <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: mmu_gather: prepare to gather encoded page pointers with flagsLinus Torvalds5-19/+19
This is purely a preparatory patch that makes all the data structures ready for encoding flags with the mmu_gather page pointers. The code currently always sets the flag to zero and doesn't use it yet, but now it's tracking the type state along. The next step will be to actually start using it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: teach release_pages() to take an array of encoded page pointers tooLinus Torvalds2-6/+31
release_pages() already could take either an array of page pointers, or an array of folio pointers. Expand it to also accept an array of encoded page pointers, which is what both the existing mlock() use and the upcoming mmu_gather use of encoded page pointers wants. Note that release_pages() won't actually use, or react to, any extra encoded bits. Instead, this is very much a case of "I have walked the array of encoded pages and done everything the extra bits tell me to do, now release it all". Also, while the "either page or folio pointers" dual use was handled with a cast of the pointer in "release_folios()", this takes a slightly different approach and uses the "transparent union" attribute to describe the set of arguments to the function: https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html which has been supported by gcc forever, but the kernel hasn't used before. That allows us to avoid using various wrappers with casts, and just use the same function regardless of use. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: introduce 'encoded' page pointers with embedded extra bitsLinus Torvalds1-1/+33
We already have this notion in parts of the MM code (see the mlock code with the LRU_PAGE and NEW_PAGE bits), but I'm going to introduce a new case, and I refuse to do the same thing we've done before where we just put bits in the raw pointer and say it's still a normal pointer. So this introduces a 'struct encoded_page' pointer that cannot be used for anything else than to encode a real page pointer and a couple of extra bits in the low bits. That way the compiler can trivially track the state of the pointer and you just explicitly encode and decode the extra bits. Note that this makes the alignment of 'struct page' explicit even for the case where CONFIG_HAVE_ALIGNED_STRUCT_PAGE is not set. That is entirely redundant in almost all cases, since the page structure already contains several word-sized entries. However, on m68k, the alignment of even 32-bit data is just 16 bits, and as such in theory the alignment of 'struct page' could be too. So let's just make it very very explicit that the alignment needs to be at least 32 bits, giving us a guarantee of two unused low bits in the pointer. Now, in practice, our page struct array is aligned much more than that anyway, even on m68k, and our existing code in mm/mlock.c obviously already depended on that. But since the whole point of this change is to be careful about the type system when hiding extra bits in the pointer, let's also be explicit about the assumptions we make. NOTE! This is being very careful in another way too: it has a build-time assertion that the 'flags' added to the page pointer actually fit in the two bits. That means that this helper must be inlined, and can only be used in contexts where the compiler can statically determine that the value fits in the available bits. [[email protected]: kerneldoc on a forward-declared struct confuses htmldocs] Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Hugh Dickins <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> [s390] Cc: Nadav Amit <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30selftests/vm: anon_cow: add mprotect() optimization testsDavid Hildenbrand1-3/+46
Let's extend the test to cover the possible mprotect() optimization when removing write-protection. mprotect() must not allow write-access to a COW-shared page by accident. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: remove unused savedwrite infrastructureDavid Hildenbrand4-133/+5
NUMA hinting no longer uses savedwrite, let's rip it out. ... and while at it, drop __pte_write() and __pmd_write() on ppc64. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwriteDavid Hildenbrand5-24/+36
commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a NUMA hinting fault") added remembering write permissions using ordinary pte_write() for PROT_NONE mapped pages to avoid write faults when remapping the page !PROT_NONE on NUMA hinting faults. That commit noted: The patch looks hacky but the alternatives looked worse. The tidest was to rewalk the page tables after a hinting fault but it was more complex than this approach and the performance was worse. It's not generally safe to just mark the page writable during the fault if it's a write fault as it may have been read-only for COW so that approach was discarded. Later, commit 288bc54949fc ("mm/autonuma: let architecture override how the write bit should be stashed in a protnone pte.") introduced a family of savedwrite PTE functions that didn't necessarily improve the whole situation. One confusing thing is that nowadays, if a page is pte_protnone() and pte_savedwrite() then also pte_write() is true. Another source of confusion is that there is only a single pte_mk_savedwrite() call in the kernel. All other write-protection code seems to silently rely on pte_wrprotect(). Ever since PageAnonExclusive was introduced and we started using it in mprotect context via commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection"), we do have machinery in place to avoid write faults when changing protection, which is exactly what we want to do here. Let's similarly do what ordinary mprotect() does nowadays when upgrading write permissions and reuse can_change_pte_writable() and can_change_pmd_writable() to detect if we can upgrade PTE permissions to be writable. For anonymous pages there should be absolutely no change: if an anonymous page is not exclusive, it could not have been mapped writable -- because only exclusive anonymous pages can be mapped writable. However, there *might* be a change for writable shared mappings that require writenotify: if they are not dirty, we cannot map them writable. While it might not matter in practice, we'd need a different way to identify whether writenotify is actually required -- and ordinary mprotect would benefit from that as well. Note that we don't optimize for the actual migration case: (1) When migration succeeds the new PTE will not be writable because the source PTE was not writable (protnone); in the future we might just optimize that case similarly by reusing can_change_pte_writable()/can_change_pmd_writable() when removing migration PTEs. (2) When migration fails, we'd have to recalculate the "writable" flag because we temporarily dropped the PT lock; for now keep it simple and set "writable=false". We'll remove all savedwrite leftovers next. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/mprotect: factor out check whether manual PTE write upgrades are requiredDavid Hildenbrand2-15/+18
Let's factor the check out into vma_wants_manual_pte_write_upgrade(), to be reused in NUMA hinting fault context soon. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/huge_memory: try avoiding write faults when changing PMD protectionDavid Hildenbrand1-2/+36
Let's replicate what we have for PTEs in can_change_pte_writable() also for PMDs. While this might look like a pure performance improvement, we'll us this to get rid of savedwrite handling in do_huge_pmd_numa_page() next. Place do_huge_pmd_numa_page() strategically good for that purpose. Note that MM_CP_TRY_CHANGE_WRITABLE is currently only set when we come via mprotect_fixup(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/mprotect: minor can_change_pte_writable() cleanupsDavid Hildenbrand1-5/+14
We want to replicate this code for handling PMDs soon. (1) No need to crash the kernel, warning and rejecting is good enough. As this will no longer get optimized out, drop the pte_write() check: no harm would be done. (2) Add a comment why PROT_NONE mapped pages are excluded. (3) Add a comment regarding MAP_SHARED handling and why we rely on the dirty bit in the PTE. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/mprotect: allow clean exclusive anon pages to be writableNadav Amit1-4/+3
Patch series "mm/autonuma: replace savedwrite infrastructure", v2. As discussed in my talk at LPC, we can reuse the same mechanism for deciding whether to map a pte writable when upgrading permissions via mprotect() -- e.g., PROT_READ -> PROT_READ|PROT_WRITE -- to replace the savedwrite infrastructure used for NUMA hinting faults (e.g., PROT_NONE -> PROT_READ|PROT_WRITE). Instead of maintaining previous write permissions for a pte/pmd, we re-determine if the pte/pmd can be writable. The big benefit is that we have a common logic for deciding whether we can map a pte/pmd writable on protection changes. For private mappings, there should be no difference -- from what I understand, that is what autonuma benchmarks care about. I ran autonumabench for v1 on a system with 2 NUMA nodes, 96 GiB each via: perf stat --null --repeat 10 The numa01 benchmark is quite noisy in my environment and I failed to reduce the noise so far. numa01: mm-unstable: 146.88 +- 6.54 seconds time elapsed ( +- 4.45% ) mm-unstable++: 147.45 +- 13.39 seconds time elapsed ( +- 9.08% ) numa02: mm-unstable: 16.0300 +- 0.0624 seconds time elapsed ( +- 0.39% ) mm-unstable++: 16.1281 +- 0.0945 seconds time elapsed ( +- 0.59% ) It is worth noting that for shared writable mappings that require writenotify, we will only avoid write faults if the pte/pmd is dirty (inherited from the older mprotect logic). If we ever care about optimizing that further, we'd need a different mechanism to identify whether the FS still needs to get notified on the next write access. In any case, such an optimization will then not be autonuma-specific, but mprotect() permission upgrades would similarly benefit from it. This patch (of 7): Anonymous pages might have the dirty bit clear, but this should not prevent mprotect from making them writable if they are exclusive. Therefore, skip the test whether the page is dirty in this case. Note that there are already other ways to get a writable PTE mapping an anonymous page that is clean: for example, via MADV_FREE. In an ideal world, we'd have a different indication from the FS whether writenotify is still required. [[email protected]: return directly; update description] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nadav Amit <[email protected]> Signed-off-by: David Hildenbrand <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Peter Xu <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30tools/vm/page_owner: ignore page_owner_sort binaryRong Tao1-0/+1
page_owner_sort was introduced since commit 48c96a368579 ("mm/page_owner: keep track of page owners"), and we should ignore it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rong Tao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm,thp,rmap: clean up the end of __split_huge_pmd_locked()Hugh Dickins1-10/+5
It's hard to add a page_add_anon_rmap() into __split_huge_pmd_locked()'s HPAGE_PMD_NR set_pte_at() loop, without wincing at the "freeze" case's HPAGE_PMD_NR page_remove_rmap() loop below it. It's just a mistake to add rmaps in the "freeze" (insert migration entries prior to splitting huge page) case: the pmd_migration case already avoids doing that, so just follow its lead. page_add_ref() versus put_page() likewise. But why is one more put_page() needed in the "freeze" case? Because it's removing the pmd rmap, already removed when pmd_migration (and freeze and pmd_migration are mutually exclusive cases). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: James Houghton <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mappedHugh Dickins5-111/+51
Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now? Yes. Not by atomic64_t or cmpxchg games, those get difficult on 32-bit; but if we slightly abuse subpages_mapcount by additionally demanding that one bit be set there when the compound page is PMD-mapped, then a cascade of two atomic ops is able to maintain the stats without bit_spin_lock. This is harder to reason about than when bit_spin_locked, but I believe safe; and no drift in stats detected when testing. When there are racing removes and adds, of course the sequence of operations is less well- defined; but each operation on subpages_mapcount is atomically good. What might be disastrous, is if subpages_mapcount could ever fleetingly appear negative: but the pte lock (or pmd lock) these rmap functions are called under, ensures that a last remove cannot race ahead of a first add. Continue to make an exception for hugetlb (PageHuge) pages, though that exception can be easily removed by a further commit if necessary: leave subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just carry on checking compound_mapcount too in folio_mapped(), page_mapped(). Evidence is that this way goes slightly faster than the previous implementation in all cases (pmds after ptes now taking around 103ms); and relieves us of worrying about contention on the bit_spin_lock. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: James Houghton <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm,thp,rmap: subpages_mapcount of PTE-mapped subpagesHugh Dickins5-119/+107
Patch series "mm,thp,rmap: rework the use of subpages_mapcount", v2. This patch (of 3): Following suggestion from Linus, instead of counting every PTE map of a compound page in subpages_mapcount, just count how many of its subpages are PTE-mapped: this yields the exact number needed for NR_ANON_MAPPED and NR_FILE_MAPPED stats, without any need for a locked scan of subpages; and requires updating the count less often. This does then revert total_mapcount() and folio_mapcount() to needing a scan of subpages; but they are inherently racy, and need no locking, so Linus is right that the scans are much better done there. Plus (unlike in 6.1 and previous) subpages_mapcount lets us avoid the scan in the common case of no PTE maps. And page_mapped() and folio_mapped() remain scanless and just as efficient with the new meaning of subpages_mapcount: those are the functions which I most wanted to remove the scan from. The updated page_dup_compound_rmap() is no longer suitable for use by anon THP's __split_huge_pmd_locked(); but page_add_anon_rmap() can be used for that, so long as its VM_BUG_ON_PAGE(!PageLocked) is deleted. Evidence is that this way goes slightly faster than the previous implementation for most cases; but significantly faster in the (now scanless) pmds after ptes case, which started out at 870ms and was brought down to 495ms by the previous series, now takes around 105ms. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: James Houghton <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>