aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2023-06-19shmem: use ramfs_kill_sb() for kill_sb method of ramfs-based tmpfsRoberto Sassu1-1/+1
As the ramfs-based tmpfs uses ramfs_init_fs_context() for the init_fs_context method, which allocates fc->s_fs_info, use ramfs_kill_sb() to free it and avoid a memory leak. Link: https://lkml.kernel.org/r/[email protected] Fixes: c3b1b1cbf002 ("ramfs: add support for "mode=" mount option") Signed-off-by: Roberto Sassu <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Howells <[email protected]> Cc: Al Viro <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19mm/mm_init.c: drop 'nid' parameter from check_for_memory()Haifeng Xu1-4/+4
The node_id in pgdat has already been set in free_area_init_node(), so use it internally instead of passing a redundant parameter. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Haifeng Xu <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19mm/sparse: remove unused parameters in sparse_remove_section()Yajun Deng2-20/+8
These parameters ms and map_offset are not used in sparse_remove_section(), so remove them. The __remove_section() is only called by __remove_pages(), remove it. And put the WARN_ON_ONCE() in sparse_remove_section(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yajun Deng <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19mm/hugetlb: use a folio in hugetlb_fault()ZhangPeng1-10/+10
We can replace seven implicit calls to compound_head() with one by using folio. [[email protected]: update comment, per Sidhartha] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: ZhangPeng <[email protected]> Reviewed-by Sidhartha Kumar <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Nanyong Sun <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19mm/hugetlb: use a folio in hugetlb_wp()ZhangPeng1-16/+16
We can replace nine implict calls to compound_head() with one by using old_folio. The page we get back is always a head page, so we just convert old_page to old_folio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: ZhangPeng <[email protected]> Suggested-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Sidhartha Kumar <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Nanyong Sun <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19mm/hugetlb: use a folio in copy_hugetlb_page_range()ZhangPeng1-11/+11
Patch series "Convert several functions in hugetlb.c to use a folio", v2. This patch series converts three functions in hugetlb.c to use a folio, which can remove several implicit calls to compound_head(). This patch (of 3): We can replace five implict calls to compound_head() with one by using pte_folio. The page we get back is always a head page, so we just convert ptepage to pte_folio. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: ZhangPeng <[email protected]> Suggested-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Sidhartha Kumar <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Nanyong Sun <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19memory tier: remove unneeded disable_all_demotion_targets() when ↵Miaohe Lin1-1/+0
!CONFIG_MIGRATION There's no caller of disable_all_demotion_targets() when CONFIG_MIGRATION is disabled. Remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19mm: vmscan: mark kswapd_run() and kswapd_stop() __meminitMiaohe Lin1-2/+2
Add __meminit to kswapd_run() and kswapd_stop() to ensure they're default to __init when memory hotplug is not enabled. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Yu Zhao <[email protected]> Acked-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19mm: skip CMA pages when they are not availableZhaoyang Huang1-1/+21
This patch fixes unproductive reclaiming of CMA pages by skipping them when they are not available for current context. It arises from the below OOM issue, which was caused by a large proportion of MIGRATE_CMA pages among free pages. [ 36.172486] [03-19 10:05:52.172] ActivityManager: page allocation failure: order:0, mode:0xc00(GFP_NOIO), nodemask=(null),cpuset=foreground,mems_allowed=0 [ 36.189447] [03-19 10:05:52.189] DMA32: 0*4kB 447*8kB (C) 217*16kB (C) 124*32kB (C) 136*64kB (C) 70*128kB (C) 22*256kB (C) 3*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 35848kB [ 36.193125] [03-19 10:05:52.193] Normal: 231*4kB (UMEH) 49*8kB (MEH) 14*16kB (H) 13*32kB (H) 8*64kB (H) 2*128kB (H) 0*256kB 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 3236kB ... [ 36.234447] [03-19 10:05:52.234] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC) [ 36.234455] [03-19 10:05:52.234] cache: ext4_io_end, object size: 64, buffer size: 64, default order: 0, min order: 0 [ 36.234459] [03-19 10:05:52.234] node 0: slabs: 53,objs: 3392, free: 0 This change further decreases the chance for wrong OOMs in the presence of a lot of CMA memory. [[email protected]: changelog addition] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhaoyang Huang <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: ke.wang <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19mm: page_isolation: write proper kerneldocJohannes Weiner1-7/+26
And remove the incorrect header comments. [[email protected]: s/lower/first/, s/upper/last/, per Mike] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19Revert "mm: vmscan: make global slab shrink lockless"Qi Zheng1-12/+16
This reverts commit f95bdb700bc6bb74e1199b1f5f90c613e152cfa7. Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec test case [1], which is caused by commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). The root cause is that SRCU has to be careful to not frequently check for SRCU read-side critical section exits. Therefore, even if no one is currently in the SRCU read-side critical section, synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has become slower. After discussion, we will try to use the refcount+RCU method [2] proposed by Dave Chinner to continue to re-implement the lockless slab shrink. So revert the shrinker_srcu related changes first. [1]. https://lore.kernel.org/lkml/[email protected]/ [2]. https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-lkp/[email protected] Signed-off-by: Qi Zheng <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19Revert "mm: vmscan: make memcg slab shrink lockless"Qi Zheng1-26/+19
This reverts commit caa05325c9126c77ebf114edce51536a0d0a9a08. Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec test case [1], which is caused by commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). The root cause is that SRCU has to be careful to not frequently check for SRCU read-side critical section exits. Therefore, even if no one is currently in the SRCU read-side critical section, synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has become slower. After discussion, we will try to use the refcount+RCU method [2] proposed by Dave Chinner to continue to re-implement the lockless slab shrink. So revert the shrinker_srcu related changes first. [1]. https://lore.kernel.org/lkml/[email protected]/ [2]. https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-lkp/[email protected] Signed-off-by: Qi Zheng <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19Revert "mm: vmscan: add shrinker_srcu_generation"Qi Zheng1-20/+4
This reverts commit 475733dda5aedba9e086379aafe6b5ffd53e8f5e. Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec test case [1], which is caused by commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). The root cause is that SRCU has to be careful to not frequently check for SRCU read-side critical section exits. Therefore, even if no one is currently in the SRCU read-side critical section, synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has become slower. We will try to use the refcount+RCU method [2] proposed by Dave Chinner to continue to re-implement the lockless slab shrink. So revert the shrinker_srcu related changes first. [1]. https://lore.kernel.org/lkml/[email protected]/ [2]. https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-lkp/[email protected] Signed-off-by: Qi Zheng <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19Revert "mm: shrinkers: make count and scan in shrinker debugfs lockless"Qi Zheng1-8/+17
This reverts commit 20cd1892fcc3efc10a7ac327cc3790494bec46b5. Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec test case [1], which is caused by commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). The root cause is that SRCU has to be careful to not frequently check for SRCU read-side critical section exits. Therefore, even if no one is currently in the SRCU read-side critical section, synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has become slower. We will try to use the refcount+RCU method [2] proposed by Dave Chinner to continue to re-implement the lockless slab shrink. So revert the shrinker_srcu related changes first. [1]. https://lore.kernel.org/lkml/[email protected]/ [2]. https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-lkp/[email protected] Signed-off-by: Qi Zheng <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19Revert "mm: vmscan: hold write lock to reparent shrinker nr_deferred"Qi Zheng1-2/+2
This reverts commit b3cabea3c9153fd42fe5cb851ac58b51ea2b32b8. Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec test case [1], which is caused by commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). The root cause is that SRCU has to be careful to not frequently check for SRCU read-side critical section exits. Therefore, even if no one is currently in the SRCU read-side critical section, synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has become slower. We will try to use the refcount+RCU method [2] proposed by Dave Chinner to continue to re-implement the lockless slab shrink. Because there will be other readers after reverting the shrinker_srcu related changes, so it is better to restore to hold read lock to reparent shrinker nr_deferred. [1]. https://lore.kernel.org/lkml/[email protected]/ [2]. https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-lkp/[email protected] Signed-off-by: Qi Zheng <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19Revert "mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()"Qi Zheng1-2/+6
This reverts commit 1643db98d9b314e0a592d152603094fbf7ab906e. Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec test case [1], which is caused by commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). The root cause is that SRCU has to be careful to not frequently check for SRCU read-side critical section exits. Therefore, even if no one is currently in the SRCU read-side critical section, synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has become slower. We will try to use the refcount+RCU method [2] proposed by Dave Chinner to continue to re-implement the lockless slab shrink. So we still need shrinker_rwsem in synchronize_shrinkers() after reverting the shrinker_srcu related changes. [1]. https://lore.kernel.org/lkml/[email protected]/ [2]. https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-lkp/[email protected] Signed-off-by: Qi Zheng <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19Revert "mm: shrinkers: convert shrinker_rwsem to mutex"Qi Zheng2-24/+24
Patch series "revert shrinker_srcu related changes". This patch (of 7): This reverts commit cf2e309ebca7bb0916771839f9b580b06c778530. Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec test case [1], which is caused by commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). The root cause is that SRCU has to be careful to not frequently check for SRCU read-side critical section exits. Therefore, even if no one is currently in the SRCU read-side critical section, synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has become slower. After discussion, we will try to use the refcount+RCU method [2] proposed by Dave Chinner to continue to re-implement the lockless slab shrink. So revert the shrinker_mutex back to shrinker_rwsem first. [1]. https://lore.kernel.org/lkml/[email protected]/ [2]. https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-lkp/[email protected] Signed-off-by: Qi Zheng <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yujie Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19mm/khugepaged: fix iteration in collapse_fileDavid Stevens1-1/+0
Remove an unnecessary call to xas_set(index) when iterating over the target range in collapse_file. The extra call to xas_set reset the xas cursor to the top of the tree, causing the xas_next call on the next iteration to walk the tree to index instead of advancing to index+1. This returned the same page again, which would cause collapse_file to fail because the page is already locked. This bug was hidden when CONFIG_DEBUG_VM was set. When that config was used, the xas_load in a subsequent VM_BUG_ON assert would walk xas from the top of the tree to index, causing the xas_next call on the next loop iteration to advance the cursor as expected. Link: https://lkml.kernel.org/r/[email protected] Fixes: a2e17cc2efc7 ("mm/khugepaged: maintain page cache uptodate flag") Signed-off-by: David Stevens <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jiaqi Yan <[email protected]> Cc: Kirill A . Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Yang Shi <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19memfd: check for non-NULL file_seals in memfd_create() syscallRoberto Sassu1-3/+6
Ensure that file_seals is non-NULL before using it in the memfd_create() syscall. One situation in which memfd_file_seals_ptr() could return a NULL pointer when CONFIG_SHMEM=n, oopsing the kernel. Link: https://lkml.kernel.org/r/[email protected] Fixes: 47b9012ecdc7 ("shmem: add sealing support to hugetlb-backed memfd") Signed-off-by: Roberto Sassu <[email protected]> Cc: Marc-Andr Lureau <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19mm/vmalloc: do not output a spurious warning when huge vmalloc() failsLorenzo Stoakes1-4/+13
In __vmalloc_area_node() we always warn_alloc() when an allocation performed by vm_area_alloc_pages() fails unless it was due to a pending fatal signal. However, huge page allocations instigated either by vmalloc_huge() or __vmalloc_node_range() (or a caller that invokes this like kvmalloc() or kvmalloc_node()) always falls back to order-0 allocations if the huge page allocation fails. This renders the warning useless and noisy, especially as all callers appear to be aware that this may fallback. This has already resulted in at least one bug report from a user who was confused by this (see link). Therefore, simply update the code to only output this warning for order-0 pages when no fatal signal is pending. Link: https://bugzilla.suse.com/show_bug.cgi?id=1211410 Link: https://lkml.kernel.org/r/[email protected] Fixes: 80b1d8fdfad1 ("mm: vmalloc: correct use of __GFP_NOWARN mask in __vmalloc_area_node()") Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Baoquan He <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-19mm/mprotect: fix do_mprotect_pkey() limit checkLiam R. Howlett1-1/+1
The return of do_mprotect_pkey() can still be incorrectly returned as success if there is a gap that spans to or beyond the end address passed in. Update the check to ensure that the end address has indeed been seen. Link: https://lore.kernel.org/all/CABi2SkXjN+5iFoBhxk71t3cmunTk-s=rB4T7qo0UQRh17s49PQ@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 82f951340f25 ("mm/mprotect: fix do_mprotect_pkey() return on error") Signed-off-by: Liam R. Howlett <[email protected]> Reported-by: Jeff Xu <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-18mm/mmap: Fix error path in do_vmi_align_munmap()Liam R. Howlett1-20/+17
The error unrolling was leaving the VMAs detached in many cases and leaving the locked_vm statistic altered, and skipping the unrolling entirely in the case of the vma tree write failing. Fix the error path by re-attaching the detached VMAs and adding the necessary goto for the failed vma tree write, and fix the locked_vm statistic by only updating after the vma tree write succeeds. Fixes: 763ecb035029 ("mm: remove the vma linked list") Reported-by: Vegard Nossum <[email protected]> Signed-off-by: Liam R. Howlett <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2023-06-18tcp: Use per-vma locking for receive zerocopyArjun Roy1-3/+4
Per-VMA locking allows us to lock a struct vm_area_struct without taking the process-wide mmap lock in read mode. Consider a process workload where the mmap lock is taken constantly in write mode. In this scenario, all zerocopy receives are periodically blocked during that period of time - though in principle, the memory ranges being used by TCP are not touched by the operations that need the mmap write lock. This results in performance degradation. Now consider another workload where the mmap lock is never taken in write mode, but there are many TCP connections using receive zerocopy that are concurrently receiving. These connections all take the mmap lock in read mode, but this does induce a lot of contention and atomic ops for this process-wide lock. This results in additional CPU overhead caused by contending on the cache line for this lock. However, with per-vma locking, both of these problems can be avoided. As a test, I ran an RPC-style request/response workload with 4KB payloads and receive zerocopy enabled, with 100 simultaneous TCP connections. I measured perf cycles within the find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and without per-vma locking enabled. When using process-wide mmap semaphore read locking, about 1% of measured perf cycles were within this path. With per-VMA locking, this value dropped to about 0.45%. Signed-off-by: Arjun Roy <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-06-12Merge tag 'mm-hotfixes-stable-2023-06-12-12-22' of ↵Linus Torvalds4-12/+28
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "19 hotfixes. 14 are cc:stable and the remainder address issues which were introduced during this development cycle or which were considered inappropriate for a backport" * tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: zswap: do not shrink if cgroup may not zswap page cache: fix page_cache_next/prev_miss off by one ocfs2: check new file size on fallocate call mailmap: add entry for John Keeping mm/damon/core: fix divide error in damon_nr_accesses_to_accesses_bp() epoll: ep_autoremove_wake_function should use list_del_init_careful mm/gup_test: fix ioctl fail for compat task nilfs2: reject devices with insufficient block count ocfs2: fix use-after-free when unmounting read-only filesystem lib/test_vmalloc.c: avoid garbage in page array nilfs2: fix possible out-of-bounds segment allocation in resize ioctl riscv/purgatory: remove PGO flags powerpc/purgatory: remove PGO flags x86/purgatory: remove PGO flags kexec: support purgatories with .text.hot sections mm/uffd: allow vma to merge as much as possible mm/uffd: fix vma operation where start addr cuts part of vma radix-tree: move declarations to header nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
2023-06-12zswap: do not shrink if cgroup may not zswapNhat Pham1-2/+9
Before storing a page, zswap first checks if the number of stored pages exceeds the limit specified by memory.zswap.max, for each cgroup in the hierarchy. If this limit is reached or exceeded, then zswap shrinking is triggered and short-circuits the store attempt. However, since the zswap's LRU is not memcg-aware, this can create the following pathological behavior: the cgroup whose zswap limit is 0 will evict pages from other cgroups continually, without lowering its own zswap usage. This means the shrinking will continue until the need for swap ceases or the pool becomes empty. As a result of this, we observe a disproportionate amount of zswap writeback and a perpetually small zswap pool in our experiments, even though the pool limit is never hit. More generally, a cgroup might unnecessarily evict pages from other cgroups before we drive the memcg back below its limit. This patch fixes the issue by rejecting zswap store attempt without shrinking the pool when obj_cgroup_may_zswap() returns false. [[email protected]: fix return of unintialized value] [[email protected]: s/ENOSPC/ENOMEM/] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: f4840ccfca25 ("zswap: memcg accounting") Signed-off-by: Nhat Pham <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-12page cache: fix page_cache_next/prev_miss off by oneMike Kravetz1-10/+16
Ackerley Tng reported an issue with hugetlbfs fallocate here[1]. The issue showed up after the conversion of hugetlb page cache lookup code to use page_cache_next_miss. Code in hugetlb fallocate, userfaultfd and GUP is now using page_cache_next_miss to determine if a page is present the page cache. The following statement is used. present = page_cache_next_miss(mapping, index, 1) != index; There are two issues with page_cache_next_miss when used in this way. 1) If the passed value for index is equal to the 'wrap-around' value, the same index will always be returned. This wrap-around value is 0, so 0 will be returned even if page is present at index 0. 2) If there is no gap in the range passed, the last index in the range will be returned. When passed a range of 1 as above, the passed index value will be returned even if the page is present. The end result is the statement above will NEVER indicate a page is present in the cache, even if it is. As noted by Ackerley in [1], users can see this by hugetlb fallocate incorrectly returning EEXIST if pages are already present in the file. In addition, hugetlb pages will not be included in core dumps if they need to be brought in via GUP. userfaultfd UFFDIO_COPY also uses this code and will not notice pages already present in the cache. It may try to allocate a new page and potentially return ENOMEM as opposed to EEXIST. Both page_cache_next_miss and page_cache_prev_miss have similar issues. Fix by: - Check for index equal to 'wrap-around' value and do not exit early. - If no gap is found in range, return index outside range. - Update function description to say 'wrap-around' value could be returned if passed as index. [1] https://lore.kernel.org/linux-mm/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: d0ce0e47b323 ("mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio()") Signed-off-by: Mike Kravetz <[email protected]> Reported-by: Ackerley Tng <[email protected]> Reviewed-by: Ackerley Tng <[email protected]> Tested-by: Ackerley Tng <[email protected]> Cc: Erdem Aktas <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vishal Annapurve <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-12mm/damon/core: fix divide error in damon_nr_accesses_to_accesses_bp()Kefeng Wang1-0/+2
If 'aggr_interval' is smaller than 'sample_interval', max_nr_accesses in damon_nr_accesses_to_accesses_bp() becomes zero which leads to divide error, let's validate the values of them in damon_set_attrs() to fix it, which similar to others attrs check. Link: https://lkml.kernel.org/r/[email protected] Fixes: 2f5bef5a590b ("mm/damon/core: update monitoring results for new monitoring attributes") Reported-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=841a46899768ec7bec67 Link: https://lore.kernel.org/damon/[email protected]/ Reviewed-by: SeongJae Park <[email protected]> Signed-off-by: Kefeng Wang <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-12mm/gup_test: fix ioctl fail for compat taskHaibo Li1-0/+1
When tools/testing/selftests/mm/gup_test.c is compiled as 32bit, then run on arm64 kernel, it reports "ioctl: Inappropriate ioctl for device". Fix it by filling compat_ioctl in gup_test_fops Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Haibo Li <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: AngeloGioacchino Del Regno <[email protected]> Cc: Matthias Brugger <[email protected]> Cc: John Hubbard <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-12block: replace fmode_t with a block-specific type for block open flagsChristoph Hellwig1-1/+1
The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Jack Wang <[email protected]> [rnbd] Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-06-12block: use the holder as indication for exclusive opensChristoph Hellwig1-4/+3
The current interface for exclusive opens is rather confusing as it requires both the FMODE_EXCL flag and a holder. Remove the need to pass FMODE_EXCL and just key off the exclusive open off a non-NULL holder. For blkdev_put this requires adding the holder argument, which provides better debug checking that only the holder actually releases the hold, but at the same time allows removing the now superfluous mode argument. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Acked-by: Christian Brauner <[email protected]> Acked-by: David Sterba <[email protected]> [btrfs] Acked-by: Jack Wang <[email protected]> [rnbd] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-06-09mm: sparse: mark populate_section_memmap() staticArnd Bergmann1-1/+1
There are two definitions of this function, but the second one lacks the 'static' annotation: mm/sparse.c:704:25: error: no previous prototype for 'populate_section_memmap' [-Werror=missing-prototypes] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Eric Paris <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Moore <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Russell King <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Waiman Long <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappingsLorenzo Stoakes1-0/+101
Writing to file-backed dirty-tracked mappings via GUP is inherently broken as we cannot rule out folios being cleaned and then a GUP user writing to them again and possibly marking them dirty unexpectedly. This is especially egregious for long-term mappings (as indicated by the use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as we have already done in the slow path. We have access to less information in the fast path as we cannot examine the VMA containing the mapping, however we can determine whether the folio is anonymous or belonging to a whitelisted filesystem - specifically hugetlb and shmem mappings. We take special care to ensure that both the folio and mapping are safe to access when performing these checks and document folio_fast_pin_allowed() accordingly. It's important to note that there are no APIs allowing users to specify FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can always rely on the fact that if we fail to pin on the fast path, the code will fall back to the slow path which can perform the more thorough check. Link: https://lkml.kernel.org/r/a27d39b87ded7f3dad5fd4181edb106393660453.1683235180.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Suggested-by: Kirill A . Shutemov <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: Mika Penttilä <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappingsLorenzo Stoakes1-1/+43
Writing to file-backed mappings which require folio dirty tracking using GUP is a fundamentally broken operation, as kernel write access to GUP mappings do not adhere to the semantics expected by a file system. A GUP caller uses the direct mapping to access the folio, which does not cause write notify to trigger, nor does it enforce that the caller marks the folio dirty. The problem arises when, after an initial write to the folio, writeback results in the folio being cleaned and then the caller, via the GUP interface, writes to the folio again. As a result of the use of this secondary, direct, mapping to the folio no write notify will occur, and if the caller does mark the folio dirty, this will be done so unexpectedly. For example, consider the following scenario:- 1. A folio is written to via GUP which write-faults the memory, notifying the file system and dirtying the folio. 2. Later, writeback is triggered, resulting in the folio being cleaned and the PTE being marked read-only. 3. The GUP caller writes to the folio, as it is mapped read/write via the direct mapping. 4. The GUP caller, now done with the page, unpins it and sets it dirty (though it does not have to). This results in both data being written to a folio without writenotify, and the folio being dirtied unexpectedly (if the caller decides to do so). This issue was first reported by Jan Kara [1] in 2018, where the problem resulted in file system crashes. This is only relevant when the mappings are file-backed and the underlying file system requires folio dirty tracking. File systems which do not, such as shmem or hugetlb, are not at risk and therefore can be written to without issue. Unfortunately this limitation of GUP has been present for some time and requires future rework of the GUP API in order to provide correct write access to such mappings. However, for the time being we introduce this check to prevent the most egregious case of this occurring, use of the FOLL_LONGTERM pin. These mappings are considerably more likely to be written to after folios are cleaned and thus simply must not be permitted to do so. This patch changes only the slow-path GUP functions, a following patch adapts the GUP-fast path along similar lines. [1] https://lore.kernel.org/linux-mm/[email protected]/ Link: https://lkml.kernel.org/r/7282506742d2390c125949c2f9894722750bb68a.1683235180.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: John Hubbard <[email protected]> Reviewed-by: Mika Penttilä <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Kirill A . Shutemov <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09mm/mmap: separate writenotify and dirty tracking logicLorenzo Stoakes1-12/+46
Patch series "mm/gup: disallow GUP writing to file-backed mappings by default", v9. Writing to file-backed mappings which require folio dirty tracking using GUP is a fundamentally broken operation, as kernel write access to GUP mappings do not adhere to the semantics expected by a file system. A GUP caller uses the direct mapping to access the folio, which does not cause write notify to trigger, nor does it enforce that the caller marks the folio dirty. The problem arises when, after an initial write to the folio, writeback results in the folio being cleaned and then the caller, via the GUP interface, writes to the folio again. As a result of the use of this secondary, direct, mapping to the folio no write notify will occur, and if the caller does mark the folio dirty, this will be done so unexpectedly. For example, consider the following scenario:- 1. A folio is written to via GUP which write-faults the memory, notifying the file system and dirtying the folio. 2. Later, writeback is triggered, resulting in the folio being cleaned and the PTE being marked read-only. 3. The GUP caller writes to the folio, as it is mapped read/write via the direct mapping. 4. The GUP caller, now done with the page, unpins it and sets it dirty (though it does not have to). This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As pin_user_pages_fast_only() does not exist, we can rely on a slightly imperfect whitelisting in the PUP-fast case and fall back to the slow case should this fail. This patch (of 3): vma_wants_writenotify() is specifically intended for setting PTE page table flags, accounting for existing page table flag state and whether the underlying filesystem performs dirty tracking for a file-backed mapping. Everything is predicated firstly on whether the mapping is shared writable, as this is the only instance where dirty tracking is pertinent - MAP_PRIVATE mappings will always be CoW'd and unshared, and read-only file-backed shared mappings cannot be written to, even with FOLL_FORCE. All other checks are in line with existing logic, though now separated into checks eplicitily for dirty tracking and those for determining how to set page table flags. We make this change so we can perform checks in the GUP logic to determine which mappings might be problematic when written to. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/0f218370bd49b4e6bbfbb499f7c7b92c26ba1ceb.1683235180.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: John Hubbard <[email protected]> Reviewed-by: Mika Penttilä <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Kirill A . Shutemov <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09mm/early_ioremap.c: improve the execution efficiency of early_ioremap_setup()Liam Ni1-5/+3
Reduce the number of invalid loops of the function early_ioremap_setup() to improve the efficiency of function execution Link: https://lkml.kernel.org/r/CACZJ9cU6t5sLoDwE6_XOg+UJLpZt4+qHfjYN2bA0s+3y9y6pQQ@mail.gmail.com Signed-off-by: LiamNi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09memcg: use helper macro FLUSH_TIMEMiaohe Lin1-1/+1
Use helper macro FLUSH_TIME to indicate the flush time to improve the readability a bit. No functional change intended. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Shakeel Butt <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09mm: page_alloc: remove unneeded header filesMiaohe Lin1-4/+0
Remove some unneeded header files. No functional change intended. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09mm: fix failure to unmap pte on highmem systemsRyan Roberts1-4/+2
The loser of a race to service a pte for a device private entry in the swap path previously unlocked the ptl, but failed to unmap the pte. This only affects highmem systems since unmapping a pte is a noop on non-highmem systems. Link: https://lkml.kernel.org/r/[email protected] Fixes: 16ce101db85d ("mm/memory.c: fix race when faulting a device private page") Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: Zi Yan <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Uladzislau Rezki (Sony) <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09mm/damon/ops-common: refactor to use {pte|pmd}p_clear_young_notify()Ryan Roberts1-20/+2
With the fix in place to atomically test and clear young on ptes and pmds, simplify the code to handle the clearing for both the primary mmu and the mmu notifier with a single API call. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Acked-by: Yu Zhao <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Uladzislau Rezki (Sony) <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09mm/damon/ops-common: atomically test and clear young on ptes and pmdsRyan Roberts4-16/+12
It is racy to non-atomically read a pte, then clear the young bit, then write it back as this could discard dirty information. Further, it is bad practice to directly set a pte entry within a table. Instead clearing young must go through the arch-provided helper, ptep_test_and_clear_young() to ensure it is modified atomically and to give the arch code visibility and allow it to check (and potentially modify) the operation. Link: https://lkml.kernel.org/r/[email protected] Fixes: 3f49584b262c ("mm/damon: implement primitives for the virtual memory address spaces"). Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: Zi Yan <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Uladzislau Rezki (Sony) <[email protected]> Cc: Yu Zhao <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09mm: vmalloc must set pte via arch codeRyan Roberts1-2/+8
Patch series "Fixes for pte encapsulation bypasses", v3. A series to improve the encapsulation of pte entries by disallowing non-arch code from directly dereferencing pte_t pointers. This patch (of 4): It is bad practice to directly set pte entries within a pte table. Instead all modifications must go through arch-provided helpers such as set_pte_at() to give the arch code visibility and allow it to check (and potentially modify) the operation. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 3e9a9e256b1e ("mm: add a vmap_pfn function") Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: Zi Yan <[email protected]> Acked-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09vmstat: allow_direct_reclaim should use zone_page_state_snapshotMarcelo Tosatti1-1/+1
A customer provided evidence indicating that a process was stalled in direct reclaim: - The process was trapped in throttle_direct_reclaim(). The function wait_event_killable() was called to wait condition allow_direct_reclaim(pgdat) for current node to be true. The allow_direct_reclaim(pgdat) examined the number of free pages on the node by zone_page_state() which just returns value in zone->vm_stat[NR_FREE_PAGES]. - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0. However, the freelist on this node was not empty. - This inconsistent of vmstat value was caused by percpu vmstat on nohz_full cpus. Every increment/decrement of vmstat is performed on percpu vmstat counter at first, then pooled diffs are cumulated to the zone's vmstat counter in timely manner. However, on nohz_full cpus (in case of this customer's system, 48 of 52 cpus) these pooled diffs were not cumulated once the cpu had no event on it so that the cpu started sleeping infinitely. I checked percpu vmstat and found there were total 69 counts not cumulated to the zone's vmstat counter yet. - In this situation, kswapd did not help the trapped process. In pgdat_balanced(), zone_wakermark_ok_safe() examined the number of free pages on the node by zone_page_state_snapshot() which checks pending counts on percpu vmstat. Therefore kswapd could know there were 69 free pages correctly. Since zone->_watermark = {8, 20, 32}, kswapd did not work because 69 was greater than 32 as high watermark. Change allow_direct_reclaim to use zone_page_state_snapshot, which allows a more precise version of the vmstat counters to be used. allow_direct_reclaim will only be called from try_to_free_pages, which is not a hot path. Testing: Due to difficulties accessing the system, it has not been possible for the reproducer to test the patch (however its clear from available data and analysis that it should fix it). Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Michal Hocko <[email protected]> Reviewed-by: Aaron Tomlin <[email protected]> Signed-off-by: Marcelo Tosatti <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09fs: factor out a direct_write_fallback helperChristoph Hellwig1-51/+15
Add a helper dealing with handling the syncing of a buffered write fallback for direct I/O. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Miklos Szeredi <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Cc: Al Viro <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Johannes Thumshirn <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Xiubo Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09filemap: add a kiocb_invalidate_post_direct_write helperChristoph Hellwig1-17/+20
Add a helper to invalidate page cache after a dio write. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Acked-by: Darrick J. Wong <[email protected]> Cc: Al Viro <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Johannes Thumshirn <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Xiubo Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09filemap: add a kiocb_invalidate_pages helperChristoph Hellwig1-20/+28
Factor out a helper that calls filemap_write_and_wait_range and invalidate_inode_pages2_range for the range covered by a write kiocb or returns -EAGAIN if the kiocb is marked as nowait and there would be pages to write or invalidate. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Acked-by: Darrick J. Wong <[email protected]> Cc: Al Viro <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Johannes Thumshirn <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Xiubo Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09filemap: add a kiocb_write_and_wait helperChristoph Hellwig1-12/+18
Factor out a helper that does filemap_write_and_wait_range for the range covered by a read kiocb, or returns -EAGAIN if the kiocb is marked as nowait and there would be pages to write. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Acked-by: Darrick J. Wong <[email protected]> Cc: Al Viro <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Johannes Thumshirn <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Xiubo Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09filemap: update ki_pos in generic_perform_writeChristoph Hellwig1-4/+4
All callers of generic_perform_write need to updated ki_pos, move it into common code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Acked-by: Theodore Ts'o <[email protected]> Acked-by: Darrick J. Wong <[email protected]> Cc: Al Viro <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Johannes Thumshirn <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Trond Myklebust <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09backing_dev: remove current->backing_dev_infoChristoph Hellwig1-3/+0
Patch series "cleanup the filemap / direct I/O interaction", v4. This series cleans up some of the generic write helper calling conventions and the page cache writeback / invalidation for direct I/O. This is a spinoff from the no-bufferhead kernel project, for which we'll want to an use iomap based buffered write path in the block layer. This patch (of 12): The last user of current->backing_dev_info disappeared in commit b9b1335e6403 ("remove bdi_congested() and wb_congested() and related functions"). Remove the field and all assignments to it. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Acked-by: Theodore Ts'o <[email protected]> Cc: Al Viro <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Xiubo Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09mm: zswap: shrink until can acceptDomenico Cerasuolo1-3/+14
This update addresses an issue with the zswap reclaim mechanism, which hinders the efficient offloading of cold pages to disk, thereby compromising the preservation of the LRU order and consequently diminishing, if not inverting, its performance benefits. The functioning of the zswap shrink worker was found to be inadequate, as shown by basic benchmark test. For the test, a kernel build was utilized as a reference, with its memory confined to 1G via a cgroup and a 5G swap file provided. The results are presented below, these are averages of three runs without the use of zswap: real 46m26s user 35m4s sys 7m37s With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G system), the results changed to: real 56m4s user 35m13s sys 8m43s written_back_pages: 18 reject_reclaim_fail: 0 pool_limit_hit:1478 Besides the evident regression, one thing to notice from this data is the extremely low number of written_back_pages and pool_limit_hit. The pool_limit_hit counter, which is increased in zswap_frontswap_store when zswap is completely full, doesn't account for a particular scenario: once zswap hits his limit, zswap_pool_reached_full is set to true; with this flag on, zswap_frontswap_store rejects pages if zswap is still above the acceptance threshold. Once we include the rejections due to zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478 to a significant 21578266. Zswap is stuck in an undesirable state where it rejects pages because it's above the acceptance threshold, yet fails to attempt memory reclaimation. This happens because the shrink work is only queued when zswap_frontswap_store detects that it's full and the work itself only reclaims one page per run. This state results in hot pages getting written directly to disk, while cold ones remain memory, waiting only to be invalidated. The LRU order is completely broken and zswap ends up being just an overhead without providing any benefits. This commit applies 2 changes: a) the shrink worker is set to reclaim pages until the acceptance threshold is met and b) the task is also enqueued when zswap is not full but still above the threshold. Testing this suggested update showed much better numbers: real 36m37s user 35m8s sys 9m32s written_back_pages: 10459423 reject_reclaim_fail: 12896 pool_limit_hit: 75653 Link: https://lkml.kernel.org/r/[email protected] Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit") Signed-off-by: Domenico Cerasuolo <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Reviewed-by: Vitaly Wool <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Seth Jennings <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09mm/mm_init.c: move set_pageblock_order() to free_area_init()Haifeng Xu1-1/+2
pageblock_order only needs to be set once, there is no need to initialize it in every zone/node. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Haifeng Xu <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>