aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2024-03-04hugetlb: pass struct vm_fault through to hugetlb_handle_userfault()Vishal Moola (Oracle)1-29/+9
Now that hugetlb_fault() has a struct vm_fault, have hugetlb_handle_userfault() use it instead of creating one of its own. This lets us reduce the number of arguments passed to hugetlb_handle_userfault() from 7 to 3, cleaning up the code and stack. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04hugetlb: move vm_fault declaration to the top of hugetlb_fault()Vishal Moola (Oracle)1-13/+19
hugetlb_fault() currently defines a vm_fault to pass to the generic handle_userfault() function. We can move this definition to the top of hugetlb_fault() so that it can be used throughout the rest of the hugetlb fault path. This will help cleanup a number of excess variables and function arguments throughout the stack. Also, since vm_fault already has space to store the page offset, use that instead and get rid of idx. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04mm/memory: change vmf_anon_prepare() to be non-staticVishal Moola (Oracle)2-1/+2
Patch series "Handle hugetlb faults under the VMA lock", v2. It is generally safe to handle hugetlb faults under the VMA lock. The only time this is unsafe is when no anon_vma has been allocated to this vma yet, so we can use vmf_anon_prepare() instead of anon_vma_prepare() to bailout if necessary. This should only happen for the first hugetlb page in the vma. Additionally, this patchset begins to use struct vm_fault within hugetlb_fault(). This works towards cleaning up hugetlb code, and should significantly reduce the number of arguments passed to functions. The last patch in this series may cause ltp hugemmap10 to "fail". This is because vmf_anon_prepare() may bailout with no anon_vma under the VMA lock after allocating a folio for the hugepage. In free_huge_folio(), this folio is completely freed on bailout iff there is a surplus of hugetlb pages. This will remove a folio off the freelist and decrement the number of hugepages while ltp expects these counters to remain unchanged on failure. The rest of the ltp testcases pass. This patch (of 2): In order to handle hugetlb faults under the VMA lock, hugetlb can use vmf_anon_prepare() to ensure we can safely prepare an anon_vma. Change it to be a non-static function so it can be used within hugetlb as well. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04mm/page_alloc: make check_new_page() return boolHao Ge1-3/+3
Make check_new_page() return bool like check_new_pages() Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hao Ge <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04mm/util.c: add byte count to __vm_enough_memory failure warningMatthew Cassell1-2/+4
Commit 44b414c8715c5dcf53288 ("mm/util.c: add warning if __vm_enough_memory fails") adds debug information which gives the process id and executable name should __vm_enough_memory() fail. Adding the number of pages to the failure message would benefit application developers and system administrators in debugging overambitious memory requests by providing a point of reference to the amount of memory causing __vm_enough_memory() to fail. 1. Set appropriate kernel tunable to reach code path for failure message: # echo 2 > /proc/sys/vm/overcommit_memory 2. Test program to generate failure - requests 1 gibibyte per iteration: #include <stdlib.h> #include <stdio.h> int main(int argc, char **argv) { for(;;) { if(malloc(1<<30) == NULL) break; printf("allocated 1 GiB\n"); } return 0; } 3. Output: Before: __vm_enough_memory: pid: 1218, comm: a.out, not enough memory for the allocation After: __vm_enough_memory: pid: 1137, comm: a.out, bytes: 1073741824, not enough memory for the allocation Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Cassell <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04mm/zswap: change zswap_pool kref to percpu_refChengming Zhou1-15/+33
All zswap entries will take a reference of zswap_pool when zswap_store(), and drop it when free. Change it to use the percpu_ref is better for scalability performance. Although percpu_ref use a bit more memory which should be ok for our use case, since we almost have only one zswap_pool to be using. The performance gain is for zswap_store/load hotpath. Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs x86-64 machine, below is the average of 5 runs) mm-unstable zswap-global-lru real 63.20 63.12 user 1061.75 1062.95 sys 268.74 264.44 [[email protected]: fix zswap_pools_lock usages after changing to percpu_ref] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Chengming Zhou <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04mm/zswap: global lru and shrinker shared by all zswap_poolsChengming Zhou1-105/+66
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3. Dynamic pool creation has been supported for a long time, which maybe not used so much in practice. But with the per-memcg lru merged, the current structure of zswap_pool's lru and shrinker become less optimal. In the current structure, each zswap_pool has its own lru, shrinker and shrink_work, but only the latest zswap_pool will be the current used. 1. When memory has pressure, all shrinkers of zswap_pools will try to shrink its lru list, there is no order between them. 2. When zswap limit hit, only the last zswap_pool's shrink_work will try to shrink its own lru, which is inefficient. A more natural way is to have a global zswap lru shared between all zswap_pools, and so is the shrinker. The code becomes much simpler too. Another optimization is changing zswap_pool kref to percpu_ref, which will be taken reference by every zswap entry. So the scalability is better. Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs x86-64 machine, below is the average of 5 runs) mm-unstable zswap-global-lru real 63.20 63.12 user 1061.75 1062.95 sys 268.74 264.44 This patch (of 3): Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools in a list, only the first will be current used. Each zswap_pool has its own lru and shrinker, which is not necessary and has its problem: 1. When memory has pressure, all shrinker of zswap_pools will try to shrink its own lru, there is no order between them. 2. When zswap limit hit, only the last zswap_pool's shrink_work will try to shrink its lru list. The rationale here was to try and empty the old pool first so that we can completely drop it. However, since we only support exclusive loads now, the LRU ordering should be entirely decided by the order of stores, so the oldest entries on the LRU will naturally be from the oldest pool. Anyway, having a global lru and shrinker shared by all zswap_pools is better and efficient. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Nhat Pham <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04mm, mmap: fix vma_merge() case 7 with vma_ops->closeVlastimil Babka1-1/+9
When debugging issues with a workload using SysV shmem, Michal Hocko has come up with a reproducer that shows how a series of mprotect() operations can result in an elevated shm_nattch and thus leak of the resource. The problem is caused by wrong assumptions in vma_merge() commit 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test"). The shmem vmas have a vma_ops->close callback that decrements shm_nattch, and we remove the vma without calling it. vma_merge() has thus historically avoided merging vma's with vma_ops->close and commit 714965ca8252 was supposed to keep it that way. It relaxed the checks for vma_ops->close in can_vma_merge_after() assuming that it is never called on a vma that would be a candidate for removal. However, the vma_merge() code does also use the result of this check in the decision to remove a different vma in the merge case 7. A robust solution would be to refactor vma_merge() code in a way that the vma_ops->close check is only done for vma's that are actually going to be removed, and not as part of the preliminary checks. That would both solve the existing bug, and also allow additional merges that the checks currently prevent unnecessarily in some cases. However to fix the existing bug first with a minimized risk, and for easier stable backports, this patch only adds a vma_ops->close check to the buggy case 7 specifically. All other cases of vma removal are covered by the can_vma_merge_before() check that includes the test for vma_ops->close. The reproducer code, adapted from Michal Hocko's code: int main(int argc, char *argv[]) { int segment_id; size_t segment_size = 20 * PAGE_SIZE; char * sh_mem; struct shmid_ds shmid_ds; key_t key = 0x1234; segment_id = shmget(key, segment_size, IPC_CREAT | IPC_EXCL | S_IRUSR | S_IWUSR); sh_mem = (char *)shmat(segment_id, NULL, 0); mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_NONE); mprotect(sh_mem + PAGE_SIZE, PAGE_SIZE, PROT_WRITE); mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_WRITE); shmdt(sh_mem); shmctl(segment_id, IPC_STAT, &shmid_ds); printf("nattch after shmdt(): %lu (expected: 0)\n", shmid_ds.shm_nattch); if (shmctl(segment_id, IPC_RMID, 0)) printf("IPCRM failed %d\n", errno); return (shmid_ds.shm_nattch) ? 1 : 0; } Link: https://lkml.kernel.org/r/[email protected] Fixes: 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test") Signed-off-by: Vlastimil Babka <[email protected]> Reported-by: Michal Hocko <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04mm: userfaultfd: fix unexpected change to src_folio when UFFDIO_MOVE failsQi Zheng1-3/+3
After ptep_clear_flush(), if we find that src_folio is pinned we will fail UFFDIO_MOVE and put src_folio back to src_pte entry, but the change to src_folio->{mapping,index} is not restored in this process. This is not what we expected, so fix it. This can cause the rmap for that page to be invalid, possibly resulting in memory corruption. At least swapout+migration would no longer work, because we might fail to locate the mappings of that folio. Link: https://lkml.kernel.org/r/[email protected] Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Qi Zheng <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL ↵Vlastimil Babka3-11/+11
allocations Sven reports an infinite loop in __alloc_pages_slowpath() for costly order __GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such combination can happen in a suspend/resume context where a GFP_KERNEL allocation can have __GFP_IO masked out via gfp_allowed_mask. Quoting Sven: 1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER) with __GFP_RETRY_MAYFAIL set. 2. page alloc's __alloc_pages_slowpath tries to get a page from the freelist. This fails because there is nothing free of that costly order. 3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim, which bails out because a zone is ready to be compacted; it pretends to have made a single page of progress. 4. page alloc tries to compact, but this always bails out early because __GFP_IO is not set (it's not passed by the snd allocator, and even if it were, we are suspending so the __GFP_IO flag would be cleared anyway). 5. page alloc believes reclaim progress was made (because of the pretense in item 3) and so it checks whether it should retry compaction. The compaction retry logic thinks it should try again, because: a) reclaim is needed because of the early bail-out in item 4 b) a zonelist is suitable for compaction 6. goto 2. indefinite stall. (end quote) The immediate root cause is confusing the COMPACT_SKIPPED returned from __alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be indicating a lack of order-0 pages, and in step 5 evaluating that in should_compact_retry() as a reason to retry, before incrementing and limiting the number of retries. There are however other places that wrongly assume that compaction can happen while we lack __GFP_IO. To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO evaluation and switch the open-coded test in try_to_compact_pages() to use it. Also use the new helper in: - compaction_ready(), which will make reclaim not bail out in step 3, so there's at least one attempt to actually reclaim, even if chances are small for a costly order - in_reclaim_compaction() which will make should_continue_reclaim() return false and we don't over-reclaim unnecessarily - in __alloc_pages_slowpath() to set a local variable can_compact, which is then used to avoid retrying reclaim/compaction for costly allocations (step 5) if we can't compact and also to skip the early compaction attempt that we do in some cases Link: https://lkml.kernel.org/r/[email protected] Fixes: 3250845d0526 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"") Signed-off-by: Vlastimil Babka <[email protected]> Reported-by: Sven van Ashbrook <[email protected]> Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzPUVOZF%[email protected]/ Tested-by: Karthikeyan Ramasubramanian <[email protected]> Cc: Brian Geffon <[email protected]> Cc: Curtis Malainey <[email protected]> Cc: Jaroslav Kysela <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Takashi Iwai <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-04mm, slab: remove memcg_from_slab_obj()Vlastimil Babka1-5/+0
This empty wrapped exists only for !CONFIG_MEMCG_KMEM and seems it was never used. Probably a leftover from development of a series. Reviewed-by: Chengming Zhou <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2024-03-01mm, slab: remove the corner case of inc_slabs_node()Chengming Zhou1-11/+2
We already have the inc_slabs_node() after kmem_cache_node->node[node] initialized in early_kmem_cache_node_alloc(), this special case of inc_slabs_node() can be removed. Then we don't need to consider the existence of kmem_cache_node in inc_slabs_node() anymore. Signed-off-by: Chengming Zhou <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2024-03-01mm/slab: Fix a kmemleak in kmem_cache_destroy()Xiaolei Wang2-6/+8
For earlier kmem cache creation, slab_sysfs_init() has not been called. Consequently, kmem_cache_destroy() cannot utilize kobj_type::release to release the kmem_cache structure. Therefore, tweak kmem_cache_release() to use slab_kmem_cache_release() for releasing kmem_cache when slab_state isn't FULL. This will fixes the memory leaks like following: unreferenced object 0xffff0000c2d87080 (size 128): comm "swapper/0", pid 1, jiffies 4294893428 hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 6b 6b 6b 6b .....N......kkkk ff ff ff ff ff ff ff ff b8 ab 48 89 00 80 ff ff.....H..... backtrace (crc 8819d0f6): [<ffff80008317a298>] kmemleak_alloc+0xb0/0xc4 [<ffff8000807e553c>] kmem_cache_alloc_node+0x288/0x3a8 [<ffff8000807e95f0>] __kmem_cache_create+0x1e4/0x64c [<ffff8000807216bc>] kmem_cache_create_usercopy+0x1c4/0x2cc [<ffff8000807217e0>] kmem_cache_create+0x1c/0x28 [<ffff8000819f6278>] arm_v7s_alloc_pgtable+0x1c0/0x6d4 [<ffff8000819f53a0>] alloc_io_pgtable_ops+0xe8/0x2d0 [<ffff800084b2d2c4>] arm_v7s_do_selftests+0xe0/0x73c [<ffff800080016b68>] do_one_initcall+0x11c/0x7ac [<ffff800084a71ddc>] kernel_init_freeable+0x53c/0xbb8 [<ffff8000831728d8>] kernel_init+0x24/0x144 [<ffff800080018e98>] ret_from_fork+0x10/0x20 Signed-off-by: Xiaolei Wang <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2024-02-29mm/shmem.c: Use new form of *@param in kernel-docAkira Yokosawa1-2/+2
Use the form of *@param which kernel-doc recognizes now. This resolves the warnings from "make htmldocs" as reported in [1]. Reported-by: Stephen Rothwell <[email protected]> Link: [1] https://lore.kernel.org/r/[email protected]/ Acked-by: Christoph Hellwig <[email protected]> Signed-off-by: Akira Yokosawa <[email protected]> Signed-off-by: Chandan Babu R <[email protected]>
2024-02-27Merge tag 'mm-hotfixes-stable-2024-02-27-14-52' of ↵Linus Torvalds7-102/+56
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Six hotfixes. Three are cc:stable and the remainder address post-6.7 issues or aren't considered appropriate for backporting" * tag 'mm-hotfixes-stable-2024-02-27-14-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/debug_vm_pgtable: fix BUG_ON with pud advanced test mm: cachestat: fix folio read-after-free in cache walk MAINTAINERS: add memory mapping entry with reviewers mm/vmscan: fix a bug calling wakeup_kswapd() with a wrong zone index kasan: revert eviction of stack traces in generic mode stackdepot: use variable size records for non-evictable entries
2024-02-26mm, slab, kasan: replace kasan_never_merge() with SLAB_NO_MERGEVlastimil Babka2-17/+7
The SLAB_KASAN flag prevents merging of caches in some configurations, which is handled in a rather complicated way via kasan_never_merge(). Since we now have a generic SLAB_NO_MERGE flag, we can instead use it for KASAN caches in addition to SLAB_KASAN in those configurations, and simplify the SLAB_NEVER_MERGE handling. Tested-by: Xiongwei Song <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Tested-by: David Rientjes <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2024-02-26mm, slab: use an enum to define SLAB_ cache creation flagsVlastimil Babka1-3/+3
The values of SLAB_ cache creation flags are defined by hand, which is tedious and error-prone. Use an enum to assign the bit number and a __SLAB_FLAG_BIT() macro to #define the final flags. This renumbers the flag values, which is OK as they are only used internally. Also define a __SLAB_FLAG_UNUSED macro to assign value to flags disabled by their respective config options in a unified and sparse-friendly way. Reviewed-and-tested-by: Xiongwei Song <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2024-02-26mm, slab: deprecate SLAB_MEM_SPREAD flagVlastimil Babka1-1/+0
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed. SLUB instead relies on the page allocator's NUMA policies. Change the flag's value to 0 to free up the value it had, and mark it for full removal once all users are gone. Reported-by: Steven Rostedt <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Reviewed-and-tested-by: Xiongwei Song <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2024-02-25swap: port block device usage to fileChristian Brauner1-11/+11
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-02-24Merge tag 'cxl-fixes-6.8-rc6' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull cxl fixes from Dan Williams: "A collection of significant fixes for the CXL subsystem. The largest change in this set, that bordered on "new development", is the fix for the fact that the location of the new qos_class attribute did not match the Documentation. The fix ends up deleting more code than it added, and it has a new unit test to backstop basic errors in this interface going forward. So the "red-diff" and unit test saved the "rip it out and try again" response. In contrast, the new notification path for firmware reported CXL errors (CXL CPER notifications) has a locking context bug that can not be fixed with a red-diff. Given where the release cycle stands, it is not comfortable to squeeze in that fix in these waning days. So, that receives the "back it out and try again later" treatment. There is a regression fix in the code that establishes memory NUMA nodes for platform CXL regions. That has an ack from x86 folks. There are a couple more fixups for Linux to understand (reassemble) CXL regions instantiated by platform firmware. The policy around platforms that do not match host-physical-address with system-physical-address (i.e. systems that have an address translation mechanism between the address range reported in the ACPI CEDT.CFMWS and endpoint decoders) has been softened to abort driver load rather than teardown the memory range (can cause system hangs). Lastly, there is a robustness / regression fix for cases where the driver would previously continue in the face of error, and a fixup for PCI error notification handling. Summary: - Fix NUMA initialization from ACPI CEDT.CFMWS - Fix region assembly failures due to async init order - Fix / simplify export of qos_class information - Fix cxl_acpi initialization vs single-window-init failures - Fix handling of repeated 'pci_channel_io_frozen' notifications - Workaround platforms that violate host-physical-address == system-physical address assumptions - Defer CXL CPER notification handling to v6.9" * tag 'cxl-fixes-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl/acpi: Fix load failures due to single window creation failure acpi/ghes: Remove CXL CPER notifications cxl/pci: Fix disabling memory if DVSEC CXL Range does not match a CFMWS window cxl/test: Add support for qos_class checking cxl: Fix sysfs export of qos_class for memdev cxl: Remove unnecessary type cast in cxl_qos_class_verify() cxl: Change 'struct cxl_memdev_state' *_perf_list to single 'struct cxl_dpa_perf' cxl/region: Allow out of order assembly of autodiscovered regions cxl/region: Handle endpoint decoders in cxl_region_find_decoder() x86/numa: Fix the sort compare func used in numa_fill_memblks() x86/numa: Fix the address overlap check in numa_fill_memblks() cxl/pci: Skip to handle RAS errors if CXL.mem device is detached
2024-02-23writeback: remove a use of write_cache_pages() from do_writepages()Matthew Wilcox (Oracle)1-13/+18
Use the new writeback_iter() directly instead of indirecting through a callback. [[email protected]: ported to the while based iter style] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Dave Chinner <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23writeback: add a writeback iteratorChristoph Hellwig1-78/+114
Refactor the code left in write_cache_pages into an iterator that the file system can call to get the next folio for a writeback operation: struct folio *folio = NULL; while ((folio = writeback_iter(mapping, wbc, folio, &error))) { error = <do per-folio writeback>; } The twist here is that the error value is passed by reference, so that the iterator can restore it when breaking out of the loop. Handling of the magic AOP_WRITEPAGE_ACTIVATE value stays outside the iterator and needs is just kept in the write_cache_pages legacy wrapper. in preparation for eventually killing it off. Heavily based on a for_each* based iterator from Matthew Wilcox. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Dave Chinner <[email protected]> Cc: David Howells <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23writeback: move the folio_prepare_writeback loop out of write_cache_pages()Matthew Wilcox (Oracle)1-8/+10
Move the loop for should-we-write-this-folio to writeback_get_folio. [[email protected]: fold loop into existing helper instead of a separate one per Jan] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Dave Chinner <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23writeback: use the folio_batch queue iteratorMatthew Wilcox (Oracle)1-13/+15
Instead of keeping our own local iterator variable, use the one just added to folio_batch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Dave Chinner <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23writeback: simplify the loops in write_cache_pages()Matthew Wilcox (Oracle)1-39/+36
Collapse the two nested loops into one. This is needed as a step towards turning this into an iterator. Note that this drops the "index <= end" check in the previous outer loop and just relies on filemap_get_folios_tag() to return 0 entries when index > end. This actually has a subtle implication when end == -1 because then the returned index will be -1 as well and thus if there is page present on index -1, we could be looping indefinitely. But as the comment in filemap_get_folios_tag documents this as already broken anyway we should not worry about it here either. The fix for that would probably a change to the filemap_get_folios_tag() calling convention. [[email protected]: update the commit log per Jan] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Dave Chinner <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23writeback: factor writeback_get_batch() out of write_cache_pages()Matthew Wilcox (Oracle)1-22/+38
This simple helper will be the basis of the writeback iterator. To make this work, we need to remember the current index and end positions in writeback_control. [[email protected]: heavily rebased, add helpers to get the tag and end index, don't keep the end index in struct writeback_control] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Dave Chinner <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23writeback: factor folio_prepare_writeback() out of write_cache_pages()Matthew Wilcox (Oracle)1-27/+34
Reduce write_cache_pages() by about 30 lines; much of it is commentary, but it all bundles nicely into an obvious function. [[email protected]: rename should_writeback_folio to folio_prepare_writeback per Jan] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Dave Chinner <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23writeback: rework the loop termination condition in write_cache_pagesChristoph Hellwig1-51/+33
Rework the way we deal with the cleanup after the writepage call. First handle the magic AOP_WRITEPAGE_ACTIVATE separately from real error returns to get it out of the way of the actual error handling path. The split the handling on intgrity vs non-integrity branches first, and return early using a goto for the non-ingegrity early loop condition to remove the need for the done and done_index local variables, and for assigning the error to ret when we can just return error directly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Dave Chinner <[email protected]> Cc: David Howells <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23writeback: only update ->writeback_index for range_cyclic writebackChristoph Hellwig1-10/+14
mapping->writeback_index is only [1] used as the starting point for range_cyclic writeback, so there is no point in updating it for other types of writeback. [1] except for btrfs_defrag_file which does really odd things with mapping->writeback_index. But btrfs doesn't use write_cache_pages at all, so this isn't relevant here. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Dave Chinner <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Howells <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23writeback: also update wbc->nr_to_write on writeback failureChristoph Hellwig1-1/+1
When exiting write_cache_pages early due to a non-integrity write failure, wbc->nr_to_write currently doesn't account for the folio we just failed to write. This doesn't matter because the callers always ingore the value on a failure, but moving the update to common code will allow to simplify the code, so do it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Dave Chinner <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Howells <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23writeback: fix done_index when hitting the wbc->nr_to_writeChristoph Hellwig1-0/+1
When write_cache_pages finishes writing out a folio, it fails to update done_index to account for the number of pages in the folio just written. That means when range_cyclic writeback is restarted, it will be restarted at this folio instead of after it as it should. Fix that by updating done_index before breaking out of the loop. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Dave Chinner <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Howells <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23writeback: don't call mapping_set_error on AOP_WRITEPAGE_ACTIVATEChristoph Hellwig1-1/+3
Patch series "convert write_cache_pages() to an iterator", v8. This is an evolution of the series Matthew Wilcox originally sent in June 2023, which has changed quite a bit since and now has a while based iterator. This patch (of 14): mapping_set_error should only be called on 0 returns (which it ignores) or a negative error code. writepage_cb ends up being able to call writepage_cb on the magic AOP_WRITEPAGE_ACTIVATE return value from ->writepage which means success but the caller needs to unlock the page. Ignore that and just call mapping_set_error on negative errors. (no fixes tag as this goes back more than 20 years over various renames and refactors so I've given up chasing down the original introduction) Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Brian Foster <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Howells <[email protected]> Cc: Dave Chinner <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/page_alloc: make bad_range() return boolHao Ge1-6/+6
bad_range() can return bool, so let us change it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hao Ge <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23madvise:madvise_cold_or_pageout_pte_range(): allow split while ↵Barry Song1-1/+1
folio_estimated_sharers = 0 The purpose is stopping splitting large folios whose mapcount are 2 or above. Folios whose estimated_shares = 0 should be still perfect and even better candidates than estimated_shares = 1. Consider a pte-mapped large folio with 16 subpages, if we unmap 1-15, the current code will split folios and reclaim them while madvise goes on this folio; but if we unmap subpage 0, we will keep this folio and break. This is weird. For pmd-mapped large folios, we can still use "= 1" as the condition as anyway we have the entire map for it. So this patch doesn't change the condition for pmd-mapped large folios. This also explains why we had been using "= 1" for both pmd-mapped and pte-mapped large folios before commit 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios"), because in the past, we used the mapcount of the specific subpage, since the subpage had pte present, its mapcount wouldn't be 0. The problem can be quite easily reproduced by writing a small program, unmapping the first subpage of a pte-mapped large folio vs. unmapping anyone other than the first subpage. Link: https://lkml.kernel.org/r/[email protected] Fixes: 2f406263e3e9 ("madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check") Signed-off-by: Barry Song <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Vishal Moola (Oracle) <[email protected]> Cc: Yin Fengwei <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/swapfile:__swap_duplicate: drop redundant WRITE_ONCE on swap_map for err ↵Barry Song1-1/+2
cases The code is quite hard to read, we are still writing swap_map after errors happen. Though the written value is as before, has_cache = count & SWAP_HAS_CACHE; count &= ~SWAP_HAS_CACHE; [snipped] WRITE_ONCE(p->swap_map[offset], count | has_cache); It would be better to entirely drop the WRITE_ONCE for both performance and readability. [[email protected]: avoid using goto] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23shmem: properly report quota mount optionsJan Kara1-0/+18
Report quota options among the set of mount options. This allows proper user visibility into whether quotas are enabled or not. Link: https://lkml.kernel.org/r/[email protected] Fixes: e09764cff44b ("shmem: quota support") Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Carlos Maiolino <[email protected]> Acked-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/compaction: optimize >0 order folio compaction with free page split.Zi Yan1-5/+30
During migration in a memory compaction, free pages are placed in an array of page lists based on their order. But the desired free page order (i.e., the order of a source page) might not be always present, thus leading to migration failures and premature compaction termination. Split a high order free pages when source migration page has a lower order to increase migration successful rate. Note: merging free pages when a migration fails and a lower order free page is returned via compaction_free() is possible, but there is too much work. Since the free pages are not buddy pages, it is hard to identify these free pages using existing PFN-based page merging algorithm. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Tested-by: Baolin Wang <[email protected]> Tested-by: Yu Zhao <[email protected]> Cc: Adam Manzanares <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Huang Ying <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kemeng Shi <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yin Fengwei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/compaction: add support for >0 order folio memory compaction.Zi Yan3-63/+83
Before last commit, memory compaction only migrates order-0 folios and skips >0 order folios. Last commit splits all >0 order folios during compaction. This commit migrates >0 order folios during compaction by keeping isolated free pages at their original size without splitting them into order-0 pages and using them directly during migration process. What is different from the prior implementation: 1. All isolated free pages are kept in a NR_PAGE_ORDERS array of page lists, where each page list stores free pages in the same order. 2. All free pages are not post_alloc_hook() processed nor buddy pages, although their orders are stored in first page's private like buddy pages. 3. During migration, in new page allocation time (i.e., in compaction_alloc()), free pages are then processed by post_alloc_hook(). When migration fails and a new page is returned (i.e., in compaction_free()), free pages are restored by reversing the post_alloc_hook() operations using newly added free_pages_prepare_fpi_none(). Step 3 is done for a latter optimization that splitting and/or merging free pages during compaction becomes easier. Note: without splitting free pages, compaction can end prematurely due to migration will return -ENOMEM even if there is free pages. This happens when no order-0 free page exist and compaction_alloc() return NULL. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Tested-by: Baolin Wang <[email protected]> Tested-by: Yu Zhao <[email protected]> Cc: Adam Manzanares <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Huang Ying <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kemeng Shi <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yin Fengwei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/compaction: enable compacting >0 order folios.Zi Yan1-25/+76
migrate_pages() supports >0 order folio migration and during compaction, even if compaction_alloc() cannot provide >0 order free pages, migrate_pages() can split the source page and try to migrate the base pages from the split. It can be a baseline and start point for adding support for compacting >0 order folios. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Suggested-by: Huang Ying <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Tested-by: Baolin Wang <[email protected]> Tested-by: Yu Zhao <[email protected]> Cc: Adam Manzanares <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kemeng Shi <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yin Fengwei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/page_alloc: remove unused fpi_flags in free_pages_prepare()Zi Yan1-5/+5
Patch series "Enable >0 order folio memory compaction", v7. This patchset enables >0 order folio memory compaction, which is one of the prerequisitions for large folio support[1]. I am aware of that split free pages is necessary for folio migration in compaction, since if >0 order free pages are never split and no order-0 free page is scanned, compaction will end prematurely due to migration returns -ENOMEM. Free page split becomes a must instead of an optimization. lkp ncompare results (on a 8-CPU (Intel Xeon E5-2650 v4 @2.20GHz) 16G VM) for default LRU (-no-mglru) and CONFIG_LRU_GEN are shown at the bottom, copied from V3[4]. In sum, most of vm-scalability applications do not see performance change, and the others see ~4% to ~26% performance boost under default LRU and ~2% to ~6% performance boost under CONFIG_LRU_GEN. Overview === To support >0 order folio compaction, the patchset changes how free pages used for migration are kept during compaction. Free pages used to be split into order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared, page order stored in page->private is zeroed, and page reference is set to 1). Now all free pages are kept in a NR_PAGE_ORDER array of page lists based on their order without post allocation process. When migrate_pages() asks for a new page, one of the free pages, based on the requested page order, is then processed and given out. And THP <2MB would need this feature. [1] https://lore.kernel.org/linux-mm/[email protected]/ [2] https://lore.kernel.org/linux-mm/[email protected]/ [3] https://lore.kernel.org/linux-mm/[email protected]/ [4] https://lore.kernel.org/linux-mm/[email protected]/ [5] https://lore.kernel.org/linux-mm/[email protected]/ [6] https://lore.kernel.org/linux-mm/[email protected]/ [7] https://lore.kernel.org/linux-mm/[email protected]/ This patch (of 4): Commit 0a54864f8dfb ("kasan: remove PG_skip_kasan_poison flag") removes the use of fpi_flags in should_skip_kasan_poison() and fpi_flags is only passed to should_skip_kasan_poison() in free_pages_prepare(). Remove the unused parameter. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Adam Manzanares <[email protected]> Cc: Baolin Wang <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kemeng Shi <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Cc: Yin Fengwei <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/zsmalloc: remove get_zspage_mapping()Chengming Zhou1-24/+4
Actually we seldom use the class_idx returned from get_zspage_mapping(), only the zspage->fullness is useful, just use zspage->fullness to remove this helper. Note zspage->fullness is not stable outside pool->lock, remove redundant "VM_BUG_ON(fullness != ZS_INUSE_RATIO_0)" in async_free_zspage() since we already have the same VM_BUG_ON() in __free_zspage(), which is safe to access zspage->fullness with pool->lock held. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/zsmalloc: remove_zspage() don't need fullness parameterChengming Zhou1-7/+7
We must remove_zspage() from its current fullness list, then use insert_zspage() to update its fullness and insert to new fullness list. Obviously, remove_zspage() doesn't need the fullness parameter. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/zsmalloc: remove set_zspage_mapping()Chengming Zhou1-11/+2
Patch series "mm/zsmalloc: some cleanup for get/set_zspage_mapping()". The discussion[1] with Sergey shows there are some cleanup works to do in get/set_zspage_mapping(): - the fullness returned from get_zspage_mapping() is not stable outside pool->lock, this usage pattern is confusing, but should be ok in this free_zspage path. - we seldom use the class_idx returned from get_zspage_mapping(), only free_zspage path use to get its class. - set_zspage_mapping() always set the zspage->class, but it's never changed after zspage allocated. [1] https://lore.kernel.org/all/[email protected]/ This patch (of 3): We only need to update zspage->fullness when insert_zspage(), since zspage->class is never changed after allocated. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm: compaction: early termination in compact_nodes()Kefeng Wang1-7/+17
No need to continue try compact memory if pending fatal signal, allow loop termination earlier in compact_nodes(). The existing fatal_signal_pending() check does make compact_zone() break out of the while loop, but it still enters the next zone/next nid, and some unnecessary functions(eg, lru_add_drain) are called. There was no observable benefit from the new test, it is just found from code inspection when refactoring compact_node(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm: zswap: increase reject_compress_poor but not reject_compress_fail if ↵Barry Song1-14/+13
compression returns ENOSPC We used to rely on the returned -ENOSPC of zpool_malloc() to increase reject_compress_poor. But the code wouldn't get to there after commit 744e1885922a ("crypto: scomp - fix req->dst buffer overflow") as the new code will goto out immediately after the special compression case happens. So there might be no longer a chance to execute zpool_malloc now. We are incorrectly increasing zswap_reject_compress_fail instead. Thus, we need to fix the counters handling right after compressions return ENOSPC. This patch also centralizes the counters handling for all of compress_poor, compress_fail and alloc_fail. Link: https://lkml.kernel.org/r/[email protected] Fixes: 744e1885922a ("crypto: scomp - fix req->dst buffer overflow") Signed-off-by: Barry Song <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/z3fold: fix the comment for __encode_handle()Zhongkun He1-2/+3
The comment is confusing that Pool lock should be held as this function accesses first_num above the __encode_handle() because first_num is the element of z3fold_header which is protected by z3fold_header->page_lock. I found the same comment for encode_handle() in zbud.c by accident ,Pool lock should be held as this function accesses first|last_chunks, which is the element of zbud_header and it does not have any lock, so pool lock should be held. Z3fold is based on zbud, maybe the comment come from zbud, but it was wrong, so fix it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhongkun He <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/zsmalloc: remove unused zspage->isolatedChengming Zhou1-32/+0
The zspage->isolated is not used anywhere, we don't need to maintain it, which needs to hold the heavy pool lock to update it, so just remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/zsmalloc: remove migrate_write_lock_nested()Chengming Zhou1-17/+5
The migrate write lock is to protect the race between zspage migration and zspage objects' map users. We only need to lock out the map users of src zspage, not dst zspage, which is safe to map by users concurrently, since we only need to do obj_malloc() from dst zspage. So we can remove the migrate_write_lock_nested() use case. As we are here, cleanup the __zs_compact() by moving putback_zspage() outside of migrate_write_unlock since we hold pool lock, no malloc or free users can come in. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/zsmalloc: fix migrate_write_lock() when !CONFIG_COMPACTIONChengming Zhou1-6/+3
Patch series "mm/zsmalloc: fix and optimize objects/page migration". This series is to fix and optimize the zsmalloc objects/page migration. This patch (of 3): migrate_write_lock() is a empty function when !CONFIG_COMPACTION, in which case zs_compact() can be triggered from shrinker reclaim context. (Maybe it's better to rename it to zs_shrink()?) And zspage map object users rely on this migrate_read_lock() so object won't be migrated elsewhere. Fix it by always implementing the migrate_write_lock() related functions. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/damon/reclaim: implement memory PSI-driven quota self-tuningSeongJae Park1-0/+25
Support the PSI-driven quota self-tuning from DAMON_RECLAIM by introducing yet another parameter, 'quota_mem_pressure_us'. Users can set the desired amount of memory pressure stall time per each quota reset interval using the parameter. Then DAMON_RECLAIM monitor the memory pressure stall time, specifically system-wide memory 'some' PSI value that increased during the given time interval, and self-tune the quota using the DAMOS core logic. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>