aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2021-11-06mm: don't automatically unregister bdisChristoph Hellwig1-2/+1
All BDI users now unregister explicitly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Miquel Raynal <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Vignesh Raghavendra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06mm: export bdi_unregisterChristoph Hellwig1-0/+1
Patch series "simplify bdi unregistation". This series simplifies the BDI code to get rid of the magic auto-unregister feature that hid a recent block layer refcounting bug. This patch (of 5): To wind down the magic auto-unregister semantics we'll need to push this into modular code. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Miquel Raynal <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Vignesh Raghavendra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06mm: stop filemap_read() from grabbing a superfluous pageDavid Howells1-0/+3
Under some circumstances, filemap_read() will allocate sufficient pages to read to the end of the file, call readahead/readpages on them and copy the data over - and then it will allocate another page at the EOF and call readpage on that and then ignore it. This is unnecessary and a waste of time and resources. filemap_read() *does* check for this, but only after it has already done the allocation and I/O. Fix this by checking before calling filemap_get_pages() also. Link: https://lkml.kernel.org/r/163472463105.3126792.7056099385135786492.stgit@warthog.procyon.org.uk Link: https://lore.kernel.org/r/160588481358.3465195.16552616179674485179.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/163456863216.2614702.6384850026368833133.stgit@warthog.procyon.org.uk/ Signed-off-by: David Howells <[email protected]> Acked-by: Jeff Layton <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Kent Overstreet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06mm/page_ext.c: fix a commentYinan Zhang1-1/+1
I have noticed that the previous macro is #ifndef CONFIG_SPARSEMEM. I think the comment of #else should be CONFIG_SPARSEMEM. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yinan Zhang <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06mm: debug_vm_pgtable: don't use __P000 directlyGuo Ren1-3/+4
The __Pxxx/__Sxxx macros are only for protection_map[] init. All usage of them in linux should come from protection_map array. Because a lot of architectures would re-initilize protection_map[] array, eg: x86-mem_encrypt, m68k-motorola, mips, arm, sparc. Using __P000 is not rigorous. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Guo Ren <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Cc: Gavin Shan <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Gerald Schaefer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06mm/smaps: use vma->vm_pgoff directly when counting partial swapPeter Xu1-3/+2
As it's trying to cover the whole vma anyways, use direct vm_pgoff value and vma_pages() rather than linear_page_index. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06kasan: fix tag for large allocations when using CONFIG_SLABMatthew Wilcox (Oracle)1-1/+1
If an object is allocated on a tail page of a multi-page slab, kasan will get the wrong tag because page->s_mem is NULL for tail pages. I'm not quite sure what the user-visible effect of this might be. Link: https://lkml.kernel.org/r/[email protected] Fixes: 7f94ffbc4c6a ("kasan: add hooks implementation for tag-based mode") Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Marco Elver <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06kasan: generic: introduce kasan_record_aux_stack_noalloc()Marco Elver1-2/+12
Introduce a variant of kasan_record_aux_stack() that does not do any memory allocation through stackdepot. This will permit using it in contexts that cannot allocate any memory. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Tested-by: Shuah Khan <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: "Gustavo A. R. Silva" <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Taras Madan <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vijayanand Jitta <[email protected]> Cc: Vinayak Menon <[email protected]> Cc: Walter Wu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06kasan: common: provide can_alloc in kasan_save_stack()Marco Elver3-5/+5
Add another argument, can_alloc, to kasan_save_stack() which is passed as-is to __stack_depot_save(). No functional change intended. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Tested-by: Shuah Khan <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: "Gustavo A. R. Silva" <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Taras Madan <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vijayanand Jitta <[email protected]> Cc: Vinayak Menon <[email protected]> Cc: Walter Wu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06mm: don't include <linux/dax.h> in <linux/mempolicy.h>Christoph Hellwig1-0/+1
Not required at all, and having this causes a huge kernel rebuild as soon as something in dax.h changes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06mm: disable NUMA_BALANCING_DEFAULT_ENABLED and TRANSPARENT_HUGEPAGE on ↵Sebastian Andrzej Siewior1-1/+1
PREEMPT_RT TRANSPARENT_HUGEPAGE: There are potential non-deterministic delays to an RT thread if a critical memory region is not THP-aligned and a non-RT buffer is located in the same hugepage-aligned region. It's also possible for an unrelated thread to migrate pages belonging to an RT task incurring unexpected page faults due to memory defragmentation even if khugepaged is disabled. Regular HUGEPAGEs are not affected by this can be used. NUMA_BALANCING: There is a non-deterministic delay to mark PTEs PROT_NONE to gather NUMA fault samples, increased page faults of regions even if mlocked and non-deterministic delays when migrating pages. [Mel Gorman worded 99% of the commit description]. Link: https://lore.kernel.org/all/[email protected]/ Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06mm, slub: use prefetchw instead of prefetchHyeonggon Yoo1-1/+1
Commit 0ad9500e16fe ("slub: prefetch next freelist pointer in slab_alloc()") introduced prefetch_freepointer() because when other cpu(s) freed objects into a page that current cpu owns, the freelist link is hot on cpu(s) which freed objects and possibly very cold on current cpu. But if freelist link chain is hot on cpu(s) which freed objects, it's better to invalidate that chain because they're not going to access again within a short time. So use prefetchw instead of prefetch. On supported architectures like x86 and arm, it invalidates other copied instances of a cache line when prefetching it. Before: Time: 91.677 Performance counter stats for 'hackbench -g 100 -l 10000': 1462938.07 msec cpu-clock # 15.908 CPUs utilized 18072550 context-switches # 12.354 K/sec 1018814 cpu-migrations # 696.416 /sec 104558 page-faults # 71.471 /sec 1580035699271 cycles # 1.080 GHz (54.51%) 2003670016013 instructions # 1.27 insn per cycle (54.31%) 5702204863 branch-misses (54.28%) 643368500985 cache-references # 439.778 M/sec (54.26%) 18475582235 cache-misses # 2.872 % of all cache refs (54.28%) 642206796636 L1-dcache-loads # 438.984 M/sec (46.87%) 18215813147 L1-dcache-load-misses # 2.84% of all L1-dcache accesses (46.83%) 653842996501 dTLB-loads # 446.938 M/sec (46.63%) 3227179675 dTLB-load-misses # 0.49% of all dTLB cache accesses (46.85%) 537531951350 iTLB-loads # 367.433 M/sec (54.33%) 114750630 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.37%) 630135543177 L1-icache-loads # 430.733 M/sec (46.80%) 22923237620 L1-icache-load-misses # 3.64% of all L1-icache accesses (46.76%) 91.964452802 seconds time elapsed 43.416742000 seconds user 1422.441123000 seconds sys After: Time: 90.220 Performance counter stats for 'hackbench -g 100 -l 10000': 1437418.48 msec cpu-clock # 15.880 CPUs utilized 17694068 context-switches # 12.310 K/sec 958257 cpu-migrations # 666.651 /sec 100604 page-faults # 69.989 /sec 1583259429428 cycles # 1.101 GHz (54.57%) 2004002484935 instructions # 1.27 insn per cycle (54.37%) 5594202389 branch-misses (54.36%) 643113574524 cache-references # 447.409 M/sec (54.39%) 18233791870 cache-misses # 2.835 % of all cache refs (54.37%) 640205852062 L1-dcache-loads # 445.386 M/sec (46.75%) 17968160377 L1-dcache-load-misses # 2.81% of all L1-dcache accesses (46.79%) 651747432274 dTLB-loads # 453.415 M/sec (46.59%) 3127124271 dTLB-load-misses # 0.48% of all dTLB cache accesses (46.75%) 535395273064 iTLB-loads # 372.470 M/sec (54.38%) 113500056 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.35%) 628871845924 L1-icache-loads # 437.501 M/sec (46.80%) 22585641203 L1-icache-load-misses # 3.59% of all L1-icache accesses (46.79%) 90.514819303 seconds time elapsed 43.877656000 seconds user 1397.176001000 seconds sys Link: https://lkml.org/lkml/2021/10/8/598=20 Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hyeonggon Yoo <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06mm/slub: increase default cpu partial list sizesVlastimil Babka1-4/+4
The defaults are determined based on object size and can go up to 30 for objects smaller than 256 bytes. Before the previous patch changed the accounting, this could have made cpu partial list contain up to 30 pages. After that patch, only up to 2 pages with default allocation order. Very short lists limit the usefulness of the whole concept of cpu partial lists, so this patch aims at a more reasonable default under the new accounting. The defaults are quadrupled, except for object size >= PAGE_SIZE where it's doubled. This makes the lists grow up to 10 pages in practice. A quick test of booting a kernel under virtme with 4GB RAM and 8 vcpus shows the following slab memory usage after boot: Before previous patch (using page->pobjects): Slab: 36732 kB SReclaimable: 14836 kB SUnreclaim: 21896 kB After previous patch (using page->pages): Slab: 34720 kB SReclaimable: 13716 kB SUnreclaim: 21004 kB After this patch (using page->pages, higher defaults): Slab: 35252 kB SReclaimable: 13944 kB SUnreclaim: 21308 kB In the same setup, I also ran 5 times: hackbench -l 16000 -g 16 Differences in time were in the noise, we can compare slub stats as given by slabinfo -r skbuff_head_cache (the other cache heavily used by hackbench, kmalloc-cg-512 looks similar). Negligible stats left out for brevity. Before previous patch (using page->pobjects): Objects: 1408, Memory Total: 401408 Used : 304128 Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 469952498 5946606 91 1 Slowpath 42053573 506059465 8 98 Page Alloc 41093 41044 0 0 Add partial 18 21229327 0 4 Remove partial 20039522 36051 3 0 Cpu partial list 4686640 24767229 0 4 RemoteObj/SlabFrozen 16 124027841 0 24 Total 512006071 512006071 Flushes 18 Slab Deactivation Occurrences % ------------------------------------------------- Slab empty 4993 0% Deactivation bypass 24767229 99% Refilled from foreign frees 21972674 88% After previous patch (using page->pages): Objects: 480, Memory Total: 131072 Used : 103680 Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 473016294 5405653 92 1 Slowpath 38989777 506600418 7 98 Page Alloc 32717 32701 0 0 Add partial 3 22749164 0 4 Remove partial 11371127 32474 2 0 Cpu partial list 11686226 23090059 2 4 RemoteObj/SlabFrozen 2 67541803 0 13 Total 512006071 512006071 Flushes 3 Slab Deactivation Occurrences % ------------------------------------------------- Slab empty 227 0% Deactivation bypass 23090059 99% Refilled from foreign frees 27585695 119% After this patch (using page->pages, higher defaults): Objects: 896, Memory Total: 229376 Used : 193536 Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 473799295 4980278 92 0 Slowpath 38206776 507025793 7 99 Page Alloc 32295 32267 0 0 Add partial 11 23291143 0 4 Remove partial 5815764 31278 1 0 Cpu partial list 18119280 23967320 3 4 RemoteObj/SlabFrozen 10 76974794 0 15 Total 512006071 512006071 Flushes 11 Slab Deactivation Occurrences % ------------------------------------------------- Slab empty 989 0% Deactivation bypass 23967320 99% Refilled from foreign frees 32358473 135% As expected, memory usage dropped significantly with change of accounting, increasing the defaults increased it, but not as much. The number of page allocation/frees dropped significantly with the new accounting, but didn't increase with the higher defaults. Interestingly, the number of fasthpath allocations increased, as well as allocations from the cpu partial list, even though it's shorter. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Jann Horn <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06mm, slub: change percpu partial accounting from objects to pagesVlastimil Babka1-30/+59
With CONFIG_SLUB_CPU_PARTIAL enabled, SLUB keeps a percpu list of partial slabs that can be promoted to cpu slab when the previous one is depleted, without accessing the shared partial list. A slab can be added to this list by 1) refill of an empty list from get_partial_node() - once we really have to access the shared partial list, we acquire multiple slabs to amortize the cost of locking, and 2) first free to a previously full slab - instead of putting the slab on a shared partial list, we can more cheaply freeze it and put it on the per-cpu list. To control how large a percpu partial list can grow for a kmem cache, set_cpu_partial() calculates a target number of free objects on each cpu's percpu partial list, and this can be also set by the sysfs file cpu_partial. However, the tracking of actual number of objects is imprecise, in order to limit overhead from cpu X freeing an objects to a slab on percpu partial list of cpu Y. Basically, the percpu partial slabs form a single linked list, and when we add a new slab to the list with current head "oldpage", we set in the struct page of the slab we're adding: page->pages = oldpage->pages + 1; // this is precise page->pobjects = oldpage->pobjects + (page->objects - page->inuse); page->next = oldpage; Thus the real number of free objects in the slab (objects - inuse) is only determined at the moment of adding the slab to the percpu partial list, and further freeing doesn't update the pobjects counter nor propagate it to the current list head. As Jann reports [1], this can easily lead to large inaccuracies, where the target number of objects (up to 30 by default) can translate to the same number of (empty) slab pages on the list. In case 2) above, we put a slab with 1 free object on the list, thus only increase page->pobjects by 1, even if there are subsequent frees on the same slab. Jann has noticed this in practice and so did we [2] when investigating significant increase of kmemcg usage after switching from SLAB to SLUB. While this is no longer a problem in kmemcg context thanks to the accounting rewrite in 5.9, the memory waste is still not ideal and it's questionable whether it makes sense to perform free object count based control when object counts can easily become so much inaccurate. So this patch converts the accounting to be based on number of pages only (which is precise) and removes the page->pobjects field completely. This is also ultimately simpler. To retain the existing set_cpu_partial() heuristic, first calculate the target number of objects as previously, but then convert it to target number of pages by assuming the pages will be half-filled on average. This assumption might obviously also be inaccurate in practice, but cannot degrade to actual number of pages being equal to the target number of objects. We could also skip the intermediate step with target number of objects and rewrite the heuristic in terms of pages. However we still have the sysfs file cpu_partial which uses number of objects and could break existing users if it suddenly becomes number of pages, so this patch doesn't do that. In practice, after this patch the heuristics limit the size of percpu partial list up to 2 pages. In case of a reported regression (which would mean some workload has benefited from the previous imprecise object based counting), we can tune the heuristics to get a better compromise within the new scheme, while still avoid the unexpectedly long percpu partial lists. [1] https://lore.kernel.org/linux-mm/CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=-umR7BLsEgjEYzA@mail.gmail.com/ [2] https://lore.kernel.org/all/[email protected]/ ========== Evaluation ========== Mel was kind enough to run v1 through mmtests machinery for netperf (localhost) and hackbench and, for most significant results see below. So there are some apparent regressions, especially with hackbench, which I think ultimately boils down to having shorter percpu partial lists on average and some benchmarks benefiting from longer ones. Monitoring slab usage also indicated less memory usage by slab. Based on that, the following patch will bump the defaults to allow longer percpu partial lists than after this patch. However the goal is certainly not such that we would limit the percpu partial lists to 30 pages just because previously a specific alloc/free pattern could lead to the limit of 30 objects translate to a limit to 30 pages - that would make little sense. This is a correctness patch, and if a workload benefits from larger lists, the sysfs tuning knobs are still there to allow that. Netperf 2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads per socket), 384GB RAM TCP-RR: hmean before 127045.79 after 121092.94 (-4.69%, worse) stddev before 2634.37 after 1254.08 UDP-RR: hmean before 166985.45 after 160668.94 ( -3.78%, worse) stddev before 4059.69 after 1943.63 2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads per socket), 512GB RAM TCP-RR: hmean before 84173.25 after 76914.72 ( -8.62%, worse) UDP-RR: hmean before 93571.12 after 96428.69 ( 3.05%, better) stddev before 23118.54 after 16828.14 2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads per socket), 64GB RAM TCP-RR: hmean before 49984.92 after 48922.27 ( -2.13%, worse) stddev before 6248.15 after 4740.51 UDP-RR: hmean before 61854.31 after 68761.81 ( 11.17%, better) stddev before 4093.54 after 5898.91 other machines - within 2% Hackbench (results before and after the patch, negative % means worse) 2-socket AMD EPYC 7713 (64 cores, 128 threads per core), 256GB RAM hackbench-process-sockets Amean 1 0.5380 0.5583 ( -3.78%) Amean 4 0.7510 0.8150 ( -8.52%) Amean 7 0.7930 0.9533 ( -20.22%) Amean 12 0.7853 1.1313 ( -44.06%) Amean 21 1.1520 1.4993 ( -30.15%) Amean 30 1.6223 1.9237 ( -18.57%) Amean 48 2.6767 2.9903 ( -11.72%) Amean 79 4.0257 5.1150 ( -27.06%) Amean 110 5.5193 7.4720 ( -35.38%) Amean 141 7.2207 9.9840 ( -38.27%) Amean 172 8.4770 12.1963 ( -43.88%) Amean 203 9.6473 14.3137 ( -48.37%) Amean 234 11.3960 18.7917 ( -64.90%) Amean 265 13.9627 22.4607 ( -60.86%) Amean 296 14.9163 26.0483 ( -74.63%) hackbench-thread-sockets Amean 1 0.5597 0.5877 ( -5.00%) Amean 4 0.7913 0.8960 ( -13.23%) Amean 7 0.8190 1.0017 ( -22.30%) Amean 12 0.9560 1.1727 ( -22.66%) Amean 21 1.7587 1.5660 ( 10.96%) Amean 30 2.4477 1.9807 ( 19.08%) Amean 48 3.4573 3.0630 ( 11.41%) Amean 79 4.7903 5.1733 ( -8.00%) Amean 110 6.1370 7.4220 ( -20.94%) Amean 141 7.5777 9.2617 ( -22.22%) Amean 172 9.2280 11.0907 ( -20.18%) Amean 203 10.2793 13.3470 ( -29.84%) Amean 234 11.2410 17.1070 ( -52.18%) Amean 265 12.5970 23.3323 ( -85.22%) Amean 296 17.1540 24.2857 ( -41.57%) 2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads per socket), 384GB RAM hackbench-process-sockets Amean 1 0.5760 0.4793 ( 16.78%) Amean 4 0.9430 0.9707 ( -2.93%) Amean 7 1.5517 1.8843 ( -21.44%) Amean 12 2.4903 2.7267 ( -9.49%) Amean 21 3.9560 4.2877 ( -8.38%) Amean 30 5.4613 5.8343 ( -6.83%) Amean 48 8.5337 9.2937 ( -8.91%) Amean 79 14.0670 15.2630 ( -8.50%) Amean 110 19.2253 21.2467 ( -10.51%) Amean 141 23.7557 25.8550 ( -8.84%) Amean 172 28.4407 29.7603 ( -4.64%) Amean 203 33.3407 33.9927 ( -1.96%) Amean 234 38.3633 39.1150 ( -1.96%) Amean 265 43.4420 43.8470 ( -0.93%) Amean 296 48.3680 48.9300 ( -1.16%) hackbench-thread-sockets Amean 1 0.6080 0.6493 ( -6.80%) Amean 4 1.0000 1.0513 ( -5.13%) Amean 7 1.6607 2.0260 ( -22.00%) Amean 12 2.7637 2.9273 ( -5.92%) Amean 21 5.0613 4.5153 ( 10.79%) Amean 30 6.3340 6.1140 ( 3.47%) Amean 48 9.0567 9.5577 ( -5.53%) Amean 79 14.5657 15.7983 ( -8.46%) Amean 110 19.6213 21.6333 ( -10.25%) Amean 141 24.1563 26.2697 ( -8.75%) Amean 172 28.9687 30.2187 ( -4.32%) Amean 203 33.9763 34.6970 ( -2.12%) Amean 234 38.8647 39.3207 ( -1.17%) Amean 265 44.0813 44.1507 ( -0.16%) Amean 296 49.2040 49.4330 ( -0.47%) 2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads per socket), 512GB RAM hackbench-process-sockets Amean 1 0.5027 0.5017 ( 0.20%) Amean 4 1.1053 1.2033 ( -8.87%) Amean 7 1.8760 2.1820 ( -16.31%) Amean 12 2.9053 3.1810 ( -9.49%) Amean 21 4.6777 4.9920 ( -6.72%) Amean 30 6.5180 6.7827 ( -4.06%) Amean 48 10.0710 10.5227 ( -4.48%) Amean 79 16.4250 17.5053 ( -6.58%) Amean 110 22.6203 24.4617 ( -8.14%) Amean 141 28.0967 31.0363 ( -10.46%) Amean 172 34.4030 36.9233 ( -7.33%) Amean 203 40.5933 43.0850 ( -6.14%) Amean 234 46.6477 48.7220 ( -4.45%) Amean 265 53.0530 53.9597 ( -1.71%) Amean 296 59.2760 59.9213 ( -1.09%) hackbench-thread-sockets Amean 1 0.5363 0.5330 ( 0.62%) Amean 4 1.1647 1.2157 ( -4.38%) Amean 7 1.9237 2.2833 ( -18.70%) Amean 12 2.9943 3.3110 ( -10.58%) Amean 21 4.9987 5.1880 ( -3.79%) Amean 30 6.7583 7.0043 ( -3.64%) Amean 48 10.4547 10.8353 ( -3.64%) Amean 79 16.6707 17.6790 ( -6.05%) Amean 110 22.8207 24.4403 ( -7.10%) Amean 141 28.7090 31.0533 ( -8.17%) Amean 172 34.9387 36.8260 ( -5.40%) Amean 203 41.1567 43.0450 ( -4.59%) Amean 234 47.3790 48.5307 ( -2.43%) Amean 265 53.9543 54.6987 ( -1.38%) Amean 296 60.0820 60.2163 ( -0.22%) 1-socket Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz (4 cores, 8 threads), 32 GB RAM hackbench-process-sockets Amean 1 1.4760 1.5773 ( -6.87%) Amean 3 3.9370 4.0910 ( -3.91%) Amean 5 6.6797 6.9357 ( -3.83%) Amean 7 9.3367 9.7150 ( -4.05%) Amean 12 15.7627 16.1400 ( -2.39%) Amean 18 23.5360 23.6890 ( -0.65%) Amean 24 31.0663 31.3137 ( -0.80%) Amean 30 38.7283 39.0037 ( -0.71%) Amean 32 41.3417 41.6097 ( -0.65%) hackbench-thread-sockets Amean 1 1.5250 1.6043 ( -5.20%) Amean 3 4.0897 4.2603 ( -4.17%) Amean 5 6.7760 7.0933 ( -4.68%) Amean 7 9.4817 9.9157 ( -4.58%) Amean 12 15.9610 16.3937 ( -2.71%) Amean 18 23.9543 24.3417 ( -1.62%) Amean 24 31.4400 31.7217 ( -0.90%) Amean 30 39.2457 39.5467 ( -0.77%) Amean 32 41.8267 42.1230 ( -0.71%) 2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads per socket), 64GB RAM hackbench-process-sockets Amean 1 1.0347 1.0880 ( -5.15%) Amean 4 1.7267 1.8527 ( -7.30%) Amean 7 2.6707 2.8110 ( -5.25%) Amean 12 4.1617 4.3383 ( -4.25%) Amean 21 7.0070 7.2600 ( -3.61%) Amean 30 9.9187 10.2397 ( -3.24%) Amean 48 15.6710 16.3923 ( -4.60%) Amean 79 24.7743 26.1247 ( -5.45%) Amean 110 34.3000 35.9307 ( -4.75%) Amean 141 44.2043 44.8010 ( -1.35%) Amean 172 54.2430 54.7260 ( -0.89%) Amean 192 60.6557 60.9777 ( -0.53%) hackbench-thread-sockets Amean 1 1.0610 1.1353 ( -7.01%) Amean 4 1.7543 1.9140 ( -9.10%) Amean 7 2.7840 2.9573 ( -6.23%) Amean 12 4.3813 4.4937 ( -2.56%) Amean 21 7.3460 7.5350 ( -2.57%) Amean 30 10.2313 10.5190 ( -2.81%) Amean 48 15.9700 16.5940 ( -3.91%) Amean 79 25.3973 26.6637 ( -4.99%) Amean 110 35.1087 36.4797 ( -3.91%) Amean 141 45.8220 46.3053 ( -1.05%) Amean 172 55.4917 55.7320 ( -0.43%) Amean 192 62.7490 62.5410 ( 0.33%) Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Reported-by: Jann Horn <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06slub: add back check for free nonslab objectsKefeng Wang1-1/+3
After commit f227f0faf63b ("slub: fix unreclaimable slab stat for bulk free"), the check for free nonslab page is replaced by VM_BUG_ON_PAGE, which only check with CONFIG_DEBUG_VM enabled, but this config may impact performance, so it only for debug. Commit 0937502af7c9 ("slub: Add check for kfree() of non slab objects.") add the ability, which should be needed in any configs to catch the invalid free, they even could be potential issue, eg, memory corruption, use after free and double free, so replace VM_BUG_ON_PAGE to WARN_ON_ONCE, add object address printing to help use to debug the issue. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rienjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06mm/slab.c: remove useless lines in enable_cpucache()Shi Lei1-3/+0
These lines are useless, so remove them. Link: https://lkml.kernel.org/r/[email protected] Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations") Signed-off-by: Shi Lei <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-28mm/damon/core-test: fix wrong expectations for 'damon_split_regions_of()'SeongJae Park1-2/+2
Kunit test cases for 'damon_split_regions_of()' expects the number of regions after calling the function will be same to their request ('nr_sub'). However, the requested number is just an upper-limit, because the function randomly decides the size of each sub-region. This fixes the wrong expectation. Link: https://lkml.kernel.org/r/[email protected] Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-28mm: khugepaged: skip huge page collapse for special filesYang Shi1-8/+11
The read-only THP for filesystems will collapse THP for files opened readonly and mapped with VM_EXEC. The intended usecase is to avoid TLB misses for large text segments. But it doesn't restrict the file types so a THP could be collapsed for a non-regular file, for example, block device, if it is opened readonly and mapped with EXEC permission. This may cause bugs, like [1] and [2]. This is definitely not the intended usecase, so just collapse THP for regular files in order to close the attack surface. [[email protected]: fix vm_file check [3]] Link: https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/ [1] Link: https://lore.kernel.org/linux-mm/[email protected]/ [2] Link: https://lkml.kernel.org/r/CAHbLzkqTW9U3VvTu1Ki5v_cLRC9gHW+znBukg_ycergE0JWj-A@mail.gmail.com [3] Link: https://lkml.kernel.org/r/[email protected] Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Hugh Dickins <[email protected]> Signed-off-by: Yang Shi <[email protected]> Reported-by: Hao Sun <[email protected]> Reported-by: [email protected] Cc: Matthew Wilcox <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Song Liu <[email protected]> Cc: Andrea Righi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-28mm, thp: bail out early in collapse_file for writeback pageRongwei Wang1-1/+6
Currently collapse_file does not explicitly check PG_writeback, instead, page_has_private and try_to_release_page are used to filter writeback pages. This does not work for xfs with blocksize equal to or larger than pagesize, because in such case xfs has no page->private. This makes collapse_file bail out early for writeback page. Otherwise, xfs end_page_writeback will panic as follows. page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:ffff0003f88c86a8 index:0x0 pfn:0x84ef32 aops:xfs_address_space_operations [xfs] ino:30000b7 dentry name:"libtest.so" flags: 0x57fffe0000008027(locked|referenced|uptodate|active|writeback) raw: 57fffe0000008027 ffff80001b48bc28 ffff80001b48bc28 ffff0003f88c86a8 raw: 0000000000000000 0000000000000000 00000000ffffffff ffff0000c3e9a000 page dumped because: VM_BUG_ON_PAGE(((unsigned int) page_ref_count(page) + 127u <= 127u)) page->mem_cgroup:ffff0000c3e9a000 ------------[ cut here ]------------ kernel BUG at include/linux/mm.h:1212! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: BUG: Bad page state in process khugepaged pfn:84ef32 xfs(E) page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:0 index:0x0 pfn:0x84ef32 libcrc32c(E) rfkill(E) aes_ce_blk(E) crypto_simd(E) ... CPU: 25 PID: 0 Comm: swapper/25 Kdump: loaded Tainted: ... pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--) Call trace: end_page_writeback+0x1c0/0x214 iomap_finish_page_writeback+0x13c/0x204 iomap_finish_ioend+0xe8/0x19c iomap_writepage_end_bio+0x38/0x50 bio_endio+0x168/0x1ec blk_update_request+0x278/0x3f0 blk_mq_end_request+0x34/0x15c virtblk_request_done+0x38/0x74 [virtio_blk] blk_done_softirq+0xc4/0x110 __do_softirq+0x128/0x38c __irq_exit_rcu+0x118/0x150 irq_exit+0x1c/0x30 __handle_domain_irq+0x8c/0xf0 gic_handle_irq+0x84/0x108 el1_irq+0xcc/0x180 arch_cpu_idle+0x18/0x40 default_idle_call+0x4c/0x1a0 cpuidle_idle_call+0x168/0x1e0 do_idle+0xb4/0x104 cpu_startup_entry+0x30/0x9c secondary_start_kernel+0x104/0x180 Code: d4210000 b0006161 910c8021 94013f4d (d4210000) ---[ end trace 4a88c6a074082f8c ]--- Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt Link: https://lkml.kernel.org/r/[email protected] Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Rongwei Wang <[email protected]> Signed-off-by: Xu Yu <[email protected]> Suggested-by: Yang Shi <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Yang Shi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Song Liu <[email protected]> Cc: William Kucharski <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-28mm/vmalloc: fix numa spreading for large hash tablesChen Wandun1-6/+9
Eric Dumazet reported a strange numa spreading info in [1], and found commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced this issue [2]. Dig into the difference before and after this patch, page allocation has some difference: before: alloc_large_system_hash __vmalloc __vmalloc_node(..., NUMA_NO_NODE, ...) __vmalloc_node_range __vmalloc_area_node alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */ alloc_pages_current alloc_page_interleave /* can be proved by print policy mode */ after: alloc_large_system_hash __vmalloc __vmalloc_node(..., NUMA_NO_NODE, ...) __vmalloc_node_range __vmalloc_area_node alloc_pages_node /* choose nid by nuam_mem_id() */ __alloc_pages_node(nid, ....) So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), it will allocate memory in current node instead of interleaving allocate memory. Link: https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ [1] Link: https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ [2] Link: https://lkml.kernel.org/r/[email protected] Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") Signed-off-by: Chen Wandun <[email protected]> Reported-by: Eric Dumazet <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Hanjun Guo <[email protected]> Cc: Uladzislau Rezki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-28mm/secretmem: avoid letting secretmem_users drop to zeroKees Cook1-1/+1
Quoting Dmitry: "refcount_inc() needs to be done before fd_install(). After fd_install() finishes, the fd can be used by userspace and we can have secret data in memory before the refcount_inc(). A straightforward misuse where a user will predict the returned fd in another thread before the syscall returns and will use it to store secret data is somewhat dubious because such a user just shoots themself in the foot. But a more interesting misuse would be to close the predicted fd and decrement the refcount before the corresponding refcount_inc, this way one can briefly drop the refcount to zero while there are other users of secretmem." Move fd_install() after refcount_inc(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/lkml/CACT4Y+b1sW6-Hkn8HQYw_SsT7X3tp-CJNh2ci0wG3ZnQz9jjig@mail.gmail.com Fixes: 9a436f8ff631 ("PM: hibernate: disable when there are active secretmem users") Signed-off-by: Kees Cook <[email protected]> Reported-by: Dmitry Vyukov <[email protected]> Reviewed-by: Dmitry Vyukov <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Jordy Zomer <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-28mm/oom_kill.c: prevent a race between process_mrelease and exit_mmapSuren Baghdasaryan1-11/+12
Race between process_mrelease and exit_mmap, where free_pgtables is called while __oom_reap_task_mm is in progress, leads to kernel crash during pte_offset_map_lock call. oom-reaper avoids this race by setting MMF_OOM_VICTIM flag and causing exit_mmap to take and release mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock. Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to fix this race, however that would be considered a hack. Fix this race by elevating mm->mm_users and preventing exit_mmap from executing until process_mrelease is finished. Patch slightly refactors the code to adapt for a possible mmget_not_zero failure. This fix has considerable negative impact on process_mrelease performance and will likely need later optimization. Link: https://lkml.kernel.org/r/[email protected] Fixes: 884a7e5964e0 ("mm: introduce process_mrelease system call") Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jann Horn <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Florian Weimer <[email protected]> Cc: Jan Engelhardt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-28mm: filemap: check if THP has hwpoisoned subpage for PMD page faultYang Shi4-1/+28
When handling shmem page fault the THP with corrupted subpage could be PMD mapped if certain conditions are satisfied. But kernel is supposed to send SIGBUS when trying to map hwpoisoned page. There are two paths which may do PMD map: fault around and regular fault. Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") the thing was even worse in fault around path. The THP could be PMD mapped as long as the VMA fits regardless what subpage is accessed and corrupted. After this commit as long as head page is not corrupted the THP could be PMD mapped. In the regular fault path the THP could be PMD mapped as long as the corrupted page is not accessed and the VMA fits. This loophole could be fixed by iterating every subpage to check if any of them is hwpoisoned or not, but it is somewhat costly in page fault path. So introduce a new page flag called HasHWPoisoned on the first tail page. It indicates the THP has hwpoisoned subpage(s). It is set if any subpage of THP is found hwpoisoned by memory failure and after the refcount is bumped successfully, then cleared when the THP is freed or split. The soft offline path doesn't need this since soft offline handler just marks a subpage hwpoisoned when the subpage is migrated successfully. But shmem THP didn't get split then migrated at all. Link: https://lkml.kernel.org/r/[email protected] Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Suggested-by: Kirill A. Shutemov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-28mm: hwpoison: remove the unnecessary THP checkYang Shi1-14/+0
When handling THP hwpoison checked if the THP is in allocation or free stage since hwpoison may mistreat it as hugetlb page. After commit 415c64c1453a ("mm/memory-failure: split thp earlier in memory error handling") the problem has been fixed, so this check is no longer needed. Remove it. The side effect of the removal is hwpoison may report unsplit THP instead of unknown error for shmem THP. It seems not like a big deal. The following patch "mm: filemap: check if THP has hwpoisoned subpage for PMD page fault" depends on this, which fixes shmem THP with hwpoisoned subpage(s) are mapped PMD wrongly. So this patch needs to be backported to -stable as well. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Suggested-by: Naoya Horiguchi <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-28memcg: page_alloc: skip bulk allocator for __GFP_ACCOUNTShakeel Butt1-0/+4
Commit 5c1f4e690eec ("mm/vmalloc: switch to bulk allocator in __vmalloc_area_node()") switched to bulk page allocator for order 0 allocation backing vmalloc. However bulk page allocator does not support __GFP_ACCOUNT allocations and there are several users of kvmalloc(__GFP_ACCOUNT). For now make __GFP_ACCOUNT allocations bypass bulk page allocator. In future if there is workload that can be significantly improved with the bulk page allocator with __GFP_ACCCOUNT support, we can revisit the decision. Link: https://lkml.kernel.org/r/[email protected] Fixes: 5c1f4e690eec ("mm/vmalloc: switch to bulk allocator in __vmalloc_area_node()") Signed-off-by: Shakeel Butt <[email protected]> Reported-by: Vasily Averin <[email protected]> Tested-by: Vasily Averin <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-25secretmem: Prevent secretmem_users from wrapping to zeroMatthew Wilcox (Oracle)1-0/+2
Commit 110860541f44 ("mm/secretmem: use refcount_t instead of atomic_t") attempted to fix the problem of secretmem_users wrapping to zero and allowing suspend once again. But it was reverted in commit 87066fdd2e30 ("Revert 'mm/secretmem: use refcount_t instead of atomic_t'") because of the problems it caused - a refcount_t was not semantically the right type to use. Instead prevent secretmem_users from wrapping to zero by forbidding new users if the number of users has wrapped from positive to negative. This stops a long way short of reaching the necessary 4 billion users where it wraps to zero again, so there's no need to be clever with special anti-wrap types or checking the return value from atomic_inc(). Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Jordy Zomer <[email protected]> Cc: Kees Cook <[email protected]>, Cc: James Bottomley <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-24Revert "mm/secretmem: use refcount_t instead of atomic_t"Linus Torvalds1-5/+4
This reverts commit 110860541f443f950c1274f217a1a3e298670a33. Converting the "secretmem_users" counter to a refcount is incorrect, because a refcount is special in zero and can't just be incremented (but a count of users is not, and "no users" is actually perfectly valid and not a sign of a free'd resource). Reported-by: [email protected] Cc: Jordy Zomer <[email protected]> Cc: Kees Cook <[email protected]>, Cc: Jordy Zomer <[email protected]> Cc: James Bottomley <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-21memblock: exclude MEMBLOCK_NOMAP regions from kmemleakMike Rapoport1-0/+3
Vladimir Zapolskiy reports: Commit a7259df76702 ("memblock: make memblock_find_in_range method private") invokes a kernel panic while running kmemleak on OF platforms with nomaped regions: Unable to handle kernel paging request at virtual address fff000021e00000 [...] scan_block+0x64/0x170 scan_gray_list+0xe8/0x17c kmemleak_scan+0x270/0x514 kmemleak_write+0x34c/0x4ac The memory allocated from memblock is registered with kmemleak, but if it is marked MEMBLOCK_NOMAP it won't have linear map entries so an attempt to scan such areas will fault. Ideally, memblock_mark_nomap() would inform kmemleak to ignore MEMBLOCK_NOMAP memory, but it can be called before kmemleak interfaces operating on physical addresses can use __va() conversion. Make sure that functions that mark allocated memory as MEMBLOCK_NOMAP take care of informing kmemleak to ignore such memory. Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/[email protected] Fixes: a7259df76702 ("memblock: make memblock_find_in_range method private") Reported-by: Vladimir Zapolskiy <[email protected]> Signed-off-by: Mike Rapoport <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Tested-by: Vladimir Zapolskiy <[email protected]> Tested-by: Qian Cai <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-21Revert "memblock: exclude NOMAP regions from kmemleak"Mike Rapoport1-6/+1
Commit 6e44bd6d34d6 ("memblock: exclude NOMAP regions from kmemleak") breaks boot on EFI systems with kmemleak and VM_DEBUG enabled: efi: Processing EFI memory map: efi: 0x000090000000-0x000091ffffff [Conventional| | | | | | | | | | |WB|WT|WC|UC] efi: 0x000092000000-0x0000928fffff [Runtime Data|RUN| | | | | | | | | |WB|WT|WC|UC] ------------[ cut here ]------------ kernel BUG at mm/kmemleak.c:1140! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 5.15.0-rc6-next-20211019+ #104 pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : kmemleak_free_part_phys+0x64/0x8c lr : kmemleak_free_part_phys+0x38/0x8c sp : ffff800011eafbc0 x29: ffff800011eafbc0 x28: 1fffff7fffb41c0d x27: fffffbfffda0e068 x26: 0000000092000000 x25: 1ffff000023d5f94 x24: ffff800011ed84d0 x23: ffff800011ed84c0 x22: ffff800011ed83d8 x21: 0000000000900000 x20: ffff800011782000 x19: 0000000092000000 x18: ffff800011ee0730 x17: 0000000000000000 x16: 0000000000000000 x15: 1ffff0000233252c x14: ffff800019a905a0 x13: 0000000000000001 x12: ffff7000023d5ed7 x11: 1ffff000023d5ed6 x10: ffff7000023d5ed6 x9 : dfff800000000000 x8 : ffff800011eaf6b7 x7 : 0000000000000001 x6 : ffff800011eaf6b0 x5 : 00008ffffdc2a12a x4 : ffff7000023d5ed7 x3 : 1ffff000023dbf99 x2 : 1ffff000022f0463 x1 : 0000000000000000 x0 : ffffffffffffffff Call trace: kmemleak_free_part_phys+0x64/0x8c memblock_mark_nomap+0x5c/0x78 reserve_regions+0x294/0x33c efi_init+0x2d0/0x490 setup_arch+0x80/0x138 start_kernel+0xa0/0x3ec __primary_switched+0xc0/0xc8 Code: 34000041 97d526e7 f9418e80 36000040 (d4210000) random: get_random_bytes called from print_oops_end_marker+0x34/0x80 with crng_init=0 ---[ end trace 0000000000000000 ]--- The crash happens because kmemleak_free_part_phys() tries to use __va() before memstart_addr is initialized and this triggers a VM_BUG_ON() in arch/arm64/include/asm/memory.h: Revert 6e44bd6d34d6 ("memblock: exclude NOMAP regions from kmemleak"), the issue it is fixing will be fixed differently. Reported-by: Qian Cai <[email protected]> Signed-off-by: Mike Rapoport <[email protected]> Acked-by: Catalin Marinas <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-18mm/thp: decrease nr_thps in file's mapping on THP splitMarek Szyprowski1-2/+4
Decrease nr_thps counter in file's mapping to ensure that the page cache won't be dropped excessively on file write access if page has been already split. I've tried a test scenario running a big binary, kernel remaps it with THPs, then force a THP split with /sys/kernel/debug/split_huge_pages. During any further open of that binary with O_RDWR or O_WRITEONLY kernel drops page cache for it, because of non-zero thps counter. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marek Szyprowski <[email protected]> Fixes: 09d91cda0e82 ("mm,thp: avoid writes to file with THP in pagecache") Fixes: 06d3eff62d9d ("mm/thp: fix node page state in split_huge_page_to_list()") Acked-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: <[email protected]> Cc: Song Liu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: William Kucharski <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-18mm, slub: fix incorrect memcg slab count for bulk freeMiaohe Lin1-1/+3
kmem_cache_free_bulk() will call memcg_slab_free_hook() for all objects when doing bulk free. So we shouldn't call memcg_slab_free_hook() again for bulk free to avoid incorrect memcg slab count. Link: https://lkml.kernel.org/r/[email protected] Fixes: d1b2cf6cb84a ("mm: memcg/slab: uncharge during kmem_cache_free_bulk()") Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Bharata B Rao <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Faiyaz Mohammed <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Kees Cook <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-18mm, slub: fix potential use-after-free in slab_debugfs_fopsMiaohe Lin1-2/+4
When sysfs_slab_add failed, we shouldn't call debugfs_slab_add() for s because s will be freed soon. And slab_debugfs_fops will use s later leading to a use-after-free. Link: https://lkml.kernel.org/r/[email protected] Fixes: 64dd68497be7 ("mm: slub: move sysfs slab alloc/free interfaces to debugfs") Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Bharata B Rao <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Faiyaz Mohammed <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Kees Cook <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-18mm, slub: fix potential memoryleak in kmem_cache_open()Miaohe Lin1-1/+1
In error path, the random_seq of slub cache might be leaked. Fix this by using __kmem_cache_release() to release all the relevant resources. Link: https://lkml.kernel.org/r/[email protected] Fixes: 210e7a43fa90 ("mm: SLUB freelist randomization") Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Bharata B Rao <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Faiyaz Mohammed <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Kees Cook <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-18mm, slub: fix mismatch between reconstructed freelist depth and cntMiaohe Lin1-2/+9
If object's reuse is delayed, it will be excluded from the reconstructed freelist. But we forgot to adjust the cnt accordingly. So there will be a mismatch between reconstructed freelist depth and cnt. This will lead to free_debug_processing() complaining about freelist count or a incorrect slub inuse count. Link: https://lkml.kernel.org/r/[email protected] Fixes: c3895391df38 ("kasan, slub: fix handling of kasan_slab_free hook") Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Bharata B Rao <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Faiyaz Mohammed <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Kees Cook <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-18mm, slub: fix two bugs in slab_debug_trace_open()Miaohe Lin1-1/+7
Patch series "Fixups for slub". This series contains various bug fixes for slub. We fix memoryleak, use-afer-free, NULL pointer dereferencing and so on in slub. More details can be found in the respective changelogs. This patch (of 5): It's possible that __seq_open_private() will return NULL. So we should check it before using lest dereferencing NULL pointer. And in error paths, we forgot to release private buffer via seq_release_private(). Memory will leak in these paths. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 64dd68497be7 ("mm: slub: move sysfs slab alloc/free interfaces to debugfs") Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Faiyaz Mohammed <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Kees Cook <[email protected]> Cc: Bharata B Rao <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-18mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind()Eric Dumazet1-11/+5
syzbot reported access to unitialized memory in mbind() [1] Issue came with commit bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes") This commit added a new bit in MPOL_MODE_FLAGS, but only checked valid combination (MPOL_F_NUMA_BALANCING can only be used with MPOL_BIND) in do_set_mempolicy() This patch moves the check in sanitize_mpol_flags() so that it is also used by mbind() [1] BUG: KMSAN: uninit-value in __mpol_equal+0x567/0x590 mm/mempolicy.c:2260 __mpol_equal+0x567/0x590 mm/mempolicy.c:2260 mpol_equal include/linux/mempolicy.h:105 [inline] vma_merge+0x4a1/0x1e60 mm/mmap.c:1190 mbind_range+0xcc8/0x1e80 mm/mempolicy.c:811 do_mbind+0xf42/0x15f0 mm/mempolicy.c:1333 kernel_mbind mm/mempolicy.c:1483 [inline] __do_sys_mbind mm/mempolicy.c:1490 [inline] __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486 __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae Uninit was created at: slab_alloc_node mm/slub.c:3221 [inline] slab_alloc mm/slub.c:3230 [inline] kmem_cache_alloc+0x751/0xff0 mm/slub.c:3235 mpol_new mm/mempolicy.c:293 [inline] do_mbind+0x912/0x15f0 mm/mempolicy.c:1289 kernel_mbind mm/mempolicy.c:1483 [inline] __do_sys_mbind mm/mempolicy.c:1490 [inline] __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486 __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae ===================================================== Kernel panic - not syncing: panic_on_kmsan set ... CPU: 0 PID: 15049 Comm: syz-executor.0 Tainted: G B 5.15.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1ff/0x28e lib/dump_stack.c:106 dump_stack+0x25/0x28 lib/dump_stack.c:113 panic+0x44f/0xdeb kernel/panic.c:232 kmsan_report+0x2ee/0x300 mm/kmsan/report.c:186 __msan_warning+0xd7/0x150 mm/kmsan/instrumentation.c:208 __mpol_equal+0x567/0x590 mm/mempolicy.c:2260 mpol_equal include/linux/mempolicy.h:105 [inline] vma_merge+0x4a1/0x1e60 mm/mmap.c:1190 mbind_range+0xcc8/0x1e80 mm/mempolicy.c:811 do_mbind+0xf42/0x15f0 mm/mempolicy.c:1333 kernel_mbind mm/mempolicy.c:1483 [inline] __do_sys_mbind mm/mempolicy.c:1490 [inline] __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486 __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae Link: https://lkml.kernel.org/r/[email protected] Fixes: bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-18memblock: check memory total_sizePeng Fan1-1/+1
mem=[X][G|M] is broken on ARM64 platform, there are cases that even type.cnt is 1, but total_size is not 0 because regions are merged into 1. So only check 'cnt' is not enough, total_size should be used, othersize bootargs 'mem=[X][G|B]' not work anymore. Link: https://lkml.kernel.org/r/[email protected] Fixes: e888fa7bb882 ("memblock: Check memory add/cap ordering") Signed-off-by: Peng Fan <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-18mm/migrate: fix CPUHP state to update node demotion orderHuang Ying1-3/+5
The node demotion order needs to be updated during CPU hotplug. Because whether a NUMA node has CPU may influence the demotion order. The update function should be called during CPU online/offline after the node_states[N_CPU] has been updated. That is done in CPUHP_AP_ONLINE_DYN during CPU online and in CPUHP_MM_VMSTAT_DEAD during CPU offline. But in commit 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events"), the function to update node demotion order is called in CPUHP_AP_ONLINE_DYN during CPU online/offline. This doesn't satisfy the order requirement. For example, there are 4 CPUs (P0, P1, P2, P3) in 2 sockets (P0, P1 in S0 and P2, P3 in S1), the demotion order is - S0 -> NUMA_NO_NODE - S1 -> NUMA_NO_NODE After P2 and P3 is offlined, because S1 has no CPU now, the demotion order should have been changed to - S0 -> S1 - S1 -> NO_NODE but it isn't changed, because the order updating callback for CPU hotplug doesn't see the new nodemask. After that, if P1 is offlined, the demotion order is changed to the expected order as above. So in this patch, we added CPUHP_AP_MM_DEMOTION_ONLINE and CPUHP_MM_DEMOTION_DEAD to be called after CPUHP_AP_ONLINE_DYN and CPUHP_MM_VMSTAT_DEAD during CPU online and offline, and register the update function on them. Link: https://lkml.kernel.org/r/[email protected] Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events") Signed-off-by: "Huang, Ying" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wei Xu <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Keith Busch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-18mm/migrate: add CPU hotplug to demotion #ifdefDave Hansen3-26/+24
Once upon a time, the node demotion updates were driven solely by memory hotplug events. But now, there are handlers for both CPU and memory hotplug. However, the #ifdef around the code checks only memory hotplug. A system that has HOTPLUG_CPU=y but MEMORY_HOTPLUG=n would miss CPU hotplug events. Update the #ifdef around the common code. Add memory and CPU-specific #ifdefs for their handlers. These memory/CPU #ifdefs avoid unused function warnings when their Kconfig option is off. [[email protected]: rework hotplug_memory_notifier() stub] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events") Signed-off-by: Dave Hansen <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wei Xu <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-18mm/migrate: optimize hotplug-time demotion order updatesDave Hansen1-1/+11
Patch series "mm/migrate: 5.15 fixes for automatic demotion", v2. This contains two fixes for the "automatic demotion" code which was merged into 5.15: * Fix memory hotplug performance regression by watching suppressing any real action on irrelevant hotplug events. * Ensure CPU hotplug handler is registered when memory hotplug is disabled. This patch (of 2): == tl;dr == Automatic demotion opted for a simple, lazy approach to handling hotplug events. This noticeably slows down memory hotplug[1]. Optimize away updates to the demotion order when memory hotplug events should have no effect. This has no effect on CPU hotplug. There is no known problem on the CPU side and any work there will be in a separate series. == Background == Automatic demotion is a memory migration strategy to ensure that new allocations have room in faster memory tiers on tiered memory systems. The kernel maintains an array (node_demotion[]) to drive these migrations. The node_demotion[] path is calculated by starting at nodes with CPUs and then "walking" to nodes with memory. Only hotplug events which online or offline a node with memory (N_ONLINE) or CPUs (N_CPU) will actually affect the migration order. == Problem == However, the current code is lazy. It completely regenerates the migration order on *any* CPU or memory hotplug event. The logic was that these events are extremely rare and that the overhead from indiscriminate order regeneration is minimal. Part of the update logic involves a synchronize_rcu(), which is a pretty big hammer. Its overhead was large enough to be detected by some 0day tests that watch memory hotplug performance[1]. == Solution == Add a new helper (node_demotion_topo_changed()) which can differentiate between superfluous and impactful hotplug events. Skip the expensive update operation for superfluous events. == Aside: Locking == It took me a few moments to declare the locking to be safe enough for node_demotion_topo_changed() to work. It all hinges on the memory hotplug lock: During memory hotplug events, 'mem_hotplug_lock' is held for write. This ensures that two memory hotplug events can not be called simultaneously. CPU hotplug has a similar lock (cpuhp_state_mutex) which also provides mutual exclusion between CPU hotplug events. In addition, the demotion code acquire and hold the mem_hotplug_lock for read during its CPU hotplug handlers. This provides mutual exclusion between the demotion memory hotplug callbacks and the CPU hotplug callbacks. This effectively allows treating the migration target generation code to act as if it is single-threaded. 1. https://lore.kernel.org/all/20210905135932.GE15026@xsang-OptiPlex-9020/ Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events") Signed-off-by: Dave Hansen <[email protected]> Reported-by: kernel test robot <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wei Xu <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dan Williams <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-13memblock: exclude NOMAP regions from kmemleakMike Rapoport1-1/+6
Vladimir Zapolskiy reports: commit a7259df76702 ("memblock: make memblock_find_in_range method private") invokes a kernel panic while running kmemleak on OF platforms with nomaped regions: Unable to handle kernel paging request at virtual address fff000021e00000 [...] scan_block+0x64/0x170 scan_gray_list+0xe8/0x17c kmemleak_scan+0x270/0x514 kmemleak_write+0x34c/0x4ac Indeed, NOMAP regions don't have linear map entries so an attempt to scan these areas would fault. Prevent such faults by excluding NOMAP regions from kmemleak. Link: https://lore.kernel.org/all/[email protected] Fixes: a7259df76702 ("memblock: make memblock_find_in_range method private") Signed-off-by: Mike Rapoport <[email protected]> Tested-by: Vladimir Zapolskiy <[email protected]>
2021-09-24mm: fix uninitialized use in overcommit_policy_handlerChen Jun1-2/+2
We get an unexpected value of /proc/sys/vm/overcommit_memory after running the following program: int main() { int fd = open("/proc/sys/vm/overcommit_memory", O_RDWR); write(fd, "1", 1); write(fd, "2", 1); close(fd); } write(fd, "2", 1) will pass *ppos = 1 to proc_dointvec_minmax. proc_dointvec_minmax will return 0 without setting new_policy. t.data = &new_policy; ret = proc_dointvec_minmax(&t, write, buffer, lenp, ppos) -->do_proc_dointvec -->__do_proc_dointvec if (write) { if (proc_first_pos_non_zero_ignore(ppos, table)) goto out; sysctl_overcommit_memory = new_policy; so sysctl_overcommit_memory will be set to an uninitialized value. Check whether new_policy has been changed by proc_dointvec_minmax. Link: https://lkml.kernel.org/r/[email protected] Fixes: 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy") Signed-off-by: Chen Jun <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Feng Tang <[email protected]> Reviewed-by: Kefeng Wang <[email protected]> Cc: Rui Xiang <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-24mm/memory_failure: fix the missing pte_unmap() callQi Zheng1-5/+5
The paired pte_unmap() call is missing before the dev_pagemap_mapping_shift() returns. So fix it. David says: "I guess this code never runs on 32bit / highmem, that's why we didn't notice so far". [[email protected]: cleanup] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qi Zheng <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-24mm/debug: sync up latest migrate_reason to migrate_reason_namesWeizhao Ouyang1-0/+1
Sync up MR_DEMOTION to migrate_reason_names and add a synch prompt. Link: https://lkml.kernel.org/r/[email protected] Fixes: 26aa2d199d6f ("mm/migrate: demote pages during reclaim") Signed-off-by: Weizhao Ouyang <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mina Almasry <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Wei Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-24mm/debug: sync up MR_CONTIG_RANGE and MR_LONGTERM_PINWeizhao Ouyang1-1/+2
Sync up MR_CONTIG_RANGE and MR_LONGTERM_PIN to migrate_reason_names. Link: https://lkml.kernel.org/r/[email protected] Fixes: 310253514bbf ("mm/migrate: rename migration reason MR_CMA to MR_CONTIG_RANGE") Fixes: d1e153fea2a8 ("mm/gup: migrate pinned pages out of movable zone") Signed-off-by: Weizhao Ouyang <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mina Almasry <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Wei Xu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-24mm: fs: invalidate bh_lrus for only cold pathMinchan Kim1-3/+16
The kernel test robot reported the regression of fio.write_iops[1] with commit 8cc621d2f45d ("mm: fs: invalidate BH LRU during page migration"). Since lru_add_drain is called frequently, invalidate bh_lrus there could increase bh_lrus cache miss ratio, which needs more IO in the end. This patch moves the bh_lrus invalidation from the hot path( e.g., zap_page_range, pagevec_release) to cold path(i.e., lru_add_drain_all, lru_cache_disable). Zhengjun Xing confirmed "I test the patch, the regression reduced to -2.9%" [1] https://lore.kernel.org/lkml/20210520083144.GD14190@xsang-OptiPlex-9020/ [2] 8cc621d2f45d, mm: fs: invalidate BH LRU during page migration Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Reported-by: kernel test robot <[email protected]> Reviewed-by: Chris Goldsworthy <[email protected]> Tested-by: "Xing, Zhengjun" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-24mm/shmem.c: fix judgment error in shmem_is_huge()Liu Yuntao1-2/+2
In the case of SHMEM_HUGE_WITHIN_SIZE, the page index is not rounded up correctly. When the page index points to the first page in a huge page, round_up() cannot bring it to the end of the huge page, but to the end of the previous one. An example: HPAGE_PMD_NR on my machine is 512(2 MB huge page size). After allcoating a 3000 KB buffer, I access it at location 2050 KB. In shmem_is_huge(), the corresponding index happens to be 512. After rounded up by HPAGE_PMD_NR, it will still be 512 which is smaller than i_size, and shmem_is_huge() will return true. As a result, my buffer takes an additional huge page, and that shouldn't happen when shmem_enabled is set to within_size. Link: https://lkml.kernel.org/r/[email protected] Fixes: f3f0e1d2150b2b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Liu Yuntao <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: wuxu.wu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-24mm/damon: don't use strnlen() with known-bogus source lengthAdam Borowski1-8/+8
gcc knows the true length too, and rightfully complains. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Adam Borowski <[email protected]> Cc: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-24mm, hwpoison: add is_free_buddy_page() in HWPoisonHandlable()Naoya Horiguchi1-1/+1
Commit fcc00621d88b ("mm/hwpoison: retry with shake_page() for unhandlable pages") changed the return value of __get_hwpoison_page() to retry for transiently unhandlable cases. However, __get_hwpoison_page() currently fails to properly judge buddy pages as handlable, so hard/soft offline for buddy pages always fail as "unhandlable page". This is totally regrettable. So let's add is_free_buddy_page() in HWPoisonHandlable(), so that __get_hwpoison_page() returns different return values between buddy pages and unhandlable pages as intended. Link: https://lkml.kernel.org/r/[email protected] Fixes: fcc00621d88b ("mm/hwpoison: retry with shake_page() for unhandlable pages") Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: Tony Luck <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Michal Hocko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-23memcg: flush lruvec stats in the refaultShakeel Butt2-10/+1
Prior to the commit 7e1c0d6f5820 ("memcg: switch lruvec stats to rstat") and the commit aa48e47e3906 ("memcg: infrastructure to flush memcg stats"), each lruvec memcg stats can be off by (nr_cgroups * nr_cpus * 32) at worst and for unbounded amount of time. The commit aa48e47e3906 moved the lruvec stats to rstat infrastructure and the commit 7e1c0d6f5820 bounded the error for all the lruvec stats to (nr_cpus * 32) at worst for at most 2 seconds. More specifically it decoupled the number of stats and the number of cgroups from the error rate. However this reduction in error comes with the cost of triggering the slowpath of stats update more frequently. Previously in the slowpath the kernel adds the stats up the memcg tree. After aa48e47e3906, the kernel triggers the asyn lruvec stats flush through queue_work(). This causes regression reports from 0day kernel bot [1] as well as from phoronix test suite [2]. We tried two options to fix the regression: 1) Increase the threshold to trigger the slowpath in lruvec stats update codepath from 32 to 512. 2) Remove the slowpath from lruvec stats update codepath and instead flush the stats in the page refault codepath. The assumption is that the kernel timely flush the stats, so, the update tree would be small in the refault codepath to not cause the preformance impact. Following are the results of will-it-scale/page_fault[1|2|3] benchmark on four settings i.e. (1) 5.15-rc1 as baseline (2) 5.15-rc1 with aa48e47e3906 and 7e1c0d6f5820 reverted (3) 5.15-rc1 with option-1 (4) 5.15-rc1 with option-2. test (1) (2) (3) (4) pg_f1 368563 406277 (10.23%) 399693 (8.44%) 416398 (12.97%) pg_f2 338399 372133 (9.96%) 369180 (9.09%) 381024 (12.59%) pg_f3 500853 575399 (14.88%) 570388 (13.88%) 576083 (15.02%) From the above result, it seems like the option-2 not only solves the regression but also improves the performance for at least these benchmarks. Feng Tang (intel) ran the aim7 benchmark with these two options and confirms that option-1 reduces the regression but option-2 removes the regression. Michael Larabel (phoronix) ran multiple benchmarks with these options and reported the results at [3] and it shows for most benchmarks option-2 removes the regression introduced by the commit aa48e47e3906 ("memcg: infrastructure to flush memcg stats"). Based on the experiment results, this patch proposed the option-2 as the solution to resolve the regression. Link: https://lore.kernel.org/all/20210726022421.GB21872@xsang-OptiPlex-9020 [1] Link: https://www.phoronix.com/scan.php?page=article&item=linux515-compile-regress [2] Link: https://openbenchmarking.org/result/2109226-DEBU-LINUX5104 [3] Fixes: aa48e47e3906 ("memcg: infrastructure to flush memcg stats") Signed-off-by: Shakeel Butt <[email protected]> Tested-by: Michael Larabel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Feng Tang <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hillf Danton <[email protected]>, Cc: Michal Koutný <[email protected]> Cc: Andrew Morton <[email protected]>, Signed-off-by: Linus Torvalds <[email protected]>