aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2023-02-09mmap: convert vma_expand() to use vma iteratorLiam R. Howlett1-5/+4
Use the vma iterator instead of the maple state for type safety and for consistency through the mm code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-09mmap: change do_mas_munmap and do_mas_aligned_munmap() to use vma iteratorLiam R. Howlett2-45/+38
Start passing the vma iterator through the mm code. This will allow for reuse of the state and cleaner invalidation if necessary. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-09mm/mmap: remove preallocation from do_mas_align_munmap()Liam R. Howlett1-8/+6
In preparation of passing the vma state through split, the pre-allocation that occurs before the split has to be moved to after. Since the preallocation would then live right next to the store, just call store instead of preallocating. This effectively restores the potential error path of splitting and not munmap'ing which pre-dates the maple tree. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-09mmap: convert vma_link() vma iteratorLiam R. Howlett1-3/+3
Avoid using the maple tree interface directly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-09mm/mmap: convert brk to use vma iteratorLiam R. Howlett1-25/+23
Use the vma iterator API for the brk() system call. This will provide type safety at compile time. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-09mm: expand vma iterator interfaceLiam R. Howlett2-0/+82
Add wrappers for the maple tree to the vma iterator. This will provide type safety at compile time. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-09mm: shrinkers: fix deadlock in shrinker debugfsQi Zheng2-6/+13
The debugfs_remove_recursive() is invoked by unregister_shrinker(), which is holding the write lock of shrinker_rwsem. It will waits for the handler of debugfs file complete. The handler also needs to hold the read lock of shrinker_rwsem to do something. So it may cause the following deadlock: CPU0 CPU1 debugfs_file_get() shrinker_debugfs_count_show()/shrinker_debugfs_scan_write() unregister_shrinker() --> down_write(&shrinker_rwsem); debugfs_remove_recursive() // wait for (A) --> wait_for_completion(); // wait for (B) --> down_read_killable(&shrinker_rwsem) debugfs_file_put() -- (A) up_write() -- (B) The down_read_killable() can be killed, so that the above deadlock can be recovered. But it still requires an extra kill action, otherwise it will block all subsequent shrinker-related operations, so it's better to fix it. [[email protected]: fix CONFIG_SHRINKER_DEBUG=n stub] Link: https://lkml.kernel.org/r/[email protected] Fixes: 5035ebc644ae ("mm: shrinkers: introduce debugfs interface for memory shrinkers") Signed-off-by: Qi Zheng <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-09mm: hwpoison: support recovery from ksm_might_need_to_copy()Kefeng Wang3-8/+22
When the kernel copies a page from ksm_might_need_to_copy(), but runs into an uncorrectable error, it will crash since poisoned page is consumed by kernel, this is similar to the issue recently fixed by Copy-on-write poison recovery. When an error is detected during the page copy, return VM_FAULT_HWPOISON in do_swap_page(), and install a hwpoison entry in unuse_pte() when swapoff, which help us to avoid system crash. Note, memory failure on a KSM page will be skipped, but still call memory_failure_queue() to be consistent with general memory failure process, and we could support KSM page recovery in the feature. [[email protected]: enhance unuse_pte(), fix issue found by lkp] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: update changelog, alter ksm_might_need_to_copy(), restore unlikely() in unuse_pte()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Tony Luck <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-09kasan: fix Oops due to missing calls to kasan_arch_is_ready()Christophe Leroy3-1/+21
On powerpc64, you can build a kernel with KASAN as soon as you build it with RADIX MMU support. However if the CPU doesn't have RADIX MMU, KASAN isn't enabled at init and the following Oops is encountered. [ 0.000000][ T0] KASAN not enabled as it requires radix! [ 4.484295][ T26] BUG: Unable to handle kernel data access at 0xc00e000000804a04 [ 4.485270][ T26] Faulting instruction address: 0xc00000000062ec6c [ 4.485748][ T26] Oops: Kernel access of bad area, sig: 11 [#1] [ 4.485920][ T26] BE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries [ 4.486259][ T26] Modules linked in: [ 4.486637][ T26] CPU: 0 PID: 26 Comm: kworker/u2:2 Not tainted 6.2.0-rc3-02590-gf8a023b0a805 #249 [ 4.486907][ T26] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1200 0xf000005 of:SLOF,HEAD pSeries [ 4.487445][ T26] Workqueue: eval_map_wq .tracer_init_tracefs_work_func [ 4.488744][ T26] NIP: c00000000062ec6c LR: c00000000062bb84 CTR: c0000000002ebcd0 [ 4.488867][ T26] REGS: c0000000049175c0 TRAP: 0380 Not tainted (6.2.0-rc3-02590-gf8a023b0a805) [ 4.489028][ T26] MSR: 8000000002009032 <SF,VEC,EE,ME,IR,DR,RI> CR: 44002808 XER: 00000000 [ 4.489584][ T26] CFAR: c00000000062bb80 IRQMASK: 0 [ 4.489584][ T26] GPR00: c0000000005624d4 c000000004917860 c000000001cfc000 1800000000804a04 [ 4.489584][ T26] GPR04: c0000000003a2650 0000000000000cc0 c00000000000d3d8 c00000000000d3d8 [ 4.489584][ T26] GPR08: c0000000049175b0 a80e000000000000 0000000000000000 0000000017d78400 [ 4.489584][ T26] GPR12: 0000000044002204 c000000003790000 c00000000435003c c0000000043f1c40 [ 4.489584][ T26] GPR16: c0000000043f1c68 c0000000043501a0 c000000002106138 c0000000043f1c08 [ 4.489584][ T26] GPR20: c0000000043f1c10 c0000000043f1c20 c000000004146c40 c000000002fdb7f8 [ 4.489584][ T26] GPR24: c000000002fdb834 c000000003685e00 c000000004025030 c000000003522e90 [ 4.489584][ T26] GPR28: 0000000000000cc0 c0000000003a2650 c000000004025020 c000000004025020 [ 4.491201][ T26] NIP [c00000000062ec6c] .kasan_byte_accessible+0xc/0x20 [ 4.491430][ T26] LR [c00000000062bb84] .__kasan_check_byte+0x24/0x90 [ 4.491767][ T26] Call Trace: [ 4.491941][ T26] [c000000004917860] [c00000000062ae70] .__kasan_kmalloc+0xc0/0x110 (unreliable) [ 4.492270][ T26] [c0000000049178f0] [c0000000005624d4] .krealloc+0x54/0x1c0 [ 4.492453][ T26] [c000000004917990] [c0000000003a2650] .create_trace_option_files+0x280/0x530 [ 4.492613][ T26] [c000000004917a90] [c000000002050d90] .tracer_init_tracefs_work_func+0x274/0x2c0 [ 4.492771][ T26] [c000000004917b40] [c0000000001f9948] .process_one_work+0x578/0x9f0 [ 4.492927][ T26] [c000000004917c30] [c0000000001f9ebc] .worker_thread+0xfc/0x950 [ 4.493084][ T26] [c000000004917d60] [c00000000020be84] .kthread+0x1a4/0x1b0 [ 4.493232][ T26] [c000000004917e10] [c00000000000d3d8] .ret_from_kernel_thread+0x58/0x60 [ 4.495642][ T26] Code: 60000000 7cc802a6 38a00000 4bfffc78 60000000 7cc802a6 38a00001 4bfffc68 60000000 3d20a80e 7863e8c2 792907c6 <7c6348ae> 20630007 78630fe0 68630001 [ 4.496704][ T26] ---[ end trace 0000000000000000 ]--- The Oops is due to kasan_byte_accessible() not checking the readiness of KASAN. Add missing call to kasan_arch_is_ready() and bail out when not ready. The same problem is observed with ____kasan_kfree_large() so fix it the same. Also, as KASAN is not available and no shadow area is allocated for linear memory mapping, there is no point in allocating shadow mem for vmalloc memory as shown below in /sys/kernel/debug/kernel_page_tables ---[ kasan shadow mem start ]--- 0xc00f000000000000-0xc00f00000006ffff 0x00000000040f0000 448K r w pte valid present dirty accessed 0xc00f000000860000-0xc00f00000086ffff 0x000000000ac10000 64K r w pte valid present dirty accessed 0xc00f3ffffffe0000-0xc00f3fffffffffff 0x0000000004d10000 128K r w pte valid present dirty accessed ---[ kasan shadow mem end ]--- So, also verify KASAN readiness before allocating and poisoning shadow mem for VMAs. Link: https://lkml.kernel.org/r/150768c55722311699fdcf8f5379e8256749f47d.1674716617.git.christophe.leroy@csgroup.eu Fixes: 41b7a347bf14 ("powerpc: Book3S 64-bit outline-only KASAN support") Signed-off-by: Christophe Leroy <[email protected]> Reported-by: Nathan Lynch <[email protected]> Suggested-by: Michael Ellerman <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: <[email protected]> [5.19+] Signed-off-by: Andrew Morton <[email protected]>
2023-02-08mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available earlyThomas Gleixner2-12/+15
The memory allocators are available during early boot even in the phase where interrupts are disabled and scheduling is not yet possible. The setup is so that GFP_KERNEL allocations work in this phase without causing might_alloc() splats to be emitted because the system state is SYSTEM_BOOTING at that point which prevents the warnings to trigger. Most allocation/free functions use local_irq_save()/restore() or a lock variant of that. But kmem_cache_alloc_bulk() and kmem_cache_free_bulk() use local_[lock]_irq_disable()/enable(), which leads to a lockdep warning when interrupts are enabled during the early boot phase. This went unnoticed so far as there are no early users of these interfaces. The upcoming conversion of the interrupt descriptor store from radix_tree to maple_tree triggered this warning as maple_tree uses the bulk interface. Cure this by moving the kmem_cache_alloc/free() bulk variants of SLUB and SLAB to local[_lock]_irq_save()/restore(). There is obviously no reclaim possible and required at this point so there is no need to expand this coverage further. No functional change. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Hyeonggon Yoo <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2023-02-07Revert "mm: Always release pages to the buddy allocator in ↵Aaron Thompson1-7/+1
memblock_free_late()." This reverts commit 115d9d77bb0f9152c60b6e8646369fa7f6167593. The pages being freed by memblock_free_late() have already been initialized, but if they are in the deferred init range, __free_one_page() might access nearby uninitialized pages when trying to coalesce buddies. This can, for example, trigger this BUG: BUG: unable to handle page fault for address: ffffe964c02580c8 RIP: 0010:__list_del_entry_valid+0x3f/0x70 <TASK> __free_one_page+0x139/0x410 __free_pages_ok+0x21d/0x450 memblock_free_late+0x8c/0xb9 efi_free_boot_services+0x16b/0x25c efi_enter_virtual_mode+0x403/0x446 start_kernel+0x678/0x714 secondary_startup_64_no_verify+0xd2/0xdb </TASK> A proper fix will be more involved so revert this change for the time being. Fixes: 115d9d77bb0f ("mm: Always release pages to the buddy allocator in memblock_free_late().") Signed-off-by: Aaron Thompson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (IBM) <[email protected]>
2023-02-06mm/slub: fix memory leak with using debugfs_lookup()Greg Kroah-Hartman1-1/+1
When calling debugfs_lookup() the result must have dput() called on it, otherwise the memory will leak over time. To make things simpler, just call debugfs_lookup_and_remove() instead which handles all of the logic at once. Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Hyeonggon Yoo <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Hyeonggon Yoo <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2023-02-03mm/gup: add folio to list when folio_isolate_lru() succeedKuan-Ying Lee1-1/+1
If we call folio_isolate_lru() successfully, we will get return value 0. We need to add this folio to the movable_pages_list. Link: https://lkml.kernel.org/r/[email protected] Fixes: 67e139b02d99 ("mm/gup.c: refactor check_and_migrate_movable_pages()") Signed-off-by: Kuan-Ying Lee <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Cc: Andrew Yang <[email protected]> Cc: Chinwen Chang <[email protected]> Cc: John Hubbard <[email protected]> Cc: Matthias Brugger <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-03Merge tag 'mm-hotfixes-stable-2023-02-02-19-24-2' of ↵Linus Torvalds11-108/+286
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "25 hotfixes, mainly for MM. 13 are cc:stable" * tag 'mm-hotfixes-stable-2023-02-02-19-24-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (26 commits) mm: memcg: fix NULL pointer in mem_cgroup_track_foreign_dirty_slowpath() Kconfig.debug: fix the help description in SCHED_DEBUG mm/swapfile: add cond_resched() in get_swap_pages() mm: use stack_depot_early_init for kmemleak Squashfs: fix handling and sanity checking of xattr_ids count sh: define RUNTIME_DISCARD_EXIT highmem: round down the address passed to kunmap_flush_on_unmap() migrate: hugetlb: check for hugetlb shared PMD in node migration mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups Revert "mm: kmemleak: alloc gray object for reserved region with direct map" freevxfs: Kconfig: fix spelling maple_tree: should get pivots boundary by type .mailmap: update e-mail address for Eugen Hristev mm, mremap: fix mremap() expanding for vma's with vm_ops->close() squashfs: harden sanity check in squashfs_read_xattr_id_table ia64: fix build error due to switch case label appearing next to declaration mm: multi-gen LRU: fix crash during cgroup migration Revert "mm: add nodes= arg to memory.reclaim" zsmalloc: fix a race with deferred_handles storing ...
2023-02-03swap: use bvec_set_page to initialize bvecsChristoph Hellwig1-6/+2
Use the bvec_set_page helper to initialize bvecs. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-02-03blk-cgroup: store a gendisk to throttle in struct task_structChristoph Hellwig1-1/+1
Switch from a request_queue pointer and reference to a gendisk once for the throttle information in struct task_struct. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-02-02fs: convert writepage_t callback to pass a folioMatthew Wilcox (Oracle)1-3/+3
Patch series "Convert writepage_t to use a folio". More folioisation. I split out the mpage work from everything else because it completely dominated the patch, but some implementations I just converted outright. This patch (of 2): We always write back an entire folio, but that's currently passed as the head page. Convert all filesystems that use write_cache_pages() to expect a folio instead of a page. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02block: remove ->rw_pageChristoph Hellwig2-22/+33
The ->rw_page method is a special purpose bypass of the usual bio handling path that is limited to single-page reads and writes and synchronous which causes a lot of extra code in the drivers, callers and the block layer. The only remaining user is the MM swap code. Switch that swap code to simply submit a single-vec on-stack bio an synchronously wait on it based on a newly added QUEUE_FLAG_SYNCHRONOUS flag set by the drivers that currently implement ->rw_page instead. While this touches one extra cache line and executes extra code, it simplifies the block layer and drivers and ensures that all feastures are properly supported by all drivers, e.g. right now ->rw_page bypassed cgroup writeback entirely. [[email protected]: fix comment typo, per Dan] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Dan Williams <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Keith Busch <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Vishal Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: factor out a swap_writepage_bdev helperChristoph Hellwig1-14/+19
Split the block device case from swap_readpage into a separate helper, following the abstraction for file based swap. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Keith Busch <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Vishal Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: remove the __swap_writepage return valueChristoph Hellwig2-15/+10
__swap_writepage always returns 0. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Keith Busch <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Vishal Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: use an on-stack bio for synchronous swapinChristoph Hellwig1-31/+38
Optimize the synchronous swap in case by using an on-stack bio instead of allocating one using bio_alloc. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Keith Busch <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Vishal Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: factor out a swap_readpage_bdev helperChristoph Hellwig1-33/+35
Split the block device case from swap_readpage into a separate helper, following the abstraction for file based swap and frontswap. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Dan Williams <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Keith Busch <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Vishal Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: remove the swap_readpage return valueChristoph Hellwig2-16/+8
swap_readpage always returns 0, and no caller checks the return value. [[email protected]: fix void-returning swap_readpage() stub, per Keith] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Dan Williams <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Keith Busch <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Vishal Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: refactor va_remove_mappingsChristoph Hellwig1-19/+8
Move the VM_FLUSH_RESET_PERMS to the caller and rename the function to better describe what it is doing. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: split __vunmapChristoph Hellwig1-43/+41
vunmap only needs to find and free the vmap_area and vm_strut, so open code that there and merge the rest of the code into vfree. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Uladzislau Rezki (Sony) <[email protected]> Cc: Vincenzo Frascino <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: move debug checks from __vunmap to remove_vm_areaChristoph Hellwig1-9/+9
All these checks apply to the free_vm_area interface as well, so move them to the common routine. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: use remove_vm_area in __vunmapChristoph Hellwig1-21/+12
Use the common helper to find and remove a vmap_area instead of open coding it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: move __remove_vm_area out of va_remove_mappingsChristoph Hellwig1-6/+4
__remove_vm_area is the only part of va_remove_mappings that requires a vmap_area. Move the call out to the caller and only pass the vm_struct to va_remove_mappings. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: call vfree instead of __vunmap from delayed_vfree_workChristoph Hellwig1-1/+1
This adds an extra, never taken, in_interrupt() branch, but will allow to cut down the maze of vfree helpers. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: move vmalloc_init and free_work down in vmalloc.cChristoph Hellwig1-53/+51
Move these two functions around a bit to avoid forward declarations. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: remove __vfree_deferredChristoph Hellwig1-26/+17
Fold __vfree_deferred into vfree_atomic, and call vfree_atomic early on from vfree if called from interrupt context so that the extra low-level helper can be avoided. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: remove __vfreeChristoph Hellwig1-12/+6
__vfree is a subset of vfree that just skips a few checks, and which is only used by vfree and an error cleanup path. Fold __vfree into vfree and switch the only other caller to call vfree() instead. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: reject vmap with VM_FLUSH_RESET_PERMSChristoph Hellwig1-0/+3
Patch series "cleanup vfree and vunmap". This little series untangles the vfree and vunmap code path a bit. This patch (of 10): VM_FLUSH_RESET_PERMS is just for use with vmalloc as it is tied to freeing the underlying pages. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm, compaction: finish pageblocks on complete migration failureMel Gorman1-8/+22
Commit 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock") address an issue where a pageblock selected by fast_find_migrateblock() was ignored. Unfortunately, the same fix resulted in numerous reports of khugepaged or kcompactd stalling for long periods of time or consuming 100% of CPU. Tracing showed that there was a lot of rescanning between a small subset of pageblocks because the conditions for marking the block skip are not met. The scan is not reaching the end of the pageblock because enough pages were isolated but none were migrated successfully. Eventually it circles back to the same block. Pageblock skip tracking tries to minimise both latency and excessive scanning but tracking exactly when a block is fully scanned requires an excessive amount of state. This patch forcibly rescans a pageblock when all isolated pages fail to migrate even though it could be for transient reasons such as page writeback or page dirty. This will sometimes migrate too many pages but pageblocks will be marked skip and forward progress will be made. "Usemen" from the mmtests configuration workload-usemem-stress-numa-compact was used to stress compaction. The compaction trace events were recorded using a 6.2-rc5 kernel that includes commit 7efc3b726103 and count of unique ranges were measured. The top 5 ranges were 3076 range=(0x10ca00-0x10cc00) 3076 range=(0x110a00-0x110c00) 3098 range=(0x13b600-0x13b800) 3104 range=(0x141c00-0x141e00) 11424 range=(0x11b600-0x11b800) While this workload is very different than what the bugs reported, the pattern of the same subset of blocks being repeatedly scanned is observed. At one point, *only* the range range=(0x11b600 ~ 0x11b800) was scanned for 2 seconds. 14 seconds passed between the first migration-related event and the last. With the series applied including this patch, the top 5 ranges were 1 range=(0x11607e-0x116200) 1 range=(0x116200-0x116278) 1 range=(0x116278-0x116400) 1 range=(0x116400-0x116424) 1 range=(0x116424-0x116600) Only unique ranges were scanned and the time between the first migration-related event was 0.11 milliseconds. Link: https://lkml.kernel.org/r/[email protected] Fixes: 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock") Signed-off-by: Mel Gorman <[email protected]> Cc: Chuyi Zhou <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Maxim Levitsky <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm, compaction: finish scanning the current pageblock if requestedMel Gorman1-0/+7
cc->finish_pageblock is set when the current pageblock should be rescanned but fast_find_migrateblock can select an alternative block. Disable fast_find_migrateblock when the current pageblock scan should be completed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Cc: Chuyi Zhou <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Maxim Levitsky <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm, compaction: check if a page has been captured before draining PCP pagesMel Gorman1-6/+6
If a page has been captured then draining is unnecssary so check first for a captured page. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Cc: Chuyi Zhou <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Maxim Levitsky <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm, compaction: rename compact_control->rescan to finish_pageblockMel Gorman2-13/+17
Patch series "Fix excessive CPU usage during compaction". Commit 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock") fixed a problem where pageblocks found by fast_find_migrateblock() were ignored. Unfortunately there were numerous bug reports complaining about high CPU usage and massive stalls once 6.1 was released. Due to the severity, the patch was reverted by Vlastimil as a short-term fix[1] to -stable. The underlying problem for each of the bugs is suspected to be the repeated scanning of the same pageblocks. This series should guarantee forward progress even with commit 7efc3b726103. More information is in the changelog for patch 4. [1] http://lore.kernel.org/r/[email protected] This patch (of 4): The rescan field was not well named albeit accurate at the time. Rename the field to finish_pageblock to indicate that the remainder of the pageblock should be scanned regardless of COMPACT_CLUSTER_MAX. The intent is that pageblocks with transient failures get marked for skipping to avoid revisiting the same pageblock. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Cc: Chuyi Zhou <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Maxim Levitsky <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm/gup.c: fix typo in commentsJongwoo Han1-1/+1
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jongwoo Han <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02kasan: reset page tags properly with samplingAndrey Konovalov1-5/+6
The implementation of page_alloc poisoning sampling assumed that tag_clear_highpage resets page tags for __GFP_ZEROTAGS allocations. However, this is no longer the case since commit 70c248aca9e7 ("mm: kasan: Skip unpoisoning of user pages"). This leads to kernel crashes when MTE-enabled userspace mappings are used with Hardware Tag-Based KASAN enabled. Reset page tags for __GFP_ZEROTAGS allocations in post_alloc_hook(). Also clarify and fix related comments. [[email protected]: update comment] Link: https://lkml.kernel.org/r/5dbd866714b4839069e2d8469ac45b60953db290.1674592780.git.andreyknvl@google.com Link: https://lkml.kernel.org/r/24ea20c1b19c2b4b56cf9f5b354915f8dbccfc77.1674592496.git.andreyknvl@google.com Fixes: 44383cef54c0 ("kasan: allow sampling page_alloc allocations for HW_TAGS") Signed-off-by: Andrey Konovalov <[email protected]> Reported-by: Peter Collingbourne <[email protected]> Tested-by: Peter Collingbourne <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Marco Elver <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm/sparse: fix "unused function 'pgdat_to_phys'" warningMike Rapoport1-1/+1
W=1 build with clangs complains: mm/sparse.c:347:27: warning: unused function 'pgdat_to_phys' [-Wunused-function] static inline phys_addr_t pgdat_to_phys(struct pglist_data *pgdat) ^ 1 warning generated. pgdat_to_phys() is only used by functions defined when CONFIG_MEMORY_HOTREMOVE=y. Move pgdat_to_phys() under #ifdef CONFIG_MEMORY_HOTREMOVE to make clang happy. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Reported-by: kernel test robot <[email protected]> Link: https://lore.kernel.org/all/[email protected] Cc: Miles Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm/page_owner: record single timestamp value for high order allocationsHyeonggon Yoo1-1/+2
When allocating a high-order page, separate allocation timestamp is recorded for each sub-page resulting in different timestamp values between them. This behavior is not consistent with the behavior when recording free timestamp and caused confusion when analyzing memory dumps. Record single timestamp for the entire allocation, aligning with the behavior for free timestamps. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hyeonggon Yoo <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: memory-failure: bump memory failure stats to pglist_dataJiaqi Yan1-0/+36
Right before memory_failure finishes its handling, accumulate poisoned page's resolution counters to pglist_data's memory_failure_stats, so as to update the corresponding sysfs entries. Tested: 1) Start an application to allocate memory buffer chunks 2) Convert random memory buffer addresses to physical addresses 3) Inject memory errors using EINJ at chosen physical addresses 4) Access poisoned memory buffer and recover from SIGBUS 5) Check counter values under /sys/devices/system/node/node*/memory_failure/* Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jiaqi Yan <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Tony Luck <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: memory-failure: add memory failure stats to sysfsJiaqi Yan1-0/+35
Patch series "Introduce per NUMA node memory error statistics", v2. Background ========== In the RFC for Kernel Support of Memory Error Detection [1], one advantage of software-based scanning over hardware patrol scrubber is the ability to make statistics visible to system administrators. The statistics include 2 categories: * Memory error statistics, for example, how many memory error are encountered, how many of them are recovered by the kernel. Note these memory errors are non-fatal to kernel: during the machine check exception (MCE) handling kernel already classified MCE's severity to be unnecessary to panic (but either action required or optional). * Scanner statistics, for example how many times the scanner have fully scanned a NUMA node, how many errors are first detected by the scanner. The memory error statistics are useful to userspace and actually not specific to scanner detected memory errors, and are the focus of this patchset. Motivation ========== Memory error stats are important to userspace but insufficient in kernel today. Datacenter administrators can better monitor a machine's memory health with the visible stats. For example, while memory errors are inevitable on servers with 10+ TB memory, starting server maintenance when there are only 1~2 recovered memory errors could be overreacting; in cloud production environment maintenance usually means live migrate all the workload running on the server and this usually causes nontrivial disruption to the customer. Providing insight into the scope of memory errors on a system helps to determine the appropriate follow-up action. In addition, the kernel's existing memory error stats need to be standardized so that userspace can reliably count on their usefulness. Today kernel provides following memory error info to userspace, but they are not sufficient or have disadvantages: * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total, not per NUMA node stats though * ras:memory_failure_event: only available after explicitly enabled * /dev/mcelog provides many useful info about the MCEs, but doesn't capture how memory_failure recovered memory MCEs * kernel logs: userspace needs to process log text Exposing memory error stats is also a good start for the in-kernel memory error detector. Today the data source of memory error stats are either direct memory error consumption, or hardware patrol scrubber detection (either signaled as UCNA or SRAO). Once in-kernel memory scanner is implemented, it will be the main source as it is usually configured to scan memory DIMMs constantly and faster than hardware patrol scrubber. How Implemented =============== As Naoya pointed out [2], exposing memory error statistics to userspace is useful independent of software or hardware scanner. Therefore we implement the memory error statistics independent of the in-kernel memory error detector. It exposes the following per NUMA node memory error counters: /sys/devices/system/node/node${X}/memory_failure/total /sys/devices/system/node/node${X}/memory_failure/recovered /sys/devices/system/node/node${X}/memory_failure/ignored /sys/devices/system/node/node${X}/memory_failure/failed /sys/devices/system/node/node${X}/memory_failure/delayed These counters describe how many raw pages are poisoned and after the attempted recoveries by the kernel, their resolutions: how many are recovered, ignored, failed, or delayed respectively. This approach can be easier to extend for future use cases than /proc/meminfo, trace event, and log. The following math holds for the statistics: * total = recovered + ignored + failed + delayed These memory error stats are reset during machine boot. The 1st commit introduces these sysfs entries. The 2nd commit populates memory error stats every time memory_failure attempts memory error recovery. The 3rd commit adds documentations for introduced stats. [1] https://lore.kernel.org/linux-mm/[email protected]/T/#mc22959244f5388891c523882e61163c6e4d703af [2] https://lore.kernel.org/linux-mm/[email protected]/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6 This patch (of 3): Today kernel provides following memory error info to userspace, but each has its own disadvantage * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total, not per NUMA node stats though * ras:memory_failure_event: only available after explicitly enabled * /dev/mcelog provides many useful info about the MCEs, but doesn't capture how memory_failure recovered memory MCEs * kernel logs: userspace needs to process log text Exposes per NUMA node memory error stats as sysfs entries: /sys/devices/system/node/node${X}/memory_failure/total /sys/devices/system/node/node${X}/memory_failure/recovered /sys/devices/system/node/node${X}/memory_failure/ignored /sys/devices/system/node/node${X}/memory_failure/failed /sys/devices/system/node/node${X}/memory_failure/delayed These counters describe how many raw pages are poisoned and after the attempted recoveries by the kernel, their resolutions: how many are recovered, ignored, failed, or delayed respectively. The following math holds for the statistics: * total = recovered + ignored + failed + delayed Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jiaqi Yan <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Tony Luck <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: multi-gen LRU: simplify lru_gen_look_around()T.J. Alumbaugh1-50/+23
Update the folio generation in place with or without current->reclaim_state->mm_walk. The LRU lock is held for longer, if mm_walk is NULL and the number of folios to update is more than PAGEVEC_SIZE. This causes a measurable regression from the LRU lock contention during a microbencmark. But a tiny regression is not worth the complexity. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: T.J. Alumbaugh <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: multi-gen LRU: improve walk_pmd_range()T.J. Alumbaugh1-20/+20
Improve readability of walk_pmd_range() and walk_pmd_range_locked(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: T.J. Alumbaugh <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: multi-gen LRU: improve lru_gen_exit_memcg()T.J. Alumbaugh1-0/+5
Add warnings and poison ->next. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: T.J. Alumbaugh <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: multi-gen LRU: section for memcg LRUT.J. Alumbaugh2-114/+144
Move memcg LRU code into a dedicated section. Improve the design doc to outline its architecture. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: T.J. Alumbaugh <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: multi-gen LRU: section for Bloom filtersT.J. Alumbaugh1-88/+92
Move Bloom filters code into a dedicated section. Improve the design doc to explain Bloom filter usage and connection between aging and eviction in their use. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: T.J. Alumbaugh <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: multi-gen LRU: section for rmap/PT walk feedbackT.J. Alumbaugh1-0/+4
Add a section for lru_gen_look_around() in the code and the design doc. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: T.J. Alumbaugh <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: multi-gen LRU: section for working set protectionT.J. Alumbaugh1-0/+4
Patch series "mm: multi-gen LRU: improve". This patch series improves a few MGLRU functions, collects related functions, and adds additional documentation. This patch (of 7): Add a section for working set protection in the code and the design doc. The admin doc already contains its usage. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: T.J. Alumbaugh <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>