blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2022-02-26	Merge tag 'fixes-2022-02-26' of ↵	Linus Torvalds	1	-2/+8
	git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock fix from Mike Rapoport: "Use kfree() to release kmalloced memblock regions memblock.{reserved,memory}.regions may be allocated using kmalloc() in memblock_double_array(). Use kfree() to release these kmalloced regions" * tag 'fixes-2022-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: memblock: use kfree() to release kmalloced memblock regions
2022-02-26	mm: fix use-after-free bug when mm->mmap is reused after being freed	Suren Baghdasaryan	1	-0/+1
	oom reaping (__oom_reap_task_mm) relies on a 2 way synchronization with exit_mmap. First it relies on the mmap_lock to exclude from unlock path[1], page tables tear down (free_pgtables) and vma destruction. This alone is not sufficient because mm->mmap is never reset. For historical reasons[2] the lock is taken there is also MMF_OOM_SKIP set for oom victims before. The oom reaper only ever looks at oom victims so the whole scheme works properly but process_mrelease can opearate on any task (with fatal signals pending) which doesn't really imply oom victims. That means that the MMF_OOM_SKIP part of the synchronization doesn't work and it can see a task after the whole address space has been demolished and traverse an already released mm->mmap list. This leads to use after free as properly caught up by KASAN report. Fix the issue by reseting mm->mmap so that MMF_OOM_SKIP synchronization is not needed anymore. The MMF_OOM_SKIP is not removed from exit_mmap yet but it acts mostly as an optimization now. [1] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3") [2] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") [[email protected]: changelog rewrite] Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 64591e8605d6 ("mm: protect free_pgtables with mmap_lock write lock in exit_mmap") Signed-off-by: Suren Baghdasaryan <[email protected]> Reported-by: [email protected] Suggested-by: Michal Hocko <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Reviewed-by: Yang Shi <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jann Horn <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Florian Weimer <[email protected]> Cc: Jan Engelhardt <[email protected]> Cc: Tim Murray <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-02-26	hugetlbfs: fix a truncation issue in hugepages parameter	Liu Yuntao	1	-2/+2
	When we specify a large number for node in hugepages parameter, it may be parsed to another number due to truncation in this statement: node = tmp; For example, add following parameter in command line: hugepagesz=1G hugepages=4294967297:5 and kernel will allocate 5 hugepages for node 1 instead of ignoring it. I move the validation check earlier to fix this issue, and slightly simplifies the condition here. Link: https://lkml.kernel.org/r/[email protected] Fixes: b5389086ad7be0 ("hugetlbfs: extend the definition of hugepages parameter to support node allocation") Signed-off-by: Liu Yuntao <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-02-26	mm/hugetlb: fix kernel crash with hugetlb mremap	Aneesh Kumar K.V	1	-4/+3
	This fixes the below crash: kernel BUG at include/linux/mm.h:2373! cpu 0x5d: Vector: 700 (Program Check) at [c00000003c6e76e0] pc: c000000000581a54: pmd_to_page+0x54/0x80 lr: c00000000058d184: move_hugetlb_page_tables+0x4e4/0x5b0 sp: c00000003c6e7980 msr: 9000000000029033 current = 0xc00000003bd8d980 paca = 0xc000200fff610100 irqmask: 0x03 irq_happened: 0x01 pid = 9349, comm = hugepage-mremap kernel BUG at include/linux/mm.h:2373! move_hugetlb_page_tables+0x4e4/0x5b0 (link register) move_hugetlb_page_tables+0x22c/0x5b0 (unreliable) move_page_tables+0xdbc/0x1010 move_vma+0x254/0x5f0 sys_mremap+0x7c0/0x900 system_call_exception+0x160/0x2c0 the kernel can't use huge_pte_offset before it set the pte entry because a page table lookup check for huge PTE bit in the page table to differentiate between a huge pte entry and a pointer to pte page. A huge_pte_alloc won't mark the page table entry huge and hence kernel should not use huge_pte_offset after a huge_pte_alloc. Link: https://lkml.kernel.org/r/[email protected] Fixes: 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed vma") Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Mina Almasry <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-02-20	memblock: use kfree() to release kmalloced memblock regions	Miaohe Lin	1	-2/+8
	memblock.{reserved,memory}.regions may be allocated using kmalloc() in memblock_double_array(). Use kfree() to release these kmalloced regions indicated by memblock_{reserved,memory}_in_slab. Signed-off-by: Miaohe Lin <[email protected]> Fixes: 3010f876500f ("mm: discard memblock data later") Signed-off-by: Mike Rapoport <[email protected]>
2022-02-17	mm: don't try to NUMA-migrate COW pages that have other uses	Linus Torvalds	1	-1/+1
	Oded Gabbay reports that enabling NUMA balancing causes corruption with his Gaudi accelerator test load: "All the details are in the bug, but the bottom line is that somehow, this patch causes corruption when the numa balancing feature is enabled AND we don't use process affinity AND we use GUP to pin pages so our accelerator can DMA to/from system memory. Either disabling numa balancing, using process affinity to bind to specific numa-node or reverting this patch causes the bug to disappear" and Oded bisected the issue to commit 09854ba94c6a ("mm: do_wp_page() simplification"). Now, the NUMA balancing shouldn't actually be changing the writability of a page, and as such shouldn't matter for COW. But it appears it does. Suspicious. However, regardless of that, the condition for enabling NUMA faults in change_pte_range() is nonsensical. It uses "page_mapcount(page)" to decide if a COW page should be NUMA-protected or not, and that makes absolutely no sense. The number of mappings a page has is irrelevant: not only does GUP get a reference to a page as in Oded's case, but the other mappings migth be paged out and the only reference to them would be in the page count. Since we should never try to NUMA-balance a page that we can't move anyway due to other references, just fix the code to use 'page_count()'. Oded confirms that that fixes his issue. Now, this does imply that something in NUMA balancing ends up changing page protections (other than the obvious one of making the page inaccessible to get the NUMA faulting information). Otherwise the COW simplification wouldn't matter - since doing the GUP on the page would make sure it's writable. The cause of that permission change would be good to figure out too, since it clearly results in spurious COW events - but fixing the nonsensical test that just happened to work before is obviously the CorrectThing(tm) to do regardless. Fixes: 09854ba94c6a ("mm: do_wp_page() simplification") Link: https://bugzilla.kernel.org/show_bug.cgi?id=215616 Link: https://lore.kernel.org/all/CAFCwf10eNmwq2wD71xjUhqkvv5+_pJMR1nPug2RqNDcFT4H86Q@mail.gmail.com/ Reported-and-tested-by: Oded Gabbay <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-02-11	kfence: make test case compatible with run time set sample interval	Peng Liu	2	-5/+6
	The parameter kfence_sample_interval can be set via boot parameter and late shell command, which is convenient for automated tests and KFENCE parameter optimization. However, KFENCE test case just uses compile-time CONFIG_KFENCE_SAMPLE_INTERVAL, which will make KFENCE test case not run as users desired. Export kfence_sample_interval, so that KFENCE test case can use run-time-set sample interval. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peng Liu <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Sumit Semwal <[email protected]> Cc: Christian Knig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-02-11	mm: memcg: synchronize objcg lists with a dedicated spinlock	Roman Gushchin	1	-5/+5
	Alexander reported a circular lock dependency revealed by the mmap1 ltp test: LOCKDEP_CIRCULAR (suite: ltp, case: mtest06 (mmap1)) WARNING: possible circular locking dependency detected 5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1 Not tainted ------------------------------------------------------ mmap1/202299 is trying to acquire lock: 00000001892c0188 (css_set_lock){..-.}-{2:2}, at: obj_cgroup_release+0x4a/0xe0 but task is already holding lock: 00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&sighand->siglock){-.-.}-{2:2}: __lock_acquire+0x604/0xbd8 lock_acquire.part.0+0xe2/0x238 lock_acquire+0xb0/0x200 _raw_spin_lock_irqsave+0x6a/0xd8 __lock_task_sighand+0x90/0x190 cgroup_freeze_task+0x2e/0x90 cgroup_migrate_execute+0x11c/0x608 cgroup_update_dfl_csses+0x246/0x270 cgroup_subtree_control_write+0x238/0x518 kernfs_fop_write_iter+0x13e/0x1e0 new_sync_write+0x100/0x190 vfs_write+0x22c/0x2d8 ksys_write+0x6c/0xf8 __do_syscall+0x1da/0x208 system_call+0x82/0xb0 -> #0 (css_set_lock){..-.}-{2:2}: check_prev_add+0xe0/0xed8 validate_chain+0x736/0xb20 __lock_acquire+0x604/0xbd8 lock_acquire.part.0+0xe2/0x238 lock_acquire+0xb0/0x200 _raw_spin_lock_irqsave+0x6a/0xd8 obj_cgroup_release+0x4a/0xe0 percpu_ref_put_many.constprop.0+0x150/0x168 drain_obj_stock+0x94/0xe8 refill_obj_stock+0x94/0x278 obj_cgroup_charge+0x164/0x1d8 kmem_cache_alloc+0xac/0x528 __sigqueue_alloc+0x150/0x308 __send_signal+0x260/0x550 send_signal+0x7e/0x348 force_sig_info_to_task+0x104/0x180 force_sig_fault+0x48/0x58 __do_pgm_check+0x120/0x1f0 pgm_check_handler+0x11e/0x180 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&sighand->siglock); lock(css_set_lock); lock(&sighand->siglock); lock(css_set_lock); * DEADLOCK * 2 locks held by mmap1/202299: #0: 00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180 #1: 00000001892ad560 (rcu_read_lock){....}-{1:2}, at: percpu_ref_put_many.constprop.0+0x0/0x168 stack backtrace: CPU: 15 PID: 202299 Comm: mmap1 Not tainted 5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1 Hardware name: IBM 3906 M04 704 (LPAR) Call Trace: dump_stack_lvl+0x76/0x98 check_noncircular+0x136/0x158 check_prev_add+0xe0/0xed8 validate_chain+0x736/0xb20 __lock_acquire+0x604/0xbd8 lock_acquire.part.0+0xe2/0x238 lock_acquire+0xb0/0x200 _raw_spin_lock_irqsave+0x6a/0xd8 obj_cgroup_release+0x4a/0xe0 percpu_ref_put_many.constprop.0+0x150/0x168 drain_obj_stock+0x94/0xe8 refill_obj_stock+0x94/0x278 obj_cgroup_charge+0x164/0x1d8 kmem_cache_alloc+0xac/0x528 __sigqueue_alloc+0x150/0x308 __send_signal+0x260/0x550 send_signal+0x7e/0x348 force_sig_info_to_task+0x104/0x180 force_sig_fault+0x48/0x58 __do_pgm_check+0x120/0x1f0 pgm_check_handler+0x11e/0x180 INFO: lockdep is turned off. In this example a slab allocation from __send_signal() caused a refilling and draining of a percpu objcg stock, resulted in a releasing of another non-related objcg. Objcg release path requires taking the css_set_lock, which is used to synchronize objcg lists. This can create a circular dependency with the sighandler lock, which is taken with the locked css_set_lock by the freezer code (to freeze a task). In general it seems that using css_set_lock to synchronize objcg lists makes any slab allocations and deallocation with the locked css_set_lock and any intervened locks risky. To fix the problem and make the code more robust let's stop using css_set_lock to synchronize objcg lists and use a new dedicated spinlock instead. Link: https://lkml.kernel.org/r/[email protected] Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API") Signed-off-by: Roman Gushchin <[email protected]> Reported-by: Alexander Egorenkov <[email protected]> Tested-by: Alexander Egorenkov <[email protected]> Reviewed-by: Waiman Long <[email protected]> Acked-by: Tejun Heo <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Reviewed-by: Jeremy Linton <[email protected]> Tested-by: Jeremy Linton <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-02-11	mm: vmscan: remove deadlock due to throttling failing to make progress	Mel Gorman	1	-1/+3
	A soft lockup bug in kcompactd was reported in a private bugzilla with the following visible in dmesg; watchdog: BUG: soft lockup - CPU#33 stuck for 26s! [kcompactd0:479] watchdog: BUG: soft lockup - CPU#33 stuck for 52s! [kcompactd0:479] watchdog: BUG: soft lockup - CPU#33 stuck for 78s! [kcompactd0:479] watchdog: BUG: soft lockup - CPU#33 stuck for 104s! [kcompactd0:479] The machine had 256G of RAM with no swap and an earlier failed allocation indicated that node 0 where kcompactd was run was potentially unreclaimable; Node 0 active_anon:29355112kB inactive_anon:2913528kB active_file:0kB inactive_file:0kB unevictable:64kB isolated(anon):0kB isolated(file):0kB mapped:8kB dirty:0kB writeback:0kB shmem:26780kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 23480320kB writeback_tmp:0kB kernel_stack:2272kB pagetables:24500kB all_unreclaimable? yes Vlastimil Babka investigated a crash dump and found that a task migrating pages was trying to drain PCP lists; PID: 52922 TASK: ffff969f820e5000 CPU: 19 COMMAND: "kworker/u128:3" Call Trace: __schedule schedule schedule_timeout wait_for_completion __flush_work __drain_all_pages __alloc_pages_slowpath.constprop.114 __alloc_pages alloc_migration_target migrate_pages migrate_to_node do_migrate_pages cpuset_migrate_mm_workfn process_one_work worker_thread kthread ret_from_fork This failure is specific to CONFIG_PREEMPT=n builds. The root of the problem is that kcompact0 is not rescheduling on a CPU while a task that has isolated a large number of the pages from the LRU is waiting on kcompact0 to reschedule so the pages can be released. While shrink_inactive_list() only loops once around too_many_isolated, reclaim can continue without rescheduling if sc->skipped_deactivate == 1 which could happen if there was no file LRU and the inactive anon list was not low. Link: https://lkml.kernel.org/r/[email protected] Fixes: d818fca1cac3 ("mm/vmscan: throttle reclaim and compaction when too may pages are isolated") Signed-off-by: Mel Gorman <[email protected]> Debugged-by: Vlastimil Babka <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-02-04	mm/kmemleak: avoid scanning potential huge holes	Lang Yu	1	-6/+7
	When using devm_request_free_mem_region() and devm_memremap_pages() to add ZONE_DEVICE memory, if requested free mem region's end pfn were huge(e.g., 0x400000000), the node_end_pfn() will be also huge (see move_pfn_range_to_zone()). Thus it creates a huge hole between node_start_pfn() and node_end_pfn(). We found on some AMD APUs, amdkfd requested such a free mem region and created a huge hole. In such a case, following code snippet was just doing busy test_bit() looping on the huge hole. for (pfn = start_pfn; pfn < end_pfn; pfn++) { struct page *page = pfn_to_online_page(pfn); if (!page) continue; ... } So we got a soft lockup: watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221] CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1 RIP: 0010:pfn_to_online_page+0x5/0xd0 Call Trace: ? kmemleak_scan+0x16a/0x440 kmemleak_write+0x306/0x3a0 ? common_file_perm+0x72/0x170 full_proxy_write+0x5c/0x90 vfs_write+0xb9/0x260 ksys_write+0x67/0xe0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae I did some tests with the patch. (1) amdgpu module unloaded before the patch: real 0m0.976s user 0m0.000s sys 0m0.968s after the patch: real 0m0.981s user 0m0.000s sys 0m0.973s (2) amdgpu module loaded before the patch: real 0m35.365s user 0m0.000s sys 0m35.354s after the patch: real 0m1.049s user 0m0.000s sys 0m1.042s Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lang Yu <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Catalin Marinas <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-02-04	mm/page_table_check: check entries at pmd levels	Pasha Tatashin	2	-0/+23
	syzbot detected a case where the page table counters were not properly updated. syzkaller login: ------------[ cut here ]------------ kernel BUG at mm/page_table_check.c:162! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 3099 Comm: pasha Not tainted 5.16.0+ #48 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO4 RIP: 0010:__page_table_check_zero+0x159/0x1a0 Call Trace: free_pcp_prepare+0x3be/0xaa0 free_unref_page+0x1c/0x650 free_compound_page+0xec/0x130 free_transhuge_page+0x1be/0x260 __put_compound_page+0x90/0xd0 release_pages+0x54c/0x1060 __pagevec_release+0x7c/0x110 shmem_undo_range+0x85e/0x1250 ... The repro involved having a huge page that is split due to uprobe event temporarily replacing one of the pages in the huge page. Later the huge page was combined again, but the counters were off, as the PTE level was not properly updated. Make sure that when PMD is cleared and prior to freeing the level the PTEs are updated. Link: https://lkml.kernel.org/r/[email protected] Fixes: df4e817b7108 ("mm: page table check") Signed-off-by: Pasha Tatashin <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Thelen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Muchun Song <[email protected]> Cc: Paul Turner <[email protected]> Cc: Wei Xu <[email protected]> Cc: Will Deacon <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-02-04	mm/khugepaged: unify collapse pmd clear, flush and free	Pasha Tatashin	1	-16/+18
	Unify the code that flushes, clears pmd entry, and frees the PTE table level into a new function collapse_and_free_pmd(). This cleanup is useful as in the next patch we will add another call to this function to iterate through PTE prior to freeing the level for page table check. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pasha Tatashin <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Thelen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Muchun Song <[email protected]> Cc: Paul Turner <[email protected]> Cc: Wei Xu <[email protected]> Cc: Will Deacon <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-02-04	mm/page_table_check: use unsigned long for page counters and cleanup	Pasha Tatashin	1	-28/+7
	For consistency, use "unsigned long" for all page counters. Also, reduce code duplication by calling __page_table_check__clear() from __page_table_check__set() functions. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pasha Tatashin <[email protected]> Reviewed-by: Wei Xu <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Thelen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Muchun Song <[email protected]> Cc: Paul Turner <[email protected]> Cc: Will Deacon <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-02-04	mm/debug_vm_pgtable: remove pte entry from the page table	Pasha Tatashin	1	-0/+2
	Patch series "page table check fixes and cleanups", v5. This patch (of 4): The pte entry that is used in pte_advanced_tests() is never removed from the page table at the end of the test. The issue is detected by page_table_check, to repro compile kernel with the following configs: CONFIG_DEBUG_VM_PGTABLE=y CONFIG_PAGE_TABLE_CHECK=y CONFIG_PAGE_TABLE_CHECK_ENFORCED=y During the boot the following BUG is printed: debug_vm_pgtable: [debug_vm_pgtable ]: Validating architecture page table helpers ------------[ cut here ]------------ kernel BUG at mm/page_table_check.c:162! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.16.0-11413-g2c271fe77d52 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 ... The entry should be properly removed from the page table before the page is released to the free list. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: a5c3b9ffb0f4 ("mm/debug_vm_pgtable: add tests validating advanced arch page table helpers") Signed-off-by: Pasha Tatashin <[email protected]> Reviewed-by: Zi Yan <[email protected]> Tested-by: Zi Yan <[email protected]> Acked-by: David Rientjes <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Cc: Paul Turner <[email protected]> Cc: Wei Xu <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Will Deacon <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Muchun Song <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: <[email protected]> [5.9+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-02-04	Revert "mm/page_isolation: unset migratetype directly for non Buddy page"	Chen Wandun	1	-1/+1
	This reverts commit 721fb891ad0b3956d5c168b2931e3e5e4fb7ca40. Commit 721fb891ad0b ("mm/page_isolation: unset migratetype directly for non Buddy page") will result memory that should in buddy disappear by mistake. move_freepages_block moves all pages in pageblock instead of pages indicated by input parameter, so if input pages is not in buddy but other pages in pageblock is in buddy, it will result in page out of control. Link: https://lkml.kernel.org/r/[email protected] Fixes: 721fb891ad0b ("mm/page_isolation: unset migratetype directly for non Buddy page") Signed-off-by: Chen Wandun <[email protected]> Reported-by: "kernelci.org bot" <[email protected]> Acked-by: David Hildenbrand <[email protected]> Tested-by: Dong Aisheng <[email protected]> Tested-by: Francesco Dolcini <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Tested-by: Guenter Roeck <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-02-03	Revert "mm/gup: small refactoring: simplify try_grab_page()"	John Hubbard	1	-5/+30
	This reverts commit 54d516b1d62ff8f17cee2da06e5e4706a0d00b8a That commit did a refactoring that effectively combined fast and slow gup paths (again). And that was again incorrect, for two reasons: a) Fast gup and slow gup get reference counts on pages in different ways and with different goals: see Linus' writeup in commit cd1adf1b63a1 ("Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly""), and b) try_grab_compound_head() also has a specific check for "FOLL_LONGTERM && !is_pinned(page)", that assumes that the caller can fall back to slow gup. This resulted in new failures, as recently report by Will McVicker [1]. But (a) has problems too, even though they may not have been reported yet. So just revert this. Link: https://lore.kernel.org/r/[email protected] [1] Fixes: 54d516b1d62f ("mm/gup: small refactoring: simplify try_grab_page()") Reported-and-tested-by: Will McVicker <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: [email protected] # 5.15 Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-30	memory-failure: fetch compound_head after pgmap_pfn_valid()	Joao Martins	1	-0/+6
	memory_failure_dev_pagemap() at the moment assumes base pages (e.g. dax_lock_page()). For devmap with compound pages fetch the compound_head in case a tail page memory failure is being handled. Currently this is a nop, but in the advent of compound pages in dev_pagemap it allows memory_failure_dev_pagemap() to keep working. Without this fix memory-failure handling (i.e. MCEs on pmem) with device-dax configured namespaces will regress (and crash). Link: https://lkml.kernel.org/r/[email protected] Reported-by: Jane Chu <[email protected]> Signed-off-by: Joao Martins <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-23	Merge tag 'bitmap-5.17-rc1' of git://github.com/norov/linux	Linus Torvalds	1	-19/+16
	Pull bitmap updates from Yury Norov: - introduce for_each_set_bitrange() - use find_first__bit() instead of find_next__bit() where possible - unify for_each_bit() macros * tag 'bitmap-5.17-rc1' of git://github.com/norov/linux: vsprintf: rework bitmap_list_string lib: bitmap: add performance test for bitmap_print_to_pagebuf bitmap: unify find_bit operations mm/percpu: micro-optimize pcpu_is_populated() Replace for_each__bit_from() with for_each__bit() where appropriate find: micro-optimize for_each_{set,clear}_bit() include/linux: move for_each_bit() macros from bitops.h to find.h cpumask: replace cpumask_next_* with cpumask_first_* where appropriate tools: sync tools/bitmap with mother linux all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate cpumask: use find_first_and_bit() lib: add find_first_and_bit() arch: remove GENERIC_FIND_FIRST_BIT entirely include: move find.h from asm_generic to linux bitops: move find_bit_*_le functions from le.h to find.h bitops: protect find_first_{,zero}_bit properly
2022-01-22	Merge branch 'akpm' (patches from Andrew)	Linus Torvalds	13	-1088/+345
	Merge yet more updates from Andrew Morton: "This is the post-linux-next queue. Material which was based on or dependent upon material which was in -next. 69 patches. Subsystems affected by this patch series: mm (migration and zsmalloc), sysctl, proc, and lib" * emailed patches from Andrew Morton <[email protected]>: (69 commits) mm: hide the FRONTSWAP Kconfig symbol frontswap: remove support for multiple ops mm: mark swap_lock and swap_active_head static frontswap: simplify frontswap_register_ops frontswap: remove frontswap_test mm: simplify try_to_unuse frontswap: remove the frontswap exports frontswap: simplify frontswap_init frontswap: remove frontswap_curr_pages frontswap: remove frontswap_shrink frontswap: remove frontswap_tmem_exclusive_gets frontswap: remove frontswap_writethrough mm: remove cleancache lib/stackdepot: always do filter_irq_stacks() in stack_depot_save() lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() proc: remove PDE_DATA() completely fs: proc: store PDE()->data into inode->i_private zsmalloc: replace get_cpu_var with local_lock zsmalloc: replace per zpage lock with pool->migrate_lock locking/rwlocks: introduce write_lock_nested ...
2022-01-22	Merge tag 'folio-5.17a' of git://git.infradead.org/users/willy/pagecache	Linus Torvalds	1	-6/+4
	Pull more folio updates from Matthew Wilcox: "Three small folio patches. One bug fix, one patch pulled forward from the patches destined for 5.18 and then a patch to make use of that functionality" * tag 'folio-5.17a' of git://git.infradead.org/users/willy/pagecache: filemap: Use folio_put_refs() in filemap_free_folio() mm: Add folio_put_refs() pagevec: Initialise folio_batch->percpu_pvec_drained
2022-01-22	mm: hide the FRONTSWAP Kconfig symbol	Christoph Hellwig	1	-15/+3
	Select FRONTSWAP from ZSWAP instead of prompting for it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	frontswap: remove support for multiple ops	Christoph Hellwig	2	-40/+18
	There is only a single instance of frontswap ops in the kernel, so simplify the frontswap code by removing support for multiple operations. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	mm: mark swap_lock and swap_active_head static	Christoph Hellwig	1	-2/+2
	swap_lock and swap_active_head are only used in swapfile.c, so mark them static. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	frontswap: simplify frontswap_register_ops	Christoph Hellwig	1	-41/+0
	Given that frontswap_register_ops must be called from built-in code, there is no need to handle the case of swapfiles coming online before or during it, so delete the code that deals with that case. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	frontswap: remove frontswap_test	Christoph Hellwig	1	-1/+1
	frontswap_test is unused now, remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	mm: simplify try_to_unuse	Christoph Hellwig	2	-87/+29
	Remove the unused frontswap and pages_to_unuse arguments, and mark the function static now that the caller in frontswap is gone. [[email protected]: fix shmem_unuse() stub, per Matthew] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Naresh Kamboju <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	frontswap: remove the frontswap exports	Christoph Hellwig	1	-6/+0
	None of the frontswap API is called from modular code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	frontswap: simplify frontswap_init	Christoph Hellwig	2	-3/+3
	Just use IS_ENABLED() and remove the __frontswap_init indirection. Also remove the unused export. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	frontswap: remove frontswap_curr_pages	Christoph Hellwig	1	-28/+0
	frontswap_curr_pages is never called, so remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	frontswap: remove frontswap_shrink	Christoph Hellwig	1	-83/+0
	frontswap_shrink is never called, so remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	frontswap: remove frontswap_tmem_exclusive_gets	Christoph Hellwig	1	-22/+1
	frontswap_tmem_exclusive_gets is never called, so remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	frontswap: remove frontswap_writethrough	Christoph Hellwig	1	-22/+1
	frontswap_writethrough is never called, so remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	mm: remove cleancache	Christoph Hellwig	5	-362/+2
	Patch series "remove Xen tmem leftovers". Since the removal of the Xen tmem driver in 2019, the cleancache hooks are entirely unused, as are large parts of frontswap. This series against linux-next (with the folio changes included) removes cleancaches, and cuts down frontswap to the bits actually used by zswap. This patch (of 13): The cleancache subsystem is unused since the removal of Xen tmem driver in commit 814bbf49dcd0 ("xen: remove tmem driver"). [[email protected]: remove now-unreachable code] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	lib/stackdepot: always do filter_irq_stacks() in stack_depot_save()	Marco Elver	1	-1/+0
	The non-interrupt portion of interrupt stack traces before interrupt entry is usually arbitrary. Therefore, saving stack traces of interrupts (that include entries before interrupt entry) to stack depot leads to unbounded stackdepot growth. As such, use of filter_irq_stacks() is a requirement to ensure stackdepot can efficiently deduplicate interrupt stacks. Looking through all current users of stack_depot_save(), none (except KASAN) pass the stack trace through filter_irq_stacks() before passing it on to stack_depot_save(). Rather than adding filter_irq_stacks() to all current users of stack_depot_save(), it became clear that stack_depot_save() should simply do filter_irq_stacks(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Reviewed-by: Alexander Potapenko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vijayanand Jitta <[email protected]> Cc: "Gustavo A. R. Silva" <[email protected]> Cc: Imran Khan <[email protected]> Cc: Chris Wilson <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Mika Kuoppala <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	lib/stackdepot: allow optional init and stack_table allocation by kvmalloc()	Vlastimil Babka	1	-0/+2
	Currently, enabling CONFIG_STACKDEPOT means its stack_table will be allocated from memblock, even if stack depot ends up not actually used. The default size of stack_table is 4MB on 32-bit, 8MB on 64-bit. This is fine for use-cases such as KASAN which is also a config option and has overhead on its own. But it's an issue for functionality that has to be actually enabled on boot (page_owner) or depends on hardware (GPU drivers) and thus the memory might be wasted. This was raised as an issue [1] when attempting to add stackdepot support for SLUB's debug object tracking functionality. It's common to build kernels with CONFIG_SLUB_DEBUG and enable slub_debug on boot only when needed, or create only specific kmem caches with debugging for testing purposes. It would thus be more efficient if stackdepot's table was allocated only when actually going to be used. This patch thus makes the allocation (and whole stack_depot_init() call) optional: - Add a CONFIG_STACKDEPOT_ALWAYS_INIT flag to keep using the current well-defined point of allocation as part of mem_init(). Make CONFIG_KASAN select this flag. - Other users have to call stack_depot_init() as part of their own init when it's determined that stack depot will actually be used. This may depend on both config and runtime conditions. Convert current users which are page_owner and several in the DRM subsystem. Same will be done for SLUB later. - Because the init might now be called after the boot-time memblock allocation has given all memory to the buddy allocator, change stack_depot_init() to allocate stack_table with kvmalloc() when memblock is no longer available. Also handle allocation failure by disabling stackdepot (could have theoretically happened even with memblock allocation previously), and don't unnecessarily align the memblock allocation to its own size anymore. [1] https://lore.kernel.org/all/CAMuHMdW=eoVzM1Re5FVoEN87nKfiLmM2+Ah7eNu2KXEhCvbZyA@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Dmitry Vyukov <[email protected]> Reviewed-by: Marco Elver <[email protected]> # stackdepot Cc: Marco Elver <[email protected]> Cc: Vijayanand Jitta <[email protected]> Cc: Maarten Lankhorst <[email protected]> Cc: Maxime Ripard <[email protected]> Cc: Thomas Zimmermann <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Oliver Glitta <[email protected]> Cc: Imran Khan <[email protected]> From: Colin Ian King <[email protected]> Subject: lib/stackdepot: fix spelling mistake and grammar in pr_err message There is a spelling mistake of the work allocation so fix this and re-phrase the message to make it easier to read. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Colin Ian King <[email protected]> Cc: Vlastimil Babka <[email protected]> From: Vlastimil Babka <[email protected]> Subject: lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup On FLATMEM, we call page_ext_init_flatmem_late() just before kmem_cache_init() which means stack_depot_init() (called by page owner init) will not recognize properly it should use kvmalloc() and not memblock_alloc(). memblock_alloc() will also not issue a warning and return a block memory that can be invalid and cause kernel page fault when saving stacks, as reported by the kernel test robot [1]. Fix this by moving page_ext_init_flatmem_late() below kmem_cache_init() so that slab_is_available() is true during stack_depot_init(). SPARSEMEM doesn't have this issue, as it doesn't do page_ext_init_flatmem_late(), but a different page_ext_init() even later in the boot process. Thanks to Mike Rapoport for pointing out the FLATMEM init ordering issue. While at it, also actually resolve a checkpatch warning in stack_depot_init() from DRM CI, which was supposed to be in the original patch already. [1] https://lore.kernel.org/all/20211014085450.GC18719@xsang-OptiPlex-9020/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Reported-by: kernel test robot <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Stephen Rothwell <[email protected]> From: Vlastimil Babka <[email protected]> Subject: lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup3 Due to cd06ab2fd48f ("drm/locking: add backtrace for locking contended locks without backoff") landing recently to -next adding a new stack depot user in drivers/gpu/drm/drm_modeset_lock.c we need to add an appropriate call to stack_depot_init() there as well. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Naresh Kamboju <[email protected]> Cc: Marco Elver <[email protected]> Cc: Vijayanand Jitta <[email protected]> Cc: Maarten Lankhorst <[email protected]> Cc: Maxime Ripard <[email protected]> Cc: Thomas Zimmermann <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Oliver Glitta <[email protected]> Cc: Imran Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> From: Vlastimil Babka <[email protected]> Subject: lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup4 Due to 4e66934eaadc ("lib: add reference counting tracking infrastructure") landing recently to net-next adding a new stack depot user in lib/ref_tracker.c we need to add an appropriate call to stack_depot_init() there as well. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Cc: Jiri Slab <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	zsmalloc: replace get_cpu_var with local_lock	Mike Galbraith	1	-3/+8
	The usage of get_cpu_var() in zs_map_object() is problematic because it disables preemption and makes it impossible to acquire any sleeping lock on PREEMPT_RT such as a spinlock_t. Replace the get_cpu_var() usage with a local_lock_t which is embedded struct mapping_area. It ensures that the access the struct is synchronized against all users on the same CPU. [minchan: remove the bit_spin_lock part and change the title] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Galbraith <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Minchan Kim <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	zsmalloc: replace per zpage lock with pool->migrate_lock	Minchan Kim	1	-109/+96
	The zsmalloc has used a bit for spin_lock in zpage handle to keep zpage object alive during several operations. However, it causes the problem for PREEMPT_RT as well as introducing too complicated. This patch replaces the bit spin_lock with pool->migrate_lock rwlock. It could make the code simple as well as zsmalloc work under PREEMPT_RT. The drawback is the pool->migrate_lock is bigger granuarity than per zpage lock so the contention would be higher than old when both IO-related operations(i.e., zsmalloc, zsfree, zs_[map\|unmap]) and compaction(page/zpage migration) are going in parallel(*, the migrate_lock is rwlock and IO related functions are all read side lock so there is no contention). However, the write-side is fast enough(dominant overhead is just page copy) so it wouldn't affect much. If the lock granurity becomes more problem later, we could introduce table locks based on handle as a hash value. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	zsmalloc: remove zspage isolation for migration	Minchan Kim	1	-149/+8
	zspage isolation for migration introduced additional exceptions to be dealt with since the zspage was isolated from class list. The reason why I isolated zspage from class list was to prevent race between obj_malloc and page migration via allocating zpage from the zspage further. However, it couldn't prevent object freeing from zspage so it needed corner case handling. This patch removes the whole mess. Now, we are fine since class->lock and zspage->lock can prevent the race. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	zsmalloc: move huge compressed obj from page to zspage	Minchan Kim	1	-24/+26
	The flag aims for zspage, not per page. Let's move it to zspage. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	zsmalloc: introduce obj_allocated	Minchan Kim	1	-17/+16
	The usage pattern for obj_to_head is to check whether the zpage is allocated or not. Thus, introduce obj_allocated. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	zsmalloc: decouple class actions from zspage works	Minchan Kim	1	-10/+13
	This patch moves class stat update out of obj_malloc since it's not related to zspage operation. This is a preparation to introduce new lock scheme in next patch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	zsmalloc: rename zs_stat_type to class_stat_type	Minchan Kim	1	-12/+12
	The stat aims for class stat, not zspage so rename it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	zsmalloc: introduce some helper functions	Minchan Kim	1	-31/+23
	Patch series "zsmalloc: remove bit_spin_lock", v2. zsmalloc uses bit_spin_lock to minimize space overhead since it's zpage granularity lock. However, it causes zsmalloc non-working under PREEMPT_RT as well as adding too much complication. This patchset tries to replace the bit_spin_lock with per-pool rwlock. It also removes unnecessary zspage isolation logic from class, which was the other part too much complication added into zsmalloc. Last patch changes the get_cpu_var to local_lock to make it work in PREEMPT_RT. This patch (of 9): get_zspage_mapping returns fullness as well as class_idx. However, the fullness is usually not used since it could be stale in some contexts. It causes misleading as well as unnecessary instructions so this patch introduces zspage_class. obj_to_location also produces page and index but we don't need always the index, either so this patch introduces obj_to_page. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-22	mm/migrate.c: rework migration_entry_wait() to not take a pageref	Alistair Popple	2	-34/+95
	This fixes the FIXME in migrate_vma_check_page(). Before migrating a page migration code will take a reference and check there are no unexpected page references, failing the migration if there are. When a thread faults on a migration entry it will take a temporary reference to the page to wait for the page to become unlocked signifying the migration entry has been removed. This reference is dropped just prior to waiting on the page lock, however the extra reference can cause migration failures so it is desirable to avoid taking it. As migration code already has a reference to the migrating page an extra reference to wait on PG_locked is unnecessary so long as the reference can't be dropped whilst setting up the wait. When faulting on a migration entry the ptl is taken to check the migration entry. Removing a migration entry also requires the ptl, and migration code won't drop its page reference until after the migration entry has been removed. Therefore retaining the ptl of a migration entry is sufficient to ensure the page has a reference. Reworking migration_entry_wait() to hold the ptl until the wait setup is complete means the extra page reference is no longer needed. [[email protected]: v5] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alistair Popple <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-20	Merge branch 'akpm' (patches from Andrew)	Linus Torvalds	5	-38/+138
	Merge more updates from Andrew Morton: "55 patches. Subsystems affected by this patch series: percpu, procfs, sysctl, misc, core-kernel, get_maintainer, lib, checkpatch, binfmt, nilfs2, hfs, fat, adfs, panic, delayacct, kconfig, kcov, and ubsan" * emailed patches from Andrew Morton <[email protected]>: (55 commits) lib: remove redundant assignment to variable ret ubsan: remove CONFIG_UBSAN_OBJECT_SIZE kcov: fix generic Kconfig dependencies if ARCH_WANTS_NO_INSTR lib/Kconfig.debug: make TEST_KMOD depend on PAGE_SIZE_LESS_THAN_256KB btrfs: use generic Kconfig option for 256kB page size limit arch/Kconfig: split PAGE_SIZE_LESS_THAN_256KB from PAGE_SIZE_LESS_THAN_64KB configs: introduce debug.config for CI-like setup delayacct: track delays from memory compact Documentation/accounting/delay-accounting.rst: add thrashing page cache and direct compact delayacct: cleanup flags in struct task_delay_info and functions use it delayacct: fix incomplete disable operation when switch enable to disable delayacct: support swapin delay accounting for swapping without blkio panic: remove oops_id panic: use error_report_end tracepoint on warnings fs/adfs: remove unneeded variable make code cleaner FAT: use io_schedule_timeout() instead of congestion_wait() hfsplus: use struct_group_attr() for memcpy() region nilfs2: remove redundant pointer sbufs fs/binfmt_elf: use PT_LOAD p_align values for static PIE const_structs.checkpatch: add frequently used ops structs ...
2022-01-20	delayacct: track delays from memory compact	wangyong	1	-0/+3
	Delay accounting does not track the delay of memory compact. When there is not enough free memory, tasks can spend a amount of their time waiting for compact. To get the impact of tasks in direct memory compact, measure the delay when allocating memory through memory compact. Also update tools/accounting/getdelays.c: / # ./getdelays_next -di -p 304 print delayacct stats ON printing IO accounting PID 304 CPU count real total virtual total delay total delay average 277 780000000 849039485 18877296 0.068ms IO count delay total delay average 0 0 0ms SWAP count delay total delay average 0 0 0ms RECLAIM count delay total delay average 5 11088812685 2217ms THRASHING count delay total delay average 0 0 0ms COMPACT count delay total delay average 3 72758 0ms watch: read=0, write=0, cancelled_write=0 Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: wangyong <[email protected]> Reviewed-by: Jiang Xuexin <[email protected]> Reviewed-by: Zhang Wenya <[email protected]> Reviewed-by: Yang Yang <[email protected]> Reviewed-by: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-20	delayacct: support swapin delay accounting for swapping without blkio	Yang Yang	2	-4/+3
	Currently delayacct accounts swapin delay only for swapping that cause blkio. If we use zram for swapping, tools/accounting/getdelays can't get any SWAP delay. It's useful to get zram swapin delay information, for example to adjust compress algorithm or /proc/sys/vm/swappiness. Reference to PSI, it accounts any kind of swapping by doing its work in swap_readpage(), no matter whether swapping causes blkio. Let delayacct do the similar work. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Yang <[email protected]> Reported-by: Zeal Robot <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-20	mm: percpu: add generic pcpu_populate_pte() function	Kefeng Wang	1	-5/+71
	With NEED_PER_CPU_PAGE_FIRST_CHUNK enabled, we need a function to populate pte, this patch adds a generic pcpu populate pte function, pcpu_populate_pte(), which is marked __weak and used on most architectures, but it is overridden on x86, which has its own implementation. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Albert Ou <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-20	mm: percpu: add generic pcpu_fc_alloc/free funciton	Kefeng Wang	1	-31/+47
	With the previous patch, we could add a generic pcpu first chunk allocate and free function to cleanup the duplicated definations on each architecture. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Albert Ou <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-20	mm: percpu: add pcpu_fc_cpu_to_node_fn_t typedef	Kefeng Wang	1	-5/+9
	Add pcpu_fc_cpu_to_node_fn_t and pass it into pcpu_fc_alloc_fn_t, pcpu first chunk allocation will call it to alloc memblock on the corresponding node by it, this is prepare for the next patch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Albert Ou <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>