aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-12-09memcg: fix possible use-after-free in memcg_write_event_control()Tejun Heo3-3/+14
memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a call to __file_cft() which verified that the specified file is a regular cgroupfs file before further accesses. The cftype pointer returned from __file_cft() was no longer necessary and the commit inadvertently dropped the file type check with it allowing any file to slip through. With the invarients broken, the d_name and parent accesses can now race against renames and removals of arbitrary files and cause use-after-free's. Fix the bug by resurrecting the file type check in __file_cft(). Now that cgroupfs is implemented through kernfs, checking the file operations needs to go through a layer of indirection. Instead, let's check the superblock and dentry type. Link: https://lkml.kernel.org/r/Y5FRm/[email protected] Fixes: 347c4a874710 ("memcg: remove cgroup_event->cft") Signed-off-by: Tejun Heo <[email protected]> Reported-by: Jann Horn <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: <[email protected]> [3.14+] Signed-off-by: Andrew Morton <[email protected]>
2022-12-09MAINTAINERS: update Muchun Song's emailMuchun Song2-2/+4
I'm moving to the @linux.dev account. Map my old addresses and update it to my new address. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-12-09mm/gup: fix gup_pud_range() for daxJohn Starks1-1/+1
For dax pud, pud_huge() returns true on x86. So the function works as long as hugetlb is configured. However, dax doesn't depend on hugetlb. Commit 414fd080d125 ("mm/gup: fix gup_pmd_range() for dax") fixed devmap-backed huge PMDs, but missed devmap-backed huge PUDs. Fix this as well. This fixes the below kernel panic: general protection fault, probably for non-canonical address 0x69e7c000cc478: 0000 [#1] SMP < snip > Call Trace: <TASK> get_user_pages_fast+0x1f/0x40 iov_iter_get_pages+0xc6/0x3b0 ? mempool_alloc+0x5d/0x170 bio_iov_iter_get_pages+0x82/0x4e0 ? bvec_alloc+0x91/0xc0 ? bio_alloc_bioset+0x19a/0x2a0 blkdev_direct_IO+0x282/0x480 ? __io_complete_rw_common+0xc0/0xc0 ? filemap_range_has_page+0x82/0xc0 generic_file_direct_write+0x9d/0x1a0 ? inode_update_time+0x24/0x30 __generic_file_write_iter+0xbd/0x1e0 blkdev_write_iter+0xb4/0x150 ? io_import_iovec+0x8d/0x340 io_write+0xf9/0x300 io_issue_sqe+0x3c3/0x1d30 ? sysvec_reschedule_ipi+0x6c/0x80 __io_queue_sqe+0x33/0x240 ? fget+0x76/0xa0 io_submit_sqes+0xe6a/0x18d0 ? __fget_light+0xd1/0x100 __x64_sys_io_uring_enter+0x199/0x880 ? __context_tracking_enter+0x1f/0x70 ? irqentry_exit_to_user_mode+0x24/0x30 ? irqentry_exit+0x1d/0x30 ? __context_tracking_exit+0xe/0x70 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb RIP: 0033:0x7fc97c11a7be < snip > </TASK> ---[ end trace 48b2e0e67debcaeb ]--- RIP: 0010:internal_get_user_pages_fast+0x340/0x990 < snip > Kernel panic - not syncing: Fatal exception Kernel Offset: disabled Link: https://lkml.kernel.org/r/[email protected] Fixes: 414fd080d125 ("mm/gup: fix gup_pmd_range() for dax") Signed-off-by: John Starks <[email protected]> Signed-off-by: Saurabh Sengar <[email protected]> Cc: Jan Kara <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Dan Williams <[email protected]> Cc: Alistair Popple <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-12-09mmap: fix do_brk_flags() modifying obviously incorrect VMAsLiam Howlett1-8/+3
Add more sanity checks to the VMA that do_brk_flags() will expand. Ensure the VMA matches basic merge requirements within the function before calling can_vma_merge_after(). Drop the duplicate checks from vm_brk_flags() since they will be enforced later. The old code would expand file VMAs on brk(), which is functionally wrong and also dangerous in terms of locking because the brk() path isn't designed for file VMAs and therefore doesn't lock the file mapping. Checking can_vma_merge_after() ensures that new anonymous VMAs can't be merged into file VMAs. See https://lore.kernel.org/linux-mm/CAG48ez1tJZTOjS_FjRZhvtDA-STFmdw8PEizPDwMGFd_ui0Nrw@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 2e7ce7d354f2 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()") Signed-off-by: Liam R. Howlett <[email protected]> Suggested-by: Jann Horn <[email protected]> Cc: Jason A. Donenfeld <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-12-09mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bitDavid Hildenbrand1-3/+5
We use "unsigned long" to store a PFN in the kernel and phys_addr_t to store a physical address. On a 64bit system, both are 64bit wide. However, on a 32bit system, the latter might be 64bit wide. This is, for example, the case on x86 with PAE: phys_addr_t and PTEs are 64bit wide, while "unsigned long" only spans 32bit. The current definition of SWP_PFN_BITS without MAX_PHYSMEM_BITS misses that case, and assumes that the maximum PFN is limited by an 32bit phys_addr_t. This implies, that SWP_PFN_BITS will currently only be able to cover 4 GiB - 1 on any 32bit system with 4k page size, which is wrong. Let's rely on the number of bits in phys_addr_t instead, but make sure to not exceed the maximum swap offset, to not make the BUILD_BUG_ON() in is_pfn_swap_entry() unhappy. Note that swp_entry_t is effectively an unsigned long and the maximum swap offset shares that value with the swap type. For example, on an 8 GiB x86 PAE system with a kernel config based on Debian 11.5 (-> CONFIG_FLATMEM=y, CONFIG_X86_PAE=y), we will currently fail removing migration entries (remove_migration_ptes()), because mm/page_vma_mapped.c:check_pte() will fail to identify a PFN match as swp_offset_pfn() wrongly masks off PFN bits. For example, split_huge_page_to_list()->...->remap_page() will leave migration entries in place and continue to unlock the page. Later, when we stumble over these migration entries (e.g., via /proc/self/pagemap), pfn_swap_entry_to_page() will BUG_ON() because these migration entries shouldn't exist anymore and the page was unlocked. [ 33.067591] kernel BUG at include/linux/swapops.h:497! [ 33.067597] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 33.067602] CPU: 3 PID: 742 Comm: cow Tainted: G E 6.1.0-rc8+ #16 [ 33.067605] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014 [ 33.067606] EIP: pagemap_pmd_range+0x644/0x650 [ 33.067612] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 48 c6 52 00 e9 23 fb ff ff e8 61 83 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31 [ 33.067615] EAX: ee394000 EBX: 00000002 ECX: ee394000 EDX: 00000000 [ 33.067617] ESI: c1b0ded4 EDI: 00024a00 EBP: c1b0ddb4 ESP: c1b0dd68 [ 33.067619] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246 [ 33.067624] CR0: 80050033 CR2: b7a00000 CR3: 01bbbd20 CR4: 00350ef0 [ 33.067625] Call Trace: [ 33.067628] ? madvise_free_pte_range+0x720/0x720 [ 33.067632] ? smaps_pte_range+0x4b0/0x4b0 [ 33.067634] walk_pgd_range+0x325/0x720 [ 33.067637] ? mt_find+0x1d6/0x3a0 [ 33.067641] ? mt_find+0x1d6/0x3a0 [ 33.067643] __walk_page_range+0x164/0x170 [ 33.067646] walk_page_range+0xf9/0x170 [ 33.067648] ? __kmem_cache_alloc_node+0x2a8/0x340 [ 33.067653] pagemap_read+0x124/0x280 [ 33.067658] ? default_llseek+0x101/0x160 [ 33.067662] ? smaps_account+0x1d0/0x1d0 [ 33.067664] vfs_read+0x90/0x290 [ 33.067667] ? do_madvise.part.0+0x24b/0x390 [ 33.067669] ? debug_smp_processor_id+0x12/0x20 [ 33.067673] ksys_pread64+0x58/0x90 [ 33.067675] __ia32_sys_ia32_pread64+0x1b/0x20 [ 33.067680] __do_fast_syscall_32+0x4c/0xc0 [ 33.067683] do_fast_syscall_32+0x29/0x60 [ 33.067686] do_SYSENTER_32+0x15/0x20 [ 33.067689] entry_SYSENTER_32+0x98/0xf1 Decrease the indentation level of SWP_PFN_BITS and SWP_PFN_MASK to keep it readable and consistent. [[email protected]: rely on sizeof(phys_addr_t) and min_t() instead] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: use "int" for comparison, as we're only comparing numbers < 64] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry") Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Peter Xu <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-12-09tmpfs: fix data loss from failed fallocateHugh Dickins1-0/+11
Fix tmpfs data loss when the fallocate system call is interrupted by a signal, or fails for some other reason. The partial folio handling in shmem_undo_range() forgot to consider this unfalloc case, and was liable to erase or truncate out data which had already been committed earlier. It turns out that none of the partial folio handling there is appropriate for the unfalloc case, which just wants to proceed to removal of whole folios: which find_get_entries() provides, even when partially covered. Original patch by Rui Wang. Link: https://lore.kernel.org/linux-mm/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: b9a8a4195c7d ("truncate,shmem: Handle truncates that split large folios") Signed-off-by: Hugh Dickins <[email protected]> Reported-by: Guoqi Chen <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Cc: Rui Wang <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Cc: <[email protected]> [5.17+] Signed-off-by: Andrew Morton <[email protected]>
2022-12-09kselftests: cgroup: update kmem test precision toleranceMichal Hocko1-3/+3
1813e51eece0 ("memcg: increase MEMCG_CHARGE_BATCH to 64") has changed the batch size while this test case has been left behind. This has led to a test failure reported by test bot: not ok 2 selftests: cgroup: test_kmem # exit=1 Update the tolerance for the pcp charges to reflect the MEMCG_CHARGE_BATCH change to fix this. [[email protected]: update comments, per Roman] Link: https://lkml.kernel.org/r/[email protected] Fixes: 1813e51eece0a ("memcg: increase MEMCG_CHARGE_BATCH to 64") Signed-off-by: Michal Hocko <[email protected]> Reported-by: kernel test robot <[email protected]> Link: https://lore.kernel.org/oe-lkp/[email protected] Acked-by: Shakeel Butt <[email protected]> Acked-by: Roman Gushchin <[email protected]> Tested-by: Yujie Liu <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Feng Tang <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: "Michal Koutný" <[email protected]> Cc: Muchun Song <[email protected]> Cc: Soheil Hassas Yeganeh <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-12-09mm: do not BUG_ON missing brk mapping, because userspace can unmap itJason A. Donenfeld1-2/+1
The following program will trigger the BUG_ON that this patch removes, because the user can munmap() mm->brk: #include <sys/syscall.h> #include <sys/mman.h> #include <assert.h> #include <unistd.h> static void *brk_now(void) { return (void *)syscall(SYS_brk, 0); } static void brk_set(void *b) { assert(syscall(SYS_brk, b) != -1); } int main(int argc, char *argv[]) { void *b = brk_now(); brk_set(b + 4096); assert(munmap(b - 4096, 4096 * 2) == 0); brk_set(b); return 0; } Compile that with musl, since glibc actually uses brk(), and then execute it, and it'll hit this splat: kernel BUG at mm/mmap.c:229! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 12 PID: 1379 Comm: a.out Tainted: G S U 6.1.0-rc7+ #419 RIP: 0010:__do_sys_brk+0x2fc/0x340 Code: 00 00 4c 89 ef e8 04 d3 fe ff eb 9a be 01 00 00 00 4c 89 ff e8 35 e0 fe ff e9 6e ff ff ff 4d 89 a7 20> RSP: 0018:ffff888140bc7eb0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000007e7000 RCX: ffff8881020fe000 RDX: ffff8881020fe001 RSI: ffff8881955c9b00 RDI: ffff8881955c9b08 RBP: 0000000000000000 R08: ffff8881955c9b00 R09: 00007ffc77844000 R10: 0000000000000000 R11: 0000000000000001 R12: 00000000007e8000 R13: 00000000007e8000 R14: 00000000007e7000 R15: ffff8881020fe000 FS: 0000000000604298(0000) GS:ffff88901f700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000603fe0 CR3: 000000015ba9a005 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x400678 Code: 10 4c 8d 41 08 4c 89 44 24 10 4c 8b 01 8b 4c 24 08 83 f9 2f 77 0a 4c 8d 4c 24 20 4c 01 c9 eb 05 48 8b> RSP: 002b:00007ffc77863890 EFLAGS: 00000212 ORIG_RAX: 000000000000000c RAX: ffffffffffffffda RBX: 000000000040031b RCX: 0000000000400678 RDX: 00000000004006a1 RSI: 00000000007e6000 RDI: 00000000007e7000 RBP: 00007ffc77863900 R08: 0000000000000000 R09: 00000000007e6000 R10: 00007ffc77863930 R11: 0000000000000212 R12: 00007ffc77863978 R13: 00007ffc77863988 R14: 0000000000000000 R15: 0000000000000000 </TASK> Instead, just return the old brk value if the original mapping has been removed. [[email protected]: fix changelog, per Liam] Link: https://lkml.kernel.org/r/[email protected] Fixes: 2e7ce7d354f2 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()") Signed-off-by: Jason A. Donenfeld <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Will Deacon <[email protected]> Cc: Jann Horn <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-12-09mailmap: update Matti Vaittinen's email addressMatti Vaittinen1-0/+1
The email backend used by ROHM keeps labeling patches as spam. This can result in missing the patches. Switch my mail address from a company mail to a personal one. Link: https://lkml.kernel.org/r/8f4498b66fedcbded37b3b87e0c516e659f8f583.1669912977.git.mazziesaccount@gmail.com Signed-off-by: Matti Vaittinen <[email protected]> Suggested-by: Krzysztof Kozlowski <[email protected]> Cc: Anup Patel <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Atish Patra <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Ben Widawsky <[email protected]> Cc: Bjorn Andersson <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Colin Ian King <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Qais Yousef <[email protected]> Cc: Vasily Averin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/memory-failure.c: cleanup in unpoison_memoryMa Wupeng1-4/+2
If freeit is true, the value of ret must be zero, there is no need to check the value of freeit after label unlock_mutex. We can drop variable freeit to do this cleanup. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ma Wupeng <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: zhenwei pi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/thp: re-apply mkdirty for small pages after splitPeter Xu1-6/+8
We used to have 624a2c94f5b7 (Partly revert "mm/thp: carry over dirty bit when thp splits on pmd") fixing the regression reported here by Anatoly Pugachev on sparc64: https://lore.kernel.org/r/[email protected] Where we temporarily ignored the dirty bit for small pages. Then, Hev also reported similar issue on loongarch: (the original mail was private, but Anatoly copied the list here) https://lore.kernel.org/r/CADxRZqxqb7f_WhMh=jweZP+ynf_JwGd-0VwbYgp4P+T0-AXosw@mail.gmail.com Hev pointed out that the issue is having HW write bit set within the pte_mkdirty() so the split pte can be written after split even if e.g. they were shared by more than one processes, causing data corrupt. Hev also tried to explain why loongarch set HW write bit in mkdirty: https://lore.kernel.org/r/CAHirt9itKO_K_HPboXh5AyJtt16Zf0cD73PtHvM=na39u_ztxA@mail.gmail.com One way to fix it is as what Huacai proposed here for loongarch (then we can re-apply the dirty bit in thp split): https://lore.kernel.org/r/[email protected] We may need similar thing for sparc64, though. For now since we've found the root cause of the dirty bit issue the simpler solution (which won't lose the dirty bit for small) that will work for both is we wr-protect after pte_mkdirty(), so the HW write bit can be persistent after thp split. Add a comment for wrprotect, so we will not mess up the ordering later. With 624a2c94f5b7 (Partly revert "mm/thp: carry over dirty bit when thp splits on pmd") this is not a fix anymore, but just brings back the dirty bit for thp split safely, so we re-apply the optimization but in safe way. Provide a Tested-by credit to Hev too (not the exact same patch but the same outcome) for loongarch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Tested-by: Hev <[email protected]> # loongarch Cc: Anatoly Pugachev <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Thorsten Leemhuis <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: vmscan: use sysfs_emit() to instead of scnprintf()Xu Panda1-1/+1
Replace open-coded snprintf() with sysfs_emit() to simplify the code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Xu Panda <[email protected]> Signed-off-by: Yang Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30s390/mm: use pmd_pgtable_page() helper in __gmap_segment_gaddr()Anshuman Khandual2-4/+3
In __gmap_segment_gaddr() pmd level page table page is being extracted from the pmd pointer, similar to pmd_pgtable_page() implementation. This reduces some redundancy by directly using pmd_pgtable_page() instead, though first making it available. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Acked-by: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Heiko Carstens <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/thp: rename pmd_to_page() as pmd_pgtable_page()Anshuman Khandual1-3/+3
Current pmd_to_page(), which derives the page table page containing the pmd address has a very misleading name. The problem being, it sounds similar to pmd_page() which derives page embedded in a given pmd entry either for next level page or a mapped huge page. Rename it as pmd_pgtable_page() instead. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30zswap: do not allocate from atomic poolSergey Senozhatsky2-2/+9
zswap_frontswap_load() should be called from preemptible context (we even call mutex_lock() there) and it does not look like we need to do GFP_ATOMIC allocaion for temp buffer. The same applies to zswap_writeback_entry(). Use GFP_KERNEL for temporary buffer allocation in both cases. Link: https://lkml.kernel.org/r/Y3xCTr6ikbtcUr/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Nhat Pham <[email protected]> Signed-off-by: Sergey Senozhatsky <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30documentation/mm: update pmd_present() in arch_pgtable_helpers.rstAnshuman Khandual1-1/+1
Although pmd_present() might seem to indicate a valid and mapped pmd entry, in reality it returns true when pmd_page() points to a valid page in memory , regardless whether the pmd entry is mapped or not. Andrea Arcangeli had earlier explained [1] the required semantics for pmd_present(). This just updates the documentation for pmd_present() as required. [1] https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm, compaction: fix fast_isolate_around() to stay within boundariesNARIBAYASHI Akira1-13/+5
Depending on the memory configuration, isolate_freepages_block() may scan pages out of the target range and causes panic. Panic can occur on systems with multiple zones in a single pageblock. The reason it is rare is that it only happens in special configurations. Depending on how many similar systems there are, it may be a good idea to fix this problem for older kernels as well. The problem is that pfn as argument of fast_isolate_around() could be out of the target range. Therefore we should consider the case where pfn < start_pfn, and also the case where end_pfn < pfn. This problem should have been addressd by the commit 6e2b7044c199 ("mm, compaction: make fast_isolate_freepages() stay within zone") but there was an oversight. Case1: pfn < start_pfn <at memory compaction for node Y> | node X's zone | node Y's zone +-----------------+------------------------------... pageblock ^ ^ ^ +-----------+-----------+-----------+-----------+... ^ ^ ^ ^ ^ end_pfn ^ start_pfn = cc->zone->zone_start_pfn pfn <---------> scanned range by "Scan After" Case2: end_pfn < pfn <at memory compaction for node X> | node X's zone | node Y's zone +-----------------+------------------------------... pageblock ^ ^ ^ +-----------+-----------+-----------+-----------+... ^ ^ ^ ^ ^ pfn ^ end_pfn start_pfn <---------> scanned range by "Scan Before" It seems that there is no good reason to skip nr_isolated pages just after given pfn. So let perform simple scan from start to end instead of dividing the scan into "Before" and "After". Link: https://lkml.kernel.org/r/[email protected] Fixes: 6e2b7044c199 ("mm, compaction: make fast_isolate_freepages() stay within zone"). Signed-off-by: NARIBAYASHI Akira <[email protected]> Cc: David Rientjes <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: document /sys/class/bdi/<bdi>/min_ratio_fine knobStefan Roesch1-0/+15
This documents the new /sys/class/bdi/<bdi>/max_ratio_fine knob. [[email protected]: fix htmldocs warnings] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add /sys/class/bdi/<bdi>/min_ratio_fine knobStefan Roesch1-0/+20
This adds the min_ratio_fine knob. The knob specifies the values not based on 1 of 100, but instead 1 per million. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add bdi_set_min_ratio_no_scale() functionStefan Roesch2-0/+8
This introduces bdi_set_min_ratio_no_scale(). It uses the max granularity for the ratio. This function by the new sysfs knob min_ratio_fine. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: document /sys/class/bdi/<bdi>/max_ratio_fine knobStefan Roesch1-0/+13
This documents the new /sys/class/bdi/<bdi>/max_ratio_fine knob. [[email protected]: fix htmldocs warnings] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add /sys/class/bdi/<bdi>/max_ratio_fine knobStefan Roesch1-0/+20
This adds the max_ratio_fine knob. The knob specifies the values not based on 1 of 100, but instead 1 per million. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add bdi_set_max_ratio_no_scale() functionStefan Roesch2-3/+9
This introduces bdi_set_max_ratio_no_scale(). It uses the max granularity for the ratio. This function by the new sysfs knob max_ratio_fine. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: document /sys/class/bdi/<bdi>/min_bytes knobStefan Roesch1-0/+15
This documents the new /sys/class/bdi/<bdi>/min_bytes knob. [[email protected]: fix htmldocs warnings] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add /sys/class/bdi/<bdi>/min_bytes knobStefan Roesch1-0/+29
bdi has two existing knobs to limit the amount of dirty memory: min_ratio and max_ratio. However the granularity of the knobs is limited and often it is more convenient to specify limits in terms of bytes. This change adds the min_bytes knob. It does not store the min_bytes value, instead it converts the max_bytes value to a ratio. The value is therefore more an approximation than an absolute value. It also maintains the sum over all the bdi min_ratio values stored in the variable bdi_min_ratio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add bdi_set_min_bytes() functionStefan Roesch2-0/+15
This introduces the bdi_set_min_bytes() function. The min_bytes function does not store the min_bytes value. Instead it converts the min_bytes value into the corresponding ratio value. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: split off __bdi_set_min_ratio() functionStefan Roesch1-1/+6
This splits off the __bdi_set_min_ratio() function from the bdi_set_min_ratio() function. The __bdi_set_min_ratio() function will also be called from the bdi_set_min_bytes() function, which will be introduced in the next patch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add bdi_get_min_bytes() functionStefan Roesch2-0/+6
This adds a function to return the specified value for min_bytes. It converts the stored min_ratio of the bdi to the corresponding bytes value. This is an approximation as it is based on the value that is returned by global_dirty_limits(), which can change. The returned value can be different than the value when the min_bytes value was set. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: document /sys/class/bdi/<bdi>/max_bytes knobStefan Roesch1-0/+14
This documents the new /sys/class/bdi/<bdi>/max_bytes knob. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add knob /sys/class/bdi/<bdi>/max_bytesStefan Roesch1-0/+29
This adds the new knob max_bytes to specify a dirty memory limit for the corresponding bdi. The specified bytes value is converted to a ratio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add bdi_set_max_bytes() functionStefan Roesch2-0/+38
This introduces the bdi_set_max_bytes() function. The max_bytes function does not store the max_bytes value. Instead it converts the max_bytes value into the corresponding ratio value. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: split off __bdi_set_max_ratio() functionStefan Roesch1-5/+9
This splits off __bdi_set_max_ratio() from bdi_set_max_ratio(). __bdi_set_max_ratio() will also be called from bdi_set_max_bytes(), which will be introduced in the next patch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add bdi_get_max_bytes() functionStefan Roesch2-0/+18
This adds a function to return the specified value for max_bytes. It converts the stored max_ratio of the bdi to the corresponding bytes value. It introduces the bdi_get_bytes helper function to do the conversion. This is an approximation as it is based on the value that is returned by global_dirty_limits(), which can change. The helper function will also be used by the min_bytes bdi knob. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: use part per 1000000 for bdi ratiosStefan Roesch3-9/+15
To get finer granularity for ratio calculations use part per million instead of percentiles. This is especially important if we want to automatically convert byte values to ratios. Otherwise the values that are actually used can be quite different. This is also important for machines with more main memory (1% of 256GB is already 2.5GB). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: document /sys/class/bdi/<bdi>/strict_limit knobStefan Roesch1-0/+11
This documents the new /sys/class/bdi/<bdi>/strict_limit knob. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add knob /sys/class/bdi/<bdi>/strict_limitStefan Roesch1-0/+29
Add a new knob to /sys/class/bdi/<bdi>/strict_limit. This new knob allows to set/unset the flag BDI_CAP_STRICTLIMIT in the bdi capabilities. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add bdi_set_strict_limit() functionStefan Roesch2-0/+16
Patch series "mm/block: add bdi sysfs knobs", v4. At meta network block devices (nbd) are used to implement remote block storage. In testing and during production it has been observed that these network block devices can consume a huge portion of the dirty writeback cache and writeback can take a considerable time. To be able to give stricter limits, I'm proposing the following changes: 1) introduce strictlimit knob Currently the max_ratio knob exists to limit the dirty_memory. However this knob only applies once (dirty_ratio + dirty_background_ratio) / 2 has been reached. With the BDI_CAP_STRICTLIMIT flag, the max_ratio can be applied without reaching that limit. This change exposes that knob. This knob can also be useful for NFS, fuse filesystems and USB devices. 2) Use part of 1000000 internal calculation The max_ratio is based on percentage. With the current machine sizes percentage values can be very high (1% of a 256GB main memory is already 2.5GB). This change uses part of 1000000 instead of percentages for the internal calculations. 3) Introduce two new sysfs knobs: min_bytes and max_bytes. Currently all calculations are based on ratio, but for a user it often more convenient to specify a limit in bytes. The new knobs will not store bytes values, instead they will translate the byte value to a corresponding ratio. As the internal values are now part of 1000, the ratio is closer to the specified value. However the value should be more seen as an approximation as it can fluctuate over time. 3) Introduce two new sysfs knobs: min_ratio_fine and max_ratio_fine. The granularity for the existing sysfs bdi knobs min_ratio and max_ratio is based on percentage values. The new sysfs bdi knobs min_ratio_fine and max_ratio_fine allow to specify the ratio as part of 1 million. This patch (of 20): This adds the bdi_set_strict_limit function to be able to set/unset the BDI_CAP_STRICTLIMIT flag. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Chris Mason <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30maple_tree: allow TEST_MAPLE_TREE only when DEBUG_KERNEL is setRandy Dunlap1-0/+1
Prevent a kconfig warning that is caused by TEST_MAPLE_TREE by adding a "depends on" clause for TEST_MAPLE_TREE since 'select' does not follow any kconfig dependencies. WARNING: unmet direct dependencies detected for DEBUG_MAPLE_TREE Depends on [n]: DEBUG_KERNEL [=n] Selected by [y]: - TEST_MAPLE_TREE [=y] && RUNTIME_TESTING_MENU [=y] Link: https://lkml.kernel.org/r/[email protected] Fixes: 120b116208a0 ("maple_tree: reorganize testing to restore module testing") Signed-off-by: Randy Dunlap <[email protected]> Reported-by: Geert Uytterhoeven <[email protected]> Reported-by: kernel test robot <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30Revert "kmsan: unpoison @tlb in arch_tlb_gather_mmu()"Alexander Potapenko1-10/+0
This reverts commit ac801e7e252c5588325e3c983c7d4167fc68c024. The patch in question was picked to -mm from the KMSAN v6 patch series (https://lore.kernel.org/linux-mm/[email protected]/) and sneaked into mainline despite its removal from the v7 series (https://lore.kernel.org/linux-mm/[email protected]/) Currently KMSAN does not warn about origin chains hitting the maximum depth, so keeping @tlb poisoned won't result in any inconveniences. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Eric Biggers <[email protected]> Cc: Marco Elver <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30folio-compat: remove try_to_release_page()Vishal Moola (Oracle)2-7/+0
There are no more callers of try_to_release_page(), so remove it. This saves 85 bytes of kernel text. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Theodore Ts'o <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30memory-failure: convert truncate_error_page() to use folioVishal Moola (Oracle)1-2/+3
Replace try_to_release_page() with filemap_release_folio(). This change is in preparation for the removal of the try_to_release_page() wrapper. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Theodore Ts'o <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30khugepage: replace try_to_release_page() with filemap_release_folio()Vishal Moola (Oracle)1-11/+12
Replace some calls with their folio equivalents. This change removes 4 calls to compound_head() and is in preparation for the removal of the try_to_release_page() wrapper. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Theodore Ts'o <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30ext4: convert move_extent_per_page() to use foliosVishal Moola (Oracle)1-21/+31
Patch series "Removing the try_to_release_page() wrapper", v3. This patchset replaces the remaining calls of try_to_release_page() with the folio equivalent: filemap_release_folio(). This allows us to remove the wrapper. This patch (of 4): Convert move_extent_per_page() to use folios. This change removes 5 calls to compound_head() and is in preparation for the removal of the try_to_release_page() wrapper. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Theodore Ts'o <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/page_alloc: simplify locking during free_unref_page_listMel Gorman1-16/+9
While freeing a large list, the zone lock will be released and reacquired to avoid long hold times since commit c24ad77d962c ("mm/page_alloc.c: avoid excessive IRQ disabled times in free_unref_page_list()"). As suggested by Vlastimil Babka, the lockrelease/reacquire logic can be simplified by reusing the logic that acquires a different lock when changing zones. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/page_alloc: leave IRQs enabled for per-cpu page allocationsMel Gorman1-70/+54
The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task allocating from the PCP must not re-enter the allocator from IRQ context. In each instance where IRQ-reentrancy is possible, the lock is acquired using pcp_spin_trylock_irqsave() even though IRQs are disabled and re-entrancy is impossible. Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common case at the cost of some IRQ allocations taking a slower path. If the PCP lists need to be refilled, the zone lock still needs to disable IRQs but that will only happen on PCP refill and drain. If an IRQ is raised when a PCP allocation is in progress, the trylock will fail and fallback to using the buddy lists directly. Note that this may not be a universal win if an interrupt-intensive workload also allocates heavily from interrupt context and contends heavily on the zone->lock as a result. [[email protected]: migratetype might be wrong if a PCP was locked] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: reported lockdep issue on IO completion from softirq] [[email protected]: fix list corruption, lock improvements, micro-optimsations] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/page_alloc: always remove pages from temporary listMel Gorman1-0/+2
Patch series "Leave IRQs enabled for per-cpu page allocations", v3. This patch (of 2): free_unref_page_list() has neglected to remove pages properly from the list of pages to free since forever. It works by coincidence because list_add happened to do the right thing adding the pages to just the PCP lists. However, a later patch added pages to either the PCP list or the zone list but only properly deleted the page from the list in one path leading to list corruption and a subsequent failure. As a preparation patch, always delete the pages from one list properly before adding to another. On its own, this fixes nothing although it adds a fractional amount of overhead but is critical to the next patch. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Reported-by: Hugh Dickins <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30selftests/vm: use memfd for hugepage-mmap testPeter Xu1-6/+4
This test was overlooked with a hard-coded mntpoint path in test when we're removing the hugetlb mntpoint in commit 0796c7b8be84. Fix it up so the test can keep running. Link: https://lkml.kernel.org/r/Y3aojfUC2nSwbCzB@x1n Fixes: 0796c7b8be84 ("selftests/vm: drop mnt point for hugetlb in run_vmtests.sh") Signed-off-by: Peter Xu <[email protected]> Reported-by: Joel Savitz <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30zram: remove unused stats fieldsSergey Senozhatsky2-4/+0
We don't show num_reads and num_writes since we removed corresponding sysfs nodes in 2017. Block layer stats are exposed via /sys/block/zramX/stat file. However, we still increment those atomic vars and store them in zram stats. Remove leftovers. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Nitin Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/migrate.c: stop using 0 as NULL pointerYang Li1-1/+1
mm/migrate.c:1198:24: warning: Using plain integer as NULL pointer Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3080 Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Li <[email protected]> Reported-by: Abaci Robot <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: multi-gen LRU: remove NULL checks on NODE_DATA()Yu Zhao1-11/+2
NODE_DATA() is preallocated for all possible nodes after commit 09f49dca570a ("mm: handle uninitialized numa nodes gracefully"). Checking its return value against NULL is now unnecessary. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>