aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2022-05-23Merge tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-blockLinus Torvalds4-35/+21
Pull block updates from Jens Axboe: "Here are the core block changes for 5.19. This contains: - blk-throttle accounting fix (Laibin) - Series removing redundant assignments (Michal) - Expose bio cache via the bio_set, so that DM can use it (Mike) - Finish off the bio allocation interface cleanups by dealing with the weirdest member of the family. bio_kmalloc combines a kmalloc for the bio and bio_vecs with a hidden bio_init call and magic cleanup semantics (Christoph) - Clean up the block layer API so that APIs consumed by file systems are (almost) only struct block_device based, so that file systems don't have to poke into block layer internals like the request_queue (Christoph) - Clean up the blk_execute_rq* API (Christoph) - Clean up various lose end in the blk-cgroup code to make it easier to follow in preparation of reworking the blkcg assignment for bios (Christoph) - Fix use-after-free issues in BFQ when processes with merged queues get moved to different cgroups (Jan) - BFQ fixes (Jan) - Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming, Wolfgang, me)" * tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits) blk-mq: fix typo in comment bfq: Remove bfq_requeue_request_body() bfq: Remove superfluous conversion from RQ_BIC() bfq: Allow current waker to defend against a tentative one bfq: Relax waker detection for shared queues blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE() blk-throttle: Set BIO_THROTTLED when bio has been throttled blk-cgroup: Remove unnecessary rcu_read_lock/unlock() blk-cgroup: always terminate io.stat lines block, bfq: make bfq_has_work() more accurate block, bfq: protect 'bfqd->queued' by 'bfqd->lock' block: cleanup the VM accounting in submit_bio block: Fix the bio.bi_opf comment block: reorder the REQ_ flags blk-iocost: combine local_stat and desc_stat to stat block: improve the error message from bio_check_eod block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone block: remove superfluous calls to blkcg_bio_issue_init kthread: unexport kthread_blkcg blk-cgroup: cleanup blkcg_maybe_throttle_current ...
2022-05-23Merge branches 'slab/for-5.19/stackdepot' and 'slab/for-5.19/refactor' into ↵Vlastimil Babka3-64/+105
slab/for-linus
2022-05-19mm: damon: use HPAGE_PMD_SIZEKefeng Wang3-4/+3
Use HPAGE_PMD_SIZE instead of open coding. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm: fix missing handler for __GFP_NOWARNQi Zheng3-8/+28
We expect no warnings to be issued when we specify __GFP_NOWARN, but currently in paths like alloc_pages() and kmalloc(), there are still some warnings printed, fix it. But for some warnings that report usage problems, we don't deal with them. If such warnings are printed, then we should fix the usage problems. Such as the following case: WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); [[email protected]: v2] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qi Zheng <[email protected]> Cc: Akinobu Mita <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()Wonhyuk Yang1-8/+5
Currently, trace point mm_page_alloc_zone_locked() doesn't show correct information. First, when alloc_flag has ALLOC_HARDER/ALLOC_CMA, page can be allocated from MIGRATE_HIGHATOMIC/MIGRATE_CMA. Nevertheless, tracepoint use requested migration type not MIGRATE_HIGHATOMIC and MIGRATE_CMA. Second, after commit 44042b4498728 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") percpu-list can store high order pages. But trace point determine whether it is a refiil of percpu-list by comparing requested order and 0. To handle these problems, make mm_page_alloc_zone_locked() only be called by __rmqueue_smallest with correct migration type. With a new argument called percpu_refill, it can show roughly whether it is a refill of percpu-list. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Wonhyuk Yang <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Baik Song An <[email protected]> Cc: Hong Yeon Kim <[email protected]> Cc: Taeung Song <[email protected]> Cc: <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm/page_owner.c: add missing __initdata attributeFanjun Kong1-1/+1
This patch fixes two issues: 1. Add __initdata attribute according to include/linux/init.h: For initialized data: You should insert __initdata between the variable name and equal sign followed by value 2. Fix below error reported by checkpatch.pl: ERROR: do not initialise statics to false Special thanks to Muchun Song :) Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Fanjun Kong <[email protected]> Suggested-by: Muchun Song <[email protected]> Reviewed-by: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19tmpfs: fix undefined-behaviour in shmem_reconfigure()Luo Meng1-0/+4
When shmem_reconfigure() calls __percpu_counter_compare(), the second parameter is unsigned long long. But in the definition of __percpu_counter_compare(), the second parameter is s64. So when __percpu_counter_compare() executes abs(count - rhs), UBSAN shows the following warning: ================================================================================ UBSAN: Undefined behaviour in lib/percpu_counter.c:209:6 signed integer overflow: 0 - -9223372036854775808 cannot be represented in type 'long long int' CPU: 1 PID: 9636 Comm: syz-executor.2 Tainted: G ---------r- - 4.18.0 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: __dump_stack home/install/linux-rh-3-10/lib/dump_stack.c:77 [inline] dump_stack+0x125/0x1ae home/install/linux-rh-3-10/lib/dump_stack.c:117 ubsan_epilogue+0xe/0x81 home/install/linux-rh-3-10/lib/ubsan.c:159 handle_overflow+0x19d/0x1ec home/install/linux-rh-3-10/lib/ubsan.c:190 __percpu_counter_compare+0x124/0x140 home/install/linux-rh-3-10/lib/percpu_counter.c:209 percpu_counter_compare home/install/linux-rh-3-10/./include/linux/percpu_counter.h:50 [inline] shmem_remount_fs+0x1ce/0x6b0 home/install/linux-rh-3-10/mm/shmem.c:3530 do_remount_sb+0x11b/0x530 home/install/linux-rh-3-10/fs/super.c:888 do_remount home/install/linux-rh-3-10/fs/namespace.c:2344 [inline] do_mount+0xf8d/0x26b0 home/install/linux-rh-3-10/fs/namespace.c:2844 ksys_mount+0xad/0x120 home/install/linux-rh-3-10/fs/namespace.c:3075 __do_sys_mount home/install/linux-rh-3-10/fs/namespace.c:3089 [inline] __se_sys_mount home/install/linux-rh-3-10/fs/namespace.c:3086 [inline] __x64_sys_mount+0xbf/0x160 home/install/linux-rh-3-10/fs/namespace.c:3086 do_syscall_64+0xca/0x5c0 home/install/linux-rh-3-10/arch/x86/entry/common.c:298 entry_SYSCALL_64_after_hwframe+0x6a/0xdf RIP: 0033:0x46b5e9 Code: 5d db fa ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 2b db fa ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f54d5f22c68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 000000000077bf60 RCX: 000000000046b5e9 RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000000 RBP: 000000000077bf60 R08: 0000000020000140 R09: 0000000000000000 R10: 00000000026740a4 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffd1fb1592f R14: 00007f54d5f239c0 R15: 000000000077bf6c ================================================================================ [[email protected]: tweak error message text] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Luo Meng <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Yu Kuai <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm/mempolicy: fix uninit-value in mpol_rebind_policy()Wang Cheng1-1/+1
mpol_set_nodemask()(mm/mempolicy.c) does not set up nodemask when pol->mode is MPOL_LOCAL. Check pol->mode before access pol->w.cpuset_mems_allowed in mpol_rebind_policy()(mm/mempolicy.c). BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:352 [inline] BUG: KMSAN: uninit-value in mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368 mpol_rebind_policy mm/mempolicy.c:352 [inline] mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368 cpuset_change_task_nodemask kernel/cgroup/cpuset.c:1711 [inline] cpuset_attach+0x787/0x15e0 kernel/cgroup/cpuset.c:2278 cgroup_migrate_execute+0x1023/0x1d20 kernel/cgroup/cgroup.c:2515 cgroup_migrate kernel/cgroup/cgroup.c:2771 [inline] cgroup_attach_task+0x540/0x8b0 kernel/cgroup/cgroup.c:2804 __cgroup1_procs_write+0x5cc/0x7a0 kernel/cgroup/cgroup-v1.c:520 cgroup1_tasks_write+0x94/0xb0 kernel/cgroup/cgroup-v1.c:539 cgroup_file_write+0x4c2/0x9e0 kernel/cgroup/cgroup.c:3852 kernfs_fop_write_iter+0x66a/0x9f0 fs/kernfs/file.c:296 call_write_iter include/linux/fs.h:2162 [inline] new_sync_write fs/read_write.c:503 [inline] vfs_write+0x1318/0x2030 fs/read_write.c:590 ksys_write+0x28b/0x510 fs/read_write.c:643 __do_sys_write fs/read_write.c:655 [inline] __se_sys_write fs/read_write.c:652 [inline] __x64_sys_write+0xdb/0x120 fs/read_write.c:652 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae Uninit was created at: slab_post_alloc_hook mm/slab.h:524 [inline] slab_alloc_node mm/slub.c:3251 [inline] slab_alloc mm/slub.c:3259 [inline] kmem_cache_alloc+0x902/0x11c0 mm/slub.c:3264 mpol_new mm/mempolicy.c:293 [inline] do_set_mempolicy+0x421/0xb70 mm/mempolicy.c:853 kernel_set_mempolicy mm/mempolicy.c:1504 [inline] __do_sys_set_mempolicy mm/mempolicy.c:1510 [inline] __se_sys_set_mempolicy+0x44c/0xb60 mm/mempolicy.c:1507 __x64_sys_set_mempolicy+0xd8/0x110 mm/mempolicy.c:1507 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae KMSAN: uninit-value in mpol_rebind_task (2) https://syzkaller.appspot.com/bug?id=d6eb90f952c2a5de9ea718a1b873c55cb13b59dc This patch seems to fix below bug too. KMSAN: uninit-value in mpol_rebind_mm (2) https://syzkaller.appspot.com/bug?id=f2fecd0d7013f54ec4162f60743a2b28df40926b The uninit-value is pol->w.cpuset_mems_allowed in mpol_rebind_policy(). When syzkaller reproducer runs to the beginning of mpol_new(), mpol_new() mm/mempolicy.c do_mbind() mm/mempolicy.c kernel_mbind() mm/mempolicy.c `mode` is 1(MPOL_PREFERRED), nodes_empty(*nodes) is `true` and `flags` is 0. Then mode = MPOL_LOCAL; ... policy->mode = mode; policy->flags = flags; will be executed. So in mpol_set_nodemask(), mpol_set_nodemask() mm/mempolicy.c do_mbind() kernel_mbind() pol->mode is 4 (MPOL_LOCAL), that `nodemask` in `pol` is not initialized, which will be accessed in mpol_rebind_policy(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Wang Cheng <[email protected]> Reported-by: <[email protected]> Tested-by: <[email protected]> Cc: David Rientjes <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm: don't be stuck to rmap lock on reclaim pathMinchan Kim5-18/+60
The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended under memory pressure if processes keep working on their vmas(e.g., fork, mmap, munmap). It makes reclaim path stuck. In our real workload traces, we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it makes other processes entering direct reclaim, which were also stuck on the lock. This patch makes lru aging path try_lock mode like shink_page_list so the reclaim context will keep working with next lru pages without being stuck. if it found the rmap lock contended, it rotates the page back to head of lru in both active/inactive lrus to make them consistent behavior, which is basic starting point rather than adding more heristic. Since this patch introduces a new "contended" field as out-param along with try_lock in-param in rmap_walk_control, it's not immutable any longer if the try_lock is set so remove const keywords on rmap related functions. Since rmap walking is already expensive operation, I doubt the const would help sizable benefit( And we didn't have it until 5.17). In a heavy app workload in Android, trace shows following statistics. It almost removes rmap lock contention from reclaim path. Martin Liu reported: Before: max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function 1632 0 1631 151.542173 31672 209 page_lock_anon_vma_read 601 0 601 145.544681 28817 198 rmap_walk_file After: max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function NaN NaN NaN NaN NaN 0.0 NaN 0 0 0 0.127645 1 12 rmap_walk_file [[email protected]: add comment, per Matthew] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Michal Hocko <[email protected]> Cc: John Dias <[email protected]> Cc: Tim Murray <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Martin Liu <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19zswap: memcg accountingJohannes Weiner2-15/+227
Applications can currently escape their cgroup memory containment when zswap is enabled. This patch adds per-cgroup tracking and limiting of zswap backend memory to rectify this. The existing cgroup2 memory.stat file is extended to show zswap statistics analogous to what's in meminfo and vmstat. Furthermore, two new control files, memory.zswap.current and memory.zswap.max, are added to allow tuning zswap usage on a per-workload basis. This is important since not all workloads benefit from zswap equally; some even suffer compared to disk swap when memory contents don't compress well. The optimal size of the zswap pool, and the threshold for writeback, also depends on the size of the workload's warm set. The implementation doesn't use a traditional page_counter transaction. zswap is unconventional as a memory consumer in that we only know the amount of memory to charge once expensive compression has occurred. If zwap is disabled or the limit is already exceeded we obviously don't want to compress page upon page only to reject them all. Instead, the limit is checked against current usage, then we compress and charge. This allows some limit overrun, but not enough to matter in practice. [[email protected]: fix for CONFIG_SLOB builds] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: opt out of cgroups v1] Link: https://lkml.kernel.org/r/Yn6it9mBYFA+/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm: zswap: add basic meminfo and vmstat coverageJohannes Weiner2-7/+10
Currently it requires poking at debugfs to figure out the size and population of the zswap cache on a host. There are no counters for reads and writes against the cache. As a result, it's difficult to understand zswap behavior on production systems. Print zswap memory consumption and how many pages are zswapped out in /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm: Kconfig: simplify zswap configurationJohannes Weiner1-30/+25
- CONFIG_ZRAM: Zram is a user-facing feature, whereas zsmalloc is not. Don't make the user chase down a technical dependency like that, just select it in automatically when zram is requested. The CONFIG_CRYPTO dependency is redundant due to more specific deps. - CONFIG_ZPOOL: This is not a user-facing feature. Hide the symbol and have it selected in as needed. - CONFIG_ZSWAP: Select CRYPTO instead of depend. Common pattern. - Make the ZSWAP suboptions and their descriptions (compression, allocation backend) a bit more straight-forward for the user. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm: Kconfig: group swap, slab, hotplug and thp options into submenusJohannes Weiner1-217/+230
There are several clusters of related config options spread throughout the mostly flat MM submenu. Group them together and put specialization options into further subdirectories to make the MM submenu a bit more organized and easier to navigate. [[email protected]: fix kbuild warnings] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix more kbuild warnings] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm: Kconfig: move swap and slab config options to the MM sectionJohannes Weiner1-0/+123
These are currently under General Setup. MM seems like a better fit. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm/swap: fix comment about swap extentMiaohe Lin1-9/+6
Since commit 4efaceb1c5f8 ("mm, swap: use rbtree for swap_extent"), rbtree is used for swap extent. Also curr_swap_extent is removed at that time. Update the corresponding comment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: NeilBrown <[email protected]> Cc: Peter Xu <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm/swap: fix the comment of get_kernel_pagesMiaohe Lin1-4/+4
If no pages were pinned, 0 is returned in fact. Fix the corresponding comment. [[email protected]: s/nr_pages/nr_segs/ also, per David, reflow comment] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Howells <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: NeilBrown <[email protected]> Cc: Peter Xu <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm/swap: clean up the comment of find_next_to_unuseMiaohe Lin1-3/+3
Since commit 10a9c496789f ("mm: simplify try_to_unuse"), frontswap parameter is removed. Update the corresponding comment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Howells <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: NeilBrown <[email protected]> Cc: Peter Xu <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm/swap: add helper swap_offset_available()Miaohe Lin1-16/+18
Add helper swap_offset_available() to remove some duplicated codes. Minor readability improvement. [[email protected]: s/swap_offset_available/swap_offset_available_and_locked/, per Neil] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: NeilBrown <[email protected]> Cc: Peter Xu <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm/swap: avoid calling swp_swap_info when try to check SWP_STABLE_WRITESMiaohe Lin1-1/+1
Use flags of si directly to check SWP_STABLE_WRITES to avoid possible READ_ONCE and thus save some cpu cycles. [[email protected]: use data_race() on si->flags, per Neil] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: NeilBrown <[email protected]> Cc: Peter Xu <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm/swap: make page_swapcount and __lru_add_drain_all staticMiaohe Lin2-2/+2
Make page_swapcount and __lru_add_drain_all static. They are only used within the file now. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Howells <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: NeilBrown <[email protected]> Cc: Peter Xu <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm/swap: remove unneeded p != NULL check in __swap_duplicateMiaohe Lin1-2/+1
If p is NULL, __swap_duplicate will already return -EINVAL. So if we reach here, p must be non-NULL. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Howells <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: NeilBrown <[email protected]> Cc: Peter Xu <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm/swap: remove buggy cache->nr check in refill_swap_slots_cacheMiaohe Lin1-1/+1
refill_swap_slots_cache is always called when cache->nr is 0. So remove such buggy and confusing check. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Howells <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: NeilBrown <[email protected]> Cc: Peter Xu <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm/swap: print bad swap offset entry in get_swap_deviceMiaohe Lin1-0/+1
If offset exceeds the si->max, print bad swap offset entry to help debug the unexpected case. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Howells <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: NeilBrown <[email protected]> Cc: Peter Xu <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm/swap: remove unneeded return value of free_swap_slotMiaohe Lin1-3/+1
The return value of free_swap_slot is always 0 and also ignored now. Remove it to clean up the code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Howells <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: NeilBrown <[email protected]> Cc: Peter Xu <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm/swap: fold __swap_info_get() into its sole callerMiaohe Lin1-18/+6
Fold __swap_info_get() into its sole caller to make code more clear. Minor readability improvement. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Howells <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: NeilBrown <[email protected]> Cc: Peter Xu <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm/swap: use helper macro __ATTR_RWMiaohe Lin1-3/+1
Use helper macro __ATTR_RW to define vma_ra_enabled_attr to make code more clear. Minor readability improvement. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Howells <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: NeilBrown <[email protected]> Cc: Peter Xu <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm/swap: use helper is_swap_pte() in swap_vma_readaheadMiaohe Lin1-3/+1
Patch series "A few cleanup patches for swap". This series contains a few patches to fix the comment, remove unneeded return value, use some helpers and so on. More details can be found in the respective changelogs. This patch (of 14): Use helper is_swap_pte() to check whether pte is swap entry to make code more clear. Minor readability improvement. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Howells <[email protected]> Cc: NeilBrown <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Peter Xu <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm: mmap: register suitable readonly file vmas for khugepagedYang Shi2-4/+7
The readonly FS THP relies on khugepaged to collapse THP for suitable vmas. But the behavior is inconsistent for "always" mode (https://lore.kernel.org/linux-mm/[email protected]/). The "always" mode means THP allocation should be tried all the time and khugepaged should try to collapse THP all the time. Of course the allocation and collapse may fail due to other factors and conditions. Currently file THP may not be collapsed by khugepaged even though all the conditions are met. That does break the semantics of "always" mode. So make sure readonly FS vmas are registered to khugepaged to fix the break. Register suitable vmas in common mmap path, that could cover both readonly FS vmas and shmem vmas, so remove the khugepaged calls in shmem.c. Still need to keep the khugepaged call in vma_merge() since vma_merge() is called in a lot of places, for example, madvise, mprotect, etc. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reported-by: Vlastimil Babka <[email protected]> Acked-by: Vlastmil Babka <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Song Liu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Zi Yan <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Song Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm: khugepaged: introduce khugepaged_enter_vma() helperYang Shi3-17/+9
The khugepaged_enter_vma_merge() actually does as the same thing as the khugepaged_enter() section called by shmem_mmap(), so consolidate them into one helper and rename it to khugepaged_enter_vma(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Vlastmil Babka <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Song Liu <[email protected]> Cc: Song Liu <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm: khugepaged: make hugepage_vma_check() non-staticYang Shi1-16/+9
The hugepage_vma_check() could be reused by khugepaged_enter() and khugepaged_enter_vma_merge(), but it is static in khugepaged.c. Make it non-static and declare it in khugepaged.h. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Suggested-by: Vlastimil Babka <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Song Liu <[email protected]> Cc: Song Liu <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm: khugepaged: make khugepaged_enter() void functionYang Shi2-13/+9
The most callers of khugepaged_enter() don't care about the return value. Only dup_mmap(), anonymous THP page fault and MADV_HUGEPAGE handle the error by returning -ENOMEM. Actually it is not harmful for them to ignore the error case either. It also sounds overkilling to fail fork() and page fault early due to khugepaged_enter() error, and MADV_HUGEPAGE does set VM_HUGEPAGE flag regardless of the error. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Song Liu <[email protected]> Acked-by: Vlastmil Babka <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Song Liu <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm: thp: only regular file could be THP eligibleYang Shi2-16/+4
Since commit a4aeaa06d45e ("mm: khugepaged: skip huge page collapse for special files"), khugepaged just collapses THP for regular file which is the intended usecase for readonly fs THP. Only show regular file as THP eligible accordingly. And make file_thp_enabled() available for khugepaged too in order to remove duplicate code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Song Liu <[email protected]> Acked-by: Vlastmil Babka <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Song Liu <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm: khugepaged: skip DAX vmaYang Shi1-0/+4
The DAX vma may be seen by khugepaged when the mm has other khugepaged suitable vmas. So khugepaged may try to collapse THP for DAX vma, but it will fail due to page sanity check, for example, page is not on LRU. So it is not harmful, but it is definitely pointless to run khugepaged against DAX vma, so skip it in early check. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Acked-by: Song Liu <[email protected]> Acked-by: Vlastmil Babka <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Song Liu <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19mm: khugepaged: remove redundant check for VM_NO_KHUGEPAGEDYang Shi1-3/+6
The hugepage_vma_check() called by khugepaged_enter_vma_merge() does check VM_NO_KHUGEPAGED. Remove the check from caller and move the check in hugepage_vma_check() up. More checks may be run for VM_NO_KHUGEPAGED vmas, but MADV_HUGEPAGE is definitely not a hot path, so cleaner code does outweigh. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Acked-by: Song Liu <[email protected]> Acked-by: Vlastmil Babka <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Song Liu <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-19random: move randomize_page() into mm where it belongsJason A. Donenfeld1-0/+32
randomize_page is an mm function. It is documented like one. It contains the history of one. It has the naming convention of one. It looks just like another very similar function in mm, randomize_stack_top(). And it has always been maintained and updated by mm people. There is no need for it to be in random.c. In the "which shape does not look like the other ones" test, pointing to randomize_page() is correct. So move randomize_page() into mm/util.c, right next to the similar randomize_stack_top() function. This commit contains no actual code changes. Cc: Andrew Morton <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-05-16mm: usercopy: move the virt_addr_valid() below the is_vmalloc_addr()Yuanzheng Song1-3/+3
The is_kmap_addr() and the is_vmalloc_addr() in the check_heap_object() will not work, because the virt_addr_valid() will exclude the kmap and vmalloc regions. So let's move the virt_addr_valid() below the is_vmalloc_addr(). Signed-off-by: Yuanzheng Song <[email protected]> Fixes: 4e140f59d285 ("mm/usercopy: Check kmap addresses properly") Fixes: 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns") Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-13mm, compaction: fast_find_migrateblock() should return pfn in the target zoneRei Yamamoto1-0/+2
At present, pages not in the target zone are added to cc->migratepages list in isolate_migratepages_block(). As a result, pages may migrate between nodes unintentionally. This would be a serious problem for older kernels without commit a984226f457f849e ("mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec"), because it can corrupt the lru list by handling pages in list without holding proper lru_lock. Avoid returning a pfn outside the target zone in the case that it is not aligned with a pageblock boundary. Otherwise isolate_migratepages_block() will handle pages not in the target zone. Link: https://lkml.kernel.org/r/[email protected] Fixes: 70b44595eafe ("mm, compaction: use free lists to quickly locate a migration source") Signed-off-by: Rei Yamamoto <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Don Dutile <[email protected]> Cc: Wonhyuk Yang <[email protected]> Cc: Rei Yamamoto <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-13mm/memcontrol: export memcg->watermark via sysfs for v2 memcgGanesan Rajagopal1-0/+13
We run a lot of automated tests when building our software and run into OOM scenarios when the tests run unbounded. v1 memcg exports memcg->watermark as "memory.max_usage_in_bytes" in sysfs. We use this metric to heuristically limit the number of tests that can run in parallel based on per test historical data. This metric is currently not exported for v2 memcg and there is no other easy way of getting this information. getrusage() syscall returns "ru_maxrss" which can be used as an approximation but that's the max RSS of a single child process across all children instead of the aggregated max for all child processes. The only work around is to periodically poll "memory.current" but that's not practical for short-lived one-off cgroups. Hence, expose memcg->watermark as "memory.peak" for v2 memcg. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ganesan Rajagopal <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-13mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctlMuchun Song2-15/+85
We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and reboot the server to enable or disable the feature of optimizing vmemmap pages associated with HugeTLB pages. However, rebooting usually takes a long time. So add a sysctl to enable or disable the feature at runtime without rebooting. Why we need this? There are 3 use cases. 1) The feature of minimizing overhead of struct page associated with each HugeTLB is disabled by default without passing "hugetlb_free_vmemmap=on" to the boot cmdline. When we (ByteDance) deliver the servers to the users who want to enable this feature, they have to configure the grub (change boot cmdline) and reboot the servers, whereas rebooting usually takes a long time (we have thousands of servers). It's a very bad experience for the users. So we need a approach to enable this feature after rebooting. This is a use case in our practical environment. 2) Some use cases are that HugeTLB pages are allocated 'on the fly' instead of being pulled from the HugeTLB pool, those workloads would be affected with this feature enabled. Those workloads could be identified by the characteristics of they never explicitly allocating huge pages with 'nr_hugepages' but only set 'nr_overcommit_hugepages' and then let the pages be allocated from the buddy allocator at fault time. We can confirm it is a real use case from the commit 099730d67417. For those workloads, the page fault time could be ~2x slower than before. We suspect those users want to disable this feature if the system has enabled this before and they don't think the memory savings benefit is enough to make up for the performance drop. 3) If the workload which wants vmemmap pages to be optimized and the workload which wants to set 'nr_overcommit_hugepages' and does not want the extera overhead at fault time when the overcommitted pages be allocated from the buddy allocator are deployed in the same server. The user could enable this feature and set 'nr_hugepages' and 'nr_overcommit_hugepages', then disable the feature. In this case, the overcommited HugeTLB pages will not encounter the extra overhead at fault time. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Kees Cook <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-13mm: hugetlb_vmemmap: use kstrtobool for hugetlb_vmemmap param parsingMuchun Song1-5/+5
Use kstrtobool rather than open coding "on" and "off" parsing in mm/hugetlb_vmemmap.c, which is more powerful to handle all kinds of parameters like 'Yy1Nn0' or [oO][NnFf] for "on" and "off". Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kees Cook <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-13mm: memory_hotplug: override memmap_on_memory when hugetlb_free_vmemmap=onMuchun Song1-6/+26
Optimizing HugeTLB vmemmap pages is not compatible with allocating memmap on hot added memory. If "hugetlb_free_vmemmap=on" and memory_hotplug.memmap_on_memory" are both passed on the kernel command line, optimizing hugetlb pages takes precedence. However, the global variable memmap_on_memory will still be set to 1, even though we will not try to allocate memmap on hot added memory. Also introduce mhp_memmap_on_memory() helper to move the definition of "memmap_on_memory" to the scope of CONFIG_MHP_MEMMAP_ON_MEMORY. In the next patch, mhp_memmap_on_memory() will also be exported to be used in hugetlb_vmemmap.c. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Mike Kravetz <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kees Cook <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-13mm: hugetlb_vmemmap: disable hugetlb_optimize_vmemmap when struct page ↵Muchun Song1-6/+6
crosses page boundaries Patch series "add hugetlb_optimize_vmemmap sysctl", v11. This series aims to add hugetlb_optimize_vmemmap sysctl to enable or disable the feature of optimizing vmemmap pages associated with HugeTLB pages. This patch (of 4): If the size of "struct page" is not the power of two but with the feature of minimizing overhead of struct page associated with each HugeTLB is enabled, then the vmemmap pages of HugeTLB will be corrupted after remapping (panic is about to happen in theory). But this only exists when !CONFIG_MEMCG && !CONFIG_SLUB on x86_64. However, it is not a conventional configuration nowadays. So it is not a real word issue, just the result of a code review. But we cannot prevent anyone from configuring that combined configure. This hugetlb_optimize_vmemmap should be disable in this case to fix this issue. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kees Cook <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-13mm: rmap: fix CONT-PTE/PMD size hugetlb issue when unmappingBaolin Wang1-17/+22
On some architectures (like ARM64), it can support CONT-PTE/PMD size hugetlb, which means it can support not only PMD/PUD size hugetlb: 2M and 1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page size specified. When unmapping a hugetlb page, we will get the relevant page table entry by huge_pte_offset() only once to nuke it. This is correct for PMD or PUD size hugetlb, since they always contain only one pmd entry or pud entry in the page table. However this is incorrect for CONT-PTE and CONT-PMD size hugetlb, since they can contain several continuous pte or pmd entry with same page table attributes, so we will nuke only one pte or pmd entry for this CONT-PTE/PMD size hugetlb page. And now try_to_unmap() is only passed a hugetlb page in the case where the hugetlb page is poisoned. Which means now we will unmap only one pte entry for a CONT-PTE or CONT-PMD size poisoned hugetlb page, and we can still access other subpages of a CONT-PTE or CONT-PMD size poisoned hugetlb page, which will cause serious issues possibly. So we should change to use huge_ptep_clear_flush() to nuke the hugetlb page table to fix this issue, which already considered CONT-PTE and CONT-PMD size hugetlb. We've already used set_huge_swap_pte_at() to set a poisoned swap entry for a poisoned hugetlb page. Meanwhile adding a VM_BUG_ON() to make sure the passed hugetlb page is poisoned in try_to_unmap(). Link: https://lkml.kernel.org/r/0a2e547238cad5bc153a85c3e9658cb9d55f9cac.1652270205.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/730ea4b6d292f32fb10b7a4e87dad49b0eb30474.1652147571.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: David S. Miller <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: James Bottomley <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Rich Felker <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-13mm: rmap: fix CONT-PTE/PMD size hugetlb issue when migrationBaolin Wang1-6/+18
On some architectures (like ARM64), it can support CONT-PTE/PMD size hugetlb, which means it can support not only PMD/PUD size hugetlb: 2M and 1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page size specified. When migrating a hugetlb page, we will get the relevant page table entry by huge_pte_offset() only once to nuke it and remap it with a migration pte entry. This is correct for PMD or PUD size hugetlb, since they always contain only one pmd entry or pud entry in the page table. However this is incorrect for CONT-PTE and CONT-PMD size hugetlb, since they can contain several continuous pte or pmd entry with same page table attributes. So we will nuke or remap only one pte or pmd entry for this CONT-PTE/PMD size hugetlb page, which is not expected for hugetlb migration. The problem is we can still continue to modify the subpages' data of a hugetlb page during migrating a hugetlb page, which can cause a serious data consistent issue, since we did not nuke the page table entry and set a migration pte for the subpages of a hugetlb page. To fix this issue, we should change to use huge_ptep_clear_flush() to nuke a hugetlb page table, and remap it with set_huge_pte_at() and set_huge_swap_pte_at() when migrating a hugetlb page, which already considered the CONT-PTE or CONT-PMD size hugetlb. [[email protected]: fix nommu build] [[email protected]: fix build errors for !CONFIG_MMU] Link: https://lkml.kernel.org/r/a4baca670aca637e7198d9ae4543b8873cb224dc.1652270205.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/ea5abf529f0997b5430961012bfda6166c1efc8c.1652147571.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: David S. Miller <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: James Bottomley <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Rich Felker <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-13zsmalloc: fix races between asynchronous zspage free and page migrationSultan Alsawaf1-4/+33
The asynchronous zspage free worker tries to lock a zspage's entire page list without defending against page migration. Since pages which haven't yet been locked can concurrently migrate off the zspage page list while lock_zspage() churns away, lock_zspage() can suffer from a few different lethal races. It can lock a page which no longer belongs to the zspage and unsafely dereference page_private(), it can unsafely dereference a torn pointer to the next page (since there's a data race), and it can observe a spurious NULL pointer to the next page and thus not lock all of the zspage's pages (since a single page migration will reconstruct the entire page list, and create_page_chain() unconditionally zeroes out each list pointer in the process). Fix the races by using migrate_read_lock() in lock_zspage() to synchronize with page migration. Link: https://lkml.kernel.org/r/[email protected] Fixes: 77ff465799c602 ("zsmalloc: zs_page_migrate: skip unnecessary loops but not return -EBUSY if zspage is not inuse") Signed-off-by: Sultan Alsawaf <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-13Revert "mm/cma.c: remove redundant cma_mutex lock"Dong Aisheng1-1/+3
This reverts commit a4efc174b382fcdb which introduced a regression issue that when there're multiple processes allocating dma memory in parallel by calling dma_alloc_coherent(), it may fail sometimes as follows: Error log: cma: cma_alloc: linux,cma: alloc failed, req-size: 148 pages, ret: -16 cma: number of available pages: 3@125+20@172+12@236+4@380+32@736+17@2287+23@2473+20@36076+99@40477+108@40852+44@41108+20@41196+108@41364+108@41620+ 108@42900+108@43156+483@44061+1763@45341+1440@47712+20@49324+20@49388+5076@49452+2304@55040+35@58141+20@58220+20@58284+ 7188@58348+84@66220+7276@66452+227@74525+6371@75549=> 33161 free of 81920 total pages When issue happened, we saw there were still 33161 pages (129M) free CMA memory and a lot available free slots for 148 pages in CMA bitmap that we want to allocate. When dumping memory info, we found that there was also ~342M normal memory, but only 1352K CMA memory left in buddy system while a lot of pageblocks were isolated. Memory info log: Normal free:351096kB min:30000kB low:37500kB high:45000kB reserved_highatomic:0KB active_anon:98060kB inactive_anon:98948kB active_file:60864kB inactive_file:31776kB unevictable:0kB writepending:0kB present:1048576kB managed:1018328kB mlocked:0kB bounce:0kB free_pcp:220kB local_pcp:192kB free_cma:1352kB lowmem_reserve[]: 0 0 0 Normal: 78*4kB (UECI) 1772*8kB (UMECI) 1335*16kB (UMECI) 360*32kB (UMECI) 65*64kB (UMCI) 36*128kB (UMECI) 16*256kB (UMCI) 6*512kB (EI) 8*1024kB (UEI) 4*2048kB (MI) 8*4096kB (EI) 8*8192kB (UI) 3*16384kB (EI) 8*32768kB (M) = 489288kB The root cause of this issue is that since commit a4efc174b382 ("mm/cma.c: remove redundant cma_mutex lock"), CMA supports concurrent memory allocation. It's possible that the memory range process A trying to alloc has already been isolated by the allocation of process B during memory migration. The problem here is that the memory range isolated during one allocation by start_isolate_page_range() could be much bigger than the real size we want to alloc due to the range is aligned to MAX_ORDER_NR_PAGES. Taking an ARMv7 platform with 1G memory as an example, when MAX_ORDER_NR_PAGES is big (e.g. 32M with max_order 14) and CMA memory is relatively small (e.g. 128M), there're only 4 MAX_ORDER slot, then it's very easy that all CMA memory may have already been isolated by other processes when one trying to allocate memory using dma_alloc_coherent(). Since current CMA code will only scan one time of whole available CMA memory, then dma_alloc_coherent() may easy fail due to contention with other processes. This patch simply falls back to the original method that using cma_mutex to make alloc_contig_range() run sequentially to avoid the issue. Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/[email protected]/ Fixes: a4efc174b382 ("mm/cma.c: remove redundant cma_mutex lock") Signed-off-by: Dong Aisheng <[email protected]> Acked-by: Minchan Kim <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Lecopzer Chen <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> [5.11+] Signed-off-by: Andrew Morton <[email protected]>
2022-05-13Merge tag 'mm-hotfixes-stable-2022-05-11' of ↵Linus Torvalds4-16/+18
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Seven MM fixes, three of which address issues added in the most recent merge window, four of which are cc:stable. Three non-MM fixes, none very serious" [ And yes, that's a real pull request from Andrew, not me creating a branch from emailed patches. Woo-hoo! ] * tag 'mm-hotfixes-stable-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: add a mailing list for DAMON development selftests: vm: Makefile: rename TARGETS to VMTARGETS mm/kfence: reset PG_slab and memcg_data before freeing __kfence_pool mailmap: add entry for [email protected] arm[64]/memremap: don't abuse pfn_valid() to ensure presence of linear map procfs: prevent unprivileged processes accessing fdinfo dir mm: mremap: fix sign for EFAULT error return value mm/hwpoison: use pr_err() instead of dump_page() in get_any_page() mm/huge_memory: do not overkill when splitting huge_zero_page Revert "mm/memory-failure.c: skip huge_zero_page in memory_failure()"
2022-05-13mm/memory-failure.c: simplify num_poisoned_pages_inc/deczhenwei pi1-8/+3
Originally, do num_poisoned_pages_inc() in memory failure routine, use num_poisoned_pages_dec() to rollback the number if filtered/ cancelled. Suggested by Naoya, do num_poisoned_pages_inc() only in action_result(), this make this clear and simple. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: zhenwei pi <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-13mm/hwpoison: disable hwpoison filter during removingzhenwei pi1-0/+1
hwpoison filter is enabled by hwpoison-inject module, after removing this module, hwpoison filter still works. What is worse, user can not find the debugfs entries to know this. Disable the hwpoison filter during removing hwpoison-inject module. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: zhenwei pi <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-13mm/memory-failure.c: add hwpoison_filter for soft offlinezhenwei pi1-2/+14
hwpoison_filter is missing in the soft offline path, this leads an issue: after enabling the corrupt filter, the user process still has a chance to inject hwpoison fault by madvise(addr, len, MADV_SOFT_OFFLINE) at PFN which is expected to reject. Also do a minor change in comment of memory_failure(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: zhenwei pi <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>