blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2024-09-03	mm: memory_hotplug: remove head variable in do_migrate_range()	Kefeng Wang	1	-8/+14
	Patch series "mm: memory_hotplug: improve do_migrate_range()", v3. Unify hwpoisoned page handling and isolation of HugeTLB/LRU/non-LRU movable page, also convert to use folios in do_migrate_range(). This patch (of 5): Directly use a folio for HugeTLB and THP when calculate the next pfn, then remove unused head variable. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/damon/tests: add .kunitconfig file for DAMON kunit tests	SeongJae Park	1	-0/+22
	'--kunitconfig' option of 'kunit.py run' supports '.kunitconfig' file name convention. Add the file for DAMON kunit tests for more convenient kunit run. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: David Gow <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/damon: move kunit tests to tests/ subdirectory with _kunit suffix	SeongJae Park	8	-4/+4
	There was a discussion about better places for kunit test code[1] and test file name suffix[2]. Folowwing the conclusion, move kunit tests for DAMON to mm/damon/tests/ subdirectory and rename those. [1] https://lore.kernel.org/CABVgOS=pUdWb6NDHszuwb1HYws4a1-b1UmN=i8U_ED7HbDT0mg@mail.gmail.com [2] https://lore.kernel.org/CABVgOSmKwPq7JEpHfS6sbOwsR0B-DBDk_JP-ZD9s9ZizvpUjbQ@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: David Gow <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/damon/dbgfs-test: skip dbgfs_set_init_regions() test if PADDR is not ↵	SeongJae Park	1	-0/+5
	registered The test depends on registration of DAMON_OPS_PADDR. It would be registered only when CONFIG_DAMON_PADDR is set. DAMON core kunit tests do fake ops registration for such case. However, the functions for such fake ops registration is not available to DAMON debugfs interface. Just skip the test in the case. Link: https://lkml.kernel.org/r/[email protected] Fixes: 999b9467974f ("mm/damon/dbgfs-test: fix is_target_id() change") Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: David Gow <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/damon/dbgfs-test: skip dbgfs_set_targets() test if PADDR is not registered	SeongJae Park	1	-0/+5
	The test depends on registration of DAMON_OPS_PADDR. It would be registered only when CONFIG_DAMON_PADDR is set. DAMON core kunit tests do fake ops registration for such case. However, the functions for such fake ops registration is not available to DAMON debugfs interface. Just skip the test in the case. Link: https://lkml.kernel.org/r/[email protected] Fixes: 999b9467974f ("mm/damon/dbgfs-test: fix is_target_id() change") Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: David Gow <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/damon/core-test: fix damon_test_ops_registration() for DAMON_VADDR unset case	SeongJae Park	1	-1/+16
	DAMON core kunit test can be executed without CONFIG_DAMON_VADDR. In the case, vaddr DAMON ops is not registered. Meanwhile, ops registration kunit test assumes the vaddr ops is registered. Check and handle the case by registrering fake vaddr ops inside the test code. Link: https://lkml.kernel.org/r/[email protected] Fixes: 4f540f5ab4f2 ("mm/damon/core-test: add a kunit test case for ops registration") Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: David Gow <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/damon/core-test: test only vaddr case on ops registration test	SeongJae Park	1	-6/+2
	DAMON ops registration kunit test tests both vaddr and paddr use cases in parts of the whole test cases. Basically testing only one ops use case is enough. Do the test with only vaddr use case. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: David Gow <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	selftests/damon: add execute permissions to test scripts	SeongJae Park	9	-0/+0
	Some test scripts are missing executable permissions. It causes warnings that make the test output unnecessarily verbose. Add executable permissions. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: David Gow <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	selftests/damon: cleanup __pycache__/ with 'make clean'	SeongJae Park	1	-0/+2
	Python-based tests creates __pycache__/ directory. Remove it with 'make clean' by defining it as EXTRA_CLEAN. Link: https://lkml.kernel.org/r/[email protected] Fixes: b5906f5f7359 ("selftests/damon: add a test for update_schemes_tried_regions sysfs command") Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: David Gow <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	selftests/damon: add access_memory_even to .gitignore	SeongJae Park	1	-0/+1
	Patch series "misc fixups for DAMON {self,kunit} tests". This patchset is for minor fixups of DAMON selftests and kunit tests. First three patches make DAMON selftests more cleanly maintained (patches 1 and 2) without unnecessary warnings (patch 3). Following six patches remove unnecessary test case (patch 4), handle configs combinations that can make tests fail (patches 5-7), reorganize the test files following the new guideline (patch 8), and add reference kunitconfig for DAMON kunit tests (patch 9). This patch (of 9): DAMON selftests build access_memory_even, but its not on the .gitignore list. Add it to make 'git status' output cleaner. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: c94df805c774 ("selftests/damon: implement a program for even-numbered memory regions access") Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: David Gow <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	sched/numa: Fix the vma scan starving issue	Yujie Liu	1	-0/+9
	Problem statement: Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the Numa vma scan overhead has been reduced a lot. Meanwhile, the reducing of the vma scan might create less Numa page fault information. The insufficient information makes it harder for the Numa balancer to make decision. Later, commit b7a5b537c55c08 ("sched/numa: Complete scanning of partial VMAs regardless of PID activity") and commit 84db47ca7146d7 ("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found to bring back part of the performance. Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system, a long duration of remote Numa node read was observed by PMU events: A few cores having ~500MB/s remote memory access for ~20 seconds. It causes high core-to-core variance and performance penalty. After the investigation, it is found that many vmas are skipped due to the active PID check. According to the trace events, in most cases, vma_is_accessed() returns false because the history access info stored in pids_active array has been cleared. Proposal: The main idea is to adjust vma_is_accessed() to let it return true easier. Thus compare the diff between mm->numa_scan_seq and vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold, scan the vma. This patch especially helps the cases where there are small number of threads, like the process-based SPECcpu. Without this patch, if the SPECcpu process access the vma at the beginning, then sleeps for a long time, the pid_active array will be cleared. A a result, if this process is woken up again, it never has a chance to set prot_none anymore. Because only the first 2 times of access is granted for vma scan: (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2 to be worse, no other threads within the task can help set the prot_none. This causes information lost. Raghavendra helped test current patch and got the positive result on the AMD platform: autonumabench NUMA01 base patched Amean syst-NUMA01 194.05 ( 0.00%) 165.11 * 14.92%* Amean elsp-NUMA01 324.86 ( 0.00%) 315.58 * 2.86%* Duration User 380345.36 368252.04 Duration System 1358.89 1156.23 Duration Elapsed 2277.45 2213.25 autonumabench NUMA02 Amean syst-NUMA02 1.12 ( 0.00%) 1.09 * 2.93%* Amean elsp-NUMA02 3.50 ( 0.00%) 3.56 * -1.84%* Duration User 1513.23 1575.48 Duration System 8.33 8.13 Duration Elapsed 28.59 29.71 kernbench Amean user-256 22935.42 ( 0.00%) 22535.19 * 1.75%* Amean syst-256 7284.16 ( 0.00%) 7608.72 * -4.46%* Amean elsp-256 159.01 ( 0.00%) 158.17 * 0.53%* Duration User 68816.41 67615.74 Duration System 21873.94 22848.08 Duration Elapsed 506.66 504.55 Intel 256 CPUs/2 Sockets: autonuma benchmark also shows improvements: v6.10-rc5 v6.10-rc5 +patch Amean syst-NUMA01 245.85 ( 0.00%) 230.84 * 6.11%* Amean syst-NUMA01_THREADLOCAL 205.27 ( 0.00%) 191.86 * 6.53%* Amean syst-NUMA02 18.57 ( 0.00%) 18.09 * 2.58%* Amean syst-NUMA02_SMT 2.63 ( 0.00%) 2.54 * 3.47%* Amean elsp-NUMA01 517.17 ( 0.00%) 526.34 * -1.77%* Amean elsp-NUMA01_THREADLOCAL 99.92 ( 0.00%) 100.59 * -0.67%* Amean elsp-NUMA02 15.81 ( 0.00%) 15.72 * 0.59%* Amean elsp-NUMA02_SMT 13.23 ( 0.00%) 12.89 * 2.53%* v6.10-rc5 v6.10-rc5 +patch Duration User 1064010.16 1075416.23 Duration System 3307.64 3104.66 Duration Elapsed 4537.54 4604.73 The SPECcpu remote node access issue disappears with the patch applied. Link: https://lkml.kernel.org/r/[email protected] Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic") Signed-off-by: Chen Yu <[email protected]> Co-developed-by: Chen Yu <[email protected]> Signed-off-by: Yujie Liu <[email protected]> Reported-by: Xiaoping Zhou <[email protected]> Reviewed-and-tested-by: Raghavendra K T <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: "Chen, Tim C" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Vincent Guittot <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	memory tier: fix deadlock warning while onlining pages	Yanfei Xu	1	-1/+2
	commit 823430c8e9d9 ("memory tier: consolidate the initialization of memory tiers") introduces a locking change that use guard(mutex) to instead of mutex_lock/unlock() for memory_tier_lock. It unexpectedly expanded the locked region to include the hotplug_memory_notifier(), as a result, it triggers an locking dependency detected of ABBA deadlock. Exclude hotplug_memory_notifier() from the locked region to fixing it. The deadlock scenario is that when a memory online event occurs, the execution of memory notifier will access the read lock of the memory_chain.rwsem, then the reigistration of the memory notifier in memory_tier_init() acquires the write lock of the memory_chain.rwsem while holding memory_tier_lock. Then the memory online event continues to invoke the memory hotplug callback registered by memory_tier_init(). Since this callback tries to acquire the memory_tier_lock, a deadlock occurs. In fact, this deadlock can't happen because memory_tier_init() always executes before memory online events happen due to the subsys_initcall() has an higher priority than module_init(). [ 133.491106] WARNING: possible circular locking dependency detected [ 133.493656] 6.11.0-rc2+ #146 Tainted: G O N [ 133.504290] ------------------------------------------------------ [ 133.515194] (udev-worker)/1133 is trying to acquire lock: [ 133.525715] ffffffff87044e28 (memory_tier_lock){+.+.}-{3:3}, at: memtier_hotplug_callback+0x383/0x4b0 [ 133.536449] [ 133.536449] but task is already holding lock: [ 133.549847] ffffffff875d3310 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x60/0xb0 [ 133.556781] [ 133.556781] which lock already depends on the new lock. [ 133.556781] [ 133.569957] [ 133.569957] the existing dependency chain (in reverse order) is: [ 133.577618] [ 133.577618] -> #1 ((memory_chain).rwsem){++++}-{3:3}: [ 133.584997] down_write+0x97/0x210 [ 133.588647] blocking_notifier_chain_register+0x71/0xd0 [ 133.592537] register_memory_notifier+0x26/0x30 [ 133.596314] memory_tier_init+0x187/0x300 [ 133.599864] do_one_initcall+0x117/0x5d0 [ 133.603399] kernel_init_freeable+0xab0/0xeb0 [ 133.606986] kernel_init+0x28/0x2f0 [ 133.610312] ret_from_fork+0x59/0x90 [ 133.613652] ret_from_fork_asm+0x1a/0x30 [ 133.617012] [ 133.617012] -> #0 (memory_tier_lock){+.+.}-{3:3}: [ 133.623390] __lock_acquire+0x2efd/0x5c60 [ 133.626730] lock_acquire+0x1ce/0x580 [ 133.629757] __mutex_lock+0x15c/0x1490 [ 133.632731] mutex_lock_nested+0x1f/0x30 [ 133.635717] memtier_hotplug_callback+0x383/0x4b0 [ 133.638748] notifier_call_chain+0xbf/0x370 [ 133.641647] blocking_notifier_call_chain+0x76/0xb0 [ 133.644636] memory_notify+0x2e/0x40 [ 133.647427] online_pages+0x597/0x720 [ 133.650246] memory_subsys_online+0x4f6/0x7f0 [ 133.653107] device_online+0x141/0x1d0 [ 133.655831] online_memory_block+0x4d/0x60 [ 133.658616] walk_memory_blocks+0xc0/0x120 [ 133.661419] add_memory_resource+0x51d/0x6c0 [ 133.664202] add_memory_driver_managed+0xf5/0x180 [ 133.667060] dev_dax_kmem_probe+0x7f7/0xb40 [kmem] [ 133.669949] dax_bus_probe+0x147/0x230 [ 133.672687] really_probe+0x27f/0xac0 [ 133.675463] __driver_probe_device+0x1f3/0x460 [ 133.678493] driver_probe_device+0x56/0x1b0 [ 133.681366] __driver_attach+0x277/0x570 [ 133.684149] bus_for_each_dev+0x145/0x1e0 [ 133.686937] driver_attach+0x49/0x60 [ 133.689673] bus_add_driver+0x2f3/0x6b0 [ 133.692421] driver_register+0x170/0x4b0 [ 133.695118] __dax_driver_register+0x141/0x1b0 [ 133.697910] dax_kmem_init+0x54/0xff0 [kmem] [ 133.700794] do_one_initcall+0x117/0x5d0 [ 133.703455] do_init_module+0x277/0x750 [ 133.706054] load_module+0x5d1d/0x74f0 [ 133.708602] init_module_from_file+0x12c/0x1a0 [ 133.711234] idempotent_init_module+0x3f1/0x690 [ 133.713937] __x64_sys_finit_module+0x10e/0x1a0 [ 133.716492] x64_sys_call+0x184d/0x20d0 [ 133.719053] do_syscall_64+0x6d/0x140 [ 133.721537] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 133.724239] [ 133.724239] other info that might help us debug this: [ 133.724239] [ 133.730832] Possible unsafe locking scenario: [ 133.730832] [ 133.735298] CPU0 CPU1 [ 133.737759] ---- ---- [ 133.740165] rlock((memory_chain).rwsem); [ 133.742623] lock(memory_tier_lock); [ 133.745357] lock((memory_chain).rwsem); [ 133.748141] lock(memory_tier_lock); [ 133.750489] [ 133.750489] * DEADLOCK * [ 133.750489] [ 133.756742] 6 locks held by (udev-worker)/1133: [ 133.759179] #0: ffff888207be6158 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x26c/0x570 [ 133.762299] #1: ffffffff875b5868 (device_hotplug_lock){+.+.}-{3:3}, at: lock_device_hotplug+0x20/0x30 [ 133.765565] #2: ffff88820cf6a108 (&dev->mutex){....}-{3:3}, at: device_online+0x2f/0x1d0 [ 133.768978] #3: ffffffff86d08ff0 (cpu_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x17/0x30 [ 133.772312] #4: ffffffff8702dfb0 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x23/0x30 [ 133.775544] #5: ffffffff875d3310 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x60/0xb0 [ 133.779113] [ 133.779113] stack backtrace: [ 133.783728] CPU: 5 UID: 0 PID: 1133 Comm: (udev-worker) Tainted: G O N 6.11.0-rc2+ #146 [ 133.787220] Tainted: [O]=OOT_MODULE, [N]=TEST [ 133.789948] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 133.793291] Call Trace: [ 133.795826] <TASK> [ 133.798284] dump_stack_lvl+0xea/0x150 [ 133.801025] dump_stack+0x19/0x20 [ 133.803609] print_circular_bug+0x477/0x740 [ 133.806341] check_noncircular+0x2f4/0x3e0 [ 133.809056] ? __pfx_check_noncircular+0x10/0x10 [ 133.811866] ? __pfx_lockdep_lock+0x10/0x10 [ 133.814670] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 [ 133.817610] __lock_acquire+0x2efd/0x5c60 [ 133.820339] ? __pfx___lock_acquire+0x10/0x10 [ 133.823128] ? __dax_driver_register+0x141/0x1b0 [ 133.825926] ? do_one_initcall+0x117/0x5d0 [ 133.828648] lock_acquire+0x1ce/0x580 [ 133.831349] ? memtier_hotplug_callback+0x383/0x4b0 [ 133.834293] ? __pfx_lock_acquire+0x10/0x10 [ 133.837134] __mutex_lock+0x15c/0x1490 [ 133.839829] ? memtier_hotplug_callback+0x383/0x4b0 [ 133.842753] ? memtier_hotplug_callback+0x383/0x4b0 [ 133.845602] ? __this_cpu_preempt_check+0x21/0x30 [ 133.848438] ? __pfx___mutex_lock+0x10/0x10 [ 133.851200] ? __pfx_lock_acquire+0x10/0x10 [ 133.853935] ? global_dirty_limits+0xc0/0x160 [ 133.856699] ? __sanitizer_cov_trace_switch+0x58/0xa0 [ 133.859564] mutex_lock_nested+0x1f/0x30 [ 133.862251] ? mutex_lock_nested+0x1f/0x30 [ 133.864964] memtier_hotplug_callback+0x383/0x4b0 [ 133.867752] notifier_call_chain+0xbf/0x370 [ 133.870550] ? writeback_set_ratelimit+0xe8/0x160 [ 133.873372] blocking_notifier_call_chain+0x76/0xb0 [ 133.876311] memory_notify+0x2e/0x40 [ 133.879013] online_pages+0x597/0x720 [ 133.881686] ? irqentry_exit+0x3e/0xa0 [ 133.884397] ? __pfx_online_pages+0x10/0x10 [ 133.887244] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 [ 133.890299] ? mhp_init_memmap_on_memory+0x7a/0x1c0 [ 133.893203] memory_subsys_online+0x4f6/0x7f0 [ 133.896099] ? __pfx_memory_subsys_online+0x10/0x10 [ 133.899039] ? xa_load+0x16d/0x2e0 [ 133.901667] ? __pfx_xa_load+0x10/0x10 [ 133.904366] ? __pfx_memory_subsys_online+0x10/0x10 [ 133.907218] device_online+0x141/0x1d0 [ 133.909845] online_memory_block+0x4d/0x60 [ 133.912494] walk_memory_blocks+0xc0/0x120 [ 133.915104] ? __pfx_online_memory_block+0x10/0x10 [ 133.917776] add_memory_resource+0x51d/0x6c0 [ 133.920404] ? __pfx_add_memory_resource+0x10/0x10 [ 133.923104] ? _raw_write_unlock+0x31/0x60 [ 133.925781] ? register_memory_resource+0x119/0x180 [ 133.928450] add_memory_driver_managed+0xf5/0x180 [ 133.931036] dev_dax_kmem_probe+0x7f7/0xb40 [kmem] [ 133.933665] ? __pfx_dev_dax_kmem_probe+0x10/0x10 [kmem] [ 133.936332] ? __pfx___up_read+0x10/0x10 [ 133.938878] dax_bus_probe+0x147/0x230 [ 133.941332] ? __pfx_dax_bus_probe+0x10/0x10 [ 133.943954] really_probe+0x27f/0xac0 [ 133.946387] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30 [ 133.949106] __driver_probe_device+0x1f3/0x460 [ 133.951704] ? parse_option_str+0x149/0x190 [ 133.954241] driver_probe_device+0x56/0x1b0 [ 133.956749] __driver_attach+0x277/0x570 [ 133.959228] ? __pfx___driver_attach+0x10/0x10 [ 133.961776] bus_for_each_dev+0x145/0x1e0 [ 133.964367] ? __pfx_bus_for_each_dev+0x10/0x10 [ 133.967019] ? __kasan_check_read+0x15/0x20 [ 133.969543] ? _raw_spin_unlock+0x31/0x60 [ 133.972132] driver_attach+0x49/0x60 [ 133.974536] bus_add_driver+0x2f3/0x6b0 [ 133.977044] driver_register+0x170/0x4b0 [ 133.979480] __dax_driver_register+0x141/0x1b0 [ 133.982126] ? __pfx_dax_kmem_init+0x10/0x10 [kmem] [ 133.984724] dax_kmem_init+0x54/0xff0 [kmem] [ 133.987284] ? __pfx_dax_kmem_init+0x10/0x10 [kmem] [ 133.989965] do_one_initcall+0x117/0x5d0 [ 133.992506] ? __pfx_do_one_initcall+0x10/0x10 [ 133.995185] ? __kasan_kmalloc+0x88/0xa0 [ 133.997748] ? kasan_poison+0x3e/0x60 [ 134.000288] ? kasan_unpoison+0x2c/0x60 [ 134.002762] ? kasan_poison+0x3e/0x60 [ 134.005202] ? __asan_register_globals+0x62/0x80 [ 134.007753] ? __pfx_dax_kmem_init+0x10/0x10 [kmem] [ 134.010439] do_init_module+0x277/0x750 [ 134.012953] load_module+0x5d1d/0x74f0 [ 134.015406] ? __pfx_load_module+0x10/0x10 [ 134.017887] ? __pfx_ima_post_read_file+0x10/0x10 [ 134.020470] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 [ 134.023127] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 [ 134.025767] ? security_kernel_post_read_file+0xa2/0xd0 [ 134.028429] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 [ 134.031162] ? kernel_read_file+0x503/0x820 [ 134.033645] ? __pfx_kernel_read_file+0x10/0x10 [ 134.036232] ? __pfx___lock_acquire+0x10/0x10 [ 134.038766] init_module_from_file+0x12c/0x1a0 [ 134.041291] ? init_module_from_file+0x12c/0x1a0 [ 134.043936] ? __pfx_init_module_from_file+0x10/0x10 [ 134.046516] ? __this_cpu_preempt_check+0x21/0x30 [ 134.049091] ? __kasan_check_read+0x15/0x20 [ 134.051551] ? do_raw_spin_unlock+0x60/0x210 [ 134.054077] idempotent_init_module+0x3f1/0x690 [ 134.056643] ? __pfx_idempotent_init_module+0x10/0x10 [ 134.059318] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 [ 134.061995] ? __fget_light+0x17d/0x210 [ 134.064428] __x64_sys_finit_module+0x10e/0x1a0 [ 134.066976] x64_sys_call+0x184d/0x20d0 [ 134.069405] do_syscall_64+0x6d/0x140 [ 134.071926] entry_SYSCALL_64_after_hwframe+0x76/0x7e [[email protected]: add mutex_lock/unlock() pair back] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 823430c8e9d9 ("memory tier: consolidate the initialization of memory tiers") Signed-off-by: Yanfei Xu <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Ho-Ren (Jack) Chuang <[email protected]> Cc: Jonathan Cameron <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm: vmalloc: refactor vm_area_alloc_pages() function	Uladzislau Rezki (Sony)	1	-20/+17
	The aim is to simplify and making the vm_area_alloc_pages() function less confusing as it became more clogged nowadays: - eliminate a "bulk_gfp" variable and do not overwrite a gfp flag for bulk allocator; - drop __GFP_NOFAIL flag for high-order-page requests on upper layer. It becomes less spread between levels when it comes to __GFP_NOFAIL allocations; - add a comment about a fallback path if high-order attempt is unsuccessful because for such cases __GFP_NOFAIL is dropped; - fix a typo in a commit message. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm: rework vm_ops->close() handling on VMA merge	Lorenzo Stoakes	2	-59/+164
	In commit 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test") we relaxed the VMA merge rules for VMAs possessing a vm_ops->close() hook, permitting this operation in instances where we wouldn't delete the VMA as part of the merge operation. This was later corrected in commit fc0c8f9089c2 ("mm, mmap: fix vma_merge() case 7 with vma_ops->close") to account for a subtle case that the previous commit had not taken into account. In both instances, we first rely on is_mergeable_vma() to determine whether we might be dealing with a VMA that might be removed, taking advantage of the fact that a 'previous' VMA will never be deleted, only VMAs that follow it. The second patch corrects the instance where a merge of the previous VMA into a subsequent one did not correctly check whether the subsequent VMA had a vm_ops->close() handler. Both changes prevent merge cases that are actually permissible (for instance a merge of a VMA into a following VMA with a vm_ops->close(), but with no previous VMA, which would result in the next VMA being extended, not deleted). In addition, both changes fail to consider the case where a VMA that would otherwise be merged with the previous and next VMA might have vm_ops->close(), on the assumption that for this to be the case, all three would have to have the same vma->vm_file to be mergeable and thus the same vm_ops. And in addition both changes operate at 50,000 feet, trying to guess whether a VMA will be deleted. As we have majorly refactored the VMA merge operation and de-duplicated code to the point where we know precisely where deletions will occur, this patch removes the aforementioned checks altogether and instead explicitly checks whether a VMA will be deleted. In cases where a reduced merge is still possible (where we merge both previous and next VMA but the next VMA has a vm_ops->close hook, meaning we could just merge the previous and current VMA), we do so, otherwise the merge is not permitted. We take advantage of our userland testing to assert that this functions correctly - replacing the previous limited vm_ops->close() tests with tests for every single case where we delete a VMA. We also update all testing for both new and modified VMAs to set vma->vm_ops->close() in every single instance where this would not prevent the merge, to assert that we never do so. Link: https://lkml.kernel.org/r/9f96b8cfeef3d14afabddac3d6144afdfbef2e22.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Mark Brown <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm: refactor vma_merge() into modify-only vma_merge_existing_range()	Lorenzo Stoakes	2	-253/+264
	The existing vma_merge() function is no longer required to handle what were previously referred to as cases 1-3 (i.e. the merging of a new VMA), as this is now handled by vma_merge_new_vma(). Additionally, simplify the convoluted control flow of the original, maintaining identical logic only expressed more clearly and doing away with a complicated set of cases, rather logically examining each possible outcome - merging of both the previous and subsequent VMA, merging of the previous VMA and merging of the subsequent VMA alone. We now utilise the previously implemented commit_merge() function to share logic with vma_expand() de-duplicating code and providing less surface area for bugs and confusion. In order to do so, we adjust this function to accept parameters specific to merging existing ranges. Link: https://lkml.kernel.org/r/2cf6016b7bfcc4965fc3cde10827560c42e4f12c.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Mark Brown <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm: introduce commit_merge(), abstracting final commit of merge	Lorenzo Stoakes	1	-12/+27
	Pull the part of vma_expand() which actually commits the merge operation, that is inserts it into the maple tree and sets the VMA's vma->vm_start and vma->vm_end parameters, into its own function. We implement only the parts needed for vma_expand() which now as a result of previous work is also the means by which new VMA ranges are merged. The next commit in the series will implement merging of existing ranges which will extend commit_merge() to accommodate this case and result in all merges using this common code. Link: https://lkml.kernel.org/r/7b985a20dfa549e3c370cd274d732b64c44f6dbd.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Mark Brown <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm: make vma_prepare() and friends static and internal to vma.c	Lorenzo Stoakes	2	-185/+158
	Now we have abstracted merge behaviour for new VMA ranges, we are able to render vma_prepare(), init_vma_prep(), vma_complete(), can_vma_merge_before() and can_vma_merge_after() static and internal to vma.c. These are internal implementation details of kernel VMA manipulation and merging mechanisms and thus should not be exposed. This also renders the functions userland testable. Link: https://lkml.kernel.org/r/7f7f1c34ce10405a6aab2714c505af3cf41b7851.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Mark Brown <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm: avoid using vma_merge() for new VMAs	Lorenzo Stoakes	5	-100/+279
	Abstract vma_merge_new_vma() to use vma_merge_struct and rename the resultant function vma_merge_new_range() to be clear what the purpose of this function is - a new VMA is desired in the specified range, and we wish to see if it is possible to 'merge' surrounding VMAs into this range rather than having to allocate a new VMA. Note that this function uses vma_extend() exclusively, so adopts its requirement that the iterator point at or before the gap. We add an assert to this effect. This is as opposed to vma_merge_existing_range(), which will be introduced in a subsequent commit, and provide the same functionality for cases in which we are modifying an existing VMA. In mmap_region() and do_brk_flags() we open code scenarios where we prefer to use vma_expand() rather than invoke a full vma_merge() operation. Abstract this logic and eliminate all of the open-coding, and also use the same logic for all cases where we add new VMAs to, rather than ultimately use vma_merge(), rather use vma_expand(). Doing so removes duplication and simplifies VMA merging in all such cases, laying the ground for us to eliminate the merging of new VMAs in vma_merge() altogether. Also add the ability for the vmg to track state, and able to report errors, allowing for us to differentiate a failed merge from an inability to allocate memory in callers. This makes it far easier to understand what is happening in these cases avoiding confusion, bugs and allowing for future optimisation. Also introduce vma_iter_next_rewind() to allow for retrieval of the next, and (optionally) the prev VMA, rewinding to the start of the previous gap. Introduce are_anon_vmas_compatible() to abstract individual VMA anon_vma comparison for the case of merging on both sides where the anon_vma of the VMA being merged maybe compatible with prev and next, but prev and next's anon_vma's may not be compatible with each other. Finally also introduce can_vma_merge_left() / can_vma_merge_right() to check adjacent VMA compatibility and that they are indeed adjacent. Link: https://lkml.kernel.org/r/49d37c0769b6b9dc03b27fe4d059173832556392.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Tested-by: Mark Brown <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm: abstract vma_expand() to use vma_merge_struct	Lorenzo Stoakes	4	-35/+27
	The purpose of the vmg is to thread merge state through functions and avoid egregious parameter lists. We expand this to vma_expand(), which is used for a number of merge cases. Accordingly, adjust its callers, mmap_region() and relocate_vma_down(), to use a vmg. An added purpose of this change is the ability in a future commit to perform all new VMA range merging using vma_expand(). Link: https://lkml.kernel.org/r/4bc8c9dbc9ca52452ef8e587b28fe555854ceb38.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Mark Brown <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm: remove duplicated open-coded VMA policy check	Lorenzo Stoakes	2	-10/+7
	Both can_vma_merge_before() and can_vma_merge_after() are invoked after checking for compatible VMA NUMA policy, we can simply move this to is_mergeable_vma() and abstract this altogether. In mmap_region() we set vmg->policy to NULL, so the policy comparisons checked in can_vma_merge_before() and can_vma_merge_after() are exactly equivalent to !vma_policy(vmg.next) and !vma_policy(vmg.prev). Equally, in do_brk_flags(), vmg->policy is NULL, so the can_vma_merge_after() is checking !vma_policy(vma), as we set vmg.prev to vma. In vma_merge(), we compare prev and next policies with vmg->policy before checking can_vma_merge_after() and can_vma_merge_before() respectively, which this patch causes to be checked in precisely the same way. This therefore maintains precisely the same logic as before, only now abstracted into is_mergeable_vma(). Link: https://lkml.kernel.org/r/0dbff286d9c4988333bc6f4ff3734cb95dd5410a.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Mark Brown <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm: introduce vma_merge_struct and abstract vma_merge(),vma_modify()	Lorenzo Stoakes	4	-207/+246
	Rather than passing around huge numbers of parameters to numerous helper functions, abstract them into a single struct that we thread through the operation, the vma_merge_struct ('vmg'). Adjust vma_merge() and vma_modify() to accept this parameter, as well as predicate functions can_vma_merge_before(), can_vma_merge_after(), and the vma_modify_...() helper functions. Also introduce VMG_STATE() and VMG_VMA_STATE() helper macros to allow for easy vmg declaration. We additionally remove the requirement that vma_merge() is passed a VMA object representing the candidate new VMA. Previously it used this to obtain the mm_struct, file and anon_vma properties of the proposed range (a rather confusing state of affairs), which are now provided by the vmg directly. We also remove the pgoff calculation previously performed vma_modify(), and instead calculate this in VMG_VMA_STATE() via the vma_pgoff_offset() helper. Link: https://lkml.kernel.org/r/a955aad09d81329f6fbeb636b2dd10cde7b73dab.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Mark Brown <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	tools: add VMA merge tests	Lorenzo Stoakes	2	-10/+1317
	Add a variety of VMA merge unit tests to assert that the behaviour of VMA merge is correct at an abstract level and VMAs are merged or not merged as expected. These are intentionally added _before_ we start refactoring vma_merge() in order that we can continually assert correctness throughout the rest of the series. In order to reduce churn going forward, we backport the vma_merge_struct data type to the test code which we introduce and use in a future commit, and add wrappers around the merge new and existing VMA cases. Link: https://lkml.kernel.org/r/1c7a0b43cfad2c511a6b1b52f3507696478ff51a.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Mark Brown <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	tools: improve vma test Makefile	Lorenzo Stoakes	1	-2/+4
	Patch series "mm: remove vma_merge()", v3. The infamous vma_merge() function has been the cause of a great deal of pain, bugs and confusion for a very long time. It is subtle, contains many corner cases, tries to do far too much and is as a result very fragile. The fact that the function requires there to be a numbering system to cover each possible eventuality with references to each in the many branches of its implementation as to which case you are looking at speaks to all this. Some of this complexity is inherent - unfortunately there is no getting away from the need to figure out precisely how to execute the merge, whether we need to remove VMAs, whether it is safe to do so, what constitutes a mergeable VMA and so on. However, a lot of the complexity is not inherent but instead a product of the function's 'organic' development. Liam has gone to great lengths to improve the situation as a part of his maple tree implementation, greatly improving the readability of the code, and Vlastimil and myself have additionally gone to lengths to try to improve things further. However, with the availability of userland VMA testing, it now becomes possible to perform a rather more significant refactoring while maintaining confidence in its correct operation. An attempt was previously made by Vlastimil [0] to eliminate vma_merge(), however it was rather - brutal - and an astute reader might refer to the date of that patch for insight as to its intent. This series instead divides merge operations into two natural kinds - merges which occur when a NEW vma is being added to the address space, and merges which occur when a vma is being MODIFIED. Happily, the vma_expand() function introduced by Liam, which has the capacity for also deleting a subsequent VMA, covers each of the NEW vma cases. By abstracting the actual final commit of changes to a VMA to its own function, commit_merge() and writing a wrapper around vma_expand() for new VMA cases vma_merge_new_range(), we can avoid having to use vma_merge() for these instances altogether. By doing so we are also able to then de-duplicate all existing merge logic in mmap_region() and do_brk_flags() and have everything invoke this new function, so we universally take the same approach to merging new VMAs. Having done so, we can then completely rework vma_merge() into vma_merge_existing_range() and use this for the instances where a merge is proposed for a region of an existing VMA. This eliminates vma_merge() and its numbered cases and instead divides things into logical cases - merge both, merge left, merge right (the latter 2 being either partial or full merges). The code is heavily annotated with ASCII diagrams and greatly simplified in comparison to the existing vma_merge() function. Having made this change, we take the opportunity to address an issue with merging VMAs possessing a vm_ops->close() hook - commit 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test") and commit fc0c8f9089c2 ("mm, mmap: fix vma_merge() case 7 with vma_ops->close") make efforts to relax how we handle these, making assumptions about which VMAs might end up deleted (and thus, if possessing a vm_ops->close() hook, cannot be). This refactor means we do not need to guess, so instead explicitly only disallow merge in instances where a VMA with a vm_ops->close() hook would be deleted (and try a smaller merge in cases where this is possible). In addition to these changes, we introduce a new vma_merge_struct abstraction to allow VMA merge state to be threaded through the operation neatly. There is heavy unit testing provided for all merge functionality, added prior to the refactoring, allowing for before/after testing. The vm_ops->close() change also introduces exhaustive testing to demonstrate that this functions as expected, and in addition to this the reproduction code from commit fc0c8f9089c2 ("mm, mmap: fix vma_merge() case 7 with vma_ops->close") was tested and confirmed passing. [0]:https://lore.kernel.org/linux-mm/[email protected]/ This patch (of 10): Have vma.o depend on its source dependencies explicitly, as previously these were simply being ignored as existing object files were up to date. This now correctly re-triggers the build if mm/ source is changed as well as local source code. Also set clean as a phony rule. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/e3ea58f08364ae5432c9a074de0195a7c7e0b04a.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Mark Brown <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/vma.h: optimise vma_munmap_struct	Liam R. Howlett	1	-3/+4
	The vma_munmap_struct has a hole of 4 bytes and pushes the struct to three cachelines. Relocating the three booleans upwards allows for the struct to only use two cachelines (as reported by pahole on amd64). Before: struct vma_munmap_struct { struct vma_iterator * vmi; /* 0 8 / struct vm_area_struct vma; /* 8 8 / struct vm_area_struct prev; /* 16 8 / struct vm_area_struct next; /* 24 8 / struct list_head uf; /* 32 8 / long unsigned int start; / 40 8 / long unsigned int end; / 48 8 / long unsigned int unmap_start; / 56 8 / / --- cacheline 1 boundary (64 bytes) --- / long unsigned int unmap_end; / 64 8 / int vma_count; / 72 4 / / XXX 4 bytes hole, try to pack / long unsigned int nr_pages; / 80 8 / long unsigned int locked_vm; / 88 8 / long unsigned int nr_accounted; / 96 8 / long unsigned int exec_vm; / 104 8 / long unsigned int stack_vm; / 112 8 / long unsigned int data_vm; / 120 8 / / --- cacheline 2 boundary (128 bytes) --- / bool unlock; / 128 1 / bool clear_ptes; / 129 1 / bool closed_vm_ops; / 130 1 / / size: 136, cachelines: 3, members: 19 / / sum members: 127, holes: 1, sum holes: 4 / / padding: 5 / / last cacheline: 8 bytes / }; After: struct vma_munmap_struct { struct vma_iterator vmi; /* 0 8 / struct vm_area_struct vma; /* 8 8 / struct vm_area_struct prev; /* 16 8 / struct vm_area_struct next; /* 24 8 / struct list_head uf; /* 32 8 / long unsigned int start; / 40 8 / long unsigned int end; / 48 8 / long unsigned int unmap_start; / 56 8 / / --- cacheline 1 boundary (64 bytes) --- / long unsigned int unmap_end; / 64 8 / int vma_count; / 72 4 / bool unlock; / 76 1 / bool clear_ptes; / 77 1 / bool closed_vm_ops; / 78 1 / / XXX 1 byte hole, try to pack / long unsigned int nr_pages; / 80 8 / long unsigned int locked_vm; / 88 8 / long unsigned int nr_accounted; / 96 8 / long unsigned int exec_vm; / 104 8 / long unsigned int stack_vm; / 112 8 / long unsigned int data_vm; / 120 8 / / size: 128, cachelines: 2, members: 19 / / sum members: 127, holes: 1, sum holes: 1 */ }; Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/vma: drop incorrect comment from vms_gather_munmap_vmas()	Liam R. Howlett	1	-6/+1
	The comment has been outdated since 6b73cff239e52 ("mm: change munmap splitting order and move_vma()"). The move_vma() was altered to fix the fragile state of the accounting since then. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm: move may_expand_vm() check in mmap_region()	Liam R. Howlett	3	-35/+4
	The may_expand_vm() check requires the count of the pages within the munmap range. Since this is needed for accounting and obtained later, the reodering of ma_expand_vm() to later in the call stack, after the vma munmap struct (vms) is initialised and the gather stage is potentially run, will allow for a single loop over the vmas. The gather sage does not commit any work and so everything can be undone in the case of a failure. The MAP_FIXED page count is available after the vms_gather_munmap_vmas() call, so use it instead of looping over the vmas twice. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	ipc/shm, mm: drop do_vma_munmap()	Liam R. Howlett	5	-43/+20
	The do_vma_munmap() wrapper existed for callers that didn't have a vma iterator and needed to check the vma mseal status prior to calling the underlying munmap(). All callers now use a vma iterator and since the mseal check has been moved to do_vmi_align_munmap() and the vmas are aligned, this function can just be called instead. do_vmi_align_munmap() can no longer be static as ipc/shm is using it and it is exported via the mm.h header. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/mmap: use vms accounted pages in mmap_region()	Liam R. Howlett	1	-2/+3
	Change from nr_pages variable to vms.nr_accounted for the charged pages calculation. This is necessary for a future patch. This also avoids checking security_vm_enough_memory_mm() if the amount of memory won't change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Acked-by: Paul Moore <[email protected]> [LSM] Cc: Kees Cook <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/mmap: use PHYS_PFN in mmap_region()	Liam R. Howlett	1	-5/+5
	Instead of shifting the length by PAGE_SIZE, use PHYS_PFN. Also use the existing local variable everywhere instead of some of the time. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm: change failure of MAP_FIXED to restoring the gap on failure	Liam R. Howlett	3	-26/+61
	Prior to call_mmap(), the vmas that will be replaced need to clear the way for what may happen in the call_mmap(). This clean up work includes clearing the ptes and calling the close() vm_ops. Some users do more setup than can be restored by calling the vm_ops open() function. It is safer to store the gap in the vma tree in these cases. That is to say that the failure scenario that existed before the MAP_FIXED gap exposure is restored as it is safer than trying to undo a partial mapping. Since abort_munmap_vmas() is only reattaching vmas with this change, the function is renamed to reattach_vmas(). There is also a secondary failure that may occur if there is not enough memory to store the gap. In this case, the vmas are reattached and resources freed. If the system cannot complete the call_mmap() and fails to allocate with GFP_KERNEL, then the system will print a warning about the failure. [[email protected]: fix off-by-one error in vms_abort_munmap_vmas()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/mmap: avoid zeroing vma tree in mmap_region()	Liam R. Howlett	3	-48/+85
	Instead of zeroing the vma tree and then overwriting the area, let the area be overwritten and then clean up the gathered vmas using vms_complete_munmap_vmas(). To ensure locking is downgraded correctly, the mm is set regardless of MAP_FIXED or not (NULL vma). If a driver is mapping over an existing vma, then clear the ptes before the call_mmap() invocation. This is done using the vms_clean_up_area() helper. If there is a close vm_ops, that must also be called to ensure any cleanup is done before mapping over the area. This also means that calling open has been added to the abort of an unmap operation, for now. Since vm_ops->open() and vm_ops->close() are not always undo each other (state cleanup may exist in ->close() that is lost forever), the code cannot be left in this way, but that change has been isolated to another commit to make this point very obvious for traceability. Temporarily keep track of the number of pages that will be removed and reduce the charged amount. This also drops the validate_mm() call in the vma_expand() function. It is necessary to drop the validate as it would fail since the mm map_count would be incorrect during a vma expansion, prior to the cleanup from vms_complete_munmap_vmas(). Clean up the error handing of the vms_gather_munmap_vmas() by calling the verification within the function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm: clean up unmap_region() argument list	Liam R. Howlett	3	-15/+11
	With the only caller to unmap_region() being the error path of mmap_region(), the argument list can be significantly reduced. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/vma: track start and end for munmap in vma_munmap_struct	Liam R. Howlett	2	-7/+31
	Set the start and end address for munmap when the prev and next are gathered. This is needed to avoid incorrect addresses being used during the vms_complete_munmap_vmas() function if the prev/next vma are expanded. Add a new helper vms_complete_pte_clear(), which is needed later and will avoid growing the argument list to unmap_region() beyond the 9 it already has. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/mmap: reposition vma iterator in mmap_region()	Liam R. Howlett	3	-33/+39
	Instead of moving (or leaving) the vma iterator pointing at the previous vma, leave it pointing at the insert location. Pointing the vma iterator at the insert location allows for a cleaner walk of the vma tree for MAP_FIXED and the no expansion cases. The vma_prev() call in the case of merging the previous vma is equivalent to vma_iter_prev_range(), since the vma iterator will be pointing to the location just before the previous vma. This change needs to export abort_munmap_vmas() from mm/vma. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/vma: support vma == NULL in init_vma_munmap()	Liam R. Howlett	2	-4/+9
	Adding support for a NULL vma means the init_vma_munmap() can be initialized for a less error-prone process when calling vms_complete_munmap_vmas() later on. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/vma: expand mmap_region() munmap call	Liam R. Howlett	3	-33/+57
	Open code the do_vmi_align_munmap() call so that it can be broken up later in the series. This requires exposing a few more vma operations. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/vma: inline munmap operation in mmap_region()	Liam R. Howlett	1	-6/+9
	mmap_region is already passed sanitized addr and len, so change the call to do_vmi_munmap() to do_vmi_align_munmap() and inline the other checks. The inlining of the function and checks is an intermediate step in the series so future patches are easier to follow. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/vma: extract validate_mm() from vma_complete()	Liam R. Howlett	2	-1/+5
	vma_complete() will need to be called during an unsafe time to call validate_mm(). Extract the call in all places now so that only one location can be modified in the next change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/vma: change munmap to use vma_munmap_struct() for accounting and ↵	Liam R. Howlett	2	-40/+49
	surrounding vmas Clean up the code by changing the munmap operation to use a structure for the accounting and munmap variables. Since remove_mt() is only called in one location and the contents will be reduced to almost nothing. The remains of the function can be added to vms_complete_munmap_vmas(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/vma: introduce vma_munmap_struct for use in munmap operations	Liam R. Howlett	2	-66/+90
	Use a structure to pass along all the necessary information and counters involved in removing vmas from the mm_struct. Update vmi_ function names to vms_ to indicate the first argument type change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/vma: extract the gathering of vmas from do_vmi_align_munmap()	Liam R. Howlett	1	-33/+62
	Create vmi_gather_munmap_vmas() to handle the gathering of vmas into a detached maple tree for removal later. Part of the gathering is the splitting of vmas that span the boundary. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/vma: introduce vmi_complete_munmap_vmas()	Liam R. Howlett	1	-25/+55
	Extract all necessary operations that need to be completed after the vma maple tree is updated from a munmap() operation. Extracting this makes the later patch in the series easier to understand. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/vma: introduce abort_munmap_vmas()	Liam R. Howlett	1	-5/+17
	Extract clean up of failed munmap() operations from do_vmi_align_munmap(). This simplifies later patches in the series. It is worth noting that the mas_for_each() loop now has a different upper limit. This should not change the number of vmas visited for reattaching to the main vma tree (mm_mt), as all vmas are reattached in both scenarios. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm/vma: correctly position vma_iterator in __split_vma()	Liam R. Howlett	1	-1/+4
	Patch series "Avoid MAP_FIXED gap exposure", v8. It is now possible to walk the vma tree using the rcu read locks and is beneficial to do so to reduce lock contention. Doing so while a MAP_FIXED mapping is executing means that a reader may see a gap in the vma tree that should never logically exist - and does not when using the mmap lock in read mode. The temporal gap exists because mmap_region() calls munmap() prior to installing the new mapping. This patch set stops rcu readers from seeing the temporal gap by splitting up the munmap() function into two parts. The first part prepares the vma tree for modifications by doing the necessary splits and tracks the vmas marked for removal in a side tree. The second part completes the munmapping of the vmas after the vma tree has been overwritten (either by a MAP_FIXED replacement vma or by a NULL in the munmap() case). Please note that rcu walkers will still be able to see a temporary state of split vmas that may be in the process of being removed, but the temporal gap will not be exposed. vma_start_write() are called on both parts of the split vma, so this state is detectable. If existing vmas have a vm_ops->close(), then they will be called prior to mapping the new vmas (and ptes are cleared out). Without calling ->close(), hugetlbfs tests fail (hugemmap06 specifically) due to resources still being marked as 'busy'. Unfortunately, calling the corresponding ->open() may not restore the state of the vmas, so it is safer to keep the existing failure scenario where a gap is inserted and never replaced. The failure scenario is in its own patch (0015) for traceability. This patch (of 21): The vma iterator may be left pointing to the newly created vma. This happens when inserting the new vma at the end of the old vma (!new_below). The incorrect position in the vma iterator is not exposed currently since the vma iterator is repositioned in the munmap path and is not reused in any of the other paths. This has limited impact in the current code, but is required for future changes. Link: https://lkml.kernel.org/r/[email protected] Fixes: b2b3b886738f ("mm: don't use __vma_adjust() in __split_vma()") Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Paul Moore <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	Documentation/cgroup-v2: clarify that zswap.writeback is ignored if zswap is ↵	Mike Yuan	1	-0/+2
	disabled As discussed in [1], zswap-related settings natually lose their effect when zswap is disabled, specifically zswap.writeback here. Be explicit about this behavior. [1] https://lore.kernel.org/linux-kernel/CAKEwX=Mhbwhh-=xxCU-RjMXS_n=RpV3Gtznb2m_3JgL+jzz++g@mail.gmail.com/ [[email protected]: fix/simplify text] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Yuan <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	selftests: test_zswap: add test for hierarchical zswap.writeback	Mike Yuan	1	-21/+54
	Ensure that zswap.writeback check goes up the cgroup tree, i.e. is hierarchical. Create a subcgroup which has zswap.writeback set to 1, and the upper hierarchy's restrictions shall apply. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Yuan <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm: remove code to handle same filled pages	Usama Arif	1	-77/+8
	With an earlier commit to handle zero-filled pages in swap directly, and with only 1% of the same-filled pages being non-zero, zswap no longer needs to handle same-filled pages and can just work on compressed pages. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Usama Arif <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Andi Kleen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm: store zero pages to be swapped out in a bitmap	Usama Arif	3	-6/+151
	Patch series "mm: store zero pages to be swapped out in a bitmap", v8. As shown in the patch series that introduced the zswap same-filled optimization [1], 10-20% of the pages stored in zswap are same-filled. This is also observed across Meta's server fleet. By using VM counters in swap_writepage (not included in this patchseries) it was found that less than 1% of the same-filled pages to be swapped out are non-zero pages. For conventional swap setup (without zswap), rather than reading/writing these pages to flash resulting in increased I/O and flash wear, a bitmap can be used to mark these pages as zero at write time, and the pages can be filled at read time if the bit corresponding to the page is set. When using zswap with swap, this also means that a zswap_entry does not need to be allocated for zero filled pages resulting in memory savings which would offset the memory used for the bitmap. A similar attempt was made earlier in [2] where zswap would only track zero-filled pages instead of same-filled. This patchseries adds zero-filled pages optimization to swap (hence it can be used even if zswap is disabled) and removes the same-filled code from zswap (as only 1% of the same-filled pages are non-zero), simplifying code. [1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/ [2] https://lore.kernel.org/lkml/[email protected]/ This patch (of 2): Approximately 10-20% of pages to be swapped out are zero pages [1]. Rather than reading/writing these pages to flash resulting in increased I/O and flash wear, a bitmap can be used to mark these pages as zero at write time, and the pages can be filled at read time if the bit corresponding to the page is set. With this patch, NVMe writes in Meta server fleet decreased by almost 10% with conventional swap setup (zswap disabled). [1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/ Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Usama Arif <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Usama Arif <[email protected]> Cc: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm:page_alloc: fix the NULL ac->nodemask in __alloc_pages_slowpath()	Zhongkun He	2	-0/+11
	should_reclaim_retry() is not ALLOC_CPUSET aware and that means that it considers reclaimability of NUMA nodes which are outside of the cpuset. If other nodes have a lot of reclaimable memory then should_reclaim_retry would instruct page allocator to retry even though there is no memory reclaimable on the cpuset nodemask. This is not really a huge problem because the number of retries without any reclaim progress is bound but it could be certainly improved. This is a cold path so this shouldn't really have a measurable impact on performance on most workloads. 1.Test step and the machines. ------------ root@vm:/sys/fs/cgroup/test# numactl -H \| grep size node 0 size: 9477 MB node 1 size: 10079 MB node 2 size: 10079 MB node 3 size: 10078 MB root@vm:/sys/fs/cgroup/test# cat cpuset.mems 2 root@vm:/sys/fs/cgroup/test# stress --vm 1 --vm-bytes 12g --vm-keep stress: info: [33430] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd stress: FAIL: [33430] (425) <-- worker 33431 got signal 9 stress: WARN: [33430] (427) now reaping child worker processes stress: FAIL: [33430] (461) failed run completed in 2s 2. reclaim_retry_zone info: We can only alloc pages from node=2, but the reclaim_retry_zone is node=0 and return true. root@vm:/sys/kernel/debug/tracing# cat trace stress-33431 [001] ..... 13223.617311: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=1 wmark_check=1 stress-33431 [001] ..... 13223.617682: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=2 wmark_check=1 stress-33431 [001] ..... 13223.618103: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=3 wmark_check=1 stress-33431 [001] ..... 13223.618454: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=4 wmark_check=1 stress-33431 [001] ..... 13223.618770: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=5 wmark_check=1 stress-33431 [001] ..... 13223.619150: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=6 wmark_check=1 stress-33431 [001] ..... 13223.619510: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=7 wmark_check=1 stress-33431 [001] ..... 13223.619850: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=8 wmark_check=1 stress-33431 [001] ..... 13223.620171: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=9 wmark_check=1 stress-33431 [001] ..... 13223.620533: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=10 wmark_check=1 stress-33431 [001] ..... 13223.620894: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=11 wmark_check=1 stress-33431 [001] ..... 13223.621224: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=12 wmark_check=1 stress-33431 [001] ..... 13223.621551: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=13 wmark_check=1 stress-33431 [001] ..... 13223.621847: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=14 wmark_check=1 stress-33431 [001] ..... 13223.622200: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=15 wmark_check=1 stress-33431 [001] ..... 13223.622580: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=16 wmark_check=1 With this patch, we can check the right node and get less retry in __alloc_pages_slowpath() because there is nothing to do. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhongkun He <[email protected]> Suggested-by: Michal Hocko <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03	mm: swapfile: fix SSD detection with swapfile on btrfs	Johannes Weiner	1	-79/+86
	We've been noticing a trend of significant lock contention in the swap subsystem as core counts have been increasing in our fleet. It turns out that our swapfiles on btrfs on flash were in fact using the old swap code for rotational storage. This turns out to be a detection issue in the swapon sequence: btrfs sets si->bdev during swap activation, which currently happens after swapon's SSD detection and cluster setup. Thus, none of the SSD optimizations and cluster lock splitting are enabled for btrfs swap. Rearrange the swapon sequence so that filesystem activation happens before determining swap behavior based on the backing device. Afterwards, the nonrotational drive is detected correctly: - Adding 2097148k swap on /mnt/swapfile. Priority:-3 extents:1 across:2097148k + Adding 2097148k swap on /mnt/swapfile. Priority:-3 extents:1 across:2097148k SS Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>