blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2022-03-22	mm/damon: remove redundant page validation	Baolin Wang	1	-6/+0
	It will never get a NULL page by pte_page() as discussed in thread [1], thus remove the redundant page validation to fix below Smatch static checker warning. mm/damon/vaddr.c:405 damon_hugetlb_mkold() warn: 'page' can't be NULL. [1] https://lore.kernel.org/linux-mm/20220106091200.GA14564@kili/ Link: https://lkml.kernel.org/r/6d32f7d201b8970d53f51b6c5717d472aed2987c.1642386715.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Reported-by: Dan Carpenter <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Souptick Joarder <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/damon: remove the target id concept	SeongJae Park	8	-128/+133
	DAMON asks each monitoring target ('struct damon_target') to have one 'unsigned long' integer called 'id', which should be unique among the targets of same monitoring context. Meaning of it is, however, totally up to the monitoring primitives that registered to the monitoring context. For example, the virtual address spaces monitoring primitives treats the id as a 'struct pid' pointer. This makes the code flexible, but ugly, not well-documented, and type-unsafe[1]. Also, identification of each target can be done via its index. For the reason, this commit removes the concept and uses clear type definition. For now, only 'struct pid' pointer is used for the virtual address spaces monitoring. If DAMON is extended in future so that we need to put another identifier field in the struct, we will use a union for such primitives-dependent fields and document which primitives are using which type. [1] https://lore.kernel.org/linux-mm/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/damon/core: move damon_set_targets() into dbgfs	SeongJae Park	5	-54/+52
	damon_set_targets() function is defined in the core for general use cases, but called from only dbgfs. Also, because the function is for general use cases, dbgfs does additional handling of pid type target id case. To make the situation simpler, this commit moves the function into dbgfs and makes it to do the pid type case handling on its own. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	Docs/admin-guide/mm/damon/usage: update for changed initail_regions file input	SeongJae Park	1	-10/+14
	A previous commit made init_regions debugfs file to use target index instead of target id for specifying the target of the init regions. This commit updates the usage document to reflect the change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/damon/dbgfs/init_regions: use target index instead of target id	SeongJae Park	2	-23/+22
	Patch series "Remove the type-unclear target id concept". DAMON asks each monitoring target ('struct damon_target') to have one 'unsigned long' integer called 'id', which should be unique among the targets of same monitoring context. Meaning of it is, however, totally up to the monitoring primitives that registered to the monitoring context. For example, the virtual address spaces monitoring primitives treats the id as a 'struct pid' pointer. This makes the code flexible but ugly, not well-documented, and type-unsafe[1]. Also, identification of each target can be done via its index. For the reason, this patchset removes the concept and uses clear type definition. [1] https://lore.kernel.org/linux-mm/[email protected]/ This patch (of 4): Target id is a 'unsigned long' data, which can be interpreted differently by each monitoring primitives. For example, it means 'struct pid *' for the virtual address spaces monitoring, while it means nothing but an integer to be displayed to debugfs interface users for the physical address space monitoring. It's flexible but makes code ugly and type-unsafe[1]. To be prepared for eventual removal of the concept, this commit removes a use case of the concept in 'init_regions' debugfs file handling. In detail, this commit replaces use of the id with the index of each target in the context's targets list. [1] https://lore.kernel.org/linux-mm/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/hmm.c: remove unneeded local variable ret	Miaohe Lin	1	-2/+1
	The local variable ret is always 0. Remove it to make code more tight. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	kfence: allow use of a deferrable timer	Marco Elver	3	-2/+37
	Allow the use of a deferrable timer, which does not force CPU wake-ups when the system is idle. A consequence is that the sample interval becomes very unpredictable, to the point that it is not guaranteed that the KFENCE KUnit test still passes. Nevertheless, on power-constrained systems this may be preferable, so let's give the user the option should they accept the above trade-off. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Reviewed-by: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	kfence: test: try to avoid test_gfpzero trigger rcu_stall	Peng Liu	1	-0/+1
	When CONFIG_KFENCE_NUM_OBJECTS is set to a big number, kfence kunit-test-case test_gfpzero will eat up nearly all the CPU's resources and rcu_stall is reported as the following log which is cut from a physical server. rcu: INFO: rcu_sched self-detected stall on CPU rcu: 68-....: (14422 ticks this GP) idle=6ce/1/0x4000000000000002 softirq=592/592 fqs=7500 (t=15004 jiffies g=10677 q=20019) Task dump for CPU 68: task:kunit_try_catch state:R running task stack: 0 pid: 9728 ppid: 2 flags:0x0000020a Call trace: dump_backtrace+0x0/0x1e4 show_stack+0x20/0x2c sched_show_task+0x148/0x170 ... rcu_sched_clock_irq+0x70/0x180 update_process_times+0x68/0xb0 tick_sched_handle+0x38/0x74 ... gic_handle_irq+0x78/0x2c0 el1_irq+0xb8/0x140 kfree+0xd8/0x53c test_alloc+0x264/0x310 [kfence_test] test_gfpzero+0xf4/0x840 [kfence_test] kunit_try_run_case+0x48/0x20c kunit_generic_run_threadfn_adapter+0x28/0x34 kthread+0x108/0x13c ret_from_fork+0x10/0x18 To avoid rcu_stall and unacceptable latency, a schedule point is added to test_gfpzero. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peng Liu <[email protected]> Reviewed-by: Marco Elver <[email protected]> Tested-by: Brendan Higgins <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Wang Kefeng <[email protected]> Cc: Daniel Latypov <[email protected]> Cc: David Gow <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	kunit: make kunit_test_timeout compatible with comment	Peng Liu	1	-1/+1
	In function kunit_test_timeout, it is declared "300 * MSEC_PER_SEC" represent 5min. However, it is wrong when dealing with arm64 whose default HZ = 250, or some other situations. Use msecs_to_jiffies to fix this, and kunit_test_timeout will work as desired. Link: https://lkml.kernel.org/r/[email protected] Fixes: 5f3e06208920 ("kunit: test: add support for test abort") Signed-off-by: Peng Liu <[email protected]> Reviewed-by: Marco Elver <[email protected]> Reviewed-by: Daniel Latypov <[email protected]> Reviewed-by: Brendan Higgins <[email protected]> Tested-by: Brendan Higgins <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Wang Kefeng <[email protected]> Cc: David Gow <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	kunit: fix UAF when run kfence test case test_gfpzero	Peng Liu	2	-1/+2
	Patch series "kunit: fix a UAF bug and do some optimization", v2. This series is to fix UAF (use after free) when running kfence test case test_gfpzero, which is time costly. This UAF bug can be easily triggered by setting CONFIG_KFENCE_NUM_OBJECTS = 65535. Furthermore, some optimization for kunit tests has been done. This patch (of 3): Kunit will create a new thread to run an actual test case, and the main process will wait for the completion of the actual test thread until overtime. The variable "struct kunit test" has local property in function kunit_try_catch_run, and will be used in the test case thread. Task kunit_try_catch_run will free "struct kunit test" when kunit runs overtime, but the actual test case is still run and an UAF bug will be triggered. The above problem has been both observed in a physical machine and qemu platform when running kfence kunit tests. The problem can be triggered when setting CONFIG_KFENCE_NUM_OBJECTS = 65535. Under this setting, the test case test_gfpzero will cost hours and kunit will run to overtime. The follows show the panic log. BUG: unable to handle page fault for address: ffffffff82d882e9 Call Trace: kunit_log_append+0x58/0xd0 ... test_alloc.constprop.0.cold+0x6b/0x8a [kfence_test] test_gfpzero.cold+0x61/0x8ab [kfence_test] kunit_try_run_case+0x4c/0x70 kunit_generic_run_threadfn_adapter+0x11/0x20 kthread+0x166/0x190 ret_from_fork+0x22/0x30 Kernel panic - not syncing: Fatal exception Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 To solve this problem, the test case thread should be stopped when the kunit frame runs overtime. The stop signal will send in function kunit_try_catch_run, and test_gfpzero will handle it. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peng Liu <[email protected]> Reviewed-by: Marco Elver <[email protected]> Reviewed-by: Brendan Higgins <[email protected]> Tested-by: Brendan Higgins <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Wang Kefeng <[email protected]> Cc: Daniel Latypov <[email protected]> Cc: David Gow <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	kfence: alloc kfence_pool after system startup	Tianchen Ding	1	-21/+90
	Allow enabling KFENCE after system startup by allocating its pool via the page allocator. This provides the flexibility to enable KFENCE even if it wasn't enabled at boot time. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tianchen Ding <[email protected]> Reviewed-by: Marco Elver <[email protected]> Tested-by: Peng Liu <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	kfence: allow re-enabling KFENCE after system startup	Tianchen Ding	1	-3/+18
	Patch series "provide the flexibility to enable KFENCE", v3. If CONFIG_CONTIG_ALLOC is not supported, we fallback to try alloc_pages_exact(). Allocating pages in this way has limits about MAX_ORDER (default 11). So we will not support allocating kfence pool after system startup with a large KFENCE_NUM_OBJECTS. When handling failures in kfence_init_pool_late(), we pair free_pages_exact() to alloc_pages_exact() for compatibility consideration, though it actually does the same as free_contig_range(). This patch (of 2): If once KFENCE is disabled by: echo 0 > /sys/module/kfence/parameters/sample_interval KFENCE could never be re-enabled until next rebooting. Allow re-enabling it by writing a positive num to sample_interval. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tianchen Ding <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/kfence: remove unnecessary CONFIG_KFENCE option	tangmeng	1	-1/+1
	In mm/Makefile has: obj-$(CONFIG_KFENCE) += kfence/ So that we don't need 'obj-$(CONFIG_KFENCE) :=' in mm/kfence/Makefile, delete it from mm/kfence/Makefile. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: tangmeng <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitriy Vyukov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/page_table_check.c: use strtobool for param parsing	Dr. David Alan Gilbert	1	-9/+1
	Use strtobool rather than open coding "on" and "off" parsing. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/highmem: remove unnecessary done label	Miaohe Lin	1	-5/+4
	Remove unnecessary done label to simplify the code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	highmem: document kunmap_local()	Ira Weiny	1	-0/+10
	Some users of kmap() add an offset to the kmap() address to be used during the mapping. When converting to kmap_local_page() the base address does not need to be stored because any address within the page can be used in kunmap_local(). However, this was not clear from the documentation and cause some questions.[1] Document that any address in the page can be used in kunmap_local() to clarify this for future users. [1] https://lore.kernel.org/lkml/[email protected]/ [[email protected]: updates per Christoph] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ira Weiny <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/early_ioremap: declare early_memremap_pgprot_adjust()	Vlastimil Babka	2	-0/+7
	The mm/ directory can almost fully be built with W=1, which would help in local development. One remaining issue is missing prototype for early_memremap_pgprot_adjust(). Thus add a declaration for this function. Use mm/internal.h instead of asm/early_ioremap.h to avoid missing type definitions and unnecessary exposure. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/usercopy: return 1 from hardened_usercopy __setup() handler	Randy Dunlap	1	-1/+4
	__setup() handlers should return 1 if the command line option is handled and 0 if not (or maybe never return 0; it just pollutes init's environment). This prevents: Unknown kernel command line parameters \ "BOOT_IMAGE=/boot/bzImage-517rc5 hardened_usercopy=off", will be \ passed to user space. Run /sbin/init as init process with arguments: /sbin/init with environment: HOME=/ TERM=linux BOOT_IMAGE=/boot/bzImage-517rc5 hardened_usercopy=off or hardened_usercopy=on but when "hardened_usercopy=foo" is used, there is no Unknown kernel command line parameter. Return 1 to indicate that the boot option has been handled. Print a warning if strtobool() returns an error on the option string, but do not mark this as in unknown command line option and do not cause init's environment to be polluted with this string. Link: https://lkml.kernel.org/r/[email protected] Link: lore.kernel.org/r/[email protected] Fixes: b5cb15d9372ab ("usercopy: Allow boot cmdline disabling of hardening") Signed-off-by: Randy Dunlap <[email protected]> Reported-by: Igor Zhbanov <[email protected]> Acked-by: Chris von Recklinghausen <[email protected]> Cc: Kees Cook <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: uninline copy_overflow()	Christophe Leroy	2	-1/+10
	While building a small config with CONFIG_CC_OPTIMISE_FOR_SIZE, I ended up with more than 50 times the following function in vmlinux because GCC doesn't honor the 'inline' keyword: c00243bc <copy_overflow>: c00243bc: 94 21 ff f0 stwu r1,-16(r1) c00243c0: 7c 85 23 78 mr r5,r4 c00243c4: 7c 64 1b 78 mr r4,r3 c00243c8: 3c 60 c0 62 lis r3,-16286 c00243cc: 7c 08 02 a6 mflr r0 c00243d0: 38 63 5e e5 addi r3,r3,24293 c00243d4: 90 01 00 14 stw r0,20(r1) c00243d8: 4b ff 82 45 bl c001c61c <__warn_printk> c00243dc: 0f e0 00 00 twui r0,0 c00243e0: 80 01 00 14 lwz r0,20(r1) c00243e4: 38 21 00 10 addi r1,r1,16 c00243e8: 7c 08 03 a6 mtlr r0 c00243ec: 4e 80 00 20 blr With -Winline, GCC tells: /include/linux/thread_info.h:212:20: warning: inlining failed in call to 'copy_overflow': call is unlikely and code size would grow [-Winline] copy_overflow() is a non conditional warning called by check_copy_size() on an error path. check_copy_size() have to remain inlined in order to benefit from constant folding, but copy_overflow() is not worth inlining. Uninline the warning when CONFIG_BUG is selected. When CONFIG_BUG is not selected, WARN() does nothing so skip it. This reduces the size of vmlinux by almost 4kbytes. Link: https://lkml.kernel.org/r/e1723b9cfa924bcefcd41f69d0025b38e4c9364e.1644819985.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Cc: David Laight <[email protected]> Cc: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: remove usercopy_warn()	Christophe Leroy	2	-13/+0
	Users of usercopy_warn() were removed by commit 53944f171a89 ("mm: remove HARDENED_USERCOPY_FALLBACK") Remove it. Link: https://lkml.kernel.org/r/5f26643fc70b05f8455b60b99c30c17d635fa640.1644231910.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Reviewed-by: Stephen Kitt <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Kees Cook <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/zswap.c: allow handling just same-value filled pages	Maciej S. Szmigiero	2	-4/+33
	Zswap has an ability to efficiently store same-value filled pages, which can be turned on and off using the "same_filled_pages_enabled" parameter. However, there is currently no way to enable just this (lightweight) functionality, while not making use of the whole compressed page storage machinery. Add a "non_same_filled_pages_enabled" parameter which allows disabling handling of pages that aren't same-value filled. This way zswap can be run in such lightweight same-value filled pages only mode. Link: https://lkml.kernel.org/r/7dbafa963e8bab43608189abbe2067f4b9287831.1641247624.git.maciej.szmigiero@oracle.com Signed-off-by: Maciej S. Szmigiero <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/thp: ClearPageDoubleMap in first page_add_file_rmap()	Hugh Dickins	1	-0/+11
	PageDoubleMap is maintained differently for anon and for shmem+file: the shmem+file one was never cleared, because a safe place to do so could not be found; so it would blight future use of the cached hugepage until evicted. See https://lore.kernel.org/lkml/[email protected]/ But page_add_file_rmap() does provide a safe place to do so (though later than one might wish): allowing testing to return to an initial state without a damaging drop_caches. Link: https://lkml.kernel.org/r/[email protected] Fixes: 9a73f61bdb8a ("thp, mlock: do not mlock PTE-mapped file huge pages") Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: only re-generate demotion targets when a numa node changes its N_CPU state	Oscar Salvador	3	-38/+30
	Abhishek reported that after patch [1], hotplug operations are taking roughly double the expected time. [2] The reason behind is that the CPU callbacks that migrate_on_reclaim_init() sets always call set_migration_target_nodes() whenever a CPU is brought up/down. But we only care about numa nodes going from having cpus to become cpuless, and vice versa, as that influences the demotion_target order. We do already have two CPU callbacks (vmstat_cpu_online() and vmstat_cpu_dead()) that check exactly that, so get rid of the CPU callbacks in migrate_on_reclaim_init() and only call set_migration_target_nodes() from vmstat_cpu_{dead,online}() whenever a numa node change its N_CPU state. [1] https://lore.kernel.org/linux-mm/[email protected]/ [2] https://lore.kernel.org/linux-mm/[email protected]/ [[email protected]: add feedback from Huang Ying] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 884a6e5d1f93b ("mm/migrate: update node demotion order on hotplug events") Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Tested-by: Baolin Wang <[email protected]> Reported-by: Abhishek Goel <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Abhishek Goel <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	drivers/base/memory: clarify adding and removing of memory blocks	David Hildenbrand	1	-18/+20
	Let's make it clearer at which places we actually add and remove memory blocks -- streamlining the terminology -- and highlight which memory block start out online and which start out as offline. * rename add_memory_block -> add_boot_memory_block * rename init_memory_block -> add_memory_block * rename unregister_memory -> remove_memory_block * rename register_memory -> __add_memory_block * add add_hotplug_memory_block * mark add_boot_memory_block with __init (suggested by Oscar) __add_memory_block() is a pure helper for add_memory_block(), remove the somewhat obvious comment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	drivers/base/memory: determine and store zone for single-zone memory blocks	David Hildenbrand	5	-57/+125
	test_pages_in_a_zone() is just another nasty PFN walker that can easily stumble over ZONE_DEVICE memory ranges falling into the same memory block as ordinary system RAM: the memmap of parts of these ranges might possibly be uninitialized. In fact, we observed (on an older kernel) with UBSAN: UBSAN: Undefined behaviour in ./include/linux/mm.h:1133:50 index 7 is out of range for type 'zone [5]' CPU: 121 PID: 35603 Comm: read_all Kdump: loaded Tainted: [...] Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.12.2 11/15/2019 Call Trace: dump_stack+0x9a/0xf0 ubsan_epilogue+0x9/0x7a __ubsan_handle_out_of_bounds+0x13a/0x181 test_pages_in_a_zone+0x3c4/0x500 show_valid_zones+0x1fa/0x380 dev_attr_show+0x43/0xb0 sysfs_kf_seq_show+0x1c5/0x440 seq_read+0x49d/0x1190 vfs_read+0xff/0x300 ksys_read+0xb8/0x170 do_syscall_64+0xa5/0x4b0 entry_SYSCALL_64_after_hwframe+0x6a/0xdf RIP: 0033:0x7f01f4439b52 We seem to stumble over a memmap that contains a garbage zone id. While we could try inserting pfn_to_online_page() calls, it will just make memory offlining slower, because we use test_pages_in_a_zone() to make sure we're offlining pages that all belong to the same zone. Let's just get rid of this PFN walker and determine the single zone of a memory block -- if any -- for early memory blocks during boot. For memory onlining, we know the single zone already. Let's avoid any additional memmap scanning and just rely on the zone information available during boot. For memory hot(un)plug, we only really care about memory blocks that: * span a single zone (and, thereby, a single node) * are completely System RAM (IOW, no holes, no ZONE_DEVICE) If one of these conditions is not met, we reject memory offlining. Hotplugged memory blocks (starting out offline), always meet both conditions. There are three scenarios to handle: (1) Memory hot(un)plug A memory block with zone == NULL cannot be offlined, corresponding to our previous test_pages_in_a_zone() check. After successful memory onlining/offlining, we simply set the zone accordingly. * Memory onlining: set the zone we just used for onlining * Memory offlining: set zone = NULL So a hotplugged memory block starts with zone = NULL. Once memory onlining is done, we set the proper zone. (2) Boot memory with !CONFIG_NUMA We know that there is just a single pgdat, so we simply scan all zones of that pgdat for an intersection with our memory block PFN range when adding the memory block. If more than one zone intersects (e.g., DMA and DMA32 on x86 for the first memory block) we set zone = NULL and consequently mimic what test_pages_in_a_zone() used to do. (3) Boot memory with CONFIG_NUMA At the point in time we create the memory block devices during boot, we don't know yet which nodes actually span a memory block. While we could scan all zones of all nodes for intersections, overlapping nodes complicate the situation and scanning all nodes is possibly expensive. But that problem has already been solved by the code that sets the node of a memory block and creates the link in the sysfs -- do_register_memory_block_under_node(). So, we hook into the code that sets the node id for a memory block. If we already have a different node id set for the memory block, we know that multiple nodes actually have PFNs falling into our memory block: we set zone = NULL and consequently mimic what test_pages_in_a_zone() used to do. If there is no node id set, we do the same as (2) for the given node. Note that the call order in driver_init() is: -> memory_dev_init(): create memory block devices -> node_dev_init(): link memory block devices to the node and set the node id So in summary, we detect if there is a single zone responsible for this memory block and we consequently store the zone in that case in the memory block, updating it during memory onlining/offlining. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reported-by: Rafael Parra <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rafael Parra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	drivers/base/node: rename link_mem_sections() to ↵	David Hildenbrand	3	-13/+14
	register_memory_block_under_node() Patch series "drivers/base/memory: determine and store zone for single-zone memory blocks", v2. I remember talking to Michal in the past about removing test_pages_in_a_zone(), which we use for: * verifying that a memory block we intend to offline is really only managed by a single zone. We don't support offlining of memory blocks that are managed by multiple zones (e.g., multiple nodes, DMA and DMA32) * exposing that zone to user space via /sys/devices/system/memory/memory*/valid_zones Now that I identified some more cases where test_pages_in_a_zone() might go wrong, and we received an UBSAN report (see patch #3), let's get rid of this PFN walker. So instead of detecting the zone at runtime with test_pages_in_a_zone() by scanning the memmap, let's determine and remember for each memory block if it's managed by a single zone. The stored zone can then be used for the above two cases, avoiding a manual lookup using test_pages_in_a_zone(). This avoids eventually stumbling over uninitialized memmaps in corner cases, especially when ZONE_DEVICE ranges partly fall into memory block (that are responsible for managing System RAM). Handling memory onlining is easy, because we online to exactly one zone. Handling boot memory is more tricky, because we want to avoid scanning all zones of all nodes to detect possible zones that overlap with the physical memory region of interest. Fortunately, we already have code that determines the applicable nodes for a memory block, to create sysfs links -- we'll hook into that. Patch #1 is a simple cleanup I had laying around for a longer time. Patch #2 contains the main logic to remove test_pages_in_a_zone() and further details. [1] https://lkml.kernel.org/r/[email protected] [2] https://lkml.kernel.org/r/[email protected] This patch (of 2): Let's adjust the stale terminology, making it match unregister_memory_block_under_nodes() and do_register_memory_block_under_node(). We're dealing with memory block devices, which span 1..X memory sections. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Oscar Salvador <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Rafael Parra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/memory_hotplug: fix misplaced comment in offline_pages	Miaohe Lin	1	-1/+1
	It's misplaced since commit 7960509329c2 ("mm, memory_hotplug: print reason for the offlining failure"). Move it to the right place. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/memory_hotplug: clean up try_offline_node	Miaohe Lin	1	-5/+4
	We can use helper macro node_spanned_pages to check whether node spans pages. And we can change the parameter of check_cpu_on_node to nid as that's what it really cares. Thus we can further get rid of the local variable pgdat and improve the readability a bit. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/memory_hotplug: avoid calling zone_intersects() for ZONE_NORMAL	Miaohe Lin	1	-1/+1
	If zid reaches ZONE_NORMAL, the caller will always get the NORMAL zone no matter what zone_intersects() returns. So we can save some possible cpu cycles by avoid calling zone_intersects() for ZONE_NORMAL. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/memory_hotplug: remove obsolete comment of __add_pages	Miaohe Lin	1	-6/+0
	Patch series "A few cleanup patches around memory_hotplug". This series contains a few patches to fix obsolete and misplaced comments, clean up the try_offline_node function and so on. This patch (of 4): Since commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online"), there is no need to pass in the zone. [[email protected]: remove the comment altogether, per David] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	drivers/base/node: consolidate node device subsystem initialization in ↵	David Hildenbrand	12	-80/+22
	node_dev_init() ... and call node_dev_init() after memory_dev_init() from driver_init(), so before any of the existing arch/subsys calls. All online nodes should be known at that point: early during boot, arch code determines node and zone ranges and sets the relevant nodes online; usually this happens in setup_arch(). This is in line with memory_dev_init(), which initializes the memory device subsystem and creates all memory block devices. Similar to memory_dev_init(), panic() if anything goes wrong, we don't want to continue with such basic initialization errors. The important part is that node_dev_init() gets called after memory_dev_init() and after cpu_dev_init(), but before any of the relevant archs call register_cpu() to register the new cpu device under the node device. The latter should be the case for the current users of topology_init(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Tested-by: Anatoly Pugachev <[email protected]> (sparc64) Cc: Greg Kroah-Hartman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Albert Ou <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	drivers/base/memory: add memory block to memory group after registration ↵	David Hildenbrand	1	-3/+5
	succeeded If register_memory() fails, we freed the memory block but already added the memory block to the group list, not good. Let's defer adding the block to the memory group to after registering the memory block device. We do handle it properly during unregister_memory(), but that's not called when the registration fails. Link: https://lkml.kernel.org/r/[email protected] Fixes: 028fc57a1c36 ("drivers/base/memory: introduce "memory groups" to logically group memory blocks") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	memcg: do not tweak node in alloc_mem_cgroup_per_node_info	Wei Yang	1	-12/+2
	alloc_mem_cgroup_per_node_info is allocated for each possible node and this used to be a problem because !node_online nodes didn't have appropriate data structure allocated. This has changed by "mm: handle uninitialized numa nodes gracefully" so we can drop the special casing here. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Michal Hocko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Alexey Makhalov <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Nico Pache <[email protected]> Cc: Wei Yang <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Rafael Aquini <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: make free_area_init_node aware of memory less nodes	Michal Hocko	1	-3/+8
	free_area_init_node is also called from memory less node initialization path (free_area_init_memoryless_node). It doesn't really make much sense to display the physical memory range for those nodes: Initmem setup node XX [mem 0x0000000000000000-0x0000000000000000] Instead be explicit that the node is memoryless: Initmem setup node XX as memoryless Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Rafael Aquini <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Alexey Makhalov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Nico Pache <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm, memory_hotplug: reorganize new pgdat initialization	Michal Hocko	3	-28/+27
	When a !node_online node is brought up it needs a hotplug specific initialization because the node could be either uninitialized yet or it could have been recycled after previous hotremove. hotadd_init_pgdat is responsible for that. Internal pgdat state is initialized at two places currently - hotadd_init_pgdat - free_area_init_core_hotplug There is no real clear cut what should go where but this patch's chosen to move the whole internal state initialization into free_area_init_core_hotplug. hotadd_init_pgdat is still responsible to pull all the parts together - most notably to initialize zonelists because those depend on the overall topology. This patch doesn't introduce any functional change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Rafael Aquini <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Alexey Makhalov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nico Pache <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm, memory_hotplug: drop arch_free_nodedata	Michal Hocko	3	-18/+0
	Prior to "mm: handle uninitialized numa nodes gracefully" memory hotplug used to allocate pgdat when memory has been added to a node (hotadd_init_pgdat) arch_free_nodedata has been only used in the failure path because once the pgdat is exported (to be visible by NODA_DATA(nid)) it cannot really be freed because there is no synchronization available for that. pgdat is allocated for each possible nodes now so the memory hotplug doesn't need to do the ever use arch_free_nodedata so drop it. This patch doesn't introduce any functional change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Rafael Aquini <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Mike Rapoport <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Alexey Makhalov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Nico Pache <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: handle uninitialized numa nodes gracefully	Michal Hocko	5	-19/+50
	We have had several reports [1][2][3] that page allocator blows up when an allocation from a possible node is requested. The underlying reason is that NODE_DATA for the specific node is not allocated. NUMA specific initialization is arch specific and it can vary a lot. E.g. x86 tries to initialize all nodes that have some cpu affinity (see init_cpu_to_node) but this can be insufficient because the node might be cpuless for example. One way to address this problem would be to check for !node_online nodes when trying to get a zonelist and silently fall back to another node. That is unfortunately adding a branch into allocator hot path and it doesn't handle any other potential NODE_DATA users. This patch takes a different approach (following a lead of [3]) and it pre allocates pgdat for all possible nodes in an arch indipendent code - free_area_init. All uninitialized nodes are treated as memoryless nodes. node_state of the node is not changed because that would lead to other side effects - e.g. sysfs representation of such a node and from past discussions [4] it is known that some tools might have problems digesting that. Newly allocated pgdat only gets a minimal initialization and the rest of the work is expected to be done by the memory hotplug - hotadd_new_pgdat (renamed to hotadd_init_pgdat). generic_alloc_nodedata is changed to use the memblock allocator because neither page nor slab allocators are available at the stage when all pgdats are allocated. Hotplug doesn't allocate pgdat anymore so we can use the early boot allocator. The only arch specific implementation is ia64 and that is changed to use the early allocator as well. [1] http://lkml.kernel.org/r/[email protected] [2] http://lkml.kernel.org/r/[email protected] [3] http://lkml.kernel.org/r/[email protected] [4] http://lkml.kernel.org/r/[email protected] [[email protected]: replace comment, per Mike] Link: https://lkml.kernel.org/r/[email protected] Reported-by: Alexey Makhalov <[email protected]> Tested-by: Alexey Makhalov <[email protected]> Reported-by: Nico Pache <[email protected]> Acked-by: Rafael Aquini <[email protected]> Tested-by: Rafael Aquini <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Acked-by: Mike Rapoport <[email protected]> Signed-off-by: Michal Hocko <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm, memory_hotplug: make arch_alloc_nodedata independent on ↵	Michal Hocko	2	-62/+59
	CONFIG_MEMORY_HOTPLUG Patch series "mm, memory_hotplug: handle unitialized numa node gracefully". The core of the fix is patch 2 which also links existing bug reports. The high level goal is to have all possible numa nodes have their pgdat allocated and initialized so for_each_possible_node(nid) NODE_DATA(nid) will never return garbage. This has proven to be problem in several places when an offline numa node is used for an allocation just to realize that node_data and therefore allocation fallback zonelists are not initialized and such an allocation request blows up. There were attempts to address that by checking node_online in several places including the page allocator. This patchset approaches the problem from a different perspective and instead of special casing, which just adds a runtime overhead, it allocates pglist_data for each possible node. This can add some memory overhead for platforms with high number of possible nodes if they do not contain any memory. This should be a rather rare configuration though. How to test this? David has provided and excellent howto: http://lkml.kernel.org/r/[email protected] Patches 1 and 3-6 are mostly cleanups. The patchset has been reviewed by Rafael (thanks!) and the core fix tested by Rafael and Alexey (thanks to both). David has tested as per instructions above and hasn't found any fallouts in the memory hotplug scenarios. This patch (of 6): This is a preparatory patch and it doesn't introduce any functional change. It merely pulls out arch_alloc_nodedata (and co) outside of CONFIG_MEMORY_HOTPLUG because the following patch will need to call this from the generic MM code. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Rafael Aquini <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Mike Rapoport <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Wei Yang <[email protected]> Cc: Alexey Makhalov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Nico Pache <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: madvise: skip unmapped vma holes passed to process_madvise	Charan Teja Kalla	1	-1/+8
	The process_madvise() system call is expected to skip holes in vma passed through 'struct iovec' vector list. But do_madvise, which process_madvise() calls for each vma, returns ENOMEM in case of unmapped holes, despite the VMA is processed. Thus process_madvise() should treat ENOMEM as expected and consider the VMA passed to as processed and continue processing other vma's in the vector list. Returning -ENOMEM to user, despite the VMA is processed, will be unable to figure out where to start the next madvise. Link: https://lkml.kernel.org/r/4f091776142f2ebf7b94018146de72318474e686.1647008754.git.quic_charante@quicinc.com Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API") Signed-off-by: Charan Teja Kalla <[email protected]> Cc: David Rientjes <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: madvise: return correct bytes advised with process_madvise	Charan Teja Kalla	1	-2/+1
	Patch series "mm: madvise: return correct bytes processed with process_madvise", v2. With the process_madvise(), always choose to return non zero processed bytes over an error. This can help the user to know on which VMA, passed in the 'struct iovec' vector list, is failed to advise thus can take the decission of retrying/skipping on that VMA. This patch (of 2): The process_madvise() system call returns error even after processing some VMA's passed in the 'struct iovec' vector list which leaves the user confused to know where to restart the advise next. It is also against this syscall man page[1] documentation where it mentions that "return value may be less than the total number of requested bytes, if an error occurred after some iovec elements were already processed.". Consider a user passed 10 VMA's in the 'struct iovec' vector list of which 9 are processed but one. Then it just returns the error caused on that failed VMA despite the first 9 VMA's processed, leaving the user confused about on which VMA it is failed. Returning the number of bytes processed here can help the user to know which VMA it is failed on and thus can retry/skip the advise on that VMA. [1]https://man7.org/linux/man-pages/man2/process_madvise.2.html. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/125b61a0edcee5c2db8658aed9d06a43a19ccafc.1647008754.git.quic_charante@quicinc.com Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API") Signed-off-by: Charan Teja Kalla <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Rientjes <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Michal Hocko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/madvise: use vma_lookup() instead of find_vma()	Miaohe Lin	1	-2/+2
	Using vma_lookup() verifies the start address is contained in the found vma. This results in easier to read the code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/hwpoison: check the subpage, not the head page	Matthew Wilcox (Oracle)	1	-2/+2
	Hardware poison is tracked on a per-page basis, not on the head page. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: David Rientjes <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/ksm: use helper macro __ATTR_RW	Miaohe Lin	1	-2/+1
	Use helper macro __ATTR_RW to define KSM_ATTR to make code more clear. Minor readability improvement. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/vmstat: add event for ksm swapping in copy	Yang Yang	3	-0/+9
	When faults in from swap what used to be a KSM page and that page had been swapped in before, system has to make a copy, and leaves remerging the pages to a later pass of ksmd. That is not good for performace, we'd better to reduce this kind of copy. There are some ways to reduce it, for example lessen swappiness or madvise(, , MADV_MERGEABLE) range. So add this event to support doing this tuning. Just like this patch: "mm, THP, swap: add THP swapping out fallback counting". Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Yang <[email protected]> Reviewed-by: Ran Xiaokai <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Yang Shi <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Saravanan D <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: page_io: fix psi memory pressure error on cold swapins	Johannes Weiner	1	-2/+5
	Once upon a time, all swapins counted toward memory pressure[1]. Then Joonsoo introduced workingset detection for anonymous pages and we gained the ability to distinguish hot from cold swapins[2][3]. But we failed to update swap_readpage() accordingly, and now we account partial memory pressure in the swapin path of cold memory. Not for all situations - which adds more inconsistency: paths using the conventional submit_bio() and lock_page() route will not see much pressure - unless storage itself is heavily congested and the bio submissions stall. ZRAM and ZSWAP do most of the work directly from swap_readpage() and will see all swapins reflected as pressure. IOW, a workload doing cold swapins could see little to no pressure reported with on-disk swap, but potentially high pressure with a zram or zswap backend. That confuses any psi-based health monitoring, load shedding, proactive reclaim, or userspace OOM killing schemes that might be in place for the workload. Restore consistency by making all swapin stall accounting conditional on the page actually being part of the workingset. [1] commit 937790699be9 ("mm/page_io.c: annotate refault stalls from swap_readpage") [2] commit aae466b0052e ("mm/swap: implement workingset detection for anonymous LRU") [3] commit cad8320b4b39 ("mm/swap: don't SetPageWorkingset unconditionally during swapin") Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Reported-by: CGEL <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	memory tiering: skip to scan fast memory	Huang Ying	2	-10/+33
	If the NUMA balancing isn't used to optimize the page placement among sockets but only among memory types, the hot pages in the fast memory node couldn't be migrated (promoted) to anywhere. So it's unnecessary to scan the pages in the fast memory node via changing their PTE/PMD mapping to be PROT_NONE. So that the page faults could be avoided too. In the test, if only the memory tiering NUMA balancing mode is enabled, the number of the NUMA balancing hint faults for the DRAM node is reduced to almost 0 with the patch. While the benchmark score doesn't change visibly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Suggested-by: Dave Hansen <[email protected]> Tested-by: Baolin Wang <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Zi Yan <[email protected]> Cc: Wei Xu <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: zhongjiang-ali <[email protected]> Cc: Feng Tang <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	NUMA balancing: optimize page placement for memory tiering system	Huang Ying	8	-18/+70
	With the advent of various new memory types, some machines will have multiple types of memory, e.g. DRAM and PMEM (persistent memory). The memory subsystem of these machines can be called memory tiering system, because the performance of the different types of memory are usually different. In such system, because of the memory accessing pattern changing etc, some pages in the slow memory may become hot globally. So in this patch, the NUMA balancing mechanism is enhanced to optimize the page placement among the different memory types according to hot/cold dynamically. In a typical memory tiering system, there are CPUs, fast memory and slow memory in each physical NUMA node. The CPUs and the fast memory will be put in one logical node (called fast memory node), while the slow memory will be put in another (faked) logical node (called slow memory node). That is, the fast memory is regarded as local while the slow memory is regarded as remote. So it's possible for the recently accessed pages in the slow memory node to be promoted to the fast memory node via the existing NUMA balancing mechanism. The original NUMA balancing mechanism will stop to migrate pages if the free memory of the target node becomes below the high watermark. This is a reasonable policy if there's only one memory type. But this makes the original NUMA balancing mechanism almost do not work to optimize page placement among different memory types. Details are as follows. It's the common cases that the working-set size of the workload is larger than the size of the fast memory nodes. Otherwise, it's unnecessary to use the slow memory at all. So, there are almost always no enough free pages in the fast memory nodes, so that the globally hot pages in the slow memory node cannot be promoted to the fast memory node. To solve the issue, we have 2 choices as follows, a. Ignore the free pages watermark checking when promoting hot pages from the slow memory node to the fast memory node. This will create some memory pressure in the fast memory node, thus trigger the memory reclaiming. So that, the cold pages in the fast memory node will be demoted to the slow memory node. b. Define a new watermark called wmark_promo which is higher than wmark_high, and have kswapd reclaiming pages until free pages reach such watermark. The scenario is as follows: when we want to promote hot-pages from a slow memory to a fast memory, but fast memory's free pages would go lower than high watermark with such promotion, we wake up kswapd with wmark_promo watermark in order to demote cold pages and free us up some space. So, next time we want to promote hot-pages we might have a chance of doing so. The choice "a" may create high memory pressure in the fast memory node. If the memory pressure of the workload is high, the memory pressure may become so high that the memory allocation latency of the workload is influenced, e.g. the direct reclaiming may be triggered. The choice "b" works much better at this aspect. If the memory pressure of the workload is high, the hot pages promotion will stop earlier because its allocation watermark is higher than that of the normal memory allocation. So in this patch, choice "b" is implemented. A new zone watermark (WMARK_PROMO) is added. Which is larger than the high watermark and can be controlled via watermark_scale_factor. In addition to the original page placement optimization among sockets, the NUMA balancing mechanism is extended to be used to optimize page placement according to hot/cold among different memory types. So the sysctl user space interface (numa_balancing) is extended in a backward compatible way as follow, so that the users can enable/disable these functionality individually. The sysctl is converted from a Boolean value to a bits field. The definition of the flags is, - 0: NUMA_BALANCING_DISABLED - 1: NUMA_BALANCING_NORMAL - 2: NUMA_BALANCING_MEMORY_TIERING We have tested the patch with the pmbench memory accessing benchmark with the 80:20 read/write ratio and the Gauss access address distribution on a 2 socket Intel server with Optane DC Persistent Memory Model. The test results shows that the pmbench score can improve up to 95.9%. Thanks Andrew Morton to help fix the document format error. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Tested-by: Baolin Wang <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Zi Yan <[email protected]> Cc: Wei Xu <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: zhongjiang-ali <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Feng Tang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	NUMA Balancing: add page promotion counter	Huang Ying	4	-3/+21
	Patch series "NUMA balancing: optimize memory placement for memory tiering system", v13 With the advent of various new memory types, some machines will have multiple types of memory, e.g. DRAM and PMEM (persistent memory). The memory subsystem of these machines can be called memory tiering system, because the performance of the different types of memory are different. After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM"), the PMEM could be used as the cost-effective volatile memory in separate NUMA nodes. In a typical memory tiering system, there are CPUs, DRAM and PMEM in each physical NUMA node. The CPUs and the DRAM will be put in one logical node, while the PMEM will be put in another (faked) logical node. To optimize the system overall performance, the hot pages should be placed in DRAM node. To do that, we need to identify the hot pages in the PMEM node and migrate them to DRAM node via NUMA migration. In the original NUMA balancing, there are already a set of existing mechanisms to identify the pages recently accessed by the CPUs in a node and migrate the pages to the node. So we can reuse these mechanisms to build the mechanisms to optimize the page placement in the memory tiering system. This is implemented in this patchset. At the other hand, the cold pages should be placed in PMEM node. So, we also need to identify the cold pages in the DRAM node and migrate them to PMEM node. In commit 26aa2d199d6f ("mm/migrate: demote pages during reclaim"), a mechanism to demote the cold DRAM pages to PMEM node under memory pressure is implemented. Based on that, the cold DRAM pages can be demoted to PMEM node proactively to free some memory space on DRAM node to accommodate the promoted hot PMEM pages. This is implemented in this patchset too. We have tested the solution with the pmbench memory accessing benchmark with the 80:20 read/write ratio and the Gauss access address distribution on a 2 socket Intel server with Optane DC Persistent Memory Model. The test results shows that the pmbench score can improve up to 95.9%. This patch (of 3): In a system with multiple memory types, e.g. DRAM and PMEM, the CPU and DRAM in one socket will be put in one NUMA node as before, while the PMEM will be put in another NUMA node as described in the description of the commit c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM"). So, the NUMA balancing mechanism will identify all PMEM accesses as remote access and try to promote the PMEM pages to DRAM. To distinguish the number of the inter-type promoted pages from that of the inter-socket migrated pages. A new vmstat count is added. The counter is per-node (count in the target node). So this can be used to identify promotion imbalance among the NUMA nodes. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Yang Shi <[email protected]> Tested-by: Baolin Wang <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Zi Yan <[email protected]> Cc: Wei Xu <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: zhongjiang-ali <[email protected]> Cc: Feng Tang <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	powerpc/fadump: opt out from freeing pages on cma activation failure	Hari Bathini	1	-0/+6
	With commit a4e92ce8e4c8 ("powerpc/fadump: Reservationless firmware assisted dump"), Linux kernel's Contiguous Memory Allocator (CMA) based reservation was introduced in fadump. That change was aimed at using CMA to let applications utilize the memory reserved for fadump while blocking it from being used for kernel pages. The assumption was, even if CMA activation fails for whatever reason, the memory still remains reserved to avoid it from being used for kernel pages. But commit 072355c1cf2d ("mm/cma: expose all pages to the buddy if activation of an area fails") breaks this assumption as it started exposing all pages to buddy allocator on CMA activation failure. It led to warning messages like below while running crash-utility on vmcore of a kernel having above two commits: crash: seek error: kernel virtual address: <from reserved region> To fix this problem, opt out from exposing pages to buddy allocator on CMA activation failure for fadump reserved memory. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hari Bathini <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Michael Ellerman <[email protected]> Cc: Mahesh Salgaonkar <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Sourabh Jain <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/cma: provide option to opt out from exposing pages on activation failure	Hari Bathini	3	-2/+12
	Patch series "powerpc/fadump: handle CMA activation failure appropriately", v3. Commit 072355c1cf2d ("mm/cma: expose all pages to the buddy if activation of an area fails") started exposing all pages to buddy allocator on CMA activation failure. But there can be CMA users that want to handle the reserved memory differently on CMA allocation failure. Provide an option to opt out from exposing pages to buddy for such cases. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hari Bathini <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mahesh Salgaonkar <[email protected]> Cc: Sourabh Jain <[email protected]> Cc: Michael Ellerman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>