blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2020-01-04	mm/memory_hotplug: shrink zones when offlining memory	David Hildenbrand	1	-2/+5
	We currently try to shrink a single zone when removing memory. We use the zone of the first page of the memory we are removing. If that memmap was never initialized (e.g., memory was never onlined), we will read garbage and can trigger kernel BUGs (due to a stale pointer): BUG: unable to handle page fault for address: 000000000000353d #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] SMP PTI CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4 Workqueue: kacpi_hotplug acpi_hotplug_work_fn RIP: 0010:clear_zone_contiguous+0x5/0x10 Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840 RSP: 0018:ffffad2400043c98 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000 RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40 RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000 R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680 FS: 0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __remove_pages+0x4b/0x640 arch_remove_memory+0x63/0x8d try_remove_memory+0xdb/0x130 __remove_memory+0xa/0x11 acpi_memory_device_remove+0x70/0x100 acpi_bus_trim+0x55/0x90 acpi_device_hotplug+0x227/0x3a0 acpi_hotplug_work_fn+0x1a/0x30 process_one_work+0x221/0x550 worker_thread+0x50/0x3b0 kthread+0x105/0x140 ret_from_fork+0x3a/0x50 Modules linked in: CR2: 000000000000353d Instead, shrink the zones when offlining memory or when onlining failed. Introduce and use remove_pfn_range_from_zone(() for that. We now properly shrink the zones, even if we have DIMMs whereby - Some memory blocks fall into no zone (never onlined) - Some memory blocks fall into multiple zones (offlined+re-onlined) - Multiple memory blocks that fall into different zones Drop the zone parameter (with a potential dubious value) from __remove_pages() and __remove_section(). Link: http://lkml.kernel.org/r/[email protected] Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Dan Williams <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: <[email protected]> [5.0+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01	mm/memory_hotplug.c: remove __online_page_set_limits()	Souptick Joarder	1	-2/+0
	__online_page_set_limits() is a dummy function - remove it and all callers. Link: http://lkml.kernel.org/r/8e1bc9d3b492f6bde16e95ebc1dee11d6aefabd7.1567889743.git.jrdr.linux@gmail.com Link: http://lkml.kernel.org/r/854db2cf8145d9635249c95584d9a91fd774a229.1567889743.git.jrdr.linux@gmail.com Link: http://lkml.kernel.org/r/9afe6c5a18158f3884a6b302ac2c772f3da49ccc.1567889743.git.jrdr.linux@gmail.com Signed-off-by: Souptick Joarder <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Juergen Gross <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01	include/linux/memory_hotplug.h: move definitions of {set,clear}_zone_contiguous	Ben Dooks (Codethink)	1	-3/+3
	The {set,clear}_zone_contiguous are built whatever the configuratoon so move the definitions outside the current ifdef to avoid the following compiler warnings: mm/page_alloc.c:1550:6: warning: no previous prototype for 'set_zone_contiguous' [-Wmissing-prototypes] mm/page_alloc.c:1571:6: warning: no previous prototype for 'clear_zone_contiguous' [-Wmissing-prototypes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ben Dooks (Codethink) <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01	mm/memory_hotplug: remove __online_page_free() and ↵	David Hildenbrand	1	-2/+0
	__online_page_increment_counters() Let's drop the now unused functions. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Wei Yang <[email protected]> Cc: Dan Williams <[email protected]> Cc: Qian Cai <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Stephen Hemminger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01	mm/memory_hotplug: export generic_online_page()	David Hildenbrand	1	-0/+1
	Patch series "mm/memory_hotplug: Export generic_online_page()". Let's replace the __online_page...() functions by generic_online_page(). Hyper-V only wants to delay the actual onlining of un-backed pages, so we can simpy re-use the generic function. This patch (of 3): Let's expose generic_online_page() so online_page_callback users can simply fall back to the generic implementation when actually deciding to online the pages. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Dan Williams <[email protected]> Cc: Wei Yang <[email protected]> Cc: Qian Cai <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Stephen Hemminger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-18	mm/sparsemem: support sub-section hotplug	Dan Williams	1	-1/+1
	The libnvdimm sub-system has suffered a series of hacks and broken workarounds for the memory-hotplug implementation's awkward section-aligned (128MB) granularity. For example the following backtrace is emitted when attempting arch_add_memory() with physical address ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section: # cat /proc/iomem \| grep -A1 -B1 Persistent\ Memory 100000000-1ffffffff : System RAM 200000000-303ffffff : Persistent Memory (legacy) 304000000-43fffffff : System RAM 440000000-23ffffffff : Persistent Memory 2400000000-43bfffffff : Persistent Memory 2400000000-43bfffffff : namespace2.0 WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60 [..] RIP: 0010:add_pages+0x5c/0x60 [..] Call Trace: devm_memremap_pages+0x460/0x6e0 pmem_attach_disk+0x29e/0x680 [nd_pmem] ? nd_dax_probe+0xfc/0x120 [libnvdimm] nvdimm_bus_probe+0x66/0x160 [libnvdimm] It was discovered that the problem goes beyond RAM vs PMEM collisions as some platform produce PMEM vs PMEM collisions within a given section. The libnvdimm workaround for that case revealed that the libnvdimm section-alignment-padding implementation has been broken for a long while. A fix for that long-standing breakage introduces as many problems as it solves as it would require a backward-incompatible change to the namespace metadata interpretation. Instead of that dubious route [1], address the root problem in the memory-hotplug implementation. Note that EEXIST is no longer treated as success as that is how sparse_add_section() reports subsection collisions, it was also obviated by recent changes to perform the request_region() for 'System RAM' before arch_add_memory() in the add_memory() sequence. [1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com [[email protected]: fix deactivate_section for early sections] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Oscar Salvador <[email protected]> Tested-by: Aneesh Kumar K.V <[email protected]> [ppc64] Reviewed-by: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jane Chu <[email protected]> Cc: Jeff Moyer <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Wei Yang <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-18	mm/sparsemem: prepare for sub-section ranges	Dan Williams	1	-2/+3
	Prepare the memory hot-{add,remove} paths for handling sub-section ranges by plumbing the starting page frame and number of pages being handled through arch_{add,remove}_memory() to sparse_{add,remove}_one_section(). This is simply plumbing, small cleanups, and some identifier renames. No intended functional changes. Link: http://lkml.kernel.org/r/156092353780.979959.9713046515562743194.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Tested-by: Aneesh Kumar K.V <[email protected]> [ppc64] Reviewed-by: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jane Chu <[email protected]> Cc: Jeff Moyer <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Wei Yang <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-18	mm/memory_hotplug: move and simplify walk_memory_blocks()	David Hildenbrand	1	-2/+0
	Let's move walk_memory_blocks() to the place where memory block logic resides and simplify it. While at it, add a type for the callback function. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Andrew Banman <[email protected]> Cc: Mike Travis <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wei Yang <[email protected]> Cc: Arun KS <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-18	mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of ↵	David Hildenbrand	1	-1/+1
	pfns walk_memory_range() was once used to iterate over sections. Now, it iterates over memory blocks. Rename the function, fixup the documentation. Also, pass start+size instead of PFNs, which is what most callers already have at hand. (we'll rework link_mem_sections() most probably soon) Follow-up patches will rework, simplify, and move walk_memory_blocks() to drivers/base/memory.c. Note: walk_memory_blocks() only works correctly right now if the start_pfn is aligned to a section start. This is the case right now, but we'll generalize the function in a follow up patch so the semantics match the documentation. [[email protected]: remove unused variable] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Len Brown <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Rashmica Gupta <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Michael Neuling <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wei Yang <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Qian Cai <[email protected]> Cc: Arun KS <[email protected]> Cc: Nick Desaulniers <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-18	mm/memory_hotplug: remove "zone" parameter from sparse_remove_one_section	David Hildenbrand	1	-1/+1
	The parameter is unused, so let's drop it. Memory removal paths should never care about zones. This is the job of memory offlining and will require more refactorings. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Wei Yang <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Alex Deucher <[email protected]> Cc: Andrew Banman <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Arun KS <[email protected]> Cc: Baoquan He <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chintan Pandya <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Chris Wilson <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Jun Yao <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: Mark Brown <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: "[email protected]" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Qian Cai <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Rich Felker <[email protected]> Cc: Rob Herring <[email protected]> Cc: Robin Murphy <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-18	mm/memory_hotplug: drop MHP_MEMBLOCK_API	David Hildenbrand	1	-8/+0
	No longer needed, the callers of arch_add_memory() can handle this manually. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Wei Yang <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Qian Cai <[email protected]> Cc: Arun KS <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Alex Deucher <[email protected]> Cc: Andrew Banman <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Baoquan He <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chintan Pandya <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Chris Wilson <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: Jun Yao <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: Mark Brown <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: "[email protected]" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Rich Felker <[email protected]> Cc: Rob Herring <[email protected]> Cc: Robin Murphy <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-18	mm/memory_hotplug: allow arch_remove_memory() without CONFIG_MEMORY_HOTREMOVE	David Hildenbrand	1	-2/+0
	We want to improve error handling while adding memory by allowing to use arch_remove_memory() and __remove_pages() even if CONFIG_MEMORY_HOTREMOVE is not set to e.g., implement something like: arch_add_memory() rc = do_something(); if (rc) { arch_remove_memory(); } We won't get rid of CONFIG_MEMORY_HOTREMOVE for now, as it will require quite some dependencies for memory offlining. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Alex Deucher <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Mark Brown <[email protected]> Cc: Chris Wilson <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Rob Herring <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: "[email protected]" <[email protected]> Cc: Andrew Banman <[email protected]> Cc: Arun KS <[email protected]> Cc: Qian Cai <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Baoquan He <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chintan Pandya <[email protected]> Cc: Dan Williams <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Jun Yao <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Robin Murphy <[email protected]> Cc: Wei Yang <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-16	mm/hotplug: make remove_memory() interface usable	Pavel Tatashin	1	-2/+6
	Presently the remove_memory() interface is inherently broken. It tries to remove memory but panics if some memory is not offline. The problem is that it is impossible to ensure that all memory blocks are offline as this function also takes lock_device_hotplug that is required to change memory state via sysfs. So, between calling this function and offlining all memory blocks there is always a window when lock_device_hotplug is released, and therefore, there is always a chance for a panic during this window. Make this interface to return an error if memory removal fails. This way it is safe to call this function without panicking machine, and also makes it symmetric to add_memory() which already returns an error. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Tatashin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Fengguang Wu <[email protected]> Cc: Huang Ying <[email protected]> Cc: James Morris <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Keith Busch <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Takashi Iwai <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Yaowei Bai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14	mm/memory_hotplug: make __remove_pages() and arch_remove_memory() never fail	David Hildenbrand	1	-4/+4
	All callers of arch_remove_memory() ignore errors. And we should really try to remove any errors from the memory removal path. No more errors are reported from __remove_pages(). BUG() in s390x code in case arch_remove_memory() is triggered. We may implement that properly later. WARN in case powerpc code failed to remove the section mapping, which is better than ignoring the error completely right now. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Stefan Agner <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Arun KS <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Rob Herring <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Wei Yang <[email protected]> Cc: Qian Cai <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Andrew Banman <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mike Travis <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14	mm, memory_hotplug: provide a more generic restrictions for memory hotplug	Michal Hocko	1	-7/+24
	arch_add_memory, __add_pages take a want_memblock which controls whether the newly added memory should get the sysfs memblock user API (e.g. ZONE_DEVICE users do not want/need this interface). Some callers even want to control where do we allocate the memmap from by configuring altmap. Add a more generic hotplug context for arch_add_memory and __add_pages. struct mhp_restrictions contains flags which contains additional features to be enabled by the memory hotplug (MHP_MEMBLOCK_API currently) and altmap for alternative memmap allocator. This patch shouldn't introduce any functional change. [[email protected]: build fix] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Oscar Salvador <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14	mm, memory_hotplug: cleanup memory offline path	Michal Hocko	1	-1/+2
	check_pages_isolated_cb currently accounts the whole pfn range as being offlined if test_pages_isolated suceeds on the range. This is based on the assumption that all pages in the range are freed which is currently the case in most cases but it won't be with later changes, as pages marked as vmemmap won't be isolated. Move the offlined pages counting to offline_isolated_pages_cb and rely on __offline_isolated_pages to return the correct value. check_pages_isolated_cb will still do it's primary job and check the pfn range. While we are at it remove check_pages_isolated and offline_isolated_pages and use directly walk_system_ram_range as do in online_pages. Link: http://lkml.kernel.org/r/[email protected] Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Oscar Salvador <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-03-11	Merge tag 'for-linus-5.1a-rc1-tag' of ↵	Linus Torvalds	1	-0/+2
	git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: "xen fixes and features: - remove fallback code for very old Xen hypervisors - three patches for fixing Xen dom0 boot regressions - an old patch for Xen PCI passthrough which was never applied for unknown reasons - some more minor fixes and cleanup patches" * tag 'for-linus-5.1a-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: fix dom0 boot on huge systems xen, cpu_hotplug: Prevent an out of bounds access xen: remove pre-xen3 fallback handlers xen/ACPI: Switch to bitmap_zalloc() x86/xen: dont add memory above max allowed allocation x86: respect memory size limiting via mem= parameter xen/gntdev: Check and release imported dma-bufs on close xen/gntdev: Do not destroy context while dma-bufs are in use xen/pciback: Don't disable PCI_COMMAND on PCI device reset. xen-scsiback: mark expected switch fall-through xen: mark expected switch fall-through
2019-03-05	mm/page_alloc.c: memory hotplug: free pages as higher order	Arun KS	1	-1/+1
	When freeing pages are done with higher order, time spent on coalescing pages by buddy allocator can be reduced. With section size of 256MB, hot add latency of a single section shows improvement from 50-60 ms to less than 1 ms, hence improving the hot add latency by 60 times. Modify external providers of online callback to align with the change. [[email protected]: v11] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: remove unused local, per Arun] [[email protected]: avoid return of void-returning __free_pages_core(), per Oscar] [[email protected]: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch] [[email protected]: v8] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: v9] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arun KS <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Alexander Duyck <[email protected]> Cc: K. Y. Srinivasan <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Dan Williams <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Souptick Joarder <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Srivatsa Vaddagiri <[email protected]> Cc: Vinayak Menon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-18	x86: respect memory size limiting via mem= parameter	Juergen Gross	1	-0/+2
	When limiting memory size via kernel parameter "mem=" this should be respected even in case of memory made accessible via a PCI card. Today this kind of memory won't be made usable in initial memory setup as the memory won't be visible in E820 map, but it might be added when adding PCI devices due to corresponding ACPI table entries. Not respecting "mem=" can be corrected by adding a global max_mem_size variable set by parse_memopt() which will result in rejecting adding memory areas resulting in a memory size above the allowed limit. Signed-off-by: Juergen Gross <[email protected]> Acked-by: Ingo Molnar <[email protected]> Reviewed-by: William Kucharski <[email protected]> Signed-off-by: Juergen Gross <[email protected]>
2019-02-01	mm/hotplug: invalid PFNs from pfn_to_online_page()	Qian Cai	1	-8/+10
	On an arm64 ThunderX2 server, the first kmemleak scan would crash [1] with CONFIG_DEBUG_VM_PGFLAGS=y due to page_to_nid() found a pfn that is not directly mapped (MEMBLOCK_NOMAP). Hence, the page->flags is uninitialized. This is due to the commit 9f1eb38e0e11 ("mm, kmemleak: little optimization while scanning") starts to use pfn_to_online_page() instead of pfn_valid(). However, in the CONFIG_MEMORY_HOTPLUG=y case, pfn_to_online_page() does not call memblock_is_map_memory() while pfn_valid() does. Historically, the commit 68709f45385a ("arm64: only consider memblocks with NOMAP cleared for linear mapping") causes pages marked as nomap being no long reassigned to the new zone in memmap_init_zone() by calling __init_single_page(). Since the commit 2d070eab2e82 ("mm: consider zone which is not fully populated to have holes") introduced pfn_to_online_page() and was designed to return a valid pfn only, but it is clearly broken on arm64. Therefore, let pfn_to_online_page() call pfn_valid_within(), so it can handle nomap thanks to the commit f52bb98f5ade ("arm64: mm: always enable CONFIG_HOLES_IN_ZONE"), while it will be optimized away on architectures where have no HOLES_IN_ZONE. [1] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000006 Mem abort info: ESR = 0x96000005 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000005 CM = 0, WnR = 0 Internal error: Oops: 96000005 [#1] SMP CPU: 60 PID: 1408 Comm: kmemleak Not tainted 5.0.0-rc2+ #8 pstate: 60400009 (nZCv daif +PAN -UAO) pc : page_mapping+0x24/0x144 lr : __dump_page+0x34/0x3dc sp : ffff00003a5cfd10 x29: ffff00003a5cfd10 x28: 000000000000802f x27: 0000000000000000 x26: 0000000000277d00 x25: ffff000010791f56 x24: ffff7fe000000000 x23: ffff000010772f8b x22: ffff00001125f670 x21: ffff000011311000 x20: ffff000010772f8b x19: fffffffffffffffe x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: ffff802698b19600 x13: ffff802698b1a200 x12: ffff802698b16f00 x11: ffff802698b1a400 x10: 0000000000001400 x9 : 0000000000000001 x8 : ffff00001121a000 x7 : 0000000000000000 x6 : ffff0000102c53b8 x5 : 0000000000000000 x4 : 0000000000000003 x3 : 0000000000000100 x2 : 0000000000000000 x1 : ffff000010772f8b x0 : ffffffffffffffff Process kmemleak (pid: 1408, stack limit = 0x(____ptrval____)) Call trace: page_mapping+0x24/0x144 __dump_page+0x34/0x3dc dump_page+0x28/0x4c kmemleak_scan+0x4ac/0x680 kmemleak_scan_thread+0xb4/0xdc kthread+0x12c/0x13c ret_from_fork+0x10/0x18 Code: d503201f f9400660 36000040 d1000413 (f9400661) ---[ end trace 4d4bd7f573490c8e ]--- Kernel panic - not syncing: Fatal exception SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x002,20000c38 Memory Limit: none ---[ end Kernel panic - not syncing: Fatal exception ]--- Link: http://lkml.kernel.org/r/[email protected] Fixes: 9f1eb38e0e11 ("mm, kmemleak: little optimization while scanning") Signed-off-by: Qian Cai <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-12-28	include/linux/memory_hotplug.h: remove duplicate declaration of offline_pages()	Wei Yang	1	-1/+0
	offline_pages() is already declared in this file. Just remove the duplicated one. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-12-28	mm, sparse: pass nid instead of pgdat to sparse_add_one_section()	Wei Yang	1	-2/+2
	Since the information needed in sparse_add_one_section() is node id to allocate proper memory, it is not necessary to pass its pgdat. This patch changes the prototype of sparse_add_one_section() to pass node id directly. This is intended to reduce misleading that sparse_add_one_section() would touch pgdat. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-12-28	mm, memory_hotplug: add nid parameter to arch_remove_memory	Oscar Salvador	1	-2/+2
	Patch series "Do not touch pages in hot-remove path", v2. This patchset aims for two things: 1) A better definition about offline and hot-remove stage 2) Solving bugs where we can access non-initialized pages during hot-remove operations [2] [3]. This is achieved by moving all page/zone handling to the offline stage, so we do not need to access pages when hot-removing memory. [1] https://patchwork.kernel.org/cover/10691415/ [2] https://patchwork.kernel.org/patch/10547445/ [3] https://www.spinics.net/lists/linux-mm/msg161316.html This patch (of 5): This is a preparation for the following-up patches. The idea of passing the nid is that it will allow us to get rid of the zone parameter afterwards. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-12-28	mm/memory_hotplug: drop "online" parameter from add_memory_resource()	David Hildenbrand	1	-1/+1
	Userspace should always be in charge of how to online memory and if memory should be onlined automatically in the kernel. Let's drop the parameter to overwrite this - XEN passes memhp_auto_online, just like add_memory(), so we can directly use that instead internally. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Acked-by: Juergen Gross <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Dan Williams <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Arun KS <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-31	mm/memory_hotplug: make add_memory() take the device_hotplug_lock	David Hildenbrand	1	-0/+1
	add_memory() currently does not take the device_hotplug_lock, however is aleady called under the lock from arch/powerpc/platforms/pseries/hotplug-memory.c drivers/acpi/acpi_memhotplug.c to synchronize against CPU hot-remove and similar. In general, we should hold the device_hotplug_lock when adding memory to synchronize against online/offline request (e.g. from user space) - which already resulted in lock inversions due to device_lock() and mem_hotplug_lock - see 30467e0b3be ("mm, hotplug: fix concurrent memory hot-add deadlock"). add_memory()/add_memory_resource() will create memory block devices, so this really feels like the right thing to do. Holding the device_hotplug_lock makes sure that a memory block device can really only be accessed (e.g. via .online/.state) from user space, once the memory has been fully added to the system. The lock is not held yet in drivers/xen/balloon.c arch/powerpc/platforms/powernv/memtrace.c drivers/s390/char/sclp_cmd.c drivers/hv/hv_balloon.c So, let's either use the locked variants or take the lock. Don't export add_memory_resource(), as it once was exported to be used by XEN, which is never built as a module. If somebody requires it, we also have to export a locked variant (as device_hotplug_lock is never exported). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Reviewed-by: Rashmica Gupta <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Len Brown <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Nathan Fontenot <[email protected]> Cc: John Allen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Dan Williams <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: YASUAKI ISHIMATSU <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kate Stewart <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Neuling <[email protected]> Cc: Philippe Ombredanne <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-31	mm/memory_hotplug: make remove_memory() take the device_hotplug_lock	David Hildenbrand	1	-1/+2
	Patch series "mm: online/offline_pages called w.o. mem_hotplug_lock", v3. Reading through the code and studying how mem_hotplug_lock is to be used, I noticed that there are two places where we can end up calling device_online()/device_offline() - online_pages()/offline_pages() without the mem_hotplug_lock. And there are other places where we call device_online()/device_offline() without the device_hotplug_lock. While e.g. echo "online" > /sys/devices/system/memory/memory9/state is fine, e.g. echo 1 > /sys/devices/system/memory/memory9/online Will not take the mem_hotplug_lock. However the device_lock() and device_hotplug_lock. E.g. via memory_probe_store(), we can end up calling add_memory()->online_pages() without the device_hotplug_lock. So we can have concurrent callers in online_pages(). We e.g. touch in online_pages() basically unprotected zone->present_pages then. Looks like there is a longer history to that (see Patch #2 for details), and fixing it to work the way it was intended is not really possible. We would e.g. have to take the mem_hotplug_lock in device/base/core.c, which sounds wrong. Summary: We had a lock inversion on mem_hotplug_lock and device_lock(). More details can be found in patch 3 and patch 6. I propose the general rules (documentation added in patch 6): 1. add_memory/add_memory_resource() must only be called with device_hotplug_lock. 2. remove_memory() must only be called with device_hotplug_lock. This is already documented and holds for all callers. 3. device_online()/device_offline() must only be called with device_hotplug_lock. This is already documented and true for now in core code. Other callers (related to memory hotplug) have to be fixed up. 4. mem_hotplug_lock is taken inside of add_memory/remove_memory/ online_pages/offline_pages. To me, this looks way cleaner than what we have right now (and easier to verify). And looking at the documentation of remove_memory, using lock_device_hotplug also for add_memory() feels natural. This patch (of 6): remove_memory() is exported right now but requires the device_hotplug_lock, which is not exported. So let's provide a variant that takes the lock and only export that one. The lock is already held in arch/powerpc/platforms/pseries/hotplug-memory.c drivers/acpi/acpi_memhotplug.c arch/powerpc/platforms/powernv/memtrace.c Apart from that, there are not other users in the tree. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Reviewed-by: Rashmica Gupta <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Len Brown <[email protected]> Cc: Rashmica Gupta <[email protected]> Cc: Michael Neuling <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Nathan Fontenot <[email protected]> Cc: John Allen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Dan Williams <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: YASUAKI ISHIMATSU <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kate Stewart <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Philippe Ombredanne <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22	mm/page_alloc: Introduce free_area_init_core_hotplug	Oscar Salvador	1	-0/+1
	Currently, whenever a new node is created/re-used from the memhotplug path, we call free_area_init_node()->free_area_init_core(). But there is some code that we do not really need to run when we are coming from such path. free_area_init_core() performs the following actions: 1) Initializes pgdat internals, such as spinlock, waitqueues and more. 2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on when creating hash tables. 3) Account number of managed_pages per zone, substracting dma_reserved and memmap pages. 4) Initializes some fields of the zone structure data 5) Calls init_currently_empty_zone to initialize all the freelists 6) Calls memmap_init to initialize all pages belonging to certain zone When called from memhotplug path, free_area_init_core() only performs actions #1 and #4. Action #2 is pointless as the zones do not have any pages since either the node was freed, or we are re-using it, eitherway all zones belonging to this node should have 0 pages. For the same reason, action #3 results always in manages_pages being 0. Action #5 and #6 are performed later on when onlining the pages: online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone() online_pages()->move_pfn_range_to_zone()->memmap_init_zone() This patch does two things: First, moves the node/zone initializtion to their own function, so it allows us to create a small version of free_area_init_core, where we only perform: 1) Initialization of pgdat internals, such as spinlock, waitqueues and more 4) Initialization of some fields of the zone structure data These two functions are: pgdat_init_internals() and zone_init_internals(). The second thing this patch does, is to introduce free_area_init_core_hotplug(), the memhotplug version of free_area_init_core(): Currently, we call free_area_init_node() from the memhotplug path. In there, we set some pgdat's fields, and call calculate_node_totalpages(). calculate_node_totalpages() calculates the # of pages the node has. Since the node is either new, or we are re-using it, the zones belonging to this node should not have any pages, so there is no point to calculate this now. Actually, we re-set these values to 0 later on with the calls to: reset_node_managed_pages() reset_node_present_pages() The # of pages per node and the # of pages per zone will be calculated when onlining the pages: online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range() online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range() Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace __paginginit with __init, so their code gets freed up. [[email protected]: fix section usage] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: v6] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07	mm: move is_pageblock_removable_nolock() to mm/memory_hotplug.c	Mathieu Malaterre	1	-1/+0
	is_pageblock_removable_nolock() is not used outside of mm/memory_hotplug.c. Move it next to unique caller is_mem_section_removable() and make it static. Remove prototype in <linux/memory_hotplug.h> to silence gcc warning (W=1): mm/page_alloc.c:7704:6: warning: no previous prototype for `is_pageblock_removable_nolock' [-Wmissing-prototypes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mathieu Malaterre <[email protected]> Suggested-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-05-24	Revert "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE"	Joonsoo Kim	1	-0/+3
	This reverts the following commits that change CMA design in MM. 3d2054ad8c2d ("ARM: CMA: avoid double mapping to the CMA area if CONFIG_HIGHMEM=y") 1d47a3ec09b5 ("mm/cma: remove ALLOC_CMA") bad8c6c0b114 ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE") Ville reported a following error on i386. Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) microcode: microcode updated early to revision 0x4, date = 2013-06-28 Initializing CPU#0 Initializing HighMem for node 0 (000377fe:00118000) Initializing Movable for node 0 (00000001:00118000) BUG: Bad page state in process swapper pfn:377fe page:f53effc0 count:0 mapcount:-127 mapping:00000000 index:0x0 flags: 0x80000000() raw: 80000000 00000000 00000000 ffffff80 00000000 00000100 00000200 00000001 page dumped because: nonzero mapcount Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0-rc5-elk+ #145 Hardware name: Dell Inc. Latitude E5410/03VXMC, BIOS A15 07/11/2013 Call Trace: dump_stack+0x60/0x96 bad_page+0x9a/0x100 free_pages_check_bad+0x3f/0x60 free_pcppages_bulk+0x29d/0x5b0 free_unref_page_commit+0x84/0xb0 free_unref_page+0x3e/0x70 __free_pages+0x1d/0x20 free_highmem_page+0x19/0x40 add_highpages_with_active_regions+0xab/0xeb set_highmem_pages_init+0x66/0x73 mem_init+0x1b/0x1d7 start_kernel+0x17a/0x363 i386_start_kernel+0x95/0x99 startup_32_smp+0x164/0x168 The reason for this error is that the span of MOVABLE_ZONE is extended to whole node span for future CMA initialization, and, normal memory is wrongly freed here. I submitted the fix and it seems to work, but, another problem happened. It's so late time to fix the later problem so I decide to reverting the series. Reported-by: Ville Syrjälä <[email protected]> Acked-by: Laura Abbott <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Andrew Morton <[email protected]> Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11	mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE	Joonsoo Kim	1	-3/+0
	Patch series "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE", v2. 0. History This patchset is the follow-up of the discussion about the "Introduce ZONE_CMA (v7)" [1]. Please reference it if more information is needed. 1. What does this patch do? This patch changes the management way for the memory of the CMA area in the MM subsystem. Currently the memory of the CMA area is managed by the zone where their pfn is belong to. However, this approach has some problems since MM subsystem doesn't have enough logic to handle the situation that different characteristic memories are in a single zone. To solve this issue, this patch try to manage all the memory of the CMA area by using the MOVABLE zone. In MM subsystem's point of view, characteristic of the memory on the MOVABLE zone and the memory of the CMA area are the same. So, managing the memory of the CMA area by using the MOVABLE zone will not have any problem. 2. Motivation There are some problems with current approach. See following. Although these problem would not be inherent and it could be fixed without this conception change, it requires many hooks addition in various code path and it would be intrusive to core MM and would be really error-prone. Therefore, I try to solve them with this new approach. Anyway, following is the problems of the current implementation. o CMA memory utilization First, following is the freepage calculation logic in MM. - For movable allocation: freepage = total freepage - For unmovable allocation: freepage = total freepage - CMA freepage Freepages on the CMA area is used after the normal freepages in the zone where the memory of the CMA area is belong to are exhausted. At that moment that the number of the normal freepages is zero, so - For movable allocation: freepage = total freepage = CMA freepage - For unmovable allocation: freepage = 0 If unmovable allocation comes at this moment, allocation request would fail to pass the watermark check and reclaim is started. After reclaim, there would exist the normal freepages so freepages on the CMA areas would not be used. FYI, there is another attempt [2] trying to solve this problem in lkml. And, as far as I know, Qualcomm also has out-of-tree solution for this problem. Useless reclaim: There is no logic to distinguish CMA pages in the reclaim path. Hence, CMA page is reclaimed even if the system just needs the page that can be usable for the kernel allocation. Atomic allocation failure: This is also related to the fallback allocation policy for the memory of the CMA area. Consider the situation that the number of the normal freepages is zero since the bunch of the movable allocation requests come. Kswapd would not be woken up due to following freepage calculation logic. - For movable allocation: freepage = total freepage = CMA freepage If atomic unmovable allocation request comes at this moment, it would fails due to following logic. - For unmovable allocation: freepage = total freepage - CMA freepage = 0 It was reported by Aneesh [3]. Useless compaction: Usual high-order allocation request is unmovable allocation request and it cannot be served from the memory of the CMA area. In compaction, migration scanner try to migrate the page in the CMA area and make high-order page there. As mentioned above, it cannot be usable for the unmovable allocation request so it's just waste. 3. Current approach and new approach Current approach is that the memory of the CMA area is managed by the zone where their pfn is belong to. However, these memory should be distinguishable since they have a strong limitation. So, they are marked as MIGRATE_CMA in pageblock flag and handled specially. However, as mentioned in section 2, the MM subsystem doesn't have enough logic to deal with this special pageblock so many problems raised. New approach is that the memory of the CMA area is managed by the MOVABLE zone. MM already have enough logic to deal with special zone like as HIGHMEM and MOVABLE zone. So, managing the memory of the CMA area by the MOVABLE zone just naturally work well because constraints for the memory of the CMA area that the memory should always be migratable is the same with the constraint for the MOVABLE zone. There is one side-effect for the usability of the memory of the CMA area. The use of MOVABLE zone is only allowed for a request with GFP_HIGHMEM && GFP_MOVABLE so now the memory of the CMA area is also only allowed for this gfp flag. Before this patchset, a request with GFP_MOVABLE can use them. IMO, It would not be a big issue since most of GFP_MOVABLE request also has GFP_HIGHMEM flag. For example, file cache page and anonymous page. However, file cache page for blockdev file is an exception. Request for it has no GFP_HIGHMEM flag. There is pros and cons on this exception. In my experience, blockdev file cache pages are one of the top reason that causes cma_alloc() to fail temporarily. So, we can get more guarantee of cma_alloc() success by discarding this case. Note that there is no change in admin POV since this patchset is just for internal implementation change in MM subsystem. Just one minor difference for admin is that the memory stat for CMA area will be printed in the MOVABLE zone. That's all. 4. Result Following is the experimental result related to utilization problem. 8 CPUs, 1024 MB, VIRTUAL MACHINE make -j16 <Before> CMA area: 0 MB 512 MB Elapsed-time: 92.4 186.5 pswpin: 82 18647 pswpout: 160 69839 <After> CMA : 0 MB 512 MB Elapsed-time: 93.1 93.4 pswpin: 84 46 pswpout: 183 92 akpm: "kernel test robot" reported a 26% improvement in vm-scalability.throughput: http://lkml.kernel.org/r/20180330012721.GA3845@yexl-desktop [1]: lkml.kernel.org/r/[email protected] [2]: https://lkml.org/lkml/2014/10/15/623 [3]: http://www.spinics.net/lists/linux-mm/msg100562.html Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Tested-by: Tony Lindgren <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Russell King <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05	mm: disable interrupts while initializing deferred pages	Pavel Tatashin	1	-25/+28
	Vlastimil Babka reported about a window issue during which when deferred pages are initialized, and the current version of on-demand initialization is finished, allocations may fail. While this is highly unlikely scenario, since this kind of allocation request must be large, and must come from interrupt handler, we still want to cover it. We solve this by initializing deferred pages with interrupts disabled, and holding node_size_lock spin lock while pages in the node are being initialized. The on-demand deferred page initialization that comes later will use the same lock, and thus synchronize with deferred_init_memmap(). It is unlikely for threads that initialize deferred pages to be interrupted. They run soon after smp_init(), but before modules are initialized, and long before user space programs. This is why there is no adverse effect of having these threads running with interrupts disabled. [[email protected]: v6] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Tatashin <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Steven Sistare <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: Masayoshi Mizuma <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: AKASHI Takahiro <[email protected]> Cc: Gioh Kim <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Yaowei Bai <[email protected]> Cc: Wei Yang <[email protected]> Cc: Paul Burton <[email protected]> Cc: Miles Chen <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-08	mm: pass the vmem_altmap to memmap_init_zone	Christoph Hellwig	1	-1/+1
	Pass the vmem_altmap two levels down instead of needing a lookup. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2018-01-08	mm: pass the vmem_altmap to vmemmap_free	Christoph Hellwig	1	-1/+1
	We can just pass this on instead of having to do a radix tree lookup without proper locking a few levels into the callchain. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2018-01-08	mm: pass the vmem_altmap to arch_remove_memory and __remove_pages	Christoph Hellwig	1	-2/+3
	We can just pass this on instead of having to do a radix tree lookup without proper locking 2 levels into the callchain. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2018-01-08	mm: pass the vmem_altmap to vmemmap_populate	Christoph Hellwig	1	-1/+2
	We can just pass this on instead of having to do a radix tree lookup without proper locking a few levels into the callchain. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2018-01-08	mm: pass the vmem_altmap to arch_add_memory and __add_pages	Christoph Hellwig	1	-7/+10
	We can just pass this on instead of having to do a radix tree lookup without proper locking 2 levels into the callchain. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2017-11-02	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	Greg Kroah-Hartman	1	-0/+1
	Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a /uapi/ one with no licensing information in it, - file was a /uapi/ one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non /uapi/ files that summary was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a /uapi/ path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the /uapi/ ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------\|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <[email protected]> Reviewed-by: Philippe Ombredanne <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2017-09-08	mm/memory_hotplug: introduce add_pages	Michal Hocko	1	-0/+11
	There are new users of memory hotplug emerging. Some of them require different subset of arch_add_memory. There are some which only require allocation of struct pages without mapping those pages to the kernel address space. We currently have __add_pages for that purpose. But this is rather lowlevel and not very suitable for the code outside of the memory hotplug. E.g. x86_64 wants to update max_pfn which should be done by the caller. Introduce add_pages() which should care about those details if they are needed. Each architecture should define its implementation and select CONFIG_ARCH_HAS_ADD_PAGES. All others use the currently existing __add_pages. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Jérôme Glisse <[email protected]> Acked-by: Balbir Singh <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Sherry Cheung <[email protected]> Cc: Subhash Gutti <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06	mm, memory_hotplug: display allowed zones in the preferred ordering	Michal Hocko	1	-1/+1
	Prior to commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") we used to allow to change the valid zone types of a memory block if it is adjacent to a different zone type. This fact was reflected in memoryNN/valid_zones by the ordering of printed zones. The first one was default (echo online > memoryNN/state) and the other one could be onlined explicitly by online_{movable,kernel}. This behavior was removed by the said patch and as such the ordering was not all that important. In most cases a kernel zone would be default anyway. The only exception is movable_node handled by "mm, memory_hotplug: support movable_node for hotpluggable nodes". Let's reintroduce this behavior again because later patch will remove the zone overlap restriction and so user will be allowed to online kernel resp. movable block regardless of its placement. Original behavior will then become significant again because it would be non-trivial for users to see what is the default zone to online into. Implementation is really simple. Pull out zone selection out of move_pfn_range into zone_for_pfn_range helper and use it in show_valid_zones to display the zone for default onlining and then both kernel and movable if they are allowed. Default online zone is not duplicated. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Joonsoo Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Kani Toshimitsu <[email protected]> Cc: <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06	mm, memory_hotplug: move movable_node to the hotplug proper	Michal Hocko	1	-0/+10
	movable_node_is_enabled is defined in memblock proper while it is initialized from the memory hotplug proper. This is quite messy and it makes a dependency between the two so move movable_node along with the helper functions to memory_hotplug. To make it more entertaining the kernel parameter is ignored unless CONFIG_HAVE_MEMBLOCK_NODE_MAP=y because we do not have the node information for each memblock otherwise. So let's warn when the option is disabled. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Reza Arbab <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Kani Toshimitsu <[email protected]> Cc: Chen Yucong <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Andi Kleen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06	mm, memory_hotplug: remove unused cruft after memory hotplug rework	Michal Hocko	1	-2/+0
	zone_for_memory doesn't have any user anymore as well as the whole zone shifting infrastructure so drop them all. This shouldn't introduce any functional changes. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Dan Williams <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: David Rientjes <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Tobias Regnery <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06	mm, memory_hotplug: replace for_device by want_memblock in arch_add_memory	Michal Hocko	1	-1/+1
	arch_add_memory gets for_device argument which then controls whether we want to create memblocks for created memory sections. Simplify the logic by telling whether we want memblocks directly rather than going through pointless negation. This also makes the api easier to understand because it is clear what we want rather than nothing telling for_device which can mean anything. This shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Tested-by: Dan Williams <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: David Rientjes <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Tobias Regnery <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06	mm, memory_hotplug: do not assume ZONE_NORMAL is default kernel zone	Michal Hocko	1	-0/+2
	Heiko Carstens has noticed that he can generate overlapping zones for ZONE_DMA and ZONE_NORMAL: DMA [mem 0x0000000000000000-0x000000007fffffff] Normal [mem 0x0000000080000000-0x000000017fffffff] $ cat /sys/devices/system/memory/block_size_bytes 10000000 $ cat /sys/devices/system/memory/memory5/valid_zones DMA $ echo 0 > /sys/devices/system/memory/memory5/online $ cat /sys/devices/system/memory/memory5/valid_zones Normal $ echo 1 > /sys/devices/system/memory/memory5/online Normal $ cat /proc/zoneinfo Node 0, zone DMA spanned 524288 <----- present 458752 managed 455078 start_pfn: 0 <----- Node 0, zone Normal spanned 720896 present 589824 managed 571648 start_pfn: 327680 <----- The reason is that we assume that the default zone for kernel onlining is ZONE_NORMAL. This was a simplification introduced by the memory hotplug rework and it is easily fixable by checking the range overlap in the zone order and considering the first matching zone as the default one. If there is no such zone then assume ZONE_NORMAL as we have been doing so far. Fixes: "mm, memory_hotplug: do not associate hotadded memory to zones until online" Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reported-by: Heiko Carstens <[email protected]> Tested-by: Heiko Carstens <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Dan Williams <[email protected]> Cc: Reza Arbab <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06	mm, memory_hotplug: do not associate hotadded memory to zones until online	Michal Hocko	1	-6/+7
	The current memory hotplug implementation relies on having all the struct pages associate with a zone/node during the physical hotplug phase (arch_add_memory->__add_pages->__add_section->__add_zone). In the vast majority of cases this means that they are added to ZONE_NORMAL. This has been so since 9d99aaa31f59 ("[PATCH] x86_64: Support memory hotadd without sparsemem") and it wasn't a big deal back then because movable onlining didn't exist yet. Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable onlining 511c2aba8f07 ("mm, memory-hotplug: dynamic configure movable memory and portion memory") and then things got more complicated. Rather than reconsidering the zone association which was no longer needed (because the memory hotplug already depended on SPARSEMEM) a convoluted semantic of zone shifting has been developed. Only the currently last memblock or the one adjacent to the zone_movable can be onlined movable. This essentially means that the online type changes as the new memblocks are added. Let's simulate memory hot online manually $ echo 0x100000000 > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory32/valid_zones Normal Movable $ echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable $ echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal /sys/devices/system/memory/memory34/valid_zones:Normal Movable $ echo online_movable > /sys/devices/system/memory/memory34/state $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable Normal This is an awkward semantic because an udev event is sent as soon as the block is onlined and an udev handler might want to online it based on some policy (e.g. association with a node) but it will inherently race with new blocks showing up. This patch changes the physical online phase to not associate pages with any zone at all. All the pages are just marked reserved and wait for the onlining phase to be associated with the zone as per the online request. There are only two requirements - existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap - ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses the latter one is not an inherent requirement and can be changed in the future. It preserves the current behavior and made the code slightly simpler. This is subject to change in future. This means that the same physical online steps as above will lead to the following state: Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable Implementation: The current move_pfn_range is reimplemented to check the above requirements (allow_online_pfn_range) and then updates the respective zone (move_pfn_range_to_zone), the pgdat and links all the pages in the pfn range with the zone/node. __add_pages is updated to not require the zone and only initializes sections in the range. This allowed to simplify the arch_add_memory code (s390 could get rid of quite some of code). devm_memremap_pages is the only user of arch_add_memory which relies on the zone association because it only hooks into the memory hotplug only half way. It uses it to associate the new memory with ZONE_DEVICE but doesn't allow it to be {on,off}lined via sysfs. This means that this particular code path has to call move_pfn_range_to_zone explicitly. The original zone shifting code is kept in place and will be removed in the follow up patch for an easier review. Please note that this patch also changes the original behavior when offlining a memory block adjacent to another zone (Normal vs. Movable) used to allow to change its movable type. This will be handled later. [[email protected]: simplify zone_intersects()] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: remove duplicate call for set_page_links] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: remove unused local `i'] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Wei Yang <[email protected]> Tested-by: Dan Williams <[email protected]> Tested-by: Reza Arbab <[email protected]> Acked-by: Heiko Carstens <[email protected]> # For s390 bits Acked-by: Vlastimil Babka <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: David Rientjes <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tobias Regnery <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06	mm: consider zone which is not fully populated to have holes	Michal Hocko	1	-0/+22
	__pageblock_pfn_to_page has two users currently, set_zone_contiguous which checks whether the given zone contains holes and pageblock_pfn_to_page which then carefully returns a first valid page from the given pfn range for the given zone. This doesn't handle zones which are not fully populated though. Memory pageblocks can be offlined or might not have been onlined yet. In such a case the zone should be considered to have holes otherwise pfn walkers can touch and play with offline pages. Current callers of pageblock_pfn_to_page in compaction seem to work properly right now because they only isolate PageBuddy (isolate_freepages_block) or PageLRU resp. __PageMovable (isolate_migratepages_block) which will be always false for these pages. It would be safer to skip these pages altogether, though. In order to do this patch adds a new memory section state (SECTION_IS_ONLINE) which is set in memory_present (during boot time) or in online_pages_range during the memory hotplug. Similarly offline_mem_sections clears the bit and it is called when the memory range is offlined. pfn_to_online_page helper is then added which check the mem section and only returns a page if it is onlined already. Use the new helper in __pageblock_pfn_to_page and skip the whole page block in such a case. [[email protected]: check valid section number in pfn_to_online_page (Vlastimil), mark sections online after all struct pages are initialized in online_pages_range (Vlastimil)] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Dan Williams <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: David Rientjes <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Tobias Regnery <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06	mm, memory_hotplug: get rid of is_zone_device_section	Michal Hocko	1	-1/+1
	Device memory hotplug hooks into regular memory hotplug only half way. It needs memory sections to track struct pages but there is no need/desire to associate those sections with memory blocks and export them to the userspace via sysfs because they cannot be onlined anyway. This is currently expressed by for_device argument to arch_add_memory which then makes sure to associate the given memory range with ZONE_DEVICE. register_new_memory then relies on is_zone_device_section to distinguish special memory hotplug from the regular one. While this works now, later patches in this series want to move __add_zone outside of arch_add_memory path so we have to come up with something else. Add want_memblock down the __add_pages path and use it to control whether the section->memblock association should be done. arch_add_memory then just trivially want memblock for everything but for_device hotplug. remove_memory_section doesn't need is_zone_device_section either. We can simply skip all the memblock specific cleanup if there is no memblock for the given section. This shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Tested-by: Dan Williams <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: David Rientjes <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Tobias Regnery <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-03	base/memory, hotplug: fix a kernel oops in show_valid_zones()	Toshi Kani	1	-1/+2
	Reading a sysfs "memoryN/valid_zones" file leads to the following oops when the first page of a range is not backed by struct page. show_valid_zones() assumes that 'start_pfn' is always valid for page_zone(). BUG: unable to handle kernel paging request at ffffea017a000000 IP: show_valid_zones+0x6f/0x160 This issue may happen on x86-64 systems with 64GiB or more memory since their memory block size is bumped up to 2GiB. [1] An example of such systems is desribed below. 0x3240000000 is only aligned by 1GiB and this memory block starts from 0x3200000000, which is not backed by struct page. BIOS-e820: [mem 0x0000003240000000-0x000000603fffffff] usable Since test_pages_in_a_zone() already checks holes, fix this issue by extending this function to return 'valid_start' and 'valid_end' for a given range. show_valid_zones() then proceeds with the valid range. [1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64 systems")' Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Toshi Kani <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Zhang Zhen <[email protected]> Cc: Reza Arbab <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dan Williams <[email protected]> Cc: <[email protected]> [4.4+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24	memory_hotplug: make zone_can_shift() return a boolean value	Yasuaki Ishimatsu	1	-2/+2
	online_{kernel\|movable} is used to change the memory zone to ZONE_{NORMAL\|MOVABLE} and online the memory. To check that memory zone can be changed, zone_can_shift() is used. Currently the function returns minus integer value, plus integer value and 0. When the function returns minus or plus integer value, it means that the memory zone can be changed to ZONE_{NORNAL\|MOVABLE}. But when the function returns 0, there are two meanings. One of the meanings is that the memory zone does not need to be changed. For example, when memory is in ZONE_NORMAL and onlined by online_kernel the memory zone does not need to be changed. Another meaning is that the memory zone cannot be changed. When memory is in ZONE_NORMAL and onlined by online_movable, the memory zone may not be changed to ZONE_MOVALBE due to memory online limitation(see Documentation/memory-hotplug.txt). In this case, memory must not be onlined. The patch changes the return type of zone_can_shift() so that memory online operation fails when memory zone cannot be changed as follows: Before applying patch: # grep -A 35 "Node 2" /proc/zoneinfo Node 2, zone Normal <snip> node_scanned 0 spanned 8388608 present 7864320 managed 7864320 # echo online_movable > memory4097/state # grep -A 35 "Node 2" /proc/zoneinfo Node 2, zone Normal <snip> node_scanned 0 spanned 8388608 present 8388608 managed 8388608 online_movable operation succeeded. But memory is onlined as ZONE_NORMAL, not ZONE_MOVABLE. After applying patch: # grep -A 35 "Node 2" /proc/zoneinfo Node 2, zone Normal <snip> node_scanned 0 spanned 8388608 present 7864320 managed 7864320 # echo online_movable > memory4097/state bash: echo: write error: Invalid argument # grep -A 35 "Node 2" /proc/zoneinfo Node 2, zone Normal <snip> node_scanned 0 spanned 8388608 present 7864320 managed 7864320 online_movable operation failed because of failure of changing the memory zone from ZONE_NORMAL to ZONE_MOVABLE Fixes: df429ac03936 ("memory-hotplug: more general validation of zone during online") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yasuaki Ishimatsu <[email protected]> Reviewed-by: Reza Arbab <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-26	memory-hotplug: more general validation of zone during online	Reza Arbab	1	-0/+2
	When memory is onlined, we are only able to rezone from ZONE_MOVABLE to ZONE_KERNEL, or from (ZONE_MOVABLE - 1) to ZONE_MOVABLE. To be more flexible, use the following criteria instead; to online memory from zone X into zone Y, * Any zones between X and Y must be unused. * If X is lower than Y, the onlined memory must lie at the end of X. * If X is higher than Y, the onlined memory must lie at the start of X. Add zone_can_shift() to make this determination. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Reza Arbab <[email protected]> Reviewd-by: Yasuaki Ishimatsu <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: Dan Williams <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Tang Chen <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: David Vrabel <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: David Rientjes <[email protected]> Cc: Andrew Banman <[email protected]> Cc: Chen Yucong <[email protected]> Cc: Yasunori Goto <[email protected]> Cc: Zhang Zhen <[email protected]> Cc: Shaohua Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-27	mm: fix section mismatch warning	Linus Torvalds	1	-1/+1
	The register_page_bootmem_info_node() function needs to be marked __init in order to avoid a new warning introduced by commit f65e91df25aa ("mm: use early_pfn_to_nid in register_page_bootmem_info_node"). Otherwise you'll get a warning about how a non-init function calls early_pfn_to_nid (which is __meminit) Cc: Yang Shi <[email protected]> Cc: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>