aboutsummaryrefslogtreecommitdiff
path: root/mm/memory_hotplug.c
AgeCommit message (Collapse)AuthorFilesLines
2012-12-12hotplug: update nodemasks managementLai Jiangshan1-15/+72
Update nodemasks management for N_MEMORY. [[email protected]: fix build] Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Lin Feng <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-11memory_hotplug: ensure every online node has NORMAL memoryLai Jiangshan1-0/+40
Old memory hotplug code and new online/movable may cause a online node don't have any normal memory, but memory-management acts bad when we have nodes which is online but don't have any normal memory. Example: it may cause a bound task fail on all kernel allocation and cause the task can't create task or create other kernel object. So we disable non-normal-memory-node here, we will enable it when we prepared. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Jiang Liu <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Greg KH <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-11memory_hotplug: handle empty zone when online_movable/online_kernelLai Jiangshan1-6/+45
Make online_movable/online_kernel can empty a zone or can move memory to a empty zone. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Jiang Liu <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Greg KH <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-11mm, memory-hotplug: dynamic configure movable memory and portion memoryLai Jiangshan1-1/+99
Add online_movable and online_kernel for logic memory hotplug. This is the dynamic version of "movablecore" & "kernelcore". We have the same reason to introduce it as to introduce "movablecore" & "kernelcore". It has the same motive as "movablecore" & "kernelcore", but it is dynamic/running-time: o We can configure memory as kernelcore or movablecore after boot. Userspace workload is increased, we need more hugepage, we can't use "online_movable" to add memory and allow the system use more THP(transparent-huge-page), vice-verse when kernel workload is increase. Also help for virtualization to dynamic configure host/guest's memory, to save/(reduce waste) memory. Memory capacity on Demand o When a new node is physically online after boot, we need to use "online_movable" or "online_kernel" to configure/portion it as we expected when we logic-online it. This configuration also helps for physically-memory-migrate. o all benefit as the same as existed "movablecore" & "kernelcore". o Preparing for movable-node, which is very important for power-saving, hardware partitioning and high-available-system(hardware fault management). (Note, we don't introduce movable-node here.) Action behavior: When a memoryblock/memorysection is onlined by "online_movable", the kernel will not have directly reference to the page of the memoryblock, thus we can remove that memory any time when needed. When it is online by "online_kernel", the kernel can use it. When it is online by "online", the zone type doesn't changed. Current constraints: Only the memoryblock which is adjacent to the ZONE_MOVABLE can be online from ZONE_NORMAL to ZONE_MOVABLE. [[email protected]: use min_t, cleanups] Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Jiang Liu <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Greg KH <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-11mm/memory_hotplug.c: update start_pfn in zone and pg_data when spanned_pages ↵Tang Chen1-2/+2
== 0. If we hot-remove memory only and leave the cpus alive, the corresponding node will not be removed. But the node_start_pfn and node_spanned_pages in pg_data will be reset to 0. In this case, when we hot-add the memory back next time, the node_start_pfn will always be 0 because no pfn is less than 0. After that, if we hot-remove the memory again, it will cause kernel panic in function find_biggest_section_pfn() when it tries to scan all the pfns. The zone will also have the same problem. This patch sets start_pfn to the start_pfn of the section being added when spanned_pages of the zone or pg_data is 0. ---How to reproduce--- 1. hot-add a container with some memory and cpus; 2. hot-remove the container's memory, and leave cpus there; 3. hot-add these memory again; 4. hot-remove them again; then, the kernel will panic. ---Call trace--- BUG: unable to handle kernel paging request at 00000fff82a8cc38 IP: [<ffffffff811c0d55>] find_biggest_section_pfn+0xe5/0x180 ...... Call Trace: [<ffffffff811c1124>] __remove_zone+0x184/0x1b0 [<ffffffff811c11dc>] __remove_section+0x8c/0xb0 [<ffffffff811c12e7>] __remove_pages+0xe7/0x120 [<ffffffff81654f7c>] arch_remove_memory+0x2c/0x80 [<ffffffff81655bb6>] remove_memory+0x56/0x90 [<ffffffff813da0c8>] acpi_memory_device_remove_memory+0x48/0x73 [<ffffffff813da55a>] acpi_memory_device_notify+0x153/0x274 [<ffffffff813b6786>] acpi_ev_notify_dispatch+0x41/0x5f [<ffffffff813a3867>] acpi_os_execute_deferred+0x27/0x34 [<ffffffff81090589>] process_one_work+0x219/0x680 [<ffffffff810923be>] worker_thread+0x12e/0x320 [<ffffffff81098396>] kthread+0xc6/0xd0 [<ffffffff8167c7c4>] kernel_thread_helper+0x4/0x10 ...... ---[ end trace 96d845dbf33fee11 ]--- Signed-off-by: Tang Chen <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Wen Congyang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-11memory_hotplug: fix possible incorrect node_states[N_NORMAL_MEMORY]Lai Jiangshan1-16/+120
Currently memory_hotplug only manages the node_states[N_HIGH_MEMORY], it forgets to manage node_states[N_NORMAL_MEMORY]. This may cause node_states[N_NORMAL_MEMORY] to become incorrect. Example, if a node is empty before online, and we online a memory which is in ZONE_NORMAL. And after online, node_states[N_HIGH_MEMORY] is correct, but node_states[N_NORMAL_MEMORY] is incorrect, the online code doesn't set the new online node to node_states[N_NORMAL_MEMORY]. The same thing will happen when offlining (the offline code doesn't clear the node from node_states[N_NORMAL_MEMORY] when needed). Some memory managment code depends node_states[N_NORMAL_MEMORY], so we have to fix up the node_states[N_NORMAL_MEMORY]. We add node_states_check_changes_online() and node_states_check_changes_offline() to detect whether node_states[N_HIGH_MEMORY] and node_states[N_NORMAL_MEMORY] are changed while hotpluging. Also add @status_change_nid_normal to struct memory_notify, thus the memory hotplug callbacks know whether the node_states[N_NORMAL_MEMORY] are changed. (We can add a @flags and reuse @status_change_nid instead of introducing @status_change_nid_normal, but it will add much more complexity in memory hotplug callback in every subsystem. So introducing @status_change_nid_normal is better and it doesn't change the sematics of @status_change_nid) Signed-off-by: Lai Jiangshan <[email protected]> Cc: David Rientjes <[email protected]> Cc: Minchan Kim <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Rob Landley <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Kay Sievers <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Wen Congyang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-11memory-hotplug: allocate zone's pcp before onlining pagesWen Congyang1-2/+6
We use __free_page() to put a page to buddy system when onlining pages. __free_page() will store NR_FREE_PAGES in zone's pcp.vm_stat_diff, so we should allocate zone's pcp before onlining pages, otherwise we will lose some free pages. [[email protected]: make zone_pcp_reset independent of MEMORY_HOTREMOVE] Signed-off-by: Wen Congyang <[email protected]> Cc: David Rientjes <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Len Brown <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Minchan Kim <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-11memory-hotplug: skip HWPoisoned page when offlining pagesWen Congyang1-2/+3
hwpoisoned may be set when we offline a page by the sysfs interface /sys/devices/system/memory/soft_offline_page or /sys/devices/system/memory/hard_offline_page. If we don't clear this flag when onlining pages, this page can't be freed, and will not in free list. So we can't offline these pages again. So we should skip such page when offlining pages. Signed-off-by: Wen Congyang <[email protected]> Cc: David Rientjes <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Len Brown <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Minchan Kim <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-11mm: migrate: Add a tracepoint for migrate_pagesMel Gorman1-1/+2
The pgmigrate_success and pgmigrate_fail vmstat counters tells the user about migration activity but not the type or the reason. This patch adds a tracepoint to identify the type of page migration and why the page is being migrated. Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]>
2012-11-19various: Fix spelling of "asynchronous" in comments.Adam Buchbinder1-3/+3
"Asynchronous" is misspelled in some comments. No code changes. Signed-off-by: Adam Buchbinder <[email protected]> Signed-off-by: Jiri Kosina <[email protected]>
2012-11-16revert "mm: fix-up zone present pages"Andrew Morton1-7/+0
Revert commit 7f1290f2f2a4 ("mm: fix-up zone present pages") That patch tried to fix a issue when calculating zone->present_pages, but it caused a regression on 32bit systems with HIGHMEM. With that change, reset_zone_present_pages() resets all zone->present_pages to zero, and fixup_zone_present_pages() is called to recalculate zone->present_pages when the boot allocator frees core memory pages into buddy allocator. Because highmem pages are not freed by bootmem allocator, all highmem zones' present_pages becomes zero. Various options for improving the situation are being discussed but for now, let's return to the 3.6 code. Cc: Jianguo Wu <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Petr Tesarik <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Acked-by: David Rientjes <[email protected]> Tested-by: Chris Clayton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09memory-hotplug: suppress "Trying to free nonexistent resource ↵Yasuaki Ishimatsu1-2/+2
<XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>" warning When our x86 box calls __remove_pages(), release_mem_region() shows many warnings. And x86 box cannot unregister iomem_resource. "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>" release_mem_region() has been changed to be called in each PAGES_PER_SECTION by commit de7f0cba9678 ("memory hotplug: release memory regions in PAGES_PER_SECTION chunks"). Because powerpc registers iomem_resource in each PAGES_PER_SECTION chunk. But when I hot add memory on x86 box, iomem_resource is register in each _CRS not PAGES_PER_SECTION chunk. So x86 box unregisters iomem_resource. The patch fixes the problem. Signed-off-by: Yasuaki Ishimatsu <[email protected]> Cc: David Rientjes <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Len Brown <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Andrew Morton <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Nathan Fontenot <[email protected]> Cc: Badari Pulavarty <[email protected]> Cc: Yasunori Goto <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09memory-hotplug: update memory block's state and notify userspaceWen Congyang1-1/+32
remove_memory() will be called when hot removing a memory device. But even if offlining memory, we cannot notice it. So the patch updates the memory block's state and sends notification to userspace. Additionally, the memory device may contain more than one memory block. If the memory block has been offlined, __offline_pages() will fail. So we should try to offline one memory block at a time. Thus remove_memory() also check each memory block's state. So there is no need to check the memory block's state before calling remove_memory(). Signed-off-by: Wen Congyang <[email protected]> Signed-off-by: Yasuaki Ishimatsu <[email protected]> Cc: David Rientjes <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Len Brown <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Minchan Kim <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09memory-hotplug: preparation to notify memory block's state at memory hot removeWen Congyang1-2/+11
remove_memory() is called in two cases: 1. echo offline >/sys/devices/system/memory/memoryXX/state 2. hot remove a memory device In the 1st case, the memory block's state is changed and the notification that memory block's state changed is sent to userland after calling remove_memory(). So user can notice memory block is changed. But in the 2nd case, the memory block's state is not changed and the notification is not also sent to userspcae even if calling remove_memory(). So user cannot notice memory block is changed. For adding the notification at memory hot remove, the patch just prepare as follows: 1st case uses offline_pages() for offlining memory. 2nd case uses remove_memory() for offlining memory and changing memory block's state and notifing the information. The patch does not implement notification to remove_memory(). Signed-off-by: Wen Congyang <[email protected]> Signed-off-by: Yasuaki Ishimatsu <[email protected]> Cc: David Rientjes <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Len Brown <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Minchan Kim <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm: fix-up zone present pagesJianguo Wu1-0/+7
I think zone->present_pages indicates pages that buddy system can management, it should be: zone->present_pages = spanned pages - absent pages - bootmem pages, but is now: zone->present_pages = spanned pages - absent pages - memmap pages. spanned pages: total size, including holes. absent pages: holes. bootmem pages: pages used in system boot, managed by bootmem allocator. memmap pages: pages used by page structs. This may cause zone->present_pages less than it should be. For example, numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's present_pages should be spanned pages - absent pages, but now it also minus memmap pages(free_area_init_core), which are actually allocated from ZONE_MOVABLE. When offlining all memory of a zone, this will cause zone->present_pages less than 0, because present_pages is unsigned long type, it is actually a very large integer, it indirectly caused zone->watermark[WMARK_MIN] becomes a large integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a large integer(calculate_totalreserve_pages()), and finally cause memory allocating failure when fork process(__vm_enough_memory()). [root@localhost ~]# dmesg -bash: fork: Cannot allocate memory I think the bug described in http://marc.info/?l=linux-mm&m=134502182714186&w=2 is also caused by wrong zone present pages. This patch intends to fix-up zone->present_pages when memory are freed to buddy system on x86_64 and IA64 platforms. Signed-off-by: Jianguo Wu <[email protected]> Signed-off-by: Jiang Liu <[email protected]> Reported-by: Petr Tesarik <[email protected]> Tested-by: Petr Tesarik <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09memory-hotplug: don't replace lowmem pages with highmemMinchan Kim1-9/+6
The changelog for commit 6a6dccba2fdc ("mm: cma: don't replace lowmem pages with highmem") mentioned that lowmem pages can be replaced by highmem pages during CMA migration. 6a6dccba2fdc fixed that issue. Quote from that changelog: : The filesystem layer expects pages in the block device's mapping to not : be in highmem (the mapping's gfp mask is set in bdget()), but CMA can : currently replace lowmem pages with highmem pages, leading to crashes in : filesystem code such as the one below: : : Unable to handle kernel NULL pointer dereference at virtual address 00000400 : pgd = c0c98000 : [00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000 : Internal error: Oops: 817 [#1] PREEMPT SMP ARM : CPU: 0 Not tainted (3.5.0-rc5+ #80) : PC is at __memzero+0x24/0x80 : ... : Process fsstress (pid: 323, stack limit = 0xc0cbc2f0) : Backtrace: : [<c010e3f0>] (ext4_getblk+0x0/0x180) from [<c010e58c>] (ext4_bread+0x1c/0x98) : [<c010e570>] (ext4_bread+0x0/0x98) from [<c0117944>] (ext4_mkdir+0x160/0x3bc) : r4:c15337f0 : [<c01177e4>] (ext4_mkdir+0x0/0x3bc) from [<c00c29e0>] (vfs_mkdir+0x8c/0x98) : [<c00c2954>] (vfs_mkdir+0x0/0x98) from [<c00c2a60>] (sys_mkdirat+0x74/0xac) : r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0 : [<c00c29ec>] (sys_mkdirat+0x0/0xac) from [<c00c2ab8>] (sys_mkdir+0x20/0x24) : r6:beccdcf0 r5:00074000 r4:beccdbbc : [<c00c2a98>] (sys_mkdir+0x0/0x24) from [<c000e3c0>] (ret_fast_syscall+0x0/0x30) Memory-hotplug has same problem as CMA has so the same fix can be applied to memory-hotplug as well. Fix it by reusing. Signed-off-by: Minchan Kim <[email protected]> Cc: Kamezawa Hiroyuki <[email protected]> Reviewed-by: Yasuaki Ishimatsu <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Wen Congyang <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09memory-hotplug: build zonelists when offlining pagesXishi Qiu1-1/+6
online_pages() does build_all_zonelists() and zone_pcp_update(), I think offline_pages() should do it too. When the zone has no memory to allocate, remove it from other nodes' zonelists. zone_batchsize() depends on zone's present pages, if zone's present pages are changed, zone's pcp should be updated. Signed-off-by: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-09-17memory hotplug: fix section info double registration bugqiuxishi1-6/+10
There may be a bug when registering section info. For example, on my Itanium platform, the pfn range of node0 includes the other nodes, so other nodes' section info will be double registered, and memmap's page count will equal to 3. node0: start_pfn=0x100, spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00 node1: start_pfn=0x80000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x080000-0x100000 node2: start_pfn=0x100000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x100000-0x180000 node3: start_pfn=0x180000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x180000-0x200000 free_all_bootmem_node() register_page_bootmem_info_node() register_page_bootmem_info_section() When hot remove memory, we can't free the memmap's page because page_count() is 2 after put_page_bootmem(). sparse_remove_one_section() free_section_usemap() free_map_bootmem() put_page_bootmem() [[email protected]: add code comment] Signed-off-by: Xishi Qiu <[email protected]> Signed-off-by: Jiang Liu <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/hotplug: free zone->pageset when a zone becomes emptyJiang Liu1-0/+3
When a zone becomes empty after memory offlining, free zone->pageset. Otherwise it will cause memory leak when adding memory to the empty zone again because build_all_zonelists() will allocate zone->pageset for an empty zone. Signed-off-by: Jiang Liu <[email protected]> Signed-off-by: Wei Wang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: David Rientjes <[email protected]> Cc: Keping Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/hotplug: correctly add new zone to all other nodes' zone listsJiang Liu1-7/+8
When online_pages() is called to add new memory to an empty zone, it rebuilds all zone lists by calling build_all_zonelists(). But there's a bug which prevents the new zone to be added to other nodes' zone lists. online_pages() { build_all_zonelists() ..... node_set_state(zone_to_nid(zone), N_HIGH_MEMORY) } Here the node of the zone is put into N_HIGH_MEMORY state after calling build_all_zonelists(), but build_all_zonelists() only adds zones from nodes in N_HIGH_MEMORY state to the fallback zone lists. build_all_zonelists() ->__build_all_zonelists() ->build_zonelists() ->find_next_best_node() ->for_each_node_state(n, N_HIGH_MEMORY) So memory in the new zone will never be used by other nodes, and it may cause strange behavor when system is under memory pressure. So put node into N_HIGH_MEMORY state before calling build_all_zonelists(). Signed-off-by: Jianguo Wu <[email protected]> Signed-off-by: Jiang Liu <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: David Rientjes <[email protected]> Cc: Keping Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/hotplug: correctly setup fallback zonelists when creating new pgdatJiang Liu1-2/+2
When hotadd_new_pgdat() is called to create new pgdat for a new node, a fallback zonelist should be created for the new node. There's code to try to achieve that in hotadd_new_pgdat() as below: /* * The node we allocated has no zone fallback lists. For avoiding * to access not-initialized zonelist, build here. */ mutex_lock(&zonelists_mutex); build_all_zonelists(pgdat, NULL); mutex_unlock(&zonelists_mutex); But it doesn't work as expected. When hotadd_new_pgdat() is called, the new node is still in offline state because node_set_online(nid) hasn't been called yet. And build_all_zonelists() only builds zonelists for online nodes as: for_each_online_node(nid) { pg_data_t *pgdat = NODE_DATA(nid); build_zonelists(pgdat); build_zonelist_cache(pgdat); } Though we hope to create zonelist for the new pgdat, but it doesn't. So add a new parameter "pgdat" the build_all_zonelists() to build pgdat for the new pgdat too. Signed-off-by: Jiang Liu <[email protected]> Signed-off-by: Xishi Qiu <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: David Rientjes <[email protected]> Cc: Keping Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-11mm/memory_hotplug.c: release memory resources if hotadd_new_pgdat() failsWen Congyang1-1/+1
We should goto error to release memory resource if hotadd_new_pgdat() failed. Signed-off-by: Wen Congyang <[email protected]> Cc: Yasuaki ISIMATU <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Len Brown <[email protected]> Cc: "Brown, Len" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-05-29mm: print physical addresses consistently with other parts of kernelBjorn Helgaas1-6/+8
Print physical address info in a style consistent with the %pR style used elsewhere in the kernel. For example: -Zone PFN ranges: +Zone ranges: - DMA32 0x00000010 -> 0x00100000 + DMA32 [mem 0x00010000-0xffffffff] - Normal 0x00100000 -> 0x01080000 + Normal [mem 0x100000000-0x107fffffff] Signed-off-by: Bjorn Helgaas <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-05-21mm: page_isolation: MIGRATE_CMA isolation functions addedMichal Nazarewicz1-3/+3
This commit changes various functions that change pages and pageblocks migrate type between MIGRATE_ISOLATE and MIGRATE_MOVABLE in such a way as to allow to work with MIGRATE_CMA migrate type. Signed-off-by: Michal Nazarewicz <[email protected]> Signed-off-by: Marek Szyprowski <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Tested-by: Rob Clark <[email protected]> Tested-by: Ohad Ben-Cohen <[email protected]> Tested-by: Benjamin Gaignard <[email protected]> Tested-by: Robert Nelson <[email protected]> Tested-by: Barry Song <[email protected]>
2012-01-12mm: compaction: introduce sync-light migration for use by compactionMel Gorman1-1/+1
This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT mode that avoids writing back pages to backing storage. Async compaction maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is used. This avoids sync compaction stalling for an excessive length of time, particularly when copying files to a USB stick where there might be a large number of dirty pages backed by a filesystem that does not support ->writepages. [[email protected]: This patch is heavily based on Andrea's work] [[email protected]: fix fs/nfs/write.c build] [[email protected]: fix fs/btrfs/disk-io.c build] Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Dave Jones <[email protected]> Cc: Jan Kara <[email protected]> Cc: Andy Isaacson <[email protected]> Cc: Nai Xia <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-10-31mm: Map most files to use export.h instead of module.hPaul Gortmaker1-1/+1
The files changed within are only using the EXPORT_SYMBOL macro variants. They are not using core modular infrastructure and hence don't need module.h but only the export.h header. Signed-off-by: Paul Gortmaker <[email protected]>
2011-07-25mm: extend memory hotplug API to allow memory hotplug in virtual machinesDaniel Kiper1-3/+65
This patch contains online_page_callback and apropriate functions for registering/unregistering online page callbacks. It allows to do some machine specific tasks during online page stage which is required to implement memory hotplug in virtual machines. Currently this patch is required by latest memory hotplug support for Xen balloon driver patch which will be posted soon. Additionally, originial online_page() function was splited into following functions doing "atomic" operations: - __online_page_set_limits() - set new limits for memory management code, - __online_page_increment_counters() - increment totalram_pages and totalhigh_pages, - __online_page_free() - free page to allocator. It was done to: - not duplicate existing code, - ease hotplug code devolpment by usage of well defined interface, - avoid stupid bugs which are unavoidable when the same code (by design) is developed in many places. [[email protected]: use explicit indirect-call syntax] Signed-off-by: Daniel Kiper <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Cc: Ian Campbell <[email protected]> Cc: Jeremy Fitzhardinge <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-06-22mm, hotplug: protect zonelist building with zonelists_mutexDavid Rientjes1-0/+2
Commit 959ecc48fc75 ("mm/memory_hotplug.c: fix building of node hotplug zonelist") does not protect the build_all_zonelists() call with zonelists_mutex as needed. This can lead to races in constructing zonelist ordering if a concurrent build is underway. Protecting this with lock_memory_hotplug() is insufficient since zonelists can be rebuild though sysfs as well. Signed-off-by: David Rientjes <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-06-22mm, hotplug: fix error handling in mem_online_node()David Rientjes1-1/+1
The error handling in mem_online_node() is incorrect: hotadd_new_pgdat() returns NULL if the new pgdat could not have been allocated and a pointer to it otherwise. mem_online_node() should fail if hotadd_new_pgdat() fails, not the inverse. This fixes an issue when memoryless nodes are not onlined and their sysfs interface is not registered when their first cpu is brought up. The bug was introduced by commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before local memory online") iow v2.6.35. Signed-off-by: David Rientjes <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: [email protected] Signed-off-by: Linus Torvalds <[email protected]>
2011-06-15mm/memory_hotplug.c: fix building of node hotplug zonelistKAMEZAWA Hiroyuki1-0/+6
During memory hotplug we refresh zonelists when we online a page in a new zone. It means that the node's zonelist is not initialized until pages are onlined. So for example, "nid" passed by MEM_GOING_ONLINE notifier will point to NODE_DATA(nid) which has no zone fallback list. Moreover, if we hot-add cpu-only nodes, alloc_pages() will do no fallback. This patch makes a zonelist when a new pgdata is available. Note: in production, at fujitsu, memory should be onlined before cpu and our server didn't have any memory-less nodes and had no problems. But recent changes in MEM_GOING_ONLINE+page_cgroup will access not initialized zonelist of node. Anyway, there are memory-less node and we need some care. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: remove dependency on CONFIG_FLATMEM from online_page()Daniel Kiper1-4/+0
online_pages() is only compiled for CONFIG_MEMORY_HOTPLUG_SPARSE, so there is no need to support CONFIG_FLATMEM code within it. This patch removes code that is never used. Signed-off-by: Daniel Kiper <[email protected]> Acked-by: Dave Hansen <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mem-hotplug: call isolate_lru_page with elevated refcountKonstantin Khlebnikov1-1/+3
isolate_lru_page() must be called only with stable reference to page. So, let's grab normal page reference. Signed-off-by: Konstantin Khlebnikov <[email protected]> Cc: Andi Kleen <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occursKOSAKI Motohiro1-4/+5
Currently, memory hotplug calls setup_per_zone_wmarks() and calculate_zone_inactive_ratio(), but doesn't call setup_per_zone_lowmem_reserve(). It means the number of reserved pages aren't updated even if memory hot plug occur. This patch fixes it. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm, mem-hotplug: fix section mismatch. setup_per_zone_inactive_ratio() ↵KOSAKI Motohiro1-2/+2
should be __meminit. Commit bce7394a3e ("page-allocator: reset wmark_min and inactive ratio of zone when hotplug happens") introduced invalid section references. Now, setup_per_zone_inactive_ratio() is marked __init and then it can't be referenced from memory hotplug code. This patch marks it as __meminit and also marks caller as __ref. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Yasunori Goto <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-04-14mm: optimize pfn calculation in online_page()Daniel Kiper1-1/+1
If CONFIG_FLATMEM is enabled pfn is calculated in online_page() more than once. It is possible to optimize that and use value established at beginning of that function. Signed-off-by: Daniel Kiper <[email protected]> Acked-by: Dave Hansen <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Acked-by: David Rientjes <[email protected]> Reviewed-by: Jesper Juhl <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-31Fix common misspellingsLucas De Marchi1-1/+1
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <[email protected]>
2011-01-15Merge branch 'slub/hotplug' into slab/urgentPekka Enberg1-0/+4
2011-01-13thp: remove PG_buddyAndrea Arcangeli1-6/+8
PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can be added to page->flags without overflowing (because of the sparse section bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also has to move the memory hotplug code from _mapcount to lru.next to avoid any risk of clashes. We can't use lru.next for PG_buddy removal, but memory hotplug can use lru.next even more easily than the mapcount instead. Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: migration: cleanup migrate_pages API by matching types for offlining and ↵Mel Gorman1-1/+1
sync With the introduction of the boolean sync parameter, the API looks a little inconsistent as offlining is still an int. Convert offlining to a bool for the sake of being tidy. Signed-off-by: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Rik van Riel <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Andy Whitcroft <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: migration: allow migration to operate asynchronously and avoid ↵Mel Gorman1-1/+2
synchronous compaction in the faster path Migration synchronously waits for writeback if the initial passes fails. Callers of memory compaction do not necessarily want this behaviour if the caller is latency sensitive or expects that synchronous migration is not going to have a significantly better success rate. This patch adds a sync parameter to migrate_pages() allowing the caller to indicate if wait_on_page_writeback() is allowed within migration or not. For reclaim/compaction, try_to_compact_pages() is first called asynchronously, direct reclaim runs and then try_to_compact_pages() is called synchronously as there is a greater expectation that it'll succeed. [[email protected]: build/merge fix] Signed-off-by: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Rik van Riel <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Andy Whitcroft <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-11memory hotplug: one more lock on memory hotplugKAMEZAWA Hiroyuki1-0/+4
Now, memory_hotplug_(un)lock() is used for add/remove/offline pages for avoiding races with hibernation. But this should be held in online_pages(), too. It seems asymmetric. There are cases where one has to avoid a race with memory hotplug notifier and his own local code, and hotplug v.s. hotplug. This will add a generic solution for avoiding races. In other view, having lock here has no big impacts. online pages is tend to be done by udev script at el against each memory section one by one. Then, it's better to have lock here, too. Cc: <[email protected]> # 2.6.37 Reviewed-by: Christoph Lameter <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2010-12-02mem-hotplug: introduce {un}lock_memory_hotplug()KOSAKI Motohiro1-7/+24
Presently hwpoison is using lock_system_sleep() to prevent a race with memory hotplug. However lock_system_sleep() is a no-op if CONFIG_HIBERNATION=n. Therefore we need a new lock. Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Kamezawa Hiroyuki <[email protected]> Suggested-by: Hugh Dickins <[email protected]> Acked-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26mm: do_migrate_range: reduce list_empty() checkBob Liu1-12/+9
Simple code for reducing list_empty(&source) check. Signed-off-by: Bob Liu <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Wu Fengguang <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26mm: do_migrate_range: exit loop if not_managed is trueBob Liu1-4/+6
If not_managed is true all pages will be putback to lru, so break the loop earlier to skip other pages isolate. Signed-off-by: Bob Liu <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Wu Fengguang <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26mm/memory_hotplug.c: make scan_lru_pages() staticAndrew Morton1-1/+1
Reported-by: KOSAKI Motohiro <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26memory hotplug: unify is_removable and offline detection codeKAMEZAWA Hiroyuki1-15/+2
Now, sysfs interface of memory hotplug shows whether the section is removable or not. But it checks only migrateype of pages and doesn't check details of cluster of pages. Next, memory hotplug's set_migratetype_isolate() has the same kind of check, too. This patch adds the function __count_unmovable_pages() and makes above 2 checks to use the same logic. Then, is_removable and hotremove code uses the same logic. No changes in the hotremove logic itself. TODO: need to find a way to check RECLAMABLE. But, considering bit, calling shrink_slab() against a range before starting memory hotremove sounds better. If so, this patch's logic doesn't need to be changed. [[email protected]: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Reported-by: Michal Hocko <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26mm: compaction: fix COMPACTPAGEFAILED countingMinchan Kim1-0/+2
Presently update_nr_listpages() doesn't have a role. That's because lists passed is always empty just after calling migrate_pages. The migrate_pages cleans up page list which have failed to migrate before returning by aaa994b3. [PATCH] page migration: handle freeing of pages in migrate_pages() Do not leave pages on the lists passed to migrate_pages(). Seems that we will not need any postprocessing of pages. This will simplify the handling of pages by the callers of migrate_pages(). At that time, we thought we don't need any postprocessing of pages. But the situation is changed. The compaction need to know the number of failed to migrate for COMPACTPAGEFAILED stat This patch makes new rule for caller of migrate_pages to call putback_lru_pages. So caller need to clean up the lists so it has a chance to postprocess the pages. [suggested by Christoph Lameter] Signed-off-by: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andi Kleen <[email protected]> Reviewed-by: Mel Gorman <[email protected]> Reviewed-by: Wu Fengguang <[email protected]> Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26mm: fix return value of scan_lru_pages in memory unplugKAMEZAWA Hiroyuki1-1/+1
scan_lru_pages returns pfn. So, it's type should be "unsigned long" not "int". Note: I guess this has been work until now because memory hotplug tester's machine has not very big memory.... physical address < 32bit << PAGE_SHIFT. Reported-by: KOSAKI Motohiro <[email protected]> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-19memory_hotplug: drop spurious calls to flush_scheduled_work()Tejun Heo1-2/+0
lru_add_drain_all() uses schedule_on_each_cpu() which is synchronous. There is no reason to call flush_scheduled_work() after lru_add_drain_all(). Drop the spurious calls. This is to prepare for the deprecation and removal of flush_scheduled_work(). Signed-off-by: Tejun Heo <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Acked-by: Mel Gorman <[email protected]>
2010-09-09memory hotplug: fix next block calculation in is_removableKAMEZAWA Hiroyuki1-8/+8
next_active_pageblock() is for finding next _used_ freeblock. It skips several blocks when it finds there are a chunk of free pages lager than pageblock. But it has 2 bugs. 1. We have no lock. page_order(page) - pageblock_order can be minus. 2. pageblocks_stride += is wrong. it should skip page_order(p) of pages. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>