aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2012-12-12mm/bootmem.c: remove unused wrapper function reserve_bootmem_generic()Lin Feng2-9/+0
reserve_bootmem_generic() has no caller, Signed-off-by: Lin Feng <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mm/memory.c: remove unused code from do_wp_page()Dominik Dingel1-6/+1
page_mkwrite is initalized with zero and only set once, from that point exists no way to get to the oom or oom_free_new labels. [[email protected]: cleanup] Signed-off-by: Dominik Dingel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12asm-generic, mm: pgtable: consolidate zero page helpersKirill A. Shutemov5-35/+28
We have two different implementation of is_zero_pfn() and my_zero_pfn() helpers: for architectures with and without zero page coloring. Let's consolidate them in <asm-generic/pgtable.h>. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mm/hugetlb.c: fix warning on freeing hwpoisoned hugepageNaoya Horiguchi1-1/+7
Fix the warning from __list_del_entry() which is triggered when a process tries to do free_huge_page() for a hwpoisoned hugepage. free_huge_page() can be called for hwpoisoned hugepage from unpoison_memory(). This function gets refcount once and clears PageHWPoison, and then puts refcount twice to return the hugepage back to free pool. The second put_page() finally reaches free_huge_page(). Signed-off-by: Naoya Horiguchi <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Tony Luck <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12hwpoison, hugetlbfs: fix RSS-counter warningNaoya Horiguchi1-5/+7
Memory error handling on hugepages can break a RSS counter, which emits a message like "Bad rss-counter state mm:ffff88040abecac0 idx:1 val:-1". This is because PageAnon returns true for hugepage (this behavior is necessary for reverse mapping to work on hugetlbfs). [[email protected]: clean up code layout] Signed-off-by: Naoya Horiguchi <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Tony Luck <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12hwpoison, hugetlbfs: fix "bad pmd" warning in unmapping hwpoisoned hugepageNaoya Horiguchi1-1/+3
When a process which used a hwpoisoned hugepage tries to exit() or munmap(), the kernel can print out "bad pmd" message because page table walker in free_pgtables() encounters 'hwpoisoned entry' on pmd. This is because currently we fail to clear the hwpoisoned entry in __unmap_hugepage_range(), so this patch simply does it. Signed-off-by: Naoya Horiguchi <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Tony Luck <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mm: protect against concurrent vma expansionMichel Lespinasse1-0/+28
expand_stack() runs with a shared mmap_sem lock. Because of this, there could be multiple concurrent stack expansions in the same mm, which may cause problems in the vma gap update code. I propose to solve this by taking the mm->page_table_lock around such vma expansions, in order to avoid the concurrency issue. We only have to worry about concurrent expand_stack() calls here, since we hold a shared mmap_sem lock and all vma modificaitons other than expand_stack() are done under an exclusive mmap_sem lock. I previously tried to achieve the same effect by making sure all growable vmas in a given mm would share the same anon_vma, which we already lock here. However this turned out to be difficult - all of the schemes I tried for refcounting the growable anon_vma and clearing turned out ugly. So, I'm now proposing only the minimal fix. The overhead of taking the page table lock during stack expansion is expected to be small: glibc doesn't use expandable stacks for the threads it creates, so having multiple growable stacks is actually uncommon and we don't expect the page table lock to get bounced between threads. Signed-off-by: Michel Lespinasse <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12memcg: do not check for mm in __mem_cgroup_count_vm_eventMichal Hocko1-3/+0
The mm given to __mem_cgroup_count_vm_event() cannot be NULL because the function is either called from the page fault path or vma->vm_mm is used. So the check can be dropped. The check was introduced by commit 456f998ec817 ("memcg: add the pagefault count into memcg stats") because the originally proposed patch used current->mm for shmem but this has been changed to vma->vm_mm later on without the check being removed (thanks to Hugh for this recollection). Signed-off-by: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ying Han <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)Hugh Dickins1-1/+91
Revert 3.5's commit f21f8062201f ("tmpfs: revert SEEK_DATA and SEEK_HOLE") to reinstate 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE"), with the intervening additional arg to generic_file_llseek_size(). In 3.8, ext4 is expected to join btrfs, ocfs2 and xfs with proper SEEK_DATA and SEEK_HOLE support; and a good case has now been made for it on tmpfs, so let's join the party. It's quite easy for tmpfs to scan the radix_tree to support llseek's new SEEK_DATA and SEEK_HOLE options: so add them while the minutiae are still on my mind (in particular, the !PageUptodate-ness of pages fallocated but still unwritten). [[email protected]: fix warning with CONFIG_TMPFS=n] Signed-off-by: Hugh Dickins <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jaegeuk Hanse <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Zheng Liu <[email protected]> Cc: Jeff liu <[email protected]> Cc: Paul Eggert <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Marco Stornelli <[email protected]> Cc: Chris Mason <[email protected]> Cc: Sunil Mushran <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mm: provide more accurate estimation of pages occupied by memmapJiang Liu1-2/+24
If SPARSEMEM is enabled, it won't build page structures for non-existing pages (holes) within a zone, so provide a more accurate estimation of pages occupied by memmap if there are bigger holes within the zone. And pages for highmem zones' memmap will be allocated from lowmem, so charge nr_kernel_pages for that. [[email protected]: mark calc_memmap_size __paging_init] Signed-off-by: Jiang Liu <[email protected]> Cc: Wen Congyang <[email protected]> Cc: David Rientjes <[email protected]> Cc: Maciej Rutecki <[email protected]> Cc: Chris Clayton <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Michal Hocko <[email protected]> Tested-by: Jianguo Wu <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12fs/buffer.c: remove redundant initialization in alloc_page_buffers()Yan Hong1-3/+0
buffer_head comes from kmem_cache_zalloc(), no need to zero its fields. Signed-off-by: Yan Hong <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12fs/buffer.c: do not inline exported functionYan Hong1-2/+1
It makes no sense to inline an exported function. Signed-off-by: Yan Hong <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12writeback: fix a typo in commentYan Hong1-1/+1
Signed-off-by: Yan Hong <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mm: introduce new field "managed_pages" to struct zoneJiang Liu6-23/+121
Currently a zone's present_pages is calcuated as below, which is inaccurate and may cause trouble to memory hotplug. spanned_pages - absent_pages - memmap_pages - dma_reserve. During fixing bugs caused by inaccurate zone->present_pages, we found zone->present_pages has been abused. The field zone->present_pages may have different meanings in different contexts: 1) pages existing in a zone. 2) pages managed by the buddy system. For more discussions about the issue, please refer to: http://lkml.org/lkml/2012/11/5/866 https://patchwork.kernel.org/patch/1346751/ This patchset tries to introduce a new field named "managed_pages" to struct zone, which counts "pages managed by the buddy system". And revert zone->present_pages to count "physical pages existing in a zone", which also keep in consistence with pgdat->node_present_pages. We will set an initial value for zone->managed_pages in function free_area_init_core() and will adjust it later if the initial value is inaccurate. For DMA/normal zones, the initial value is set to: (spanned_pages - absent_pages - memmap_pages - dma_reserve) Later zone->managed_pages will be adjusted to the accurate value when the bootmem allocator frees all free pages to the buddy system in function free_all_bootmem_node() and free_all_bootmem(). The bootmem allocator doesn't touch highmem pages, so highmem zones' managed_pages is set to the accurate value "spanned_pages - absent_pages" in function free_area_init_core() and won't be updated anymore. This patch also adds a new field "managed_pages" to /proc/zoneinfo and sysrq showmem. [[email protected]: small comment tweaks] Signed-off-by: Jiang Liu <[email protected]> Cc: Wen Congyang <[email protected]> Cc: David Rientjes <[email protected]> Cc: Maciej Rutecki <[email protected]> Tested-by: Chris Clayton <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Jianguo Wu <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mm, oom: remove statically defined arch functions of same nameDavid Rientjes3-42/+27
out_of_memory() is a globally defined function to call the oom killer. x86, sh, and powerpc all use a function of the same name within file scope in their respective fault.c unnecessarily. Inline the functions into the pagefault handlers to clean the code up. Signed-off-by: David Rientjes <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Paul Mundt <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mm, oom: remove redundant sleep in pagefault oom handlerDavid Rientjes1-1/+0
out_of_memory() will already cause current to schedule if it has not been killed, so doing it again in pagefault_out_of_memory() is redundant. Remove it. Signed-off-by: David Rientjes <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mm, oom: cleanup pagefault oom handlerDavid Rientjes1-42/+7
To lock the entire system from parallel oom killing, it's possible to pass in a zonelist with all zones rather than using for_each_populated_zone() for the iteration. This obsoletes try_set_system_oom() and clear_system_oom() so that they can be removed. Signed-off-by: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12memory_hotplug: allow online/offline memory to result movable nodeLai Jiangshan1-0/+16
Now, memory management can handle movable node or nodes which don't have any normal memory, so we can dynamic configure and add movable node by: online a ZONE_MOVABLE memory from a previous offline node offline the last normal memory which result a non-normal-memory-node movable-node is very important for power-saving, hardware partitioning and high-available-system(hardware fault management). Signed-off-by: Lai Jiangshan <[email protected]> Tested-by: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Jiang Liu <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Greg KH <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12numa: add CONFIG_MOVABLE_NODE for movable-dedicated nodeLai Jiangshan4-0/+21
We need a node which only contains movable memory. This feature is very important for node hotplug. If a node has normal/highmem, the memory may be used by the kernel and can't be offlined. If the node only contains movable memory, we can offline the memory and the node. All are prepared, we can actually introduce N_MEMORY. add CONFIG_MOVABLE_NODE make we can use it for movable-dedicated node [[email protected]: fix Kconfig text] Signed-off-by: Lai Jiangshan <[email protected]> Tested-by: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Jiang Liu <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Greg KH <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mm, memcg: avoid unnecessary function call when memcg is disabledDavid Rientjes2-3/+12
While profiling numa/core v16 with cgroup_disable=memory on the command line, I noticed mem_cgroup_count_vm_event() still showed up as high as 0.60% in perftop. This occurs because the function is called extremely often even when memcg is disabled. To fix this, inline the check for mem_cgroup_disabled() so we avoid the unnecessary function call if memcg is disabled. Signed-off-by: David Rientjes <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Glauber Costa <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mm: add a reminder comment for __GFP_BITS_SHIFTAndrew Morton1-0/+1
Cc: Glauber Costa <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mm: WARN_ON_ONCE if f_op->mmap() change vma's start addressJoonsoo Kim1-0/+4
During reviewing the source code, I found a comment which mention that after f_op->mmap(), vma's start address can be changed. I didn't verify that it is really possible, because there are so many f_op->mmap() implementation. But if there are some mmap() which change vma's start address, it is possible error situation, because we already prepare prev vma, rb_link and rb_parent and these are related to original address. So add WARN_ON_ONCE for finding that this situtation really happens. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12res_counter: delete res_counter_write()Greg Thelen2-27/+0
Since commit 628f42355389 ("memcg: limit change shrink usage") both res_counter_write() and write_strategy_fn have been unused. This patch deletes them both. Signed-off-by: Greg Thelen <[email protected]> Cc: Glauber Costa <[email protected]> Cc: Tejun Heo <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Frederic Weisbecker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12hotplug: update nodemasks managementLai Jiangshan3-16/+77
Update nodemasks management for N_MEMORY. [[email protected]: fix build] Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Lin Feng <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12page_alloc: use N_MEMORY instead N_HIGH_MEMORY change the node_states ↵Lai Jiangshan2-19/+25
initialization N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Since we introduced N_MEMORY, we update the initialization of node_states. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Lin Feng <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hillf Danton <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12vmscan: use N_MEMORY instead N_HIGH_MEMORYLai Jiangshan1-2/+2
N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan <[email protected]> Acked-by: Hillf Danton <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Lin Feng <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12init: use N_MEMORY instead N_HIGH_MEMORYLai Jiangshan1-1/+1
N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Lin Feng <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12kthread: use N_MEMORY instead N_HIGH_MEMORYLai Jiangshan1-1/+1
N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Lin Feng <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12vmstat: use N_MEMORY instead N_HIGH_MEMORYLai Jiangshan1-2/+2
N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan <[email protected]> Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Lin Feng <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12hugetlb: use N_MEMORY instead N_HIGH_MEMORYLai Jiangshan2-13/+13
N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan <[email protected]> Acked-by: Hillf Danton <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Lin Feng <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mempolicy: use N_MEMORY instead N_HIGH_MEMORYLai Jiangshan1-6/+6
N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Lin Feng <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mm,migrate: use N_MEMORY instead N_HIGH_MEMORYLai Jiangshan1-1/+1
N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan <[email protected]> Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Lin Feng <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12oom: use N_MEMORY instead N_HIGH_MEMORYLai Jiangshan1-1/+1
N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan <[email protected]> Acked-by: Hillf Danton <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Lin Feng <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12memcontrol: use N_MEMORY instead N_HIGH_MEMORYLai Jiangshan2-10/+10
N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Lin Feng <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12procfs: use N_MEMORY instead N_HIGH_MEMORYLai Jiangshan2-3/+3
N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan <[email protected]> Acked-by: Hillf Danton <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Lin Feng <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12cpuset: use N_MEMORY instead N_HIGH_MEMORYLai Jiangshan3-18/+18
N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan <[email protected]> Acked-by: Hillf Danton <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Lin Feng <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mm: node_states: introduce N_MEMORYLai Jiangshan1-0/+1
We have N_NORMAL_MEMORY for standing for the nodes that have normal memory with zone_type <= ZONE_NORMAL. And we have N_HIGH_MEMORY for standing for the nodes that have normal or high memory. But we don't have any word to stand for the nodes that have *any* memory. And we have N_CPU but without N_MEMORY. Current code reuse the N_HIGH_MEMORY for this purpose because any node which has memory must have high memory or normal memory currently. A) But this reusing is bad for *readability*. Because the name N_HIGH_MEMORY just stands for high or normal: A.example 1) mem_cgroup_nr_lru_pages(): for_each_node_state(nid, N_HIGH_MEMORY) The user will be confused(why this function just counts for high or normal memory node? does it counts for ZONE_MOVABLE's lru pages?) until someone else tell them N_HIGH_MEMORY is reused to stand for nodes that have any memory. A.cont) If we introduce N_MEMORY, we can reduce this confusing AND make the code more clearly: A.example 2) mm/page_cgroup.c use N_HIGH_MEMORY twice: One is in page_cgroup_init(void): for_each_node_state(nid, N_HIGH_MEMORY) { It means if the node have memory, we will allocate page_cgroup map for the node. We should use N_MEMORY instead here to gaim more clearly. The second using is in alloc_page_cgroup(): if (node_state(nid, N_HIGH_MEMORY)) addr = vzalloc_node(size, nid); It means if the node has high or normal memory that can be allocated from kernel. We should keep N_HIGH_MEMORY here, and it will be better if the "any memory" semantic of N_HIGH_MEMORY is removed. B) This reusing is out-dated if we introduce MOVABLE-dedicated node. The MOVABLE-dedicated node should not appear in node_stats[N_HIGH_MEMORY] nor node_stats[N_NORMAL_MEMORY], because MOVABLE-dedicated node has no high or normal memory. In x86_64, N_HIGH_MEMORY=N_NORMAL_MEMORY, if a MOVABLE-dedicated node is in node_stats[N_HIGH_MEMORY], it is also means it is in node_stats[N_NORMAL_MEMORY], it causes SLUB wrong. The slub uses for_each_node_state(nid, N_NORMAL_MEMORY) and creates kmem_cache_node for MOVABLE-dedicated node and cause problem. In one word, we need a N_MEMORY. We just intrude it as an alias to N_HIGH_MEMORY and fix all im-proper usages of N_HIGH_MEMORY in late patches. Signed-off-by: Lai Jiangshan <[email protected]> Acked-by: Christoph Lameter <[email protected]> Acked-by: Hillf Danton <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Lin Feng <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mm: use migrate_prep() instead of migrate_prep_local()Marek Szyprowski1-1/+1
__alloc_contig_migrate_range() should use all possible ways to get all the pages migrated from the given memory range, so pruning per-cpu lru lists for all CPUs is required, regadless the cost of such operation. Otherwise some pages which got stuck at per-cpu lru list might get missed by migration procedure causing the contiguous allocation to fail. Reported-by: SeongHwan Yoon <[email protected]> Signed-off-by: Marek Szyprowski <[email protected]> Signed-off-by: Kyungmin Park <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mm: compaction: Fix compiler warningThierry Reding1-54/+54
compact_capture_page() is only used if compaction is enabled so it should be moved into the corresponding #ifdef. Signed-off-by: Thierry Reding <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12thp: avoid race on multiple parallel page faults to the same pageKirill A. Shutemov1-5/+24
pmd value is stable only with mm->page_table_lock taken. After taking the lock we need to check that nobody modified the pmd before changing it. Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: David Rientjes <[email protected]> Reviewed-by: Bob Liu <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12thp: introduce sysfs knob to disable huge zero pageKirill A. Shutemov3-2/+30
By default kernel tries to use huge zero page on read page fault. It's possible to disable huge zero page by writing 0 or enable it back by writing 1: echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andi Kleen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED eventsKirill A. Shutemov4-1/+16
hzp_alloc is incremented every time a huge zero page is successfully allocated. It includes allocations which where dropped due race with other allocation. Note, it doesn't count every map of the huge zero page, only its allocation. hzp_alloc_failed is incremented if kernel fails to allocate huge zero page and falls back to using small pages. Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andi Kleen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12thp: implement refcounting for huge zero pageKirill A. Shutemov1-25/+88
H. Peter Anvin doesn't like huge zero page which sticks in memory forever after the first allocation. Here's implementation of lockless refcounting for huge zero page. We have two basic primitives: {get,put}_huge_zero_page(). They manipulate reference counter. If counter is 0, get_huge_zero_page() allocates a new huge page and takes two references: one for caller and one for shrinker. We free the page only in shrinker callback if counter is 1 (only shrinker has the reference). put_huge_zero_page() only decrements counter. Counter is never zero in put_huge_zero_page() since shrinker holds on reference. Freeing huge zero page in shrinker callback helps to avoid frequent allocate-free. Refcounting has cost. On 4 socket machine I observe ~1% slowdown on parallel (40 processes) read page faulting comparing to lazy huge page allocation. I think it's pretty reasonable for synthetic benchmark. [[email protected]: fix mismerge] Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andi Kleen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12thp: lazy huge zero page allocationKirill A. Shutemov1-10/+10
Instead of allocating huge zero page on hugepage_init() we can postpone it until first huge zero page map. It saves memory if THP is not in use. cmpxchg() is used to avoid race on huge_zero_pfn initialization. Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andi Kleen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12thp: setup huge zero page on non-write page faultKirill A. Shutemov1-0/+10
All code paths seems covered. Now we can map huge zero page on read page fault. We setup it in do_huge_pmd_anonymous_page() if area around fault address is suitable for THP and we've got read page fault. If we fail to setup huge zero page (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP. Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andi Kleen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12thp: implement splitting pmd for huge zero pageKirill A. Shutemov1-1/+42
We can't split huge zero page itself (and it's bug if we try), but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. [[email protected]: fix build error] Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andi Kleen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12thp: change split_huge_page_pmd() interfaceKirill A. Shutemov10-16/+37
Pass vma instead of mm and add address parameter. In most cases we already have vma on the stack. We provides split_huge_page_pmd_mm() for few cases when we have mm, but not vma. This change is preparation to huge zero pmd splitting implementation. Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andi Kleen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12thp: change_huge_pmd(): make sure we don't try to make a page writableKirill A. Shutemov1-0/+1
mprotect core never tries to make page writable using change_huge_pmd(). Let's add an assert that the assumption is true. It's important to be sure we will not make huge zero page writable. Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andi Kleen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12thp: do_huge_pmd_wp_page(): handle huge zero pageKirill A. Shutemov3-22/+104
On write access to huge zero page we alloc a new huge page and clear it. If ENOMEM, graceful fallback: we create a new pmd table and set pte around fault address to newly allocated normal (4k) page. All other ptes in the pmd set to normal zero page. Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andi Kleen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Mel Gorman <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12thp: copy_huge_pmd(): copy huge zero pageKirill A. Shutemov1-0/+22
It's easy to copy huge zero page. Just set destination pmd to huge zero page. It's safe to copy huge zero page since we have none yet :-p [[email protected]: fix comment] Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andi Kleen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>