aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2012-07-31mm/hotplug: mark memory hotplug code in page_alloc.c as __meminitJiang Liu1-32/+34
Mark functions used by both boot and memory hotplug as __meminit to reduce memory footprint when memory hotplug is disabled. Alos guard zone_pcp_update() with CONFIG_MEMORY_HOTPLUG because it's only used by memory hotplug code. Signed-off-by: Jiang Liu <[email protected]> Cc: Wei Wang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: David Rientjes <[email protected]> Cc: Keping Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/hotplug: free zone->pageset when a zone becomes emptyJiang Liu3-0/+17
When a zone becomes empty after memory offlining, free zone->pageset. Otherwise it will cause memory leak when adding memory to the empty zone again because build_all_zonelists() will allocate zone->pageset for an empty zone. Signed-off-by: Jiang Liu <[email protected]> Signed-off-by: Wei Wang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: David Rientjes <[email protected]> Cc: Keping Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/hotplug: correctly add new zone to all other nodes' zone listsJiang Liu1-7/+8
When online_pages() is called to add new memory to an empty zone, it rebuilds all zone lists by calling build_all_zonelists(). But there's a bug which prevents the new zone to be added to other nodes' zone lists. online_pages() { build_all_zonelists() ..... node_set_state(zone_to_nid(zone), N_HIGH_MEMORY) } Here the node of the zone is put into N_HIGH_MEMORY state after calling build_all_zonelists(), but build_all_zonelists() only adds zones from nodes in N_HIGH_MEMORY state to the fallback zone lists. build_all_zonelists() ->__build_all_zonelists() ->build_zonelists() ->find_next_best_node() ->for_each_node_state(n, N_HIGH_MEMORY) So memory in the new zone will never be used by other nodes, and it may cause strange behavor when system is under memory pressure. So put node into N_HIGH_MEMORY state before calling build_all_zonelists(). Signed-off-by: Jianguo Wu <[email protected]> Signed-off-by: Jiang Liu <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: David Rientjes <[email protected]> Cc: Keping Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/hotplug: correctly setup fallback zonelists when creating new pgdatJiang Liu5-10/+17
When hotadd_new_pgdat() is called to create new pgdat for a new node, a fallback zonelist should be created for the new node. There's code to try to achieve that in hotadd_new_pgdat() as below: /* * The node we allocated has no zone fallback lists. For avoiding * to access not-initialized zonelist, build here. */ mutex_lock(&zonelists_mutex); build_all_zonelists(pgdat, NULL); mutex_unlock(&zonelists_mutex); But it doesn't work as expected. When hotadd_new_pgdat() is called, the new node is still in offline state because node_set_online(nid) hasn't been called yet. And build_all_zonelists() only builds zonelists for online nodes as: for_each_online_node(nid) { pg_data_t *pgdat = NODE_DATA(nid); build_zonelists(pgdat); build_zonelist_cache(pgdat); } Though we hope to create zonelist for the new pgdat, but it doesn't. So add a new parameter "pgdat" the build_all_zonelists() to build pgdat for the new pgdat too. Signed-off-by: Jiang Liu <[email protected]> Signed-off-by: Xishi Qiu <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: David Rientjes <[email protected]> Cc: Keping Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/memcg: replace inexistence move_lock_page_cgroup() by ↵Wanpeng Li1-2/+2
move_lock_mem_cgroup() in comment Signed-off-by: Wanpeng Li <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/memcg: mem_cgroup_relize_xxx_limit can guarantee memcg->res.limit <= ↵Wanpeng Li1-2/+2
memcg->memsw.limit Signed-off-by: Wanpeng Li <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/memcg: complete documentation for tcp memcg filesWanpeng Li1-0/+2
Signed-off-by: Wanpeng Li <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: setup pageblock_order before it's used by sparsememXishi Qiu3-2/+7
On architectures with CONFIG_HUGETLB_PAGE_SIZE_VARIABLE set, such as Itanium, pageblock_order is a variable with default value of 0. It's set to the right value by set_pageblock_order() in function free_area_init_core(). But pageblock_order may be used by sparse_init() before free_area_init_core() is called along path: sparse_init() ->sparse_early_usemaps_alloc_node() ->usemap_size() ->SECTION_BLOCKFLAGS_BITS ->((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS) The uninitialized pageblock_size will cause memory wasting because usemap_size() returns a much bigger value then it's really needed. For example, on an Itanium platform, sparse_init() pageblock_order=0 usemap_size=24576 free_area_init_core() before pageblock_order=0, usemap_size=24576 free_area_init_core() after pageblock_order=12, usemap_size=8 That means 24K memory has been wasted for each section, so fix it by calling set_pageblock_order() from sparse_init(). Signed-off-by: Xishi Qiu <[email protected]> Signed-off-by: Jiang Liu <[email protected]> Cc: Tony Luck <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: David Rientjes <[email protected]> Cc: Keping Chen <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/memory.c:print_vma_addr(): call up_read(&mm->mmap_sem) directlyJeff Liu1-1/+1
Call up_read(&mm->mmap_sem) directly since we have already got mm via current->mm at the beginning of print_vma_addr(). Signed-off-by: Jie Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31vmscan: remove obsolete shrink_control commentMinchan Kim2-2/+1
09f363c7 ("vmscan: fix shrinker callback bug in fs/super.c") fixed a shrinker callback which was returning -1 when nr_to_scan is zero, which caused excessive slab scanning. But 635697c6 ("vmscan: fix initial shrinker size handling") fixed the problem, again so we can freely return -1 although nr_to_scan is zero. So let's revert 09f363c7 because the comment added in 09f363c7 made an unnecessary rule. Signed-off-by: Minchan Kim <[email protected]> Cc: Al Viro <[email protected]> Cc: Mikulas Patocka <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: CONFIG_HAVE_MEMBLOCK_NODE -> CONFIG_HAVE_MEMBLOCK_NODE_MAPRabin Vincent1-1/+1
0ee332c14518699 ("memblock: Kill early_node_map[]") wanted to replace CONFIG_ARCH_POPULATES_NODE_MAP with CONFIG_HAVE_MEMBLOCK_NODE_MAP but ended up replacing one occurence with a reference to the non-existent symbol CONFIG_HAVE_MEMBLOCK_NODE. The resulting omission of code would probably have been causing problems to 32-bit machines with memory hotplug. Signed-off-by: Rabin Vincent <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: have order > 0 compaction start off where it leftRik van Riel4-5/+73
Order > 0 compaction stops when enough free pages of the correct page order have been coalesced. When doing subsequent higher order allocations, it is possible for compaction to be invoked many times. However, the compaction code always starts out looking for things to compact at the start of the zone, and for free pages to compact things to at the end of the zone. This can cause quadratic behaviour, with isolate_freepages starting at the end of the zone each time, even though previous invocations of the compaction code already filled up all free memory on that end of the zone. This can cause isolate_freepages to take enormous amounts of CPU with certain workloads on larger memory systems. The obvious solution is to have isolate_freepages remember where it left off last time, and continue at that point the next time it gets invoked for an order > 0 compaction. This could cause compaction to fail if cc->free_pfn and cc->migrate_pfn are close together initially, in that case we restart from the end of the zone and try once more. Forced full (order == -1) compactions are left alone. [[email protected]: checkpatch fixes] [[email protected]: s/laste/last/, use 80 cols] Signed-off-by: Rik van Riel <[email protected]> Reported-by: Jim Schutt <[email protected]> Tested-by: Jim Schutt <[email protected]> Cc: Minchan Kim <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31memcg: rename mem_control_xxx to memcg_xxxWanpeng Li1-4/+4
Replace memory_cgroup_xxx() with memcg_xxx() Signed-off-by: Wanpeng Li <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31memcg: fix bad behavior in use_hierarchy fileGlauber Costa1-0/+6
I have an application that does the following: * copy the state of all controllers attached to a hierarchy * replicate it as a child of the current level. I would expect writes to the files to mostly succeed, since they are inheriting sane values from parents. But that is not the case for use_hierarchy. If it is set to 0, we succeed ok. If we're set to 1, the value of the file is automatically set to 1 in the children, but if userspace tries to write the very same 1, it will fail. That same situation happens if we set use_hierarchy, create a child, and then try to write 1 again. Now, there is no reason whatsoever for failing to write a value that is already there. It doesn't even match the comments, that states: /* If parent's use_hierarchy is set, we can't make any modifications * in the child subtrees... since we are not changing anything. So test the new value against the one we're storing, and automatically return 0 if we're not proposing a change. Signed-off-by: Glauber Costa <[email protected]> Cc: Dhaval Giani <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Kamezawa Hiroyuki <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Ying Han <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: remove unused LRU_ALL_EVICTABLEWanpeng Li1-1/+0
Signed-off-by: Wanpeng Li <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31memcg: rename config variablesAndrew Morton33-78/+78
Sanity: CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM [[email protected]: fix missed bits] Cc: Glauber Costa <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: clean up __count_immobile_pages()Minchan Kim1-16/+18
The __count_immobile_pages() naming is rather awkward. Choose a more clear name and add a comment. Signed-off-by: Minchan Kim <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Bartlomiej Zolnierkiewicz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: do not use page_count() without a page pinMinchan Kim1-1/+8
d179e84ba ("mm: vmscan: do not use page_count without a page pin") fixed this problem in vmscan.c but same problem is in __count_immobile_pages(). I copy and paste d179e84ba's contents for description. "It is unsafe to run page_count during the physical pfn scan because compound_head could trip on a dangling pointer when reading page->first_page if the compound page is being freed by another CPU." Signed-off-by: Minchan Kim <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Bartlomiej Zolnierkiewicz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm, oom: replace some information in tasklist dumpDavid Rientjes2-8/+10
The number of ptes and swap entries are used in the oom killer's badness heuristic, so they should be shown in the tasklist dump. This patch adds those fields and replaces cpu and oom_adj values that are currently emitted. Cpu isn't interesting and oom_adj is deprecated and will be removed later this year, the same information is already displayed as oom_score_adj which is used internally. At the same time, make the documentation a little more clear to state this information is helpful to determine why the oom killer chose the task it did to kill. Signed-off-by: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm, oom: fix potential killing of thread that is disabled from oom killingDavid Rientjes1-2/+2
/proc/sys/vm/oom_kill_allocating_task will immediately kill current when the oom killer is called to avoid a potentially expensive tasklist scan for large systems. Currently, however, it is not checking current's oom_score_adj value which may be OOM_SCORE_ADJ_MIN, meaning that it has been disabled from oom killing. This patch avoids killing current in such a condition and simply falls back to the tasklist scan since memory still needs to be freed. Signed-off-by: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Acked-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: clear pages_scanned only if draining a pcp adds pages to the buddy ↵KOSAKI Motohiro1-3/+6
allocator again commit 2ff754fa8f ("mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator again") fixed one free_pcppages_bulk() misuse. But two another miuse still exist. This patch fixes it. Signed-off-by: KOSAKI Motohiro <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Wu Fengguang <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm, fadvise: don't return -EINVAL when filesystem cannot implement fadvise()KOSAKI Motohiro1-11/+7
Eric Wong reported his test suite failex when /tmp is tmpfs. https://lkml.org/lkml/2012/2/24/479 Currentlt the input check of POSIX_FADV_WILLNEED has two problems. - requires a_ops->readpage. But in fact, force_page_cache_readahead() requires that the target filesystem has either ->readpage or ->readpages. - returns -EINVAL when the filesystem doesn't have ->readpage. But posix says that fadvise is merely a hint. Thus fadvise() should return 0 if filesystem has no means of implementing fadvise(). The userland application should not know nor care whcih type of filesystem backs the TMPDIR directory, as Eric pointed out. There is nothing which userspace can do to solve this error. So change the return value to 0 when filesytem doesn't support readahead. [[email protected]: checkpatch fixes] Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Eric Wong <[email protected]> Tested-by: Eric Wong <[email protected]> Reviewed-by: Wanlong Gao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/compaction: cleanup on compaction_deferredGavin Shan1-2/+2
When CONFIG_COMPACTION is enabled, compaction_deferred() tries to recalculate the deferred limit again, which isn't necessary. When CONFIG_COMPACTION is disabled, compaction_deferred() should return "true" or "false" since it has "bool" for its return value. Signed-off-by: Gavin Shan <[email protected]> Acked-by: Minchan Kim <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31memcg: make mem_cgroup_force_empty_list() return boolKAMEZAWA Hiroyuki1-12/+7
mem_cgroup_force_empty_list() just returns 0 or -EBUSY and -EBUSY indicates 'you need to retry'. Make mem_cgroup_force_empty_list() return a bool to simplify the logic. [[email protected]: rework mem_cgroup_force_empty_list()'s comment] Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31memcg: mem_cgroup_move_parent() doesn't need gfp_maskKAMEZAWA Hiroyuki1-3/+2
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31memcg: clean up force_empty_list() return value checkKamezawa Hiroyuki1-5/+0
After bf544fdc241da8 "memcg: move charges to root cgroup if use_hierarchy=0 in mem_cgroup_move_hugetlb_parent()" mem_cgroup_move_parent() returns only -EBUSY or -EINVAL. So we can remove the -ENOMEM and -EINTR checks. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31memcg: remove check for signal_pending() during rmdir()Kamezawa Hiroyuki1-3/+0
After bf544fdc241da8 "memcg: move charges to root cgroup if use_hierarchy=0 in mem_cgroup_move_hugetlb_parent()", no memory reclaim will occur when removing a memory cgroup. If -EINTR is returned here, cgroup will show a warning. We don't need to handle any user interruption signal. Remove this. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/memblock.c:memblock_double_array(): cosmetic cleanupsAndrew Morton1-17/+18
This function is an 80-column eyesore, quite unnecessarily. Clean that up, and use standard comment layout style. Cc: Benjamin Herrenschmidt <[email protected]> Cc: Greg Pearson <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm, oom: do not schedule if current has been killedDavid Rientjes1-6/+5
The oom killer currently schedules away from current in an uninterruptible sleep if it does not have access to memory reserves. It's possible that current was killed because it shares memory with the oom killed thread or because it was killed by the user in the interim, however. This patch only schedules away from current if it does not have a pending kill, i.e. if it does not share memory with the oom killed thread. It's possible that it will immediately retry its memory allocation and fail, but it will immediately be given access to memory reserves if it calls the oom killer again. This prevents the delay of memory freeing when threads that share memory with the oom killed thread get unnecessarily scheduled. Signed-off-by: David Rientjes <[email protected]> Cc: Oleg Nesterov <[email protected]> Acked-by: KOSAKI Motohiro <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb/cgroup: remove exclude and wakeup rmdir calls from migrateAneesh Kumar K.V1-2/+4
We already hold the hugetlb_lock. That should prevent a parallel cgroup rmdir from touching page's hugetlb cgroup. So remove the exclude and wakeup calls. Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb/cgroup: assign the page hugetlb cgroup when we move the page to ↵Aneesh Kumar K.V2-14/+13
active list. A page's hugetlb cgroup assignment and movement to the active list should occur with hugetlb_lock held. Otherwise when we remove the hugetlb cgroup we will iterate the active list and find pages with NULL hugetlb cgroup values. Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: move all the in use pages to active listAneesh Kumar K.V1-1/+10
When we fail to allocate pages from the reserve pool, hugetlb tries to allocate huge pages using alloc_buddy_huge_page. Add these to the active list. We also need to add the huge page we allocate when we soft offline the oldpage to active list. Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb/cgroup: add HugeTLB controller documentationAneesh Kumar K.V1-0/+45
Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb/cgroup: migrate hugetlb cgroup info from oldpage to new page during ↵Aneesh Kumar K.V3-0/+33
migration With HugeTLB pages, hugetlb cgroup is uncharged in compound page destructor. Since we are holding a hugepage reference, we can be sure that old page won't get uncharged till the last put_page(). Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb/cgroup: add hugetlb cgroup control filesAneesh Kumar K.V4-0/+148
Add the control files for hugetlb controller [[email protected]: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g] [[email protected]: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/] Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb/cgroup: add support for cgroup removalAneesh Kumar K.V1-2/+68
Add support for cgroup removal. If we don't have parent cgroup, the charges are moved to root cgroup. Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb/cgroup: add charge/uncharge routines for hugetlb cgroupAneesh Kumar K.V3-1/+133
Add the charge and uncharge routines for hugetlb cgroup. We do cgroup charging in page alloc and uncharge in compound page destructor. Assigning page's hugetlb cgroup is protected by hugetlb_lock. [[email protected]: add huge_page_order check to avoid incorrect uncharge] Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb/cgroup: add the cgroup pointer to page lruAneesh Kumar K.V2-0/+41
Add the hugetlb cgroup pointer to 3rd page lru.next. This limit the usage to hugetlb cgroup to only hugepages with 3 or more normal pages. I guess that is an acceptable limitation. Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/hugetlb: add new HugeTLB cgroupAneesh Kumar K.V5-0/+179
Implement a new controller that allows us to control HugeTLB allocations. The extension allows to limit the HugeTLB usage per control group and enforces the controller limit during page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit at page fault time implies that, the application will get SIGBUS signal if it tries to access HugeTLB pages beyond its limit. This requires the application to know beforehand how much HugeTLB pages it would require for its use. The charge/uncharge calls will be added to HugeTLB code in later patch. Support for cgroup removal will be added in later patches. [[email protected]: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g] [[email protected]: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/g] Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: make some static variables globalAneesh Kumar K.V2-5/+7
We will use them later in hugetlb_cgroup.c Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: add a list for tracking in-use HugeTLB pagesAneesh Kumar K.V2-5/+8
hugepage_activelist will be used to track currently used HugeTLB pages. We need to find the in-use HugeTLB pages to support HugeTLB cgroup removal. On cgroup removal we update the page's HugeTLB cgroup to point to parent cgroup. Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: simplify migrate_huge_page()Aneesh Kumar K.V3-57/+27
Since we migrate only one hugepage, don't use linked list for passing the page around. Directly pass the page that need to be migrated as argument. This also removes the usage of page->lru in the migrate path. Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: use mmu_gather instead of a temporary linked list for accumulating ↵Aneesh Kumar K.V4-33/+59
pages Use a mmu_gather instead of a temporary linked list for accumulating pages when we unmap a hugepage range Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: add an inline helper for finding hstate indexAneesh Kumar K.V2-9/+17
Add an inline helper and use it in the code. Signed-off-by: Aneesh Kumar K.V <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hillf Danton <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: don't use ERR_PTR with VM_FAULT* valuesAneesh Kumar K.V1-5/+13
The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the VM_FAULT_* values from MAX_ERRNO. Signed-off-by: Aneesh Kumar K.V <[email protected]> Acked-by: Hillf Danton <[email protected]> Acked-by: KOSAKI Motohiro <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: rename max_hstate to hugetlb_max_hstateAneesh Kumar K.V1-7/+7
This patchset implements a cgroup resource controller for HugeTLB pages. The controller allows to limit the HugeTLB usage per control group and enforces the controller limit during page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit at page fault time implies that, the application will get SIGBUS signal if it tries to access HugeTLB pages beyond its limit. This requires the application to know beforehand how much HugeTLB pages it would require for its use. The goal is to control how many HugeTLB pages a group of task can allocate. It can be looked at as an extension of the existing quota interface which limits the number of HugeTLB pages per hugetlbfs superblock. HPC job scheduler requires jobs to specify their resource requirements in the job file. Once their requirements can be met, job schedulers like (SLURM) will schedule the job. We need to make sure that the jobs won't consume more resources than requested. If they do we should either error out or kill the application. This patch: Rename max_hstate to hugetlb_max_hstate. We will be using this from other subsystems like hugetlb controller in later patches. Signed-off-by: Aneesh Kumar K.V <[email protected]> Acked-by: David Rientjes <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Hillf Danton <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threadsWanpeng Li9-27/+40
Since per-BDI flusher threads were introduced in 2.6, the pdflush mechanism is not used any more. But the old interface exported through /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless. For back-compatibility, printk warning information and return 2 to notify the users that the interface is removed. Signed-off-by: Wanpeng Li <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/buddy: cleanup on should_fail_alloc_pageGavin Shan1-7/+7
Currently, function should_fail() has "bool" for its return value, so it's reasonable to change the return value of function should_fail_alloc_page() into "bool" as well. The patch does cleanup on function should_fail_alloc_page() to have "bool" for its return value. Signed-off-by: Gavin Shan <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: account the total_vm in the vm_stat_account()Huang Shijie5-9/+4
vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now. But we can also account for total_vm in the vm_stat_account() which makes the code tidy. Even for mprotect_fixup(), we can get the right result in the end. Signed-off-by: Huang Shijie <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31documentation: update how page-cluster affects swap I/OChristian Ehrhardt1-2/+10
Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of the code and add some comments about what the tunable will change in that behavior. Signed-off-by: Christian Ehrhardt <[email protected]> Acked-by: Jens Axboe <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>