blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2014-06-04	mm: page_alloc: convert hot/cold parameter and immediate callers to bool	Mel Gorman	1	-10/+10
	cold is a bool, make it one. Make the likely case the "if" part of the block instead of the else as according to the optimisation manual this is preferred. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jan Kara <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm: page_alloc: use unsigned int for order in more places	Mel Gorman	1	-20/+23
	X86 prefers the use of unsigned types for iterators and there is a tendency to mix whether a signed or unsigned type if used for page order. This converts a number of sites in mm/page_alloc.c to use unsigned int for order where possible. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jan Kara <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm: page_alloc: lookup pageblock migratetype with IRQs enabled during free	Mel Gorman	1	-1/+1
	get_pageblock_migratetype() is called during free with IRQs disabled. This is unnecessary and disables IRQs for longer than necessary. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Jan Kara <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm: page_alloc: reduce number of times page_to_pfn is called	Mel Gorman	1	-15/+19
	In the free path we calculate page_to_pfn multiple times. Reduce that. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Jan Kara <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm: page_alloc: use word-based accesses for get/set pageblock bitmaps	Mel Gorman	1	-20/+32
	The test_bit operations in get/set pageblock flags are expensive. This patch reads the bitmap on a word basis and use shifts and masks to isolate the bits of interest. Similarly masks are used to set a local copy of the bitmap and then use cmpxchg to update the bitmap if there have been no other changes made in parallel. In a test running dd onto tmpfs the overhead of the pageblock-related functions went from 1.27% in profiles to 0.5%. In addition to the performance benefits, this patch closes races that are possible between: a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype() reads part of the bits before and other part of the bits after set_pageblock_migratetype() has updated them. b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic read-modify-update set bit operation in set_pageblock_skip() will cause lost updates to some bits changed in the set_pageblock_migratetype(). Joonsoo Kim first reported the case a) via code inspection. Vlastimil Babka's testing with a debug patch showed that either a) or b) occurs roughly once per mmtests' stress-highalloc benchmark (although not necessarily in the same pageblock). Furthermore during development of unrelated compaction patches, it was observed that frequent calls to {start,undo}_isolate_page_range() the race occurs several thousands of times and has resulted in NULL pointer dereferences in move_freepages() and free_one_page() in places where free_list[migratetype] is manipulated by e.g. list_move(). Further debugging confirmed that migratetype had invalid value of 6, causing out of bounds access to the free_list array. That confirmed that the race exist, although it may be extremely rare, and currently only fatal where page isolation is performed due to memory hot remove. Races on pageblocks being updated by set_pageblock_migratetype(), where both old and new migratetype are lower MIGRATE_RESERVE, currently cannot result in an invalid value being observed, although theoretically they may still lead to unexpected creation or destruction of MIGRATE_RESERVE pageblocks. Furthermore, things could get suddenly worse when memory isolation is used more, or when new migratetypes are added. After this patch, the race has no longer been observed in testing. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reported-by: Joonsoo Kim <[email protected]> Reported-and-tested-by: Vlastimil Babka <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm: page_alloc: take the ALLOC_NO_WATERMARK check out of the fast path	Mel Gorman	1	-3/+5
	ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these cases are relatively rare events but the ALLOC_NO_WATERMARK check is an unlikely branch in the fast path. This patch moves the check out of the fast path and after it has been determined that the watermarks have not been met. This helps the common fast path at the cost of making the slow path slower and hitting kswapd with a performance cost. It's a reasonable tradeoff. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jan Kara <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm: page_alloc: only check the alloc flags and gfp_mask for dirty once	Mel Gorman	1	-2/+3
	Currently it's calculated once per zone in the zonelist. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jan Kara <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm: page_alloc: only check the zone id check if pages are buddies	Mel Gorman	1	-3/+13
	A node/zone index is used to check if pages are compatible for merging but this happens unconditionally even if the buddy page is not free. Defer the calculation as long as possible. Ideally we would check the zone boundary but nodes can overlap. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jan Kara <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm: page_alloc: use jump labels to avoid checking number_of_cpusets	Mel Gorman	1	-1/+2
	If cpusets are not in use then we still check a global variable on every page allocation. Use jump labels to avoid the overhead. Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jan Kara <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm: page_alloc: do not treat a zone that cannot be used for dirty pages as ↵	Mel Gorman	1	-1/+1
	"full" If a zone cannot be used for a dirty page then it gets marked "full" which is cached in the zlc and later potentially skipped by allocation requests that have nothing to do with dirty zones. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm: page_alloc: do not update zlc unless the zlc is active	Mel Gorman	1	-1/+1
	The zlc is used on NUMA machines to quickly skip over zones that are full. However it is always updated, even for the first zone scanned when the zlc might not even be active. As it's a write to a bitmap that potentially bounces cache line it's deceptively expensive and most machines will not care. Only update the zlc if it was active. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm, thp: avoid excessive compaction latency during fault	David Rientjes	1	-1/+8
	Synchronous memory compaction can be very expensive: it can iterate an enormous amount of memory without aborting, constantly rescheduling, waiting on page locks and lru_lock, etc, if a pageblock cannot be defragmented. Unfortunately, it's too expensive for transparent hugepage page faults and it's much better to simply fallback to pages. On 128GB machines, we find that synchronous memory compaction can take O(seconds) for a single thp fault. Now that async compaction remembers where it left off without strictly relying on sync compaction, this makes thp allocations best-effort without causing egregious latency during fault. We still need to retry async compaction after reclaim, but this won't stall for seconds. Signed-off-by: David Rientjes <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm, compaction: embed migration mode in compact_control	David Rientjes	1	-22/+17
	We're going to want to manipulate the migration mode for compaction in the page allocator, and currently compact_control's sync field is only a bool. Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction depending on the value of this bool. Convert the bool to enum migrate_mode and pass the migration mode in directly. Later, we'll want to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to avoid unnecessary latency. This also alters compaction triggered from sysfs, either for the entire system or for a node, to force MIGRATE_SYNC. [[email protected]: fix build] [[email protected]: use MIGRATE_SYNC in alloc_contig_range()] Signed-off-by: David Rientjes <[email protected]> Suggested-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm, migration: add destination page freeing callback	David Rientjes	1	-1/+1
	Memory migration uses a callback defined by the caller to determine how to allocate destination pages. When migration fails for a source page, however, it frees the destination page back to the system. This patch adds a memory migration callback defined by the caller to determine how to free destination pages. If a caller, such as memory compaction, builds its own freelist for migration targets, this can reuse already freed memory instead of scanning additional memory. If the caller provides a function to handle freeing of destination pages, it is called when page migration fails. If the caller passes NULL then freeing back to the system will be handled as usual. This patch introduces no functional change. Signed-off-by: David Rientjes <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Greg Thelen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm: debug: make bad_range() output more usable and readable	Dave Hansen	1	-2/+3
	Nobody outputs memory addresses in decimal. PFNs are essentially addresses, and they're gibberish in decimal. Output them in hex. Also, add the nid and zone name to give a little more context to the message. Signed-off-by: Dave Hansen <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplaced	Vlastimil Babka	1	-10/+13
	For the MIGRATE_RESERVE pages, it is useful when they do not get misplaced on free_list of other migratetype, otherwise they might get allocated prematurely and e.g. fragment the MIGRATE_RESEVE pageblocks. While this cannot be avoided completely when allocating new MIGRATE_RESERVE pageblocks in min_free_kbytes sysctl handler, we should prevent the misplacement where possible. Currently, it is possible for the misplacement to happen when a MIGRATE_RESERVE page is allocated on pcplist through rmqueue_bulk() as a fallback for other desired migratetype, and then later freed back through free_pcppages_bulk() without being actually used. This happens because free_pcppages_bulk() uses get_freepage_migratetype() to choose the free_list, and rmqueue_bulk() calls set_freepage_migratetype() with the desired migratetype and not the page's original MIGRATE_RESERVE migratetype. This patch fixes the problem by moving the call to set_freepage_migratetype() from rmqueue_bulk() down to __rmqueue_smallest() and __rmqueue_fallback() where the actual page's migratetype (e.g. from which free_list the page is taken from) is used. Note that this migratetype might be different from the pageblock's migratetype due to freepage stealing decisions. This is OK, as page stealing never uses MIGRATE_RESERVE as a fallback, and also takes care to leave all MIGRATE_CMA pages on the correct freelist. Therefore, as an additional benefit, the call to get_pageblock_migratetype() from rmqueue_bulk() when CMA is enabled, can be removed completely. This relies on the fact that MIGRATE_CMA pageblocks are created only during system init, and the above. The related is_migrate_isolate() check is also unnecessary, as memory isolation has other ways to move pages between freelists, and drain pcp lists containing pages that should be isolated. The buffered_rmqueue() can also benefit from calling get_freepage_migratetype() instead of get_pageblock_migratetype(). Signed-off-by: Vlastimil Babka <[email protected]> Reported-by: Yong-Taek Lee <[email protected]> Reported-by: Bartlomiej Zolnierkiewicz <[email protected]> Suggested-by: Joonsoo Kim <[email protected]> Acked-by: Joonsoo Kim <[email protected]> Suggested-by: Mel Gorman <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: "Wang, Yalin" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm: page_alloc: do not cache reclaim distances	Mel Gorman	1	-15/+2
	pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by zone_reclaim due to its distance. As it is expected that zone_reclaim_mode will be rarely enabled it is unreasonable for all machines to take a penalty. Fortunately, the zone_reclaim_mode() path is already slow and it is the path that takes the hit. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm: disable zone_reclaim_mode by default	Mel Gorman	1	-2/+0
	When it was introduced, zone_reclaim_mode made sense as NUMA distances punished and workloads were generally partitioned to fit into a NUMA node. NUMA machines are now common but few of the workloads are NUMA-aware and it's routine to see major performance degradation due to zone_reclaim_mode being enabled but relatively few can identify the problem. Those that require zone_reclaim_mode are likely to be able to detect when it needs to be enabled and tune appropriately so lets have a sensible default for the bulk of users. This patch (of 2): zone_reclaim_mode causes processes to prefer reclaiming memory from local node instead of spilling over to other nodes. This made sense initially when NUMA machines were almost exclusively HPC and the workload was partitioned into nodes. The NUMA penalties were sufficiently high to justify reclaiming the memory. On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Users that are sophisticated enough to know they need zone_reclaim_mode will detect it. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04	mm: get rid of __GFP_KMEMCG	Vladimir Davydov	1	-21/+35
	Currently to allocate a page that should be charged to kmemcg (e.g. threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page allocated is then to be freed by free_memcg_kmem_pages. Apart from looking asymmetrical, this also requires intrusion to the general allocation path. So let's introduce separate functions that will alloc/free pages charged to kmemcg. The new functions are called alloc_kmem_pages and free_kmem_pages. They should be used when the caller actually would like to use kmalloc, but has to fall back to the page allocator for the allocation is large. They only differ from alloc_pages and free_pages in that besides allocating or freeing pages they also charge them to the kmem resource counter of the current memory cgroup. [[email protected]: export kmalloc_order() to modules] Signed-off-by: Vladimir Davydov <[email protected]> Acked-by: Greg Thelen <[email protected]> Cc: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Glauber Costa <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-07	mm/page_alloc.c: change mm debug routines back to EXPORT_SYMBOL	John Hubbard	1	-1/+1
	A new dump_page() routine was recently added, and marked EXPORT_SYMBOL_GPL. dump_page() was also added to the VM_BUG_ON_PAGE() macro, and so the end result is that non-GPL code can no longer call get_page() and a few other routines. This only happens if the kernel was compiled with CONFIG_DEBUG_VM. Change dump_page() to be EXPORT_SYMBOL. Longer explanation: Prior to commit 309381feaee5 ("mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE") , it was possible to build MIT-licensed (non-GPL) drivers on Fedora. Fedora is semi-unique, in that it sets CONFIG_VM_DEBUG. Because Fedora sets CONFIG_VM_DEBUG, they end up pulling in dump_page(), via VM_BUG_ON_PAGE, via get_page(). As one of the authors of NVIDIA's new, open source, "UVM-Lite" kernel module, I originally choose to use the kernel's get_page() routine from within nvidia_uvm_page_cache.c, because get_page() has always seemed to be very clearly intended for use by non-GPL, driver code. So I'm hoping that making get_page() widely accessible again will not be too controversial. We did check with Fedora first, and they responded (https://bugzilla.redhat.com/show_bug.cgi?id=1074710#c3) that we should try to get upstream changed, before asking Fedora to change. Their reasoning seems beneficial to Linux: leaving CONFIG_DEBUG_VM set allows Fedora to help catch mm bugs. Signed-off-by: John Hubbard <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Josh Boyer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-07	memblock: use for_each_memblock()	Emil Medve	1	-5/+5
	This is a small cleanup. Signed-off-by: Emil Medve <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-07	mm: page_alloc: spill to remote nodes before waking kswapd	Johannes Weiner	1	-44/+45
	On NUMA systems, a node may start thrashing cache or even swap anonymous pages while there are still free pages on remote nodes. This is a result of commits 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") and fff4068cba48 ("mm: page_alloc: revert NUMA aspect of fair allocation policy"). Before those changes, the allocator would first try all allowed zones, including those on remote nodes, before waking any kswapds. But now, the allocator fastpath doubles as the fairness pass, which in turn can only consider the local node to prevent remote spilling based on exhausted fairness batches alone. Remote nodes are only considered in the slowpath, after the kswapds are woken up. But if remote nodes still have free memory, kswapd should not be woken to rebalance the local node or it may thrash cash or swap prematurely. Fix this by adding one more unfair pass over the zonelist that is allowed to spill to remote nodes after the local fairness pass fails but before entering the slowpath and waking the kswapds. This also gets rid of the GFP_THISNODE exemption from the fairness protocol because the unfair pass is no longer tied to kswapd, which GFP_THISNODE is not allowed to wake up. However, because remote spills can be more frequent now - we prefer them over local kswapd reclaim - the allocation batches on remote nodes could underflow more heavily. When resetting the batches, use atomic_long_read() directly instead of zone_page_state() to calculate the delta as the latter filters negative counter values. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: <[email protected]> [3.12+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-07	mm: use 'const char ' insted of 'char ' for reason in dump_page()	Kirill A. Shutemov	1	-5/+7
	I tried to use 'dump_page(page, __func__)' for debugging, but it triggers warning: warning: passing argument 2 of `dump_page' discards `const' qualifier from pointer target type [enabled by default] Let's convert 'reason' to 'const char *' in dump_page() and friends: we shouldn't modify it anyway. Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-07	mm: exclude memoryless nodes from zone_reclaim	Michal Hocko	1	-2/+3
	We had a report about strange OOM killer strikes on a PPC machine although there was a lot of swap free and a tons of anonymous memory which could be swapped out. In the end it turned out that the OOM was a side effect of zone reclaim which wasn't unmapping and swapping out and so the system was pushed to the OOM. Although this sounds like a bug somewhere in the kswapd vs. zone reclaim vs. direct reclaim interaction numactl on the said hardware suggests that the zone reclaim should not have been set in the first place: node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 size: 0 MB node 0 free: 0 MB node 2 cpus: node 2 size: 7168 MB node 2 free: 6019 MB node distances: node 0 2 0: 10 40 2: 40 10 So all the CPUs are associated with Node0 which doesn't have any memory while Node2 contains all the available memory. Node distances cause an automatic zone_reclaim_mode enabling. Zone reclaim is intended to keep the allocations local but this doesn't make any sense on the memoryless nodes. So let's exclude such nodes for init_zone_allows_reclaim which evaluates zone reclaim behavior and suitable reclaim_nodes. Signed-off-by: Michal Hocko <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Nishanth Aravamudan <[email protected]> Tested-by: Nishanth Aravamudan <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03	mm: optimize put_mems_allowed() usage	Mel Gorman	1	-4/+4
	Since put_mems_allowed() is strictly optional, its a seqcount retry, we don't need to evaluate the function if the allocation was in fact successful, saving a smp_rmb some loads and comparisons on some relative fast-paths. Since the naming, get/put_mems_allowed() does suggest a mandatory pairing, rename the interface, as suggested by Mel, to resemble the seqcount interface. This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(), where it is important to note that the return value of the latter call is inverted from its previous incarnation. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-03-04	mm: page_alloc: exempt GFP_THISNODE allocations from zone fairness	Johannes Weiner	1	-4/+22
	Jan Stancek reports manual page migration encountering allocation failures after some pages when there is still plenty of memory free, and bisected the problem down to commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy"). The problem is that GFP_THISNODE obeys the zone fairness allocation batches on one hand, but doesn't reset them and wake kswapd on the other hand. After a few of those allocations, the batches are exhausted and the allocations fail. Fixing this means either having GFP_THISNODE wake up kswapd, or GFP_THISNODE not participating in zone fairness at all. The latter seems safer as an acute bugfix, we can clean up later. Reported-by: Jan Stancek <[email protected]> Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: <[email protected]> [3.12+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-03-04	mm: close PageTail race	David Rientjes	1	-1/+3
	Commit bf6bddf1924e ("mm: introduce compaction and migration for ballooned pages") introduces page_count(page) into memory compaction which dereferences page->first_page if PageTail(page). This results in a very rare NULL pointer dereference on the aforementioned page_count(page). Indeed, anything that does compound_head(), including page_count() is susceptible to racing with prep_compound_page() and seeing a NULL or dangling page->first_page pointer. This patch uses Andrea's implementation of compound_trans_head() that deals with such a race and makes it the default compound_head() implementation. This includes a read memory barrier that ensures that if PageTail(head) is true that we return a head page that is neither NULL nor dangling. The patch then adds a store memory barrier to prep_compound_page() to ensure page->first_page is set. This is the safest way to ensure we see the head page that we are expecting, PageTail(page) is already in the unlikely() path and the memory barriers are unfortunately required. Hugetlbfs is the exception, we don't enforce a store memory barrier during init since no race is possible. Signed-off-by: David Rientjes <[email protected]> Cc: Holger Kiehl <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Rafael Aquini <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-23	mm: show message when updating min_free_kbytes in thp	Han Pingtian	1	-1/+1
	min_free_kbytes may be raised during THP's initialization. Sometimes, this will change the value which was set by the user. Showing this message will clarify this confusion. Only show this message when changing a value which was set by the user according to Michal Hocko's suggestion. Show the old value of min_free_kbytes according to Dave Hansen's suggestion. This will give user the chance to restore old value of min_free_kbytes. Signed-off-by: Han Pingtian <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-23	mm: prevent setting of a value less than 0 to min_free_kbytes	Han Pingtian	1	-1/+6
	If echo -1 > /proc/vm/sys/min_free_kbytes, the system will hang. Changing proc_dointvec() to proc_dointvec_minmax() in the min_free_kbytes_sysctl_handler() can prevent this to happen. mhocko said: : You can still do echo $BIG_VALUE > /proc/vm/sys/min_free_kbytes and make : your machine unusable but I agree that proc_dointvec_minmax is more : suitable here as we already have: : : .proc_handler = min_free_kbytes_sysctl_handler, : .extra1 = &zero, : : It used to work properly but then 6fce56ec91b5 ("sysctl: Remove references : to ctl_name and strategy from the generic sysctl table") has removed : sysctl_intvec strategy and so extra1 is ignored. Signed-off-by: Han Pingtian <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-23	mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE	Sasha Levin	1	-10/+11
	Most of the VM_BUG_ON assertions are performed on a page. Usually, when one of these assertions fails we'll get a BUG_ON with a call stack and the registers. I've recently noticed based on the requests to add a small piece of code that dumps the page to various VM_BUG_ON sites that the page dump is quite useful to people debugging issues in mm. This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what VM_BUG_ON() does, also dumps the page before executing the actual BUG_ON. [[email protected]: fix up includes] Signed-off-by: Sasha Levin <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-23	mm: print more details for bad_page()	Dave Hansen	1	-18/+54
	bad_page() is cool in that it prints out a bunch of data about the page. But, I can never remember which page flags are good and which are bad, or whether ->index or ->mapping is required to be NULL. This patch allows bad/dump_page() callers to specify a string about why they are dumping the page and adds explanation strings to a number of places. It also adds a 'bad_flags' argument to bad_page(), which it then dumps out separately from the flags which are actually set. This way, the messages will show specifically why the page was bad, specifically which flags it is complaining about, if it was a page flag combination which was the problem. [[email protected]: switch to pr_alert] Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Cc: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21	mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure	David Rientjes	1	-1/+8
	__GFP_NOFAIL may return NULL when coupled with GFP_NOWAIT or GFP_ATOMIC. Luckily, nothing currently does such craziness. So instead of causing such allocations to loop (potentially forever), we maintain the current behavior and also warn about the new users of the deprecated flag. Suggested-by: Andrew Morton <[email protected]> Signed-off-by: David Rientjes <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21	mm: compaction: encapsulate defer reset logic	Vlastimil Babka	1	-4/+1
	Currently there are several functions to manipulate the deferred compaction state variables. The remaining case where the variables are touched directly is when a successful allocation occurs in direct compaction, or is expected to be successful in the future by kswapd. Here, the lowest order that is expected to fail is updated, and in the case of successful allocation, the deferred status and counter is reset completely. Create a new function compaction_defer_reset() to encapsulate this functionality and make it easier to understand the code. No functional change. Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21	mm/page_alloc.c: use memblock apis for early memory allocations	Santosh Shilimkar	1	-12/+15
	Switch to memblock interfaces for early memory allocator instead of bootmem allocator. No functional change in beahvior than what it is in current code from bootmem users points of view. Archs already converted to NO_BOOTMEM now directly use memblock interfaces instead of bootmem wrappers build on top of memblock. And the archs which still uses bootmem, these new apis just fallback to exiting bootmem APIs. Signed-off-by: Grygorii Strashko <[email protected]> Signed-off-by: Santosh Shilimkar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tejun Heo <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tony Lindgren <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21	x86, numa, acpi, memory-hotplug: make movable_node have higher priority	Tang Chen	1	-2/+26
	If users specify the original movablecore=nn@ss boot option, the kernel will arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot option is similar except it specifies ZONE_NORMAL ranges. Now, if users specify "movable_node" in kernel commandline, the kernel will arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users do this, all the other movablecore=nn@ss and kernelcore=nn@ss options should be ignored. For those who don't want this, just specify nothing. The kernel will act as before. Signed-off-by: Tang Chen <[email protected]> Signed-off-by: Zhang Yanfei <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Chen Tang <[email protected]> Cc: Gong Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Len Brown <[email protected]> Cc: Liu Jiang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Taku Izumi <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Renninger <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasilis Liaskovitis <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21	mm, show_mem: remove SHOW_MEM_FILTER_PAGE_COUNT	Mel Gorman	1	-7/+0
	Commit 4b59e6c47309 ("mm, show_mem: suppress page counts in non-blockable contexts") introduced SHOW_MEM_FILTER_PAGE_COUNT to suppress PFN walks on large memory machines. Commit c78e93630d15 ("mm: do not walk all of system memory during show_mem") avoided a PFN walk in the generic show_mem helper which removes the requirement for SHOW_MEM_FILTER_PAGE_COUNT in that case. This patch removes PFN walkers from the arch-specific implementations that report on a per-node or per-zone granularity. ARM and unicore32 still do a PFN walk as they report memory usage on each bank which is a much finer granularity where the debugging information may still be of use. As the remaining arches doing PFN walks have relatively small amounts of memory, this patch simply removes SHOW_MEM_FILTER_PAGE_COUNT. [[email protected]: fix parisc] Signed-off-by: Mel Gorman <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Tony Luck <[email protected]> Cc: Russell King <[email protected]> Cc: James Bottomley <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21	mm: get rid of unnecessary pageblock scanning in setup_zone_migrate_reserve	Yasuaki Ishimatsu	1	-0/+13
	Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on 9TB memory machine since onlining memory sections is too slow. And we found out setup_zone_migrate_reserve spent >90% of the time. The problem is, setup_zone_migrate_reserve scans all pageblocks unconditionally, but it is only necessary if the number of reserved block was reduced (i.e. memory hot remove). Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means that the number of reserved pageblocks is almost always unchanged. This patch adds zone->nr_migrate_reserve_block to maintain the number of MIGRATE_RESERVE pageblocks and it reduces the overhead of setup_zone_migrate_reserve dramatically. The following table shows time of onlining a memory section. Amount of memory \| 128GB \| 192GB \| 256GB\| --------------------------------------------- linux-3.12 \| 23.9 \| 31.4 \| 44.5 \| This patch \| 8.3 \| 8.3 \| 8.6 \| Mel's proposal patch \| 10.9 \| 19.2 \| 31.3 \| --------------------------------------------- (millisecond) 128GB : 4 nodes and each node has 32GB of memory 192GB : 6 nodes and each node has 32GB of memory 256GB : 8 nodes and each node has 32GB of memory (*1) Mel proposed his idea by the following threads. https://lkml.org/lkml/2013/10/30/272 [[email protected]: tweak comment] Signed-off-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Yasuaki Ishimatsu <[email protected]> Reported-by: Yasuaki Ishimatsu <[email protected]> Tested-by: Yasuaki Ishimatsu <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-12-20	mm: page_alloc: revert NUMA aspect of fair allocation policy	Johannes Weiner	1	-10/+9
	Commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") meant to bring aging fairness among zones in system, but it was overzealous and badly regressed basic workloads on NUMA systems. Due to the way kswapd and page allocator interacts, we still want to make sure that all zones in any given node are used equally for all allocations to maximize memory utilization and prevent thrashing on the highest zone in the node. While the same principle applies to NUMA nodes - memory utilization is obviously improved by spreading allocations throughout all nodes - remote references can be costly and so many workloads prefer locality over memory utilization. The original change assumed that zone_reclaim_mode would be a good enough predictor for that, but it turned out to be as indicative as a coin flip. Revert the NUMA aspect of the fairness until we can find a proper way to make it configurable and agree on a sane default. Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Cc: <[email protected]> # 3.12 Signed-off-by: Linus Torvalds <[email protected]>
2013-12-20	Revert "mm: page_alloc: exclude unreclaimable allocations from zone fairness ↵	Mel Gorman	1	-2/+1
	policy" This reverts commit 73f038b863df. The NUMA behaviour of this patch is less than ideal. An alternative approch is to interleave allocations only within local zones which is implemented in the next patch. Cc: [email protected] Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-12-18	mm: page_alloc: exclude unreclaimable allocations from zone fairness policy	Johannes Weiner	1	-1/+2
	Dave Hansen noted a regression in a microbenchmark that loops around open() and close() on an 8-node NUMA machine and bisected it down to commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy"). That change forces the slab allocations of the file descriptor to spread out to all 8 nodes, causing remote references in the page allocator and slab. The round-robin policy is only there to provide fairness among memory allocations that are reclaimed involuntarily based on pressure in each zone. It does not make sense to apply it to unreclaimable kernel allocations that are freed manually, in this case instantly after the allocation, and incur the remote reference costs twice for no reason. Only round-robin allocations that are usually freed through page reclaim or slab shrinking. Bisected by Dave Hansen. Signed-off-by: Johannes Weiner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-11-13	mm/page_alloc.c: fix comment in zlc_setup()	Zhi Yong Wu	1	-1/+1
	Signed-off-by: Zhi Yong Wu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-11-13	mm: __rmqueue_fallback() should respect pageblock type	KOSAKI Motohiro	1	-10/+5
	When __rmqueue_fallback() doesn't find a free block with the required size it splits a larger page and puts the rest of the page onto the free list. But it has one serious mistake. When putting back, __rmqueue_fallback() always use start_migratetype if type is not CMA. However, __rmqueue_fallback() is only called when all of the start_migratetype queue is empty. That said, __rmqueue_fallback always puts back memory to the wrong queue except try_to_steal_freepages() changed pageblock type (i.e. requested size is smaller than half of page block). The end result is that the antifragmentation framework increases fragmenation instead of decreasing it. Mel's original anti fragmentation does the right thing. But commit 47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") broke it. This patch restores sane and old behavior. It also removes an incorrect comment which was introduced by commit fef903efcf0c ("mm/page_alloc.c: restructure free-page stealing code and fix a bug"). Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-11-13	mm: get rid of unnecessary overhead of trace_mm_page_alloc_extfrag()	KOSAKI Motohiro	1	-3/+2
	In general, every tracepoint should be zero overhead if it is disabled. However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate "new_type == start_migratetype" even if tracepoint is disabled. However, the code can be moved into tracepoint's TP_fast_assign() and TP_fast_assign exist exactly such purpose. This patch does it. Signed-off-by: KOSAKI Motohiro <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-11-13	mm: fix page_group_by_mobility_disabled breakage	KOSAKI Motohiro	1	-2/+2
	Currently, set_pageblock_migratetype() screws up MIGRATE_CMA and MIGRATE_ISOLATE if page_group_by_mobility_disabled is true. It rewrites the argument to MIGRATE_UNMOVABLE and we lost these attribute. The problem was introduced by commit 49255c619fbd ("page allocator: move check for disabled anti-fragmentation out of fastpath"). So a 4 year old issue may mean that nobody uses page_group_by_mobility_disabled. But anyway, this patch fixes the problem. Signed-off-by: KOSAKI Motohiro <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-11-13	mm/page_alloc.c: remove unused marco LONG_ALIGN	Zhang Yanfei	1	-2/+0
	Signed-off-by: Zhang Yanfei <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-11-13	mm: add a helper function to check may oom condition	Qiang Huang	1	-1/+1
	Use helper function to check if we need to deal with oom condition. Signed-off-by: Qiang Huang <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-11-13	mm: use populated_zone() instead of if(zone->present_pages)	Xishi Qiu	1	-2/+2
	Use "if (zone->present_pages)" instead of "if (zone->present_pages)". Simplify the code, no functional change. Signed-off-by: Xishi Qiu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-10-09	mm: numa: Change page last {nid,pid} into {cpu,pid}	Peter Zijlstra	1	-2/+2
	Change the per page last fault tracking to use cpu,pid instead of nid,pid. This will allow us to try and lookup the alternate task more easily. Note that even though it is the cpu that is store in the page flags that the mpol_misplaced decision is still based on the node. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Srikar Dronamraju <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ Fixed build failure on 32-bit systems. ] Signed-off-by: Ingo Molnar <[email protected]>
2013-10-09	sched/numa: Set preferred NUMA node based on number of private faults	Mel Gorman	1	-2/+2
	Ideally it would be possible to distinguish between NUMA hinting faults that are private to a task and those that are shared. If treated identically there is a risk that shared pages bounce between nodes depending on the order they are referenced by tasks. Ultimately what is desirable is that task private pages remain local to the task while shared pages are interleaved between sharing tasks running on different nodes to give good average performance. This is further complicated by THP as even applications that partition their data may not be partitioning on a huge page boundary. To start with, this patch assumes that multi-threaded or multi-process applications partition their data and that in general the private accesses are more important for cpu->memory locality in the general case. Also, no new infrastructure is required to treat private pages properly but interleaving for shared pages requires additional infrastructure. To detect private accesses the pid of the last accessing task is required but the storage requirements are a high. This patch borrows heavily from Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking" to encode some bits from the last accessing task in the page flags as well as the node information. Collisions will occur but it is better than just depending on the node information. Node information is then used to determine if a page needs to migrate. The PID information is used to detect private/shared accesses. The preferred NUMA node is selected based on where the maximum number of approximately private faults were measured. Shared faults are not taken into consideration for a few reasons. First, if there are many tasks sharing the page then they'll all move towards the same node. The node will be compute overloaded and then scheduled away later only to bounce back again. Alternatively the shared tasks would just bounce around nodes because the fault information is effectively noise. Either way accounting for shared faults the same as private faults can result in lower performance overall. The second reason is based on a hypothetical workload that has a small number of very important, heavily accessed private pages but a large shared array. The shared array would dominate the number of faults and be selected as a preferred node even though it's the wrong decision. The third reason is that multiple threads in a process will race each other to fault the shared page making the fault information unreliable. Signed-off-by: Mel Gorman <[email protected]> [ Fix complication error when !NUMA_BALANCING. ] Reviewed-by: Rik van Riel <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Srikar Dronamraju <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-09-30	revert "mm/memory-hotplug: fix lowmem count overflow when offline pages"	Joonyoung Shim	1	-4/+0
	This reverts commit cea27eb2a202 ("mm/memory-hotplug: fix lowmem count overflow when offline pages"). The fixed bug by commit cea27eb was fixed to another way by commit 3dcc0571cd64 ("mm: correctly update zone->managed_pages"). That commit enhances memory_hotplug.c to adjust totalhigh_pages when hot-removing memory, for details please refer to: http://marc.info/?l=linux-mm&m=136957578620221&w=2 As a result, commit cea27eb2a202 currently causes duplicated decreasing of totalhigh_pages, thus the revert. Signed-off-by: Joonyoung Shim <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Cc: Jiang Liu <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Bartlomiej Zolnierkiewicz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>