blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2011-05-25	mm: convert mm->cpu_vm_cpumask into cpumask_var_t	KOSAKI Motohiro	7	-9/+44
	cpumask_t is very big struct and cpu_vm_mask is placed wrong position. It might lead to reduce cache hit ratio. This patch has two change. 1) Move the place of cpumask into last of mm_struct. Because usually cpumask is accessed only front bits when the system has cpu-hotplug capability 2) Convert cpu_vm_mask into cpumask_var_t. It may help to reduce memory footprint if cpumask_size() will use nr_cpumask_bits properly in future. In addition, this patch change the name of cpu_vm_mask with cpu_vm_mask_var. It may help to detect out of tree cpu_vm_mask users. This patch has no functional change. [[email protected]: build fix] [[email protected]: coding-style fixes] Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: David Howells <[email protected]> Cc: Koichi Yasutake <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Chris Metcalf <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: thp: optimize memcg charge in khugepaged	Andrea Arcangeli	1	-10/+11
	We don't need to hold the mmmap_sem through mem_cgroup_newpage_charge(), the mmap_sem is only hold for keeping the vma stable and we don't need the vma stable anymore after we return from alloc_hugepage_vma(). Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: uninline large generic tlb.h functions	Peter Zijlstra	2	-124/+135
	Some of these functions have grown beyond inline sanity, move them out-of-line. Signed-off-by: Peter Zijlstra <[email protected]> Requested-by: Andrew Morton <[email protected]> Requested-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: optimize page_lock_anon_vma() fast-path	Peter Zijlstra	1	-4/+82
	Optimize the page_lock_anon_vma() fast path to be one atomic op, instead of two. Signed-off-by: Peter Zijlstra <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: convert anon_vma->lock to a mutex	Peter Zijlstra	6	-25/+21
	Straightforward conversion of anon_vma->lock to a mutex. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Hugh Dickins <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: use refcounts for page_lock_anon_vma()	Peter Zijlstra	2	-28/+31
	Convert page_lock_anon_vma() over to use refcounts. This is done to prepare for the conversion of anon_vma from spinlock to mutex. Sadly this inceases the cost of page_lock_anon_vma() from one to two atomics, a follow up patch addresses this, lets keep that simple for now. Signed-off-by: Peter Zijlstra <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: improve page_lock_anon_vma() comment	Peter Zijlstra	1	-2/+16
	A slightly more verbose comment to go along with the trickery in page_lock_anon_vma(). Signed-off-by: Peter Zijlstra <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: revert page_lock_anon_vma() lock annotation	Peter Zijlstra	2	-17/+2
	Its beyond ugly and gets in the way. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: Convert i_mmap_lock to a mutex	Peter Zijlstra	17	-58/+58
	Straightforward conversion of i_mmap_lock to a mutex. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: Remove i_mmap_lock lockbreak	Peter Zijlstra	8	-188/+28
	Hugh says: "The only significant loser, I think, would be page reclaim (when concurrent with truncation): could spin for a long time waiting for the i_mmap_mutex it expects would soon be dropped? " Counter points: - cpu contention makes the spin stop (need_resched()) - zap pages should be freeing pages at a higher rate than reclaim ever can I think the simplification of the truncate code is definitely worth it. Effectively reverts: 2aa15890f3c ("mm: prevent concurrent unmap_mapping_range() on the same inode") and takes out the code that caused its problem. Signed-off-by: Peter Zijlstra <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	lockdep, mutex: provide mutex_lock_nest_lock	Peter Zijlstra	3	-8/+29
	In order to convert i_mmap_lock to a mutex we need a mutex equivalent to spin_lock_nest_lock(), thus provide the mutex_lock_nest_lock() annotation. As with spin_lock_nest_lock(), mutex_lock_nest_lock() allows annotation of the locking pattern where an outer lock serializes the acquisition order of nested locks. That is, if every time you lock multiple locks A, say A1 and A2 you first acquire N, the order of acquiring A1 and A2 is irrelevant. Signed-off-by: Peter Zijlstra <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: extended batches for generic mmu_gather	Peter Zijlstra	2	-47/+84
	Instead of using a single batch (the small on-stack, or an allocated page), try and extend the batch every time it runs out and only flush once either the extend fails or we're done. Signed-off-by: Peter Zijlstra <[email protected]> Requested-by: Nick Piggin <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm, powerpc: move the RCU page-table freeing into generic code	Peter Zijlstra	10	-125/+150
	In case other architectures require RCU freed page-tables to implement gup_fast() and software filled hashes and similar things, provide the means to do so by moving the logic into generic code. Signed-off-by: Peter Zijlstra <[email protected]> Requested-by: David Miller <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: now that all old mmu_gather code is gone, remove the storage	Peter Zijlstra	21	-41/+0
	Fold all the mmu_gather rework patches into one for submission Signed-off-by: Peter Zijlstra <[email protected]> Reported-by: Hugh Dickins <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	um: mmu_gather rework	Peter Zijlstra	1	-18/+11
	Fix up the um mmu_gather code to conform to the new API. Signed-off-by: Peter Zijlstra <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	ia64: mmu_gather rework	Peter Zijlstra	1	-20/+46
	Fix up the ia64 mmu_gather code to conform to the new API. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Tony Luck <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	sh: mmu_gather rework	Peter Zijlstra	1	-11/+17
	Fix up the sh mmu_gather code to conform to the new API. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Paul Mundt <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	arm: mmu_gather rework	Peter Zijlstra	1	-16/+37
	Fix up the arm mmu_gather code to conform to the new API. Signed-off-by: Peter Zijlstra <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	s390: mmu_gather rework	Peter Zijlstra	1	-25/+37
	Adapt the stand-alone s390 mmu_gather implementation to the new preemptible mmu_gather interface. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Martin Schwidefsky <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	sparc: mmu_gather rework	Peter Zijlstra	6	-116/+63
	Rework the sparc mmu_gather usage to conform to the new world order :-) Sparc mmu_gather does two things: - tracks vaddrs to unhash - tracks pages to free Split these two things like powerpc has done and keep the vaddrs in per-cpu data structures and flush them on context switch. The remaining bits can then use the generic mmu_gather. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: David Miller <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	powerpc: mmu_gather rework	Peter Zijlstra	8	-17/+46
	Fix up powerpc to the new mmu_gather stuff. PPC has an extra batching queue to RCU free the actual pagetable allocations, use the ARCH extentions for that for now. For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the hardware hash-table, keep using per-cpu arrays but flush on context switch and use a TLF bit to track the lazy_mmu state. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: mmu_gather rework	Peter Zijlstra	5	-65/+107
	Rework the existing mmu_gather infrastructure. The direct purpose of these patches was to allow preemptible mmu_gather, but even without that I think these patches provide an improvement to the status quo. The first 9 patches rework the mmu_gather infrastructure. For review purpose I've split them into generic and per-arch patches with the last of those a generic cleanup. The next patch provides generic RCU page-table freeing, and the followup is a patch converting s390 to use this. I've also got 4 patches from DaveM lined up (not included in this series) that uses this to implement gup_fast() for sparc64. Then there is one patch that extends the generic mmu_gather batching. After that follow the mm preemptibility patches, these make part of the mm a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to mutexes which together with the mmu_gather rework makes mmu_gather preemptible as well. Making i_mmap_lock a mutex also enables a clean-up of the truncate code. This also allows for preemptible mmu_notifiers, something that XPMEM I think wants. Furthermore, it removes the new and universially detested unmap_mutex. This patch: Remove the first obstacle towards a fully preemptible mmu_gather. The current scheme assumes mmu_gather is always done with preemption disabled and uses per-cpu storage for the page batches. Change this to try and allocate a page for batching and in case of failure, use a small on-stack array to make some progress. Preemptible mmu_gather is desired in general and usable once i_mmap_lock becomes a mutex. Doing it before the mutex conversion saves us from having to rework the code by moving the mmu_gather bits inside the pte_lock. Also avoid flushing the tlb batches from under the pte lock, this is useful even without the i_mmap_lock conversion as it significantly reduces pte lock hold times. [[email protected]: fix comment tpyo] Signed-off-by: Peter Zijlstra <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Tony Luck <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: make expand_downwards() symmetrical with expand_upwards()	Michal Hocko	4	-11/+8
	Currently we have expand_upwards exported while expand_downwards is accessible only via expand_stack or expand_stack_downwards. check_stack_guard_page is a nice example of the asymmetry. It uses expand_stack for VM_GROWSDOWN while expand_upwards is called for VM_GROWSUP case. Let's clean this up by exporting both functions and make those names consistent. Let's use expand_{upwards,downwards} because expanding doesn't always involve stack manipulation (an example is ia64_do_page_fault which uses expand_upwards for registers backing store expansion). expand_downwards has to be defined for both CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards version in the early process initialization phase for growsup configuration. Signed-off-by: Michal Hocko <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: James Bottomley <[email protected]> Cc: "Luck, Tony" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm/vmalloc: remove guard page from between vmap blocks	Johannes Weiner	1	-3/+3
	The vmap allocator is used to, among other things, allocate per-cpu vmap blocks, where each vmap block is naturally aligned to its own size. Obviously, leaving a guard page after each vmap area forbids packing vmap blocks efficiently and can make the kernel run out of possible vmap blocks long before overall vmap space is exhausted. The new interface to map a user-supplied page array into linear vmalloc space (vm_map_ram) insists on allocating from a vmap block (instead of falling back to a custom area) when the area size is below a certain threshold. With heavy users of this interface (e.g. XFS) and limited vmalloc space on 32-bit, vmap block exhaustion is a real problem. Remove the guard page from the core vmap allocator. vmalloc and the old vmap interface enforce a guard page on their own at a higher level. Note that without this patch, we had accidental guard pages after those vm_map_ram areas that happened to be at the end of a vmap block, but not between every area. This patch removes this accidental guard page only. If we want guard pages after every vm_map_ram area, this should be done separately. And just like with vmalloc and the old interface on a different level, not in the core allocator. Mel pointed out: "If necessary, the guard page could be reintroduced as a debugging-only option (CONFIG_DEBUG_PAGEALLOC?). Otherwise it seems reasonable." Signed-off-by: Johannes Weiner <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Dave Chinner <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	include/linux/gfp.h: convert BUG_ON() into VM_BUG_ON()	Dave Hansen	1	-3/+1
	VM_BUG_ON() if effectively a BUG_ON() undef #ifdef CONFIG_DEBUG_VM. That is exactly what we have here now, and two different folks have suggested doing it this way. Signed-off-by: Dave Hansen <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	include/linux/gfp.h: work around apparent sparse confusion	Dave Hansen	1	-6/+1
	Running sparse on page_alloc.c today, it errors out: include/linux/gfp.h:254:17: error: bad constant expression include/linux/gfp.h:254:17: error: cannot size expression which is a line in gfp_zone(): BUILD_BUG_ON((GFP_ZONE_BAD >> bit) & 1); That's really unfortunate, because it ends up hiding all of the other legitimate sparse messages like this: mm/page_alloc.c:5315:59: warning: incorrect type in argument 1 (different base types) mm/page_alloc.c:5315:59: expected unsigned long [unsigned] [usertype] size mm/page_alloc.c:5315:59: got restricted gfp_t [usertype] <noident> ... Having sparse be able to catch these very oopsable bugs is a lot more important than keeping a BUILD_BUG_ON(). Kill the BUILD_BUG_ON(). Compiles on x86_64 with and without CONFIG_DEBUG_VM=y. defconfig boots fine for me. Signed-off-by: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	oom: replace PF_OOM_ORIGIN with toggling oom_score_adj	David Rientjes	5	-14/+38
	There's a kernel-wide shortage of per-process flags, so it's always helpful to trim one when possible without incurring a significant penalty. It's even more important when you're planning on adding a per- process flag yourself, which I plan to do shortly for transparent hugepages. PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a tendency to allocate large amounts of memory and should be preferred for killing over other tasks. We'd rather immediately kill the task making the errant syscall rather than penalizing an innocent task. This patch removes PF_OOM_ORIGIN since its behavior is equivalent to setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX. The process's old oom_score_adj is stored and then set to OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old value is then reinstated when the process should no longer be considered a high priority for oom killing. Signed-off-by: David Rientjes <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Izik Eidus <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm/compaction: reverse the change that forbade sync migraton with ↵	Andrea Arcangeli	1	-1/+1
	__GFP_NO_KSWAPD It's uncertain this has been beneficial, so it's safer to undo it. All other compaction users would still go in synchronous mode if a first attempt at async compaction failed. Hopefully we don't need to force special behavior for THP (which is the only __GFP_NO_KSWAPD user so far and it's the easier to exercise and to be noticeable). This also make __GFP_NO_KSWAPD return to its original strict semantics specific to bypass kswapd, as THP allocations have khugepaged for the async THP allocations/compactions. Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Alex Villacis Lasso <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm, mem-hotplug: update pcp->stat_threshold when memory hotplug occur	KOSAKI Motohiro	3	-2/+6
	Currently, cpu hotplug updates pcp->stat_threshold, but memory hotplug doesn't. There is no reason for this. [[email protected]: fix CONFIG_SMP=n build] Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occurs	KOSAKI Motohiro	3	-7/+8
	Currently, memory hotplug calls setup_per_zone_wmarks() and calculate_zone_inactive_ratio(), but doesn't call setup_per_zone_lowmem_reserve(). It means the number of reserved pages aren't updated even if memory hot plug occur. This patch fixes it. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm, mem-hotplug: fix section mismatch. setup_per_zone_inactive_ratio() ↵	KOSAKI Motohiro	2	-4/+4
	should be __meminit. Commit bce7394a3e ("page-allocator: reset wmark_min and inactive ratio of zone when hotplug happens") introduced invalid section references. Now, setup_per_zone_inactive_ratio() is marked __init and then it can't be referenced from memory hotplug code. This patch marks it as __meminit and also marks caller as __ref. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Yasunori Goto <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	x86,mm: make pagefault killable	KOSAKI Motohiro	3	-8/+36
	When an oom killing occurs, almost all processes are getting stuck at the following two points. 1) __alloc_pages_nodemask 2) __lock_page_or_retry 1) is not very problematic because TIF_MEMDIE leads to an allocation failure and getting out from page allocator. 2) is more problematic. In an OOM situation, zones typically don't have page cache at all and memory starvation might lead to greatly reduced IO performance. When a fork bomb occurs, TIF_MEMDIE tasks don't die quickly, meaning that a fork bomb may create new process quickly rather than the oom-killer killing it. Then, the system may become livelocked. This patch makes the pagefault interruptible by SIGKILL. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: introduce wait_on_page_locked_killable()	KOSAKI Motohiro	2	-0/+20
	commit 2687a356 ("Add lock_page_killable") introduced killable lock_page(). Similarly this patch introdues killable wait_on_page_locked(). Signed-off-by: KOSAKI Motohiro <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: per-node vmstat: show proper vmstats	KOSAKI Motohiro	3	-135/+144
	commit 2ac390370a ("writeback: add /sys/devices/system/node/<node>/vmstat") added vmstat entry. But strangely it only show nr_written and nr_dirtied. # cat /sys/devices/system/node/node20/vmstat nr_written 0 nr_dirtied 0 Of course, It's not adequate. With this patch, the vmstat show all vm stastics as /proc/vmstat. # cat /sys/devices/system/node/node0/vmstat nr_free_pages 899224 nr_inactive_anon 201 nr_active_anon 17380 nr_inactive_file 31572 nr_active_file 28277 nr_unevictable 0 nr_mlock 0 nr_anon_pages 17321 nr_mapped 8640 nr_file_pages 60107 nr_dirty 33 nr_writeback 0 nr_slab_reclaimable 6850 nr_slab_unreclaimable 7604 nr_page_table_pages 3105 nr_kernel_stack 175 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 nr_writeback_temp 0 nr_isolated_anon 0 nr_isolated_file 0 nr_shmem 260 nr_dirtied 1050 nr_written 938 numa_hit 962872 numa_miss 0 numa_foreign 0 numa_interleave 8617 numa_local 962872 numa_other 0 nr_anon_transparent_hugepages 0 [[email protected]: no externs in .c files] Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Michael Rubin <[email protected]> Cc: Wu Fengguang <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: nommu: fix a compile warning in do_mmap_pgoff()	Namhyung Kim	1	-3/+3
	Because 'ret' is declared as int, not unsigned long, no need to cast the error contants into unsigned long. If you compile this code on a 64-bit machine somehow, you'll see following warning: CC mm/nommu.o mm/nommu.c: In function `do_mmap_pgoff': mm/nommu.c:1411: warning: overflow in implicit constant conversion Signed-off-by: Namhyung Kim <[email protected]> Acked-by: Greg Ungerer <[email protected]> Cc: David Howells <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: nommu: fix a potential memory leak in do_mmap_private()	Namhyung Kim	1	-1/+1
	If f_op->read() fails and sysctl_nr_trim_pages > 1, there could be a memory leak between @region->vm_end and @region->vm_top. Signed-off-by: Namhyung Kim <[email protected]> Acked-by: Greg Ungerer <[email protected]> Cc: David Howells <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: nommu: check the vma list when unmapping file-mapped vma	Namhyung Kim	1	-4/+2
	Now we have the sorted vma list, use it in do_munmap() to check that we have an exact match. Signed-off-by: Namhyung Kim <[email protected]> Acked-by: Greg Ungerer <[email protected]> Cc: David Howells <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: nommu: find vma using the sorted vma list	Namhyung Kim	1	-8/+4
	Now we have the sorted vma list, use it in the find_vma[_exact]() rather than doing linear search on the rb-tree. Signed-off-by: Namhyung Kim <[email protected]> Acked-by: Greg Ungerer <[email protected]> Cc: David Howells <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: nommu: don't scan the vma list when deleting	Namhyung Kim	1	-7/+8
	Since commit 297c5eee3724 ("mm: make the vma list be doubly linked") made it a doubly linked list, we don't need to scan the list when deleting @vma. And the original code didn't update the prev pointer. Fix it too. Signed-off-by: Namhyung Kim <[email protected]> Acked-by: Greg Ungerer <[email protected]> Cc: David Howells <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: nommu: sort mm->mmap list properly	Namhyung Kim	4	-45/+44
	When I was reading nommu code, I found that it handles the vma list/tree in an unusual way. IIUC, because there can be more than one identical/overrapped vmas in the list/tree, it sorts the tree more strictly and does a linear search on the tree. But it doesn't applied to the list (i.e. the list could be constructed in a different order than the tree so that we can't use the list when finding the first vma in that order). Since inserting/sorting a vma in the tree and link is done at the same time, we can easily construct both of them in the same order. And linear searching on the tree could be more costly than doing it on the list, it can be converted to use the list. Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly linked") made the list be doubly linked, there were a couple of code need to be fixed to construct the list properly. Patch 1/6 is a preparation. It maintains the list sorted same as the tree and construct doubly-linked list properly. Patch 2/6 is a simple optimization for the vma deletion. Patch 3/6 and 4/6 convert tree traversal to list traversal and the rest are simple fixes and cleanups. This patch: @vma added into @mm should be sorted by start addr, end addr and VMA struct addr in that order because we may get identical VMAs in the @mm. However this was true only for the rbtree, not for the list. This patch fixes this by remembering 'rb_prev' during the tree traversal like find_vma_prepare() does and linking the @vma via __vma_link_list(). After this patch, we can iterate the whole VMAs in correct order simply by using @mm->mmap list. [[email protected]: avoid duplicating __vma_link_list()] Signed-off-by: Namhyung Kim <[email protected]> Acked-by: Greg Ungerer <[email protected]> Cc: David Howells <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: remove unused zone_idx variable from set_migratetype_isolate	Sergey Senozhatsky	1	-2/+0
	Signed-off-by: Sergey Senozhatsky <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mmap: avoid merging cloned VMAs	Shaohua Li	1	-5/+13
	Avoid merging a VMA with another VMA which is cloned from the parent process. The cloned VMA shares the anon_vma lock with the parent process's VMA. If we do the merge, more vmas (even the new range is only for current process) use the perent process's anon_vma lock. This introduces scalability issues. find_mergeable_anon_vma() already considers this case. Signed-off-by: Shaohua Li <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mmap: avoid unnecessary anon_vma lock	Shaohua Li	1	-1/+1
	If we only change vma->vm_end, we can avoid taking anon_vma lock even if 'insert' isn't NULL, which is the case of split_vma. As I understand it, we need the lock before because rmap must get the 'insert' VMA when we adjust old VMA's vm_end (the 'insert' VMA is linked to anon_vma list in __insert_vm_struct before). But now this isn't true any more. The 'insert' VMA is already linked to anon_vma list in __split_vma(with anon_vma_clone()) instead of __insert_vm_struct. There is no race rmap can't get required VMAs. So the anon_vma lock is unnecessary, and this can reduce one locking in brk case and improve scalability. Signed-off-by: Shaohua Li<[email protected]> Cc: Rik van Riel <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mmap: add alignment for some variables	Shaohua Li	1	-3/+7
	Make some variables have correct alignment/section to avoid cache issue. In a workload which heavily does mmap/munmap, the variables will be used frequently. Signed-off-by: Shaohua Li <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	arch, mm: filter disallowed nodes from arch specific show_mem functions	David Rientjes	13	-36/+34
	Architectures that implement their own show_mem() function did not pass the filter argument to show_free_areas() to appropriately avoid emitting the state of nodes that are disallowed in the current context. This patch now passes the filter argument to show_free_areas() so those nodes are now avoided. This patch also removes the show_free_areas() wrapper around __show_free_areas() and converts existing callers to pass an empty filter. ia64 emits additional information for each node, so skip_free_areas_zone() must be made global to filter disallowed nodes and it is converted to use a nid argument rather than a zone for this use case. Signed-off-by: David Rientjes <[email protected]> Cc: Russell King <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Kyle McMartin <[email protected]> Cc: Helge Deller <[email protected]> Cc: James Bottomley <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Guan Xuetao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	xtensa/mm: remove WANT_PAGE_VIRTUAL	Sebastian Andrzej Siewior	1	-4/+0
	This is not useful: it provides page->virtual and is used with highmem. xtensa has no support for highmem and those HIGHMEM bits which are found by grep are partly implemented. The interesting functions like kmap() are missing. If someone actually implements the complete HIGHMEM support he could use HASHED_PAGE_VIRTUAL like most others do. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	xtensa/mm: remove pgtable.c	Sebastian Andrzej Siewior	1	-72/+0
	It is not referenced by any Makefile. pte_alloc_one_kernel() and pte_alloc_one() is implemented in arch/xtensa/include/asm/pgalloc.h. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	drivers/video/backlight/adp5520_bl.c: check strict_strtoul() return value	Liu Yuan	1	-1/+5
	It should check if strict_strtoul() succeeds. [[email protected]: don't override strict_strtoul() return value] Signed-off-by: Liu Yuan <[email protected]> Acked-by: Michael Hennerich <[email protected]> Cc: Richard Purdie <[email protected]> Cc: Paul Mundt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: vmscan: correctly check if reclaimer should schedule during shrink_slab	Minchan Kim	1	-2/+7
	It has been reported on some laptops that kswapd is consuming large amounts of CPU and not being scheduled when SLUB is enabled during large amounts of file copying. It is expected that this is due to kswapd missing every cond_resched() point because; shrink_page_list() calls cond_resched() if inactive pages were isolated which in turn may not happen if all_unreclaimable is set in shrink_zones(). If for whatver reason, all_unreclaimable is set on all zones, we can miss calling cond_resched(). balance_pgdat() only calls cond_resched if the zones are not balanced. For a high-order allocation that is balanced, it checks order-0 again. During that window, order-0 might have become unbalanced so it loops again for order-0 and returns that it was reclaiming for order-0 to kswapd(). It can then find that a caller has rewoken kswapd for a high-order and re-enters balance_pgdat() without ever calling cond_resched(). shrink_slab only calls cond_resched() if we are reclaiming slab pages. If there are a large number of direct reclaimers, the shrinker_rwsem can be contended and prevent kswapd calling cond_resched(). This patch modifies the shrink_slab() case. If the semaphore is contended, the caller will still check cond_resched(). After each successful call into a shrinker, the check for cond_resched() remains in case one shrinker is particularly slow. [[email protected]: preserve call to cond_resched after each call into shrinker] Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: James Bottomley <[email protected]> Tested-by: Colin King <[email protected]> Cc: Raghavendra D Prabhu <[email protected]> Cc: Jan Kara <[email protected]> Cc: Chris Mason <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Rik van Riel <[email protected]> Cc: <[email protected]> [2.6.38+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25	mm: vmscan: correct use of pgdat_balanced in sleeping_prematurely	Johannes Weiner	1	-1/+1
	There are a few reports of people experiencing hangs when copying large amounts of data with kswapd using a large amount of CPU which appear to be due to recent reclaim changes. SLUB using high orders is the trigger but not the root cause as SLUB has been using high orders for a while. The root cause was bugs introduced into reclaim which are addressed by the following two patches. Patch 1 corrects logic introduced by commit 1741c877 ("mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced") to allow kswapd to go to sleep when balanced for high orders. Patch 2 notes that it is possible for kswapd to miss every cond_resched() and updates shrink_slab() so it'll at least reach that scheduling point. Chris Wood reports that these two patches in isolation are sufficient to prevent the system hanging. AFAIK, they should also resolve similar hangs experienced by James Bottomley. This patch: Johannes Weiner poined out that the logic in commit 1741c877 ("mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced") is backwards. Instead of allowing kswapd to go to sleep when balancing for high order allocations, it keeps it kswapd running uselessly. Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Wu Fengguang <[email protected]> Cc: James Bottomley <[email protected]> Tested-by: Colin King <[email protected]> Cc: Raghavendra D Prabhu <[email protected]> Cc: Jan Kara <[email protected]> Cc: Chris Mason <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Rik van Riel <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Reviewed-by: Wu Fengguang <[email protected]> Cc: <[email protected]> [2.6.38+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>