blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2010-08-09	Merge branch 'merge' of ↵	Linus Torvalds	1	-1/+1
	git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: powerpc: fix build with make 3.82 Revert "Input: appletouch - fix integer overflow issue" memblock: Fix memblock_is_region_reserved() to return a boolean powerpc: Trim defconfigs powerpc: fix i8042 module build error sound/soc: mpc5200_psc_ac97: Use gpio pins for cold reset powerpc/5200: add mpc5200_psc_ac97_gpio_reset
2010-08-09	hibernation: freeze swap at hibernation	KAMEZAWA Hiroyuki	1	-22/+72
	When taking a memory snapshot in hibernate_snapshot(), all (directly called) memory allocations use GFP_ATOMIC. Hence swap misusage during hibernation never occurs. But from a pessimistic point of view, there is no guarantee that no page allcation has __GFP_WAIT. It is better to have a global indication "we enter hibernation, don't use swap!". This patch tries to freeze new-swap-allocation during hibernation. (All user processes are frozenm so swapin is not a concern). This way, no updates will happen to swap_map[] between hibernate_snapshot() and save_image(). Swap is thawed when swsusp_free() is called. We can be assured that swap corruption will not occur. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Ondrej Zary <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	mm: fix corruption of hibernation caused by reusing swap during image saving	KAMEZAWA Hiroyuki	1	-2/+4
	Since 2.6.31, swap_map[]'s refcounting was changed to show that a used swap entry is just for swap-cache, can be reused. Then, while scanning free entry in swap_map[], a swap entry may be able to be reclaimed and reused. It was caused by commit c9e444103b5e7a5 ("mm: reuse unused swap entry if necessary"). But this caused deta corruption at resume. The scenario is - Assume a clean-swap cache, but mapped. - at hibernation_snapshot[], clean-swap-cache is saved as clean-swap-cache and swap_map[] is marked as SWAP_HAS_CACHE. - then, save_image() is called. And reuse SWAP_HAS_CACHE entry to save image, and break the contents. After resume: - the memory reclaim runs and finds clean-not-referenced-swap-cache and discards it because it's marked as clean. But here, the contents on disk and swap-cache is inconsistent. Hance memory is corrupted. This patch avoids the bug by not reclaiming swap-entry during hibernation. This is a quick fix for backporting. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Reported-by: Ondreg Zary <[email protected]> Tested-by: Ondreg Zary <[email protected]> Tested-by: Andrea Gelmini <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	ksm: cleanup for mm_slots_hash	Lai Jiangshan	1	-29/+9
	Use compile-allocated memory instead of dynamic allocated memory for mm_slots_hash. Use hash_ptr() instead divisions for bucket calculation. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Izik Eidus <[email protected]> Cc: Avi Kivity <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	vmscan: raise the bar to PAGEOUT_IO_SYNC stalls	Wu Fengguang	1	-8/+43
	Fix "system goes unresponsive under memory pressure and lots of dirty/writeback pages" bug. http://lkml.org/lkml/2010/4/4/86 In the above thread, Andreas Mohr described that Invoking any command locked up for minutes (note that I'm talking about attempted additional I/O to the _other_, _unaffected_ main system HDD - such as loading some shell binaries -, NOT the external SSD18M!!). This happens when the two conditions are both meet: - under memory pressure - writing heavily to a slow device OOM also happens in Andreas' system. The OOM trace shows that 3 processes are stuck in wait_on_page_writeback() in the direct reclaim path. One in do_fork() and the other two in unix_stream_sendmsg(). They are blocked on this condition: (sc->order && priority < DEF_PRIORITY - 2) which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC) one year ago. That condition may be too permissive. In Andreas' case, 512MB/1024 = 512KB. If the direct reclaim for the order-1 fork() allocation runs into a range of 512KB hard-to-reclaim LRU pages, it will be stalled. It's a severe problem in three ways. Firstly, it can easily happen in daily desktop usage. vmscan priority can easily go below (DEF_PRIORITY - 2) on _local_ memory pressure. Even if the system has 50% globally reclaimable pages, it still has good opportunity to have 0.1% sized hard-to-reclaim ranges. For example, a simple dd can easily create a big range (up to 20%) of dirty pages in the LRU lists. And order-1 to order-3 allocations are more than common with SLUB. Try "grep -v '1 :' /proc/slabinfo" to get the list of high order slab caches. For example, the order-1 radix_tree_node slab cache may stall applications at swap-in time; the order-3 inode cache on most filesystems may stall applications when trying to read some file; the order-2 proc_inode_cache may stall applications when trying to open a /proc file. Secondly, once triggered, it will stall unrelated processes (not doing IO at all) in the system. This "one slow USB device stalls the whole system" avalanching effect is very bad. Thirdly, once stalled, the stall time could be intolerable long for the users. When there are 20MB queued writeback pages and USB 1.1 is writing them in 1MB/s, wait_on_page_writeback() will stuck for up to 20 seconds. Not to mention it may be called multiple times. So raise the bar to only enable PAGEOUT_IO_SYNC when priority goes below DEF_PRIORITY/3, or 6.25% LRU size. As the default dirty throttle ratio is 20%, it will hardly be triggered by pure dirty pages. We'd better treat PAGEOUT_IO_SYNC as some last resort workaround -- its stall time is so uncomfortably long (easily goes beyond 1s). The bar is only raised for (order < PAGE_ALLOC_COSTLY_ORDER) allocations, which are easy to satisfy in 1TB memory boxes. So, although 6.25% of memory could be an awful lot of pages to scan on a system with 1TB of memory, it won't really have to busy scan that much. Andreas tested an older version of this patch and reported that it mostly fixed his problem. Mel Gorman helped improve it and KOSAKI Motohiro will fix it further in the next patch. Reported-by: Andreas Mohr <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Wu Fengguang <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	mm/vmalloc.c: check kmalloc() return value	Kulikov Vasiliy	1	-1/+4
	kmalloc() may fail, if so return -ENOMEM. Signed-off-by: Kulikov Vasiliy <[email protected]> Acked-by: Pekka Enberg <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	memcg: add mm_vmscan_memcg_isolate tracepoint	KOSAKI Motohiro	1	-0/+6
	Memcg also need to trace page isolation information as global reclaim. This patch does it. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	memcg, vmscan: add memcg reclaim tracepoint	KOSAKI Motohiro	1	-1/+19
	Memcg also need to trace reclaim progress as direct reclaim. This patch add it. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	vmscan: shrink_slab() requires the number of lru_pages, not the page order	KOSAKI Motohiro	1	-4/+13
	Presently shrink_slab() has the following scanning equation. lru_scanned max_pass basic_scan_objects = 4 x ------------- x ----------------------------- lru_pages shrinker->seeks (default:2) scan_objects = min(basic_scan_objects, max_pass * 2) If we pass very small value as lru_pages instead real number of lru pages, shrink_slab() drop much objects rather than necessary. And now, __zone_reclaim() pass 'order' as lru_pages by mistake. That produces a bad result. For example, if we receive very low memory pressure (scan = 32, order = 0), shrink_slab() via zone_reclaim() always drop _all_ icache/dcache objects. (see above equation, very small lru_pages make very big scan_objects result). This patch fixes it. [[email protected]: fix layout, typos] Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Acked-by: Christoph Lameter <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	mmu-notifiers: remove mmu notifier calls in apply_to_page_range()	Jeremy Fitzhardinge	1	-3/+2
	It is not appropriate for apply_to_page_range() to directly call any mmu notifiers, because it is a general purpose function whose effect depends on what context it is called in and what the callback function does. In particular, if it is being used as part of an mmu notifier implementation, the recursive calls can be particularly problematic. It is up to apply_to_page_range's caller to do any notifier calls if necessary. It does not affect any in-tree users because they all operate on init_mm, and mmu notifiers only pertain to usermode mappings. [[email protected]: remove unused local `start'] Signed-off-by: Jeremy Fitzhardinge <[email protected]> Signed-off-by: Stefano Stabellini <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Avi Kivity <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	vmscan: protect reading of reclaim_stat with lru_lock	KOSAKI Motohiro	1	-11/+9
	Rik van Riel pointed out reading reclaim_stat should be protected lru_lock, otherwise vmscan might sweep 2x much pages. This fault was introduced by commit 4f98a2fee8acdb4ac84545df98cccecfd130f8db Author: Rik van Riel <[email protected]> Date: Sat Oct 18 20:26:32 2008 -0700 vmscan: split LRU lists into anon & file sets Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Rik van Riel <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	vmscan: avoid subtraction of unsigned types	KOSAKI Motohiro	1	-7/+8
	'slab_reclaimable' and 'nr_pages' are unsigned. Subtraction is unsafe because negative results would be misinterpreted. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	mm: set VM_FAULT_WRITE in do_swap_page()	Andrea Arcangeli	1	-0/+1
	Set the flag if do_swap_page is decowing the page the same way do_wp_page would too. Signed-off-by: Andrea Arcangeli <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	rmap: add exclusive page to private anon_vma on swapin	Rik van Riel	2	-2/+15
	On swapin it is fairly common for a page to be owned exclusively by one process. In that case we want to add the page to the anon_vma of that process's VMA, instead of to the root anon_vma. This will reduce the amount of rmap searching that the swapout code needs to do. Signed-off-by: Rik van Riel <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	oom: badness heuristic rewrite	David Rientjes	2	-148/+129
	This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of "allowable" memory. "Allowable," in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of "allowable" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <[email protected]> Cc: Nick Piggin <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	oom: multi threaded process coredump don't make deadlock	KOSAKI Motohiro	1	-1/+1
	Oleg pointed out current PF_EXITING check is wrong. Because PF_EXITING is per-thread flag, not per-process flag. He said, Two threads, group-leader L and its sub-thread T. T dumps the code. In this case both threads have ->mm != NULL, L has PF_EXITING. The first problem is, select_bad_process() always return -1 in this case (even if the caller is T, this doesn't matter). The second problem is that we should add TIF_MEMDIE to T, not L. I think we can remove this dubious PF_EXITING check. but as first step, This patch add the protection of multi threaded issue. Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Minchan Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	oom: give the dying task a higher priority	Luis Claudio R. Goncalves	1	-3/+31
	In a system under heavy load it was observed that even after the oom-killer selects a task to die, the task may take a long time to die. Right after sending a SIGKILL to the task selected by the oom-killer this task has its priority increased so that it can exit() soon, freeing memory. That is accomplished by: /* * We give our sacrificial lamb high priority and access to * all the memory it needs. That way it should be able to * exit() and clear out its resources quickly... */ p->rt.time_slice = HZ; set_tsk_thread_flag(p, TIF_MEMDIE); It sounds plausible giving the dying task an even higher priority to be sure it will be scheduled sooner and free the desired memory. It was suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that this task won't interfere with any running RT task. If the dying task is already an RT task, leave it untouched. Another good suggestion, implemented here, was to avoid boosting the dying task priority in case of mem_cgroup OOM. Signed-off-by: Luis Claudio R. Goncalves <[email protected]> Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	oom: remove child->mm check from oom_kill_process()	KOSAKI Motohiro	1	-3/+0
	The current "child->mm == p->mm" check prevents selection of vfork()ed task. But we don't have any reason to don't consider vfork(). Removed. Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Minchan Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	oom: cleanup has_intersects_mems_allowed()	KOSAKI Motohiro	1	-2/+2
	presently has_intersects_mems_allowed() has own thread iterate logic, but it should use while_each_thread(). It slightly improve the code readability. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	oom: move OOM_DISABLE check from oom_kill_task to out_of_memory()	KOSAKI Motohiro	1	-2/+3
	Presently if oom_kill_allocating_task is enabled and current have OOM_DISABLED, following printk in oom_kill_process is called twice. pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n", message, task_pid_nr(p), p->comm, points); So, OOM_DISABLE check should be more early. Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Minchan Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	oom: kill duplicate OOM_DISABLE check	KOSAKI Motohiro	1	-3/+0
	select_bad_process() and badness() have the same OOM_DISABLE check. This patch kills one. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	oom: /proc/<pid>/oom_score treat kernel thread honestly	KOSAKI Motohiro	1	-6/+7
	If a kernel thread is using use_mm(), badness() returns a positive value. This is not a big issue because caller take care of it correctly. But there is one exception, /proc/<pid>/oom_score calls badness() directly and doesn't care that the task is a regular process. Another example, /proc/1/oom_score return !0 value. But it's unkillable. This incorrectness makes administration a little confusing. This patch fixes it. Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Minchan Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	oom: oom_kill_process() needs to check that p is unkillable	KOSAKI Motohiro	1	-1/+2
	When oom_kill_allocating_task is enabled, an argument task of oom_kill_process is not selected by select_bad_process(), It's just out_of_memory() caller task. It mean the task can be unkillable. check it first. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	oom: make oom_unkillable_task() helper function	KOSAKI Motohiro	1	-11/+22
	Presently we have the same task check in two places. Unify it. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	oom: oom_kill_process() doesn't select kthread child	KOSAKI Motohiro	1	-0/+2
	Presently select_bad_process() has a PF_KTHREAD check, but oom_kill_process doesn't. It mean oom_kill_process() may choose wrong task, especially, when the child are using use_mm(). Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	oom: don't try to kill oom_unkillable child	KOSAKI Motohiro	1	-3/+6
	Presently, badness() doesn't care about either CPUSET nor mempolicy. Then if the victim child process have disjoint nodemask, OOM Killer might kill innocent process. This patch fixes it. [[email protected]: coding-style fixes] Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Minchan Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	vmscan: update isolated page counters outside of main path in ↵	Mel Gorman	1	-25/+38
	shrink_inactive_list() When shrink_inactive_list() isolates pages, it updates a number of counters using temporary variables to gather them. These consume stack and it's in the main path that calls ->writepage(). This patch moves the accounting updates outside of the main path to reduce stack usage. Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Johannes Weiner <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Chris Mason <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michael Rubin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	vmscan: set up pagevec as late as possible in shrink_page_list()	Mel Gorman	1	-8/+28
	shrink_page_list() sets up a pagevec to release pages as according as they are free. It uses significant amounts of stack on the pagevec. This patch adds pages to be freed via pagevec to a linked list which is then freed en-masse at the end. This avoids using stack in the main path that potentially calls writepage(). Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Chris Mason <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michael Rubin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	vmscan: set up pagevec as late as possible in shrink_inactive_list()	Mel Gorman	1	-43/+56
	shrink_inactive_list() sets up a pagevec to release unfreeable pages. It uses significant amounts of stack doing this. This patch splits shrink_inactive_list() to take the stack usage out of the main path so that callers to writepage() do not contain an unused pagevec on the stack. Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Johannes Weiner <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Chris Mason <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michael Rubin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	vmscan: remove unnecessary temporary vars in do_try_to_free_pages	Mel Gorman	1	-5/+4
	Remove temporary variable that is only used once and does not help clarify code. [[email protected]: coding-style fixes] Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Chris Mason <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michael Rubin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	vmscan: simplify shrink_inactive_list()	KOSAKI Motohiro	1	-110/+102
	Now, max_scan of shrink_inactive_list() is always passed less than SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it. This patch also help stack diet. detail - remove "while (nr_scanned < max_scan)" loop - remove nr_freed (now, we use nr_reclaimed directly) - remove nr_scan (now, we use nr_scanned directly) - rename max_scan to nr_to_scan - pass nr_to_scan into isolate_pages() directly instead using SWAP_CLUSTER_MAX [[email protected]: coding-style fixes] Signed-off-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Johannes Weiner <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Chris Mason <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michael Rubin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	vmscan: kill prev_priority completely	KOSAKI Motohiro	4	-92/+0
	Since 2.6.28 zone->prev_priority is unused. Then it can be removed safely. It reduce stack usage slightly. Now I have to say that I'm sorry. 2 years ago, I thought prev_priority can be integrate again, it's useful. but four (or more) times trying haven't got good performance number. Thus I give up such approach. The rest of this changelog is notes on prev_priority and why it existed in the first place and why it might be not necessary any more. This information is based heavily on discussions between Andrew Morton, Rik van Riel and Kosaki Motohiro who is heavily quotes from. Historically prev_priority was important because it determined when the VM would start unmapping PTE pages. i.e. there are no balances of note within the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there is a potential risk of unnecessarily increasing minor faults as a large amount of read activity of use-once pages could push mapped pages to the end of the LRU and get unmapped. There is no proof this is still a problem but currently it is not considered to be. Active files are not deactivated if the active file list is smaller than the inactive list reducing the liklihood that file-mapped pages are being pushed off the LRU and referenced executable pages are kept on the active list to avoid them getting pushed out by read activity. Even if it is a problem, prev_priority prev_priority wouldn't works nowadays. First of all, current vmscan still a lot of UP centric code. it expose some weakness on some dozens CPUs machine. I think we need more and more improvement. The problem is, current vmscan mix up per-system-pressure, per-zone-pressure and per-task-pressure a bit. example, prev_priority try to boost priority to other concurrent priority. but if the another task have mempolicy restriction, it is unnecessary, but also makes wrong big latency and exceeding reclaim. per-task based priority + prev_priority adjustment make the emulation of per-system pressure. but it have two issue 1) too rough and brutal emulation 2) we need per-zone pressure, not per-system. Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about 2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer. but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the system have higher memory pressure than priority==0 (1/4096*10,000 > 2). prev_priority can't solve such multithreads workload issue. In other word, prev_priority concept assume the sysmtem don't have lots threads." Signed-off-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Johannes Weiner <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Chris Mason <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michael Rubin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	vmscan: tracing: add trace event when a page is written	Mel Gorman	1	-0/+2
	Add a trace event for when page reclaim queues a page for IO and records whether it is synchronous or asynchronous. Excessive synchronous IO for a process can result in noticeable stalls during direct reclaim. Excessive IO from page reclaim may indicate that the system is seriously under provisioned for the amount of dirty pages that exist. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Larry Woodman <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Chris Mason <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michael Rubin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	vmscan: tracing: add trace events for LRU page isolation	Mel Gorman	1	-0/+16
	Add an event for when pages are isolated en-masse from the LRU lists. This event augments the information available on LRU traffic and can be used to evaluate lumpy reclaim. [[email protected]: coding-style fixes] Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Larry Woodman <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Chris Mason <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michael Rubin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	vmscan: tracing: add trace events for kswapd wakeup, sleeping and direct reclaim	Mel Gorman	1	-4/+20
	Add two trace events for kswapd waking up and going asleep for the purposes of tracking kswapd activity and two trace events for direct reclaim beginning and ending. The information can be used to work out how much time a process or the system is spending on the reclamation of pages and in the case of direct reclaim, how many pages were reclaimed for that process. High frequency triggering of these events could point to memory pressure problems. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Larry Woodman <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Chris Mason <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michael Rubin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	vmscan: recalculate lru_pages on each priority	KOSAKI Motohiro	1	-13/+8
	shrink_zones() need relatively long time and lru_pages can change dramatically during shrink_zones(). So lru_pages should be recalculated for each priority. Signed-off-by: KOSAKI Motohiro <[email protected]> Acked-by: Rik van Riel <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	vmscan: zone_reclaim don't call disable_swap_token()	KOSAKI Motohiro	1	-1/+0
	Swap token don't works when zone reclaim is enabled since it was born. Because __zone_reclaim() always call disable_swap_token() unconditionally. This kill swap token feature completely. As far as I know, nobody want to that. Remove it. Signed-off-by: KOSAKI Motohiro <[email protected]> Acked-by: Rik van Riel <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	mm: implement writeback livelock avoidance using page tagging	Jan Kara	1	-17/+53
	We try to avoid livelocks of writeback when some steadily creates dirty pages in a mapping we are writing out. For memory-cleaning writeback, using nr_to_write works reasonably well but we cannot really use it for data integrity writeback. This patch tries to solve the problem. The idea is simple: Tag all pages that should be written back with a special tag (TOWRITE) in the radix tree. This can be done rather quickly and thus livelocks should not happen in practice. Then we start doing the hard work of locking pages and sending them to disk only for those pages that have TOWRITE tag set. Note: Adding new radix tree tag grows radix tree node from 288 to 296 bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs. However, the number of slab/slub items per page remains the same (13 and 7 respectively). Signed-off-by: Jan Kara <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Chris Mason <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	rmap: add anon_vma bug checks	Andrea Arcangeli	1	-0/+3
	Verify the refcounting doesn't go wrong, and resurrect the check in __page_check_anon_rmap as in old anon-vma code. Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	rmap: resurrect page_address_in_vma anon_vma check	Andrea Arcangeli	1	-3/+4
	With root anon-vma it's trivial to keep doing the usual check as in old-anon-vma code. Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	rmap: always use anon_vma root pointer	Andrea Arcangeli	1	-6/+12
	Always use anon_vma->root pointer instead of anon_vma_chain.prev. Also optimize the map-paths, if a mapping is already established no need to overwrite it with root anon-vma list, we can keep the more finegrined anon-vma and skip the overwrite: see the PageAnon check in !exclusive case. This is also the optimization that hidden the ksm bug as this tends to make ksm_might_need_to_copy skip the copy, but only the proper fix to ksm_might_need_to_copy guarantees not triggering the ksm bug unless ksm is in use. this is an optimization only... [[email protected]: coding-style fixes] Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Rik van Riel <[email protected]> [[email protected]: fix false positive BUG_ON in __page_set_anon_rmap] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	rmap: always add new vmas at the end	Andrea Arcangeli	1	-1/+1
	Make sure to always add new VMAs at the end of the list. This is important so rmap_walk does not miss a VMA that was created during the rmap_walk. The old code got this right most of the time due to luck, but was buggy when anon_vma_prepare reused a mergeable anon_vma. Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	mmap: remove unnecessary lock from __vma_link	Andrea Arcangeli	1	-2/+0
	There's no anon-vma related mangling happening inside __vma_link anymore so no need of anon_vma locking there. Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	shmem: reduce pagefault lock contention	Shaohua Li	1	-21/+49
	I'm running a shmem pagefault test case (see attached file) under a 64 CPU system. Profile shows shmem_inode_info->lock is heavily contented and 100% CPUs time are trying to get the lock. In the pagefault (no swap) case, shmem_getpage gets the lock twice, the last one is avoidable if we prealloc a page so we could reduce one time of locking. This is what below patch does. The result of the test case: 2.6.35-rc3: ~20s 2.6.35-rc3 + patch: ~12s so this is 40% improvement. One might argue if we could have better locking for shmem. But even shmem is lockless, the pagefault will soon have pagecache lock heavily contented because shmem must add new page to pagecache. So before we have better locking for pagecache, improving shmem locking doesn't have too much improvement. I did a similar pagefault test against a ramfs file, the test result is ~10.5s. [[email protected]: fix comment, clean up code layout, elimintate code duplication] Signed-off-by: Shaohua Li <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: "Zhang, Yanmin" <[email protected]> Cc: Tim Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	tmpfs: make tmpfs scalable with percpu_counter for used blocks	Tim Chen	1	-23/+17
	The current implementation of tmpfs is not scalable. We found that stat_lock is contended by multiple threads when we need to get a new page, leading to useless spinning inside this spin lock. This patch makes use of the percpu_counter library to maintain local count of used blocks to speed up getting and returning of pages. So the acquisition of stat_lock is unnecessary for getting and returning blocks, improving the performance of tmpfs on system with large number of cpus. On a 4 socket 32 core NHM-EX system, we saw improvement of 270%. The implementation below has a slight chance of race between threads causing a slight overshoot of the maximum configured blocks. However, any overshoot is small, and is bounded by the number of cpus. This happens when the number of used blocks is slightly below the maximum configured blocks when a thread checks the used block count, and another thread allocates the last block before the current thread does. This should not be a problem for tmpfs, as the overshoot is most likely to be a few blocks and bounded. If a strict limit is really desired, then configured the max blocks to be the limit less the number of cpus in system. Signed-off-by: Tim Chen <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	gcc-4.6: mm: fix unused but set warnings	Andi Kleen	3	-5/+1
	No real bugs, just some dead code and some fixups. Signed-off-by: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	mempolicy: reduce stack size of migrate_pages()	KOSAKI Motohiro	1	-13/+25
	migrate_pages() is using >500 bytes stack. Reduce it. mm/mempolicy.c: In function 'sys_migrate_pages': mm/mempolicy.c:1344: warning: the frame size of 528 bytes is larger than 512 bytes [[email protected]: don't play with a might-be-NULL pointer] Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	mm: use for_each_online_cpu() in vmstat	Minchan Kim	1	-3/+3
	The sum_vm_events passes cpumask for for_each_cpu(). But it's useless since we have for_each_online_cpu. Althougth it's tirival overhead, it's not good about coding consistency. Let's use for_each_online_cpu instead of for_each_cpu with cpumask argument. Signed-off-by: Minchan Kim <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Christoph Lameter <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	oom: fold __out_of_memory into out_of_memory	David Rientjes	1	-36/+29
	__out_of_memory() only has a single caller, so fold it into out_of_memory() and add a comment about locking for its call to oom_kill_process(). Signed-off-by: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Acked-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09	oom: remove constraint argument from select_bad_process and __out_of_memory	David Rientjes	1	-10/+8
	select_bad_process() and __out_of_memory() doe not need their enum oom_constraint arguments: it's possible to pass a NULL nodemask if constraint == CONSTRAINT_MEMORY_POLICY in the caller, out_of_memory(). Signed-off-by: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Acked-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>