blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2011-03-22	sys_swapoff: change order to match sys_swapon	Cesar Eduardo Barros	1	-3/+4
	The block in sys_swapon which does the final adjustments to the swap_info_struct and to swap_list is the same as the block which re-inserts it again at sys_swapoff on failure of try_to_unuse(), except for the order of the operations within the lock. Since the order should not matter, arbitrarily change sys_swapoff to match sys_swapon, in preparation to making both share the same code. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: move printk outside lock	Cesar Eduardo Barros	1	-7/+8
	The block in sys_swapon which does the final adjustments to the swap_info_struct and to swap_list is the same as the block which re-inserts it again at sys_swapoff on failure of try_to_unuse(). To be able to make both share the same code, move the printk() call in the middle of it to just after it. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: remove nr_good_pages variable	Cesar Eduardo Barros	1	-5/+3
	It still exists within setup_swap_map_and_extents(), but after it nr_good_pages == p->pages. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: simplify error flow in setup_swap_map_and_extents()	Cesar Eduardo Barros	1	-14/+5
	Since there is no cleanup to do, there is no reason to jump to a label. Return directly instead. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: separate parsing of bad blocks and extents	Cesar Eduardo Barros	1	-29/+54
	Move the code which parses the bad block list and the extents to a separate function. Only code movement, no functional changes. This change uses the fact that, after the success path, nr_good_pages == p->pages. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: call swap_cgroup_swapon() earlier	Cesar Eduardo Barros	1	-4/+4
	The call to swap_cgroup_swapon is in the middle of loading the swap map and extents. As it only does memory allocation and does not depend on the swapfile layout (map/extents), it can be called earlier (or later). Move it to just after the allocation of swap_map, since it is conceptually similar (allocates a map). Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: simplify error flow in read_swap_header()	Cesar Eduardo Barros	1	-9/+6
	Since there is no cleanup to do, there is no reason to jump to a label. Return directly instead. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: separate parsing of swapfile header	Cesar Eduardo Barros	1	-62/+78
	Move the code which parses and checks the swapfile header (except for the bad block list) to a separate function. Only code movement, no functional changes. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: move setting of swapfilepages near use	Cesar Eduardo Barros	1	-2/+1
	There is no reason I can see to read inode->i_size long before it is needed. Move its read to just before it is needed, to reduce the variable lifetime. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: Jesper Juhl <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: simplify error flow in claim_swapfile()	Cesar Eduardo Barros	1	-14/+6
	Since there is no cleanup to do, there is no reason to jump to a label. Return directly instead. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: separate bdev claim and inode lock	Cesar Eduardo Barros	1	-25/+39
	Move the code which claims the bdev (S_ISBLK) or locks the inode (S_ISREG) to a separate function. Only code movement, no functional changes. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: use a single error label	Cesar Eduardo Barros	1	-4/+3
	sys_swapon currently has two error labels, bad_swap and bad_swap_2. bad_swap does the same as bad_swap_2 plus destroy_swap_extents() and swap_cgroup_swapoff(); both are noops in the places where bad_swap_2 is jumped to. With a single extra test for inode (matching the one in the S_ISREG case below), all the error paths in the function can go to bad_swap. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: do only cleanup in the cleanup blocks	Cesar Eduardo Barros	1	-4/+3
	The only way error is 0 in the cleanup blocks is when the function is returning successfully. In this case, the cleanup blocks were setting S_SWAPFILE in the S_ISREG case. But this is not a cleanup. Move the setting of S_SWAPFILE to just before the "goto out;" to make this more clear. At this point, we do not need to test for inode because it will never be NULL. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: remove bdev variable	Cesar Eduardo Barros	1	-10/+9
	The bdev variable is always equivalent to (S_ISBLK(inode->i_mode) ? p->bdev : NULL), as long as it being set is moved to a bit earlier. Use this fact to remove the bdev variable. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: move setting of error nearer use	Cesar Eduardo Barros	1	-5/+6
	Move the setting of the error variable nearer the goto in a few places. Avoids calling PTR_ERR() if not IS_ERR() in two places, and makes the error condition more explicit in two other places. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: Jesper Juhl <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: remove did_down variable	Cesar Eduardo Barros	1	-6/+2
	Since mutex_lock(&inode->i_mutex) is called just after setting inode, did_down is always equivalent to (inode && S_ISREG(inode->i_mode)). Use this fact to remove the did_down variable. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: remove initial value of name variable	Cesar Eduardo Barros	1	-1/+1
	Now there is nothing which jumps to the cleanup blocks before the name variable is set. There is no need to set it initially to NULL anymore. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: simplify error flow in alloc_swap_info()	Cesar Eduardo Barros	1	-6/+1
	Since there is no cleanup to do, there is no reason to jump to a label. Return directly instead. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: simplify error return from swap_info allocation	Cesar Eduardo Barros	1	-4/+2
	At this point in sys_swapon, there is nothing to free. Return directly instead of jumping to the cleanup block at the end of the function. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: separate swap_info allocation	Cesar Eduardo Barros	1	-20/+37
	Move the swap_info allocation to its own function. Only code movement, no functional changes. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: do not depend on "type" after allocation	Cesar Eduardo Barros	1	-5/+5
	Within sys_swapon, after the swap_info entry has been allocated, we always have type == p->type and swap_info[type] == p. Use this fact to reduce the dependency on the "type" local variable within the function, as a preparation to move the allocation of the swap_info entry to a separate function. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: remove changelog from function comment	Cesar Eduardo Barros	1	-5/+0
	Changelogs belong in the git history instead of in the source code. Also, "The swapon system call" is redundant with "SYSCALL_DEFINE2(swapon, ...)". Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Reviewed-by: Jesper Juhl <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> [ Gaah. That's a _historical_ comment. But the patch-series depends on removal ] Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	sys_swapon: use vzalloc() instead of vmalloc/memset	Cesar Eduardo Barros	1	-2/+1
	This patch series refactors the sys_swapon function. sys_swapon is currently a very large function, with 313 lines (more than 12 25-line screens), which can make it a bit hard to read. This patch series reduces this size by half, by extracting large chunks of related code to new helper functions. One of these chunks of code was nearly identical to the part of sys_swapoff which is used in case of a failure return from try_to_unuse(), so this patch series also makes both share the same code. As a side effect of all this refactoring, the compiled code gets a bit smaller (from v1 of this patch series): text data bss dec hex filename 14012 944 276 15232 3b80 mm/swapfile.o.before 13941 944 276 15161 3b39 mm/swapfile.o.after This patch: Use vzalloc() instead of vmalloc/memset. Signed-off-by: Cesar Eduardo Barros <[email protected]> Tested-by: Eric B Munson <[email protected]> Acked-by: Eric B Munson <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Reviewed-by: Jesper Juhl <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: use __GFP_OTHER_NODE for transparent huge pages	Andi Kleen	1	-9/+11
	Pass __GFP_OTHER_NODE for transparent hugepages NUMA allocations done by the hugepages daemon. This way the low level accounting for local versus remote pages works correctly. Contains improvements from Andrea Arcangeli [[email protected]: coding-style fixes] Signed-off-by: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: add __GFP_OTHER_NODE flag	Andi Kleen	2	-3/+8
	Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in zone_statistics() that an allocation is on behalf of another thread. This way the local and remote counters can be still correct, even when background daemons like khugepaged are changing memory mappings. This only affects the accounting, but I think it's worth doing that right to avoid confusing users. I first tried to just pass down the right node, but this required a lot of changes to pass down this parameter and at least one addition of a 10th argument to a 9 argument function. Using the flag is a lot less intrusive. Open: should be also used for migration? [[email protected]: coding-style fixes] Signed-off-by: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback	Andrea Arcangeli	2	-16/+34
	__GFP_NO_KSWAPD allocations are usually very expensive and not mandatory to succeed as they have graceful fallback. Waiting for I/O in those, tends to be overkill in terms of latencies, so we can reduce their latency by disabling sync migrate. Unfortunately, even with async migration it's still possible for the process to be blocked waiting for a request slot (e.g. get_request_wait in the block layer) when ->writepage is called. To prevent __GFP_NO_KSWAPD blocking, this patch prevents ->writepage being called on dirty page cache for asynchronous migration. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=31142 [[email protected]: Avoid writebacks for NFS, retry locked pages, use bool] Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Cc: Arthur Marsh <[email protected]> Cc: Clemens Ladisch <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Minchan Kim <[email protected]> Reported-by: Alex Villacis Lasso <[email protected]> Tested-by: Alex Villacis Lasso <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: compaction: minimise the time IRQs are disabled while isolating pages ↵	Andrea Arcangeli	1	-0/+18
	for migration compaction_alloc() isolates pages for migration in isolate_migratepages. While it's scanning, IRQs are disabled on the mistaken assumption the scanning should be short. Tests show this to be true for the most part but contention times on the LRU lock can be increased. Before this patch, the IRQ disabled times for a simple test looked like Total sampled time IRQs off (not real total time): 5493 Event shrink_inactive_list..shrink_zone 1596 us count 1 Event shrink_inactive_list..shrink_zone 1530 us count 1 Event shrink_inactive_list..shrink_zone 956 us count 1 Event shrink_inactive_list..shrink_zone 541 us count 1 Event shrink_inactive_list..shrink_zone 531 us count 1 Event split_huge_page..add_to_swap 232 us count 1 Event save_args..call_softirq 36 us count 1 Event save_args..call_softirq 35 us count 2 Event __wake_up..__wake_up 1 us count 1 This patch reduces the worst-case IRQs-disabled latencies by releasing the lock every SWAP_CLUSTER_MAX pages that are scanned and releasing the CPU if necessary. The cost of this is that the processing performing compaction will be slower but IRQs being disabled for too long a time has worse consequences as the following report shows; Total sampled time IRQs off (not real total time): 4367 Event shrink_inactive_list..shrink_zone 881 us count 1 Event shrink_inactive_list..shrink_zone 875 us count 1 Event shrink_inactive_list..shrink_zone 868 us count 1 Event shrink_inactive_list..shrink_zone 555 us count 1 Event split_huge_page..add_to_swap 495 us count 1 Event compact_zone..compact_zone_order 269 us count 1 Event split_huge_page..add_to_swap 266 us count 1 Event shrink_inactive_list..shrink_zone 85 us count 1 Event save_args..call_softirq 36 us count 2 Event __wake_up..__wake_up 1 us count 1 [[email protected]: simplify with s/unlocked/locked/] Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Arthur Marsh <[email protected]> Cc: Clemens Ladisch <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: compaction: minimise the time IRQs are disabled while isolating free pages	Mel Gorman	1	-5/+13
	compaction_alloc() isolates free pages to be used as migration targets. While its scanning, IRQs are disabled on the mistaken assumption the scanning should be short. Analysis showed that IRQs were in fact being disabled for substantial time. A simple test was run using large anonymous mappings with transparent hugepage support enabled to trigger frequent compactions. A monitor sampled what the worst IRQ-off latencies were and a post-processing tool found the following; Total sampled time IRQs off (not real total time): 22355 Event compaction_alloc..compaction_alloc 8409 us count 1 Event compaction_alloc..compaction_alloc 7341 us count 1 Event compaction_alloc..compaction_alloc 2463 us count 1 Event compaction_alloc..compaction_alloc 2054 us count 1 Event shrink_inactive_list..shrink_zone 1864 us count 1 Event shrink_inactive_list..shrink_zone 88 us count 1 Event save_args..call_softirq 36 us count 1 Event save_args..call_softirq 35 us count 2 Event __make_request..__blk_run_queue 24 us count 1 Event __alloc_pages_nodemask..__alloc_pages_nodemask 6 us count 1 i.e. compaction is disabled IRQs for a prolonged period of time - 8ms in one instance. The full report generated by the tool can be found at http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-vanilla-micro.report This patch reduces the time IRQs are disabled by simply disabling IRQs at the last possible minute. An updated IRQs-off summary report then looks like; Total sampled time IRQs off (not real total time): 5493 Event shrink_inactive_list..shrink_zone 1596 us count 1 Event shrink_inactive_list..shrink_zone 1530 us count 1 Event shrink_inactive_list..shrink_zone 956 us count 1 Event shrink_inactive_list..shrink_zone 541 us count 1 Event shrink_inactive_list..shrink_zone 531 us count 1 Event split_huge_page..add_to_swap 232 us count 1 Event save_args..call_softirq 36 us count 1 Event save_args..call_softirq 35 us count 2 Event __wake_up..__wake_up 1 us count 1 A full report is again available at http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-minimiseirq-free-v1r4-micro.report As should be obvious, IRQ disabled latencies due to compaction are almost elimimnated for this particular test. [[email protected]: Fix initialisation of isolated] Signed-off-by: Mel Gorman <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Acked-by: Andrea Arcangeli <[email protected]> Cc: Arthur Marsh <[email protected]> Cc: Clemens Ladisch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: don't return 0 too early from find_get_pages()	Hugh Dickins	1	-0/+14
	Callers of find_get_pages(), or its wrapper pagevec_lookup() - notably truncate_inode_pages_range() - stop looking further when it returns 0. But if an interrupt comes just after its radix_tree_gang_lookup_slot(), especially if we have preemptible RCU enabled, isn't it conceivable that all 14 pages returned could be removed from the page cache by shrink_page_list(), before find_get_pages() gets to process them? So causing it to return 0 although there may be plenty more pages beyond. Make find_get_pages() and find_get_pages_tag() check for this unlikely case, and restart should it occur; but callers of find_get_pages_contig() have no such expectation, it's okay for that to return 0 early. I have not seen this in practice, just worried by the possibility. Signed-off-by: Hugh Dickins <[email protected]> Cc: Nick Piggin <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: Salman Qazi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: remove worrying dead code from find_get_pages()	Hugh Dickins	1	-2/+16
	The radix_tree_deref_retry() case in find_get_pages() has a strange little excrescence, not seen in the other gang lookups: it looks like the start of an abandoned attempt to guarantee forward progress in a case that cannot arise. ret should always be 0 here: if it isn't, then going back to restart will leak references to pages already gotten. There used to be a comment saying nr_found is necessarily 1 here: that's not quite true, but the radix_tree_deref_retry() case is peculiar to the entry at index 0, when we race with it being moved out of the radix_tree root or back. Remove the worrisome two lines, add a brief comment here and in find_get_pages_contig() and find_get_pages_tag(), and a WARN_ON in find_get_pages() should it ever be seen elsewhere than at 0. Signed-off-by: Hugh Dickins <[email protected]> Cc: Nick Piggin <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: Salman Qazi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	hugetlbfs: correct handling of negative input to /proc/sys/vm/nr_hugepages	Petr Holasek	1	-4/+2
	When the user inserts a negative value into /proc/sys/vm/nr_hugepages it will cause the kernel to allocate as many hugepages as possible and to then update /proc/meminfo to reflect this. This changes the behavior so that the negative input will result in nr_hugepages value being unchanged. Signed-off-by: Petr Holasek <[email protected]> Signed-off-by: Anton Arapov <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Eric B Munson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: vmscan: kswapd should not free an excessive number of pages when ↵	Mel Gorman	1	-3/+13
	balancing small zones When reclaiming for order-0 pages, kswapd requires that all zones be balanced. Each cycle through balance_pgdat() does background ageing on all zones if necessary and applies equal pressure on the inactive zone unless a lot of pages are free already. A "lot of free pages" is defined as a "balance gap" above the high watermark which is currently 7high_watermark. Historically this was reasonable as min_free_kbytes was small. However, on systems using huge pages, it is recommended that min_free_kbytes is higher and it is tuned with hugeadm --set-recommended-min_free_kbytes. With the introduction of transparent huge page support, this recommended value is also applied. On X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would expect around 68M of memory to be free. The Normal zone is approximately 35000 pages so under even normal memory pressure such as copying a large file, it gets exhausted quickly. As it is getting exhausted, kswapd applies pressure equally to all zones, including the DMA32 zone. DMA32 is approximately 700,000 pages with a high watermark of around 23,000 pages. In this situation, kswapd will reclaim around (230008 where 8 is the high watermark + balance gap of 7 * high watermark) pages or 718M of pages before the zone is ignored. What the user sees is that free memory far higher than it should be. To avoid an excessive number of pages being reclaimed from the larger zones, explicitely defines the "balance gap" to be either 1% of the zone or the low watermark for the zone, whichever is smaller. While kswapd will check all zones to apply pressure, it'll ignore zones that meets the (high_wmark + balance_gap) watermark. To test this, 80G were copied from a partition and the amount of memory being used was recorded. A comparison of a patch and unpatched kernel can be seen at http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps and shows that kswapd is not reclaiming as much memory with the patch applied. Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Shaohua Li <[email protected]> Cc: "Chen, Tim C" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mempolicy: remove redundant check in __mpol_equal()	Namhyung Kim	1	-2/+1
	The 'flags' field is already checked, no need to do it again. Signed-off-by: Namhyung Kim <[email protected]> Cc: Bob Liu <[email protected]> Cc: Lee Schermerhorn <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	pagewalk: only split huge pages when necessary	Dave Hansen	2	-6/+23
	Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will unconditionally split any transparent huge pages it runs in to. In practice, that means that anyone doing a cat /proc/$pid/smaps will unconditionally break down every huge page in the process and depend on khugepaged to re-collapse it later. This is fairly suboptimal. This patch changes that behavior. It teaches each ->pmd_entry handler (there are five) that they must break down the THPs themselves. Also, the _generic_ code will never break down a THP unless a ->pte_entry handler is actually set. This means that the ->pmd_entry handlers can now choose to deal with THPs without breaking them down. [[email protected]: coding-style fixes] Signed-off-by: Dave Hansen <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: David Rientjes <[email protected]> Reviewed-by: Eric B Munson <[email protected]> Tested-by: Eric B Munson <[email protected]> Cc: Michael J Wolf <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matt Mackall <[email protected]> Cc: Jeremy Fitzhardinge <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: reclaim invalidated page ASAP	Minchan Kim	2	-4/+49
	invalidate_mapping_pages is very big hint to reclaimer. It means user doesn't want to use the page any more. So in order to prevent working set page eviction, this patch move the page into tail of inactive list by PG_reclaim. Please, remember that pages in inactive list are working set as well as active list. If we don't move pages into inactive list's tail, pages near by tail of inactive list can be evicted although we have a big clue about useless pages. It's totally bad. Now PG_readahead/PG_reclaim is shared. fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for preventing fast reclaiming readahead marker page. In this series, PG_reclaim is used by invalidated page, too. If VM find the page is invalidated and it's dirty, it sets PG_reclaim to reclaim asap. Then, when the dirty page will be writeback, clear_page_dirty_for_io will clear PG_reclaim unconditionally. It disturbs this serie's goal. I think it's okay to clear PG_readahead when the page is dirty, not writeback time. So this patch moves ClearPageReadahead. In v4, ClearPageReadahead in set_page_dirty has a problem which is reported by Steven Barrett. It's due to compound page. Some driver(ex, audio) calls set_page_dirty with compound page which isn't on LRU. but my patch does ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it breaks PageTail flag. I think it doesn't affect THP and pass my test with THP enabling but Cced Andrea for double check. Signed-off-by: Minchan Kim <[email protected]> Reported-by: Steven Barrett <[email protected]> Reviewed-by: Johannes Weiner <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	memcg: move memcg reclaimable page into tail of inactive list	Minchan Kim	2	-1/+28
	The rotate_reclaimable_page function moves just written out pages, which the VM wanted to reclaim, to the end of the inactive list. That way the VM will find those pages first next time it needs to free memory. This patch applies the rule in memcg. It can help to prevent unnecessary working page eviction of memcg. Signed-off-by: Minchan Kim <[email protected]> Acked-by: Balbir Singh <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: deactivate invalidated pages	Minchan Kim	2	-5/+90
	Recently, there are reported problem about thrashing. (http://marc.info/?l=rsync&m=128885034930933&w=2) It happens by backup workloads(ex, nightly rsync). That's because the workload makes just use-once pages and touches pages twice. It promotes the page into active list so that it results in working set page eviction. Some app developer want to support POSIX_FADV_NOREUSE. But other OSes don't support it, either. (http://marc.info/?l=linux-mm&m=128928979512086&w=2) By other approach, app developers use POSIX_FADV_DONTNEED. But it has a problem. If kernel meets page is writing during invalidate_mapping_pages, it can't work. It makes for application programmer to use it since they always have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could be discardable. At last, they can't use deferred write of kernel so that they could see performance loss. (http://insights.oetiker.ch/linux/fadvise.html) In fact, invalidation is very big hint to reclaimer. It means we don't use the page any more. So let's move the writing page into inactive list's head if we can't truncate it right now. Why I move page to head of lru on this patch, Dirty/Writeback page would be flushed sooner or later. It can prevent writeout of pageout which is less effective than flusher's writeout. Originally, I reused lru_demote of Peter with some change so added his Signed-off-by. Signed-off-by: Minchan Kim <[email protected]> Reported-by: Ben Gamari <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: Wu Fengguang <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: simplify anon_vma refcounts	Peter Zijlstra	1	-48/+28
	This patch changes the anon_vma refcount to be 0 when the object is free. It does this by adding 1 ref to being in use in the anon_vma structure (iow. the anon_vma->head list is not empty). This allows a simpler release scheme without having to check both the refcount and the list as well as avoids taking a ref for each entry on the list. Signed-off-by: Peter Zijlstra <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: move anon_vma ref out from under CONFIG_foo	Peter Zijlstra	1	-8/+6
	We need the anon_vma refcount unconditionally to simplify the anon_vma lifetime rules. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: rename drop_anon_vma() to put_anon_vma()	Peter Zijlstra	3	-22/+9
	The normal code pattern used in the kernel is: get/put. Signed-off-by: Peter Zijlstra <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Hugh Dickins <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: debug-pagealloc: fix kconfig dependency warning	Akinobu Mita	1	-14/+11
	Fix kconfig dependency warning to satisfy dependencies: warning: (PAGE_POISONING) selects DEBUG_PAGEALLOC which has unmet direct dependencies (DEBUG_KERNEL && ARCH_SUPPORTS_DEBUG_PAGEALLOC && (!HIBERNATION \|\| !PPC && !SPARC) && !KMEMCHECK) Signed-off-by: Akinobu Mita <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: batch-free pcp list if possible	Namhyung Kim	1	-0/+4
	free_pcppages_bulk() frees pages from pcp lists in a round-robin fashion by keeping batch_free counter. But it doesn't need to spin if there is only one non-empty list. This can be checked by batch_free == MIGRATE_PCPTYPES. [[email protected]: fix comment] Signed-off-by: Namhyung Kim <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: change __remove_from_page_cache()	Minchan Kim	4	-7/+7
	Now we renamed remove_from_page_cache with delete_from_page_cache. As consistency of __remove_from_swap_cache and remove_from_swap_cache, we change internal page cache handling function name, too. Signed-off-by: Minchan Kim <[email protected]> Cc: Christoph Hellwig <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Johannes Weiner <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: goodbye remove_from_page_cache()	Minchan Kim	1	-16/+9
	Now delete_from_page_cache() replaces remove_from_page_cache(). So we remove remove_from_page_cache so fs or something out of mainline will notice it when compile time and can fix it. Signed-off-by: Minchan Kim <[email protected]> Cc: Christoph Hellwig <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Johannes Weiner <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: truncate: change remove_from_page_cache	Minchan Kim	1	-2/+1
	This patch series changes remove_from_page_cache()'s page ref counting rule. Page cache ref count is decreased in delete_from_page_cache(). So we don't need to decrease the page reference in callers. Signed-off-by: Minchan Kim <[email protected]> Cc: Dan Magenheimer <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Al Viro <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Johannes Weiner <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: shmem: change remove_from_page_cache	Minchan Kim	1	-2/+1
	This patch series changes remove_from_page_cache()'s page ref counting rule. Page cache ref count is decreased in delete_from_page_cache(). So we don't need to decrease the page reference in callers. Signed-off-by: Minchan Kim <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Johannes Weiner <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: introduce delete_from_page_cache()	Minchan Kim	1	-0/+16
	Presently we increase the page refcount in add_to_page_cache() but don't decrease it in remove_from_page_cache(). Such asymmetry adds confusion, requiring that callers notice it and a comment explaining why they release a page reference. It's not a good API. A long time ago, Hugh tried it (http://lkml.org/lkml/2004/10/24/140) but gave up because reiser4's drop_page() had to unlock the page between removing it from page cache and doing the page_cache_release(). But now the situation is changed. I think at least things in current mainline don't have any obstacles. The problem is for out-of-mainline filesystems - if they have done such things as reiser4, this patch could be a problem but they will discover this at compile time since we remove remove_from_page_cache(). This patch: This function works as just wrapper remove_from_page_cache(). The difference is that it decreases page references in itself. So caller have to make sure it has a page reference before calling. This patch is ready for removing remove_from_page_cache(). Signed-off-by: Minchan Kim <[email protected]> Cc: Christoph Hellwig <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Johannes Weiner <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: Edward Shishkin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: add replace_page_cache_page() function	Miklos Szeredi	3	-3/+73
	This function basically does: remove_from_page_cache(old); page_cache_release(old); add_to_page_cache_locked(new); Except it does this atomically, so there's no possibility for the "add" to fail because of a race. If memory cgroups are enabled, then the memory cgroup charge is also moved from the old page to the new. This function is currently used by fuse to move pages into the page cache on read, instead of copying the page contents. [[email protected]: add freepage() hook to replace_page_cache_page()] Signed-off-by: Miklos Szeredi <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: allow GUP to fail instead of waiting on a page	Gleb Natapov	2	-3/+8
	GUP user may want to try to acquire a reference to a page if it is already in memory, but not if IO, to bring it in, is needed. For example KVM may tell vcpu to schedule another guest process if current one is trying to access swapped out page. Meanwhile, the page will be swapped in and the guest process, that depends on it, will be able to run again. This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and FOLL_NOWAIT follow_page flags. FAULT_FLAG_RETRY_NOWAIT, when used in conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY instead. [[email protected]: improve FOLL_NOWAIT comment] Signed-off-by: Gleb Natapov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Hugh Dickins <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Avi Kivity <[email protected]> Cc: Marcelo Tosatti <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-03-22	mm: notifier_from_errno() cleanup	Prarit Bhargava	2	-7/+2
	While looking at some other notifier callbacks I noticed this code could use a simple cleanup. notifier_from_errno() no longer needs the if (ret)/else conditional. That same conditional is now done in notifier_from_errno(). Signed-off-by: Prarit Bhargava <[email protected]> Cc: Paul Menage <[email protected]> Cc: Li Zefan <[email protected]> Acked-by: Pekka Enberg <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>