Age | Commit message (Collapse) | Author | Files | Lines |
|
The block in sys_swapon which does the final adjustments to the
swap_info_struct and to swap_list is the same as the block which
re-inserts it again at sys_swapoff on failure of try_to_unuse(), except
for the order of the operations within the lock. Since the order should
not matter, arbitrarily change sys_swapoff to match sys_swapon, in
preparation to making both share the same code.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The block in sys_swapon which does the final adjustments to the
swap_info_struct and to swap_list is the same as the block which
re-inserts it again at sys_swapoff on failure of try_to_unuse(). To be
able to make both share the same code, move the printk() call in the
middle of it to just after it.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It still exists within setup_swap_map_and_extents(), but after it
nr_good_pages == p->pages.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Move the code which parses the bad block list and the extents to a
separate function. Only code movement, no functional changes.
This change uses the fact that, after the success path, nr_good_pages ==
p->pages.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The call to swap_cgroup_swapon is in the middle of loading the swap map
and extents. As it only does memory allocation and does not depend on
the swapfile layout (map/extents), it can be called earlier (or later).
Move it to just after the allocation of swap_map, since it is
conceptually similar (allocates a map).
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Move the code which parses and checks the swapfile header (except for
the bad block list) to a separate function. Only code movement, no
functional changes.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is no reason I can see to read inode->i_size long before it is
needed. Move its read to just before it is needed, to reduce the
variable lifetime.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: Jesper Juhl <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Move the code which claims the bdev (S_ISBLK) or locks the inode
(S_ISREG) to a separate function. Only code movement, no functional
changes.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
sys_swapon currently has two error labels, bad_swap and bad_swap_2.
bad_swap does the same as bad_swap_2 plus destroy_swap_extents() and
swap_cgroup_swapoff(); both are noops in the places where bad_swap_2 is
jumped to. With a single extra test for inode (matching the one in the
S_ISREG case below), all the error paths in the function can go to
bad_swap.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The only way error is 0 in the cleanup blocks is when the function is
returning successfully. In this case, the cleanup blocks were setting
S_SWAPFILE in the S_ISREG case. But this is not a cleanup.
Move the setting of S_SWAPFILE to just before the "goto out;" to make
this more clear. At this point, we do not need to test for inode because
it will never be NULL.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The bdev variable is always equivalent to (S_ISBLK(inode->i_mode) ?
p->bdev : NULL), as long as it being set is moved to a bit earlier. Use
this fact to remove the bdev variable.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Move the setting of the error variable nearer the goto in a few places.
Avoids calling PTR_ERR() if not IS_ERR() in two places, and makes the
error condition more explicit in two other places.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: Jesper Juhl <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since mutex_lock(&inode->i_mutex) is called just after setting inode,
did_down is always equivalent to (inode && S_ISREG(inode->i_mode)).
Use this fact to remove the did_down variable.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now there is nothing which jumps to the cleanup blocks before the name
variable is set. There is no need to set it initially to NULL anymore.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since there is no cleanup to do, there is no reason to jump to a label.
Return directly instead.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
At this point in sys_swapon, there is nothing to free. Return directly
instead of jumping to the cleanup block at the end of the function.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Move the swap_info allocation to its own function. Only code movement,
no functional changes.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Within sys_swapon, after the swap_info entry has been allocated, we
always have type == p->type and swap_info[type] == p. Use this fact to
reduce the dependency on the "type" local variable within the function,
as a preparation to move the allocation of the swap_info entry to a
separate function.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Changelogs belong in the git history instead of in the source code.
Also, "The swapon system call" is redundant with
"SYSCALL_DEFINE2(swapon, ...)".
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: Jesper Juhl <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
[ Gaah. That's a _historical_ comment. But the patch-series depends on removal ]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch series refactors the sys_swapon function.
sys_swapon is currently a very large function, with 313 lines (more than
12 25-line screens), which can make it a bit hard to read. This patch
series reduces this size by half, by extracting large chunks of related
code to new helper functions.
One of these chunks of code was nearly identical to the part of
sys_swapoff which is used in case of a failure return from
try_to_unuse(), so this patch series also makes both share the same
code.
As a side effect of all this refactoring, the compiled code gets a bit
smaller (from v1 of this patch series):
text data bss dec hex filename
14012 944 276 15232 3b80 mm/swapfile.o.before
13941 944 276 15161 3b39 mm/swapfile.o.after
This patch:
Use vzalloc() instead of vmalloc/memset.
Signed-off-by: Cesar Eduardo Barros <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: Jesper Juhl <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Pass __GFP_OTHER_NODE for transparent hugepages NUMA allocations done by the
hugepages daemon. This way the low level accounting for local versus
remote pages works correctly.
Contains improvements from Andrea Arcangeli
[[email protected]: coding-style fixes]
Signed-off-by: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
zone_statistics() that an allocation is on behalf of another thread. This
way the local and remote counters can be still correct, even when
background daemons like khugepaged are changing memory mappings.
This only affects the accounting, but I think it's worth doing that right
to avoid confusing users.
I first tried to just pass down the right node, but this required a lot of
changes to pass down this parameter and at least one addition of a 10th
argument to a 9 argument function. Using the flag is a lot less
intrusive.
Open: should be also used for migration?
[[email protected]: coding-style fixes]
Signed-off-by: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
__GFP_NO_KSWAPD allocations are usually very expensive and not mandatory
to succeed as they have graceful fallback. Waiting for I/O in those,
tends to be overkill in terms of latencies, so we can reduce their latency
by disabling sync migrate.
Unfortunately, even with async migration it's still possible for the
process to be blocked waiting for a request slot (e.g. get_request_wait
in the block layer) when ->writepage is called. To prevent
__GFP_NO_KSWAPD blocking, this patch prevents ->writepage being called on
dirty page cache for asynchronous migration.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=31142
[[email protected]: Avoid writebacks for NFS, retry locked pages, use bool]
Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Cc: Arthur Marsh <[email protected]>
Cc: Clemens Ladisch <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Minchan Kim <[email protected]>
Reported-by: Alex Villacis Lasso <[email protected]>
Tested-by: Alex Villacis Lasso <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
for migration
compaction_alloc() isolates pages for migration in isolate_migratepages.
While it's scanning, IRQs are disabled on the mistaken assumption the
scanning should be short. Tests show this to be true for the most part
but contention times on the LRU lock can be increased. Before this patch,
the IRQ disabled times for a simple test looked like
Total sampled time IRQs off (not real total time): 5493
Event shrink_inactive_list..shrink_zone 1596 us count 1
Event shrink_inactive_list..shrink_zone 1530 us count 1
Event shrink_inactive_list..shrink_zone 956 us count 1
Event shrink_inactive_list..shrink_zone 541 us count 1
Event shrink_inactive_list..shrink_zone 531 us count 1
Event split_huge_page..add_to_swap 232 us count 1
Event save_args..call_softirq 36 us count 1
Event save_args..call_softirq 35 us count 2
Event __wake_up..__wake_up 1 us count 1
This patch reduces the worst-case IRQs-disabled latencies by releasing the
lock every SWAP_CLUSTER_MAX pages that are scanned and releasing the CPU if
necessary. The cost of this is that the processing performing compaction will
be slower but IRQs being disabled for too long a time has worse consequences
as the following report shows;
Total sampled time IRQs off (not real total time): 4367
Event shrink_inactive_list..shrink_zone 881 us count 1
Event shrink_inactive_list..shrink_zone 875 us count 1
Event shrink_inactive_list..shrink_zone 868 us count 1
Event shrink_inactive_list..shrink_zone 555 us count 1
Event split_huge_page..add_to_swap 495 us count 1
Event compact_zone..compact_zone_order 269 us count 1
Event split_huge_page..add_to_swap 266 us count 1
Event shrink_inactive_list..shrink_zone 85 us count 1
Event save_args..call_softirq 36 us count 2
Event __wake_up..__wake_up 1 us count 1
[[email protected]: simplify with s/unlocked/locked/]
Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Arthur Marsh <[email protected]>
Cc: Clemens Ladisch <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
compaction_alloc() isolates free pages to be used as migration targets.
While its scanning, IRQs are disabled on the mistaken assumption the
scanning should be short. Analysis showed that IRQs were in fact being
disabled for substantial time. A simple test was run using large
anonymous mappings with transparent hugepage support enabled to trigger
frequent compactions. A monitor sampled what the worst IRQ-off latencies
were and a post-processing tool found the following;
Total sampled time IRQs off (not real total time): 22355
Event compaction_alloc..compaction_alloc 8409 us count 1
Event compaction_alloc..compaction_alloc 7341 us count 1
Event compaction_alloc..compaction_alloc 2463 us count 1
Event compaction_alloc..compaction_alloc 2054 us count 1
Event shrink_inactive_list..shrink_zone 1864 us count 1
Event shrink_inactive_list..shrink_zone 88 us count 1
Event save_args..call_softirq 36 us count 1
Event save_args..call_softirq 35 us count 2
Event __make_request..__blk_run_queue 24 us count 1
Event __alloc_pages_nodemask..__alloc_pages_nodemask 6 us count 1
i.e. compaction is disabled IRQs for a prolonged period of time - 8ms in
one instance. The full report generated by the tool can be found at
http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-vanilla-micro.report
This patch reduces the time IRQs are disabled by simply disabling IRQs at
the last possible minute. An updated IRQs-off summary report then looks
like;
Total sampled time IRQs off (not real total time): 5493
Event shrink_inactive_list..shrink_zone 1596 us count 1
Event shrink_inactive_list..shrink_zone 1530 us count 1
Event shrink_inactive_list..shrink_zone 956 us count 1
Event shrink_inactive_list..shrink_zone 541 us count 1
Event shrink_inactive_list..shrink_zone 531 us count 1
Event split_huge_page..add_to_swap 232 us count 1
Event save_args..call_softirq 36 us count 1
Event save_args..call_softirq 35 us count 2
Event __wake_up..__wake_up 1 us count 1
A full report is again available at
http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-minimiseirq-free-v1r4-micro.report
As should be obvious, IRQ disabled latencies due to compaction are
almost elimimnated for this particular test.
[[email protected]: Fix initialisation of isolated]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Acked-by: Andrea Arcangeli <[email protected]>
Cc: Arthur Marsh <[email protected]>
Cc: Clemens Ladisch <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Callers of find_get_pages(), or its wrapper pagevec_lookup() - notably
truncate_inode_pages_range() - stop looking further when it returns 0.
But if an interrupt comes just after its radix_tree_gang_lookup_slot(),
especially if we have preemptible RCU enabled, isn't it conceivable that
all 14 pages returned could be removed from the page cache by
shrink_page_list(), before find_get_pages() gets to process them? So
causing it to return 0 although there may be plenty more pages beyond.
Make find_get_pages() and find_get_pages_tag() check for this unlikely
case, and restart should it occur; but callers of find_get_pages_contig()
have no such expectation, it's okay for that to return 0 early.
I have not seen this in practice, just worried by the possibility.
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Nick Piggin <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Salman Qazi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The radix_tree_deref_retry() case in find_get_pages() has a strange little
excrescence, not seen in the other gang lookups: it looks like the start
of an abandoned attempt to guarantee forward progress in a case that
cannot arise.
ret should always be 0 here: if it isn't, then going back to restart will
leak references to pages already gotten. There used to be a comment
saying nr_found is necessarily 1 here: that's not quite true, but the
radix_tree_deref_retry() case is peculiar to the entry at index 0, when we
race with it being moved out of the radix_tree root or back.
Remove the worrisome two lines, add a brief comment here and in
find_get_pages_contig() and find_get_pages_tag(), and a WARN_ON in
find_get_pages() should it ever be seen elsewhere than at 0.
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Nick Piggin <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Salman Qazi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When the user inserts a negative value into /proc/sys/vm/nr_hugepages it
will cause the kernel to allocate as many hugepages as possible and to
then update /proc/meminfo to reflect this.
This changes the behavior so that the negative input will result in
nr_hugepages value being unchanged.
Signed-off-by: Petr Holasek <[email protected]>
Signed-off-by: Anton Arapov <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Eric B Munson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
balancing small zones
When reclaiming for order-0 pages, kswapd requires that all zones be
balanced. Each cycle through balance_pgdat() does background ageing on
all zones if necessary and applies equal pressure on the inactive zone
unless a lot of pages are free already.
A "lot of free pages" is defined as a "balance gap" above the high
watermark which is currently 7*high_watermark. Historically this was
reasonable as min_free_kbytes was small. However, on systems using huge
pages, it is recommended that min_free_kbytes is higher and it is tuned
with hugeadm --set-recommended-min_free_kbytes. With the introduction of
transparent huge page support, this recommended value is also applied. On
X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would
expect around 68M of memory to be free. The Normal zone is approximately
35000 pages so under even normal memory pressure such as copying a large
file, it gets exhausted quickly. As it is getting exhausted, kswapd
applies pressure equally to all zones, including the DMA32 zone. DMA32 is
approximately 700,000 pages with a high watermark of around 23,000 pages.
In this situation, kswapd will reclaim around (23000*8 where 8 is the high
watermark + balance gap of 7 * high watermark) pages or 718M of pages
before the zone is ignored. What the user sees is that free memory far
higher than it should be.
To avoid an excessive number of pages being reclaimed from the larger
zones, explicitely defines the "balance gap" to be either 1% of the zone
or the low watermark for the zone, whichever is smaller. While kswapd
will check all zones to apply pressure, it'll ignore zones that meets the
(high_wmark + balance_gap) watermark.
To test this, 80G were copied from a partition and the amount of memory
being used was recorded. A comparison of a patch and unpatched kernel can
be seen at
http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps
and shows that kswapd is not reclaiming as much memory with the patch
applied.
Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: "Chen, Tim C" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The 'flags' field is already checked, no need to do it again.
Signed-off-by: Namhyung Kim <[email protected]>
Cc: Bob Liu <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will
unconditionally split any transparent huge pages it runs in to. In
practice, that means that anyone doing a
cat /proc/$pid/smaps
will unconditionally break down every huge page in the process and depend
on khugepaged to re-collapse it later. This is fairly suboptimal.
This patch changes that behavior. It teaches each ->pmd_entry handler
(there are five) that they must break down the THPs themselves. Also, the
_generic_ code will never break down a THP unless a ->pte_entry handler is
actually set.
This means that the ->pmd_entry handlers can now choose to deal with THPs
without breaking them down.
[[email protected]: coding-style fixes]
Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: David Rientjes <[email protected]>
Reviewed-by: Eric B Munson <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Cc: Michael J Wolf <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matt Mackall <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
invalidate_mapping_pages is very big hint to reclaimer. It means user
doesn't want to use the page any more. So in order to prevent working set
page eviction, this patch move the page into tail of inactive list by
PG_reclaim.
Please, remember that pages in inactive list are working set as well as
active list. If we don't move pages into inactive list's tail, pages near
by tail of inactive list can be evicted although we have a big clue about
useless pages. It's totally bad.
Now PG_readahead/PG_reclaim is shared. fe3cba17 added ClearPageReclaim
into clear_page_dirty_for_io for preventing fast reclaiming readahead
marker page.
In this series, PG_reclaim is used by invalidated page, too. If VM find
the page is invalidated and it's dirty, it sets PG_reclaim to reclaim
asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally. It
disturbs this serie's goal.
I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead. In v4,
ClearPageReadahead in set_page_dirty has a problem which is reported by
Steven Barrett. It's due to compound page. Some driver(ex, audio) calls
set_page_dirty with compound page which isn't on LRU. but my patch does
ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it
breaks PageTail flag.
I think it doesn't affect THP and pass my test with THP enabling but Cced
Andrea for double check.
Signed-off-by: Minchan Kim <[email protected]>
Reported-by: Steven Barrett <[email protected]>
Reviewed-by: Johannes Weiner <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The rotate_reclaimable_page function moves just written out pages, which
the VM wanted to reclaim, to the end of the inactive list. That way the
VM will find those pages first next time it needs to free memory.
This patch applies the rule in memcg. It can help to prevent unnecessary
working page eviction of memcg.
Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Recently, there are reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2) It happens by backup
workloads(ex, nightly rsync). That's because the workload makes just
use-once pages and touches pages twice. It promotes the page into active
list so that it results in working set page eviction.
Some app developer want to support POSIX_FADV_NOREUSE. But other OSes
don't support it, either.
(http://marc.info/?l=linux-mm&m=128928979512086&w=2)
By other approach, app developers use POSIX_FADV_DONTNEED. But it has a
problem. If kernel meets page is writing during invalidate_mapping_pages,
it can't work. It makes for application programmer to use it since they
always have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
make sure the pages could be discardable. At last, they can't use
deferred write of kernel so that they could see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)
In fact, invalidation is very big hint to reclaimer. It means we don't
use the page any more. So let's move the writing page into inactive
list's head if we can't truncate it right now.
Why I move page to head of lru on this patch, Dirty/Writeback page would
be flushed sooner or later. It can prevent writeout of pageout which is
less effective than flusher's writeout.
Originally, I reused lru_demote of Peter with some change so added his
Signed-off-by.
Signed-off-by: Minchan Kim <[email protected]>
Reported-by: Ben Gamari <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Cc: Wu Fengguang <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch changes the anon_vma refcount to be 0 when the object is free.
It does this by adding 1 ref to being in use in the anon_vma structure
(iow. the anon_vma->head list is not empty).
This allows a simpler release scheme without having to check both the
refcount and the list as well as avoids taking a ref for each entry on the
list.
Signed-off-by: Peter Zijlstra <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We need the anon_vma refcount unconditionally to simplify the anon_vma
lifetime rules.
Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The normal code pattern used in the kernel is: get/put.
Signed-off-by: Peter Zijlstra <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix kconfig dependency warning to satisfy dependencies:
warning: (PAGE_POISONING) selects DEBUG_PAGEALLOC which has unmet
direct dependencies (DEBUG_KERNEL && ARCH_SUPPORTS_DEBUG_PAGEALLOC &&
(!HIBERNATION || !PPC && !SPARC) && !KMEMCHECK)
Signed-off-by: Akinobu Mita <[email protected]>
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
free_pcppages_bulk() frees pages from pcp lists in a round-robin fashion
by keeping batch_free counter. But it doesn't need to spin if there is
only one non-empty list. This can be checked by batch_free ==
MIGRATE_PCPTYPES.
[[email protected]: fix comment]
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now we renamed remove_from_page_cache with delete_from_page_cache. As
consistency of __remove_from_swap_cache and remove_from_swap_cache, we
change internal page cache handling function name, too.
Signed-off-by: Minchan Kim <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Johannes Weiner <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now delete_from_page_cache() replaces remove_from_page_cache(). So we
remove remove_from_page_cache so fs or something out of mainline will
notice it when compile time and can fix it.
Signed-off-by: Minchan Kim <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Johannes Weiner <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch series changes remove_from_page_cache()'s page ref counting
rule. Page cache ref count is decreased in delete_from_page_cache(). So
we don't need to decrease the page reference in callers.
Signed-off-by: Minchan Kim <[email protected]>
Cc: Dan Magenheimer <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Al Viro <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Johannes Weiner <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch series changes remove_from_page_cache()'s page ref counting
rule. Page cache ref count is decreased in delete_from_page_cache(). So
we don't need to decrease the page reference in callers.
Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Johannes Weiner <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Presently we increase the page refcount in add_to_page_cache() but don't
decrease it in remove_from_page_cache(). Such asymmetry adds confusion,
requiring that callers notice it and a comment explaining why they release
a page reference. It's not a good API.
A long time ago, Hugh tried it (http://lkml.org/lkml/2004/10/24/140) but
gave up because reiser4's drop_page() had to unlock the page between
removing it from page cache and doing the page_cache_release(). But now
the situation is changed. I think at least things in current mainline
don't have any obstacles. The problem is for out-of-mainline filesystems
- if they have done such things as reiser4, this patch could be a problem
but they will discover this at compile time since we remove
remove_from_page_cache().
This patch:
This function works as just wrapper remove_from_page_cache(). The
difference is that it decreases page references in itself. So caller have
to make sure it has a page reference before calling.
This patch is ready for removing remove_from_page_cache().
Signed-off-by: Minchan Kim <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Johannes Weiner <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Cc: Edward Shishkin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This function basically does:
remove_from_page_cache(old);
page_cache_release(old);
add_to_page_cache_locked(new);
Except it does this atomically, so there's no possibility for the "add" to
fail because of a race.
If memory cgroups are enabled, then the memory cgroup charge is also moved
from the old page to the new.
This function is currently used by fuse to move pages into the page cache
on read, instead of copying the page contents.
[[email protected]: add freepage() hook to replace_page_cache_page()]
Signed-off-by: Miklos Szeredi <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
GUP user may want to try to acquire a reference to a page if it is already
in memory, but not if IO, to bring it in, is needed. For example KVM may
tell vcpu to schedule another guest process if current one is trying to
access swapped out page. Meanwhile, the page will be swapped in and the
guest process, that depends on it, will be able to run again.
This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and
FOLL_NOWAIT follow_page flags. FAULT_FLAG_RETRY_NOWAIT, when used in
conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that
it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY
instead.
[[email protected]: improve FOLL_NOWAIT comment]
Signed-off-by: Gleb Natapov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Hugh Dickins <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
While looking at some other notifier callbacks I noticed this code could
use a simple cleanup.
notifier_from_errno() no longer needs the if (ret)/else conditional. That
same conditional is now done in notifier_from_errno().
Signed-off-by: Prarit Bhargava <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: Li Zefan <[email protected]>
Acked-by: Pekka Enberg <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|