aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2011-01-13mm: kswapd: keep kswapd awake for high-order allocations until a percentage ↵Mel Gorman1-9/+49
of the node is balanced When reclaiming for high-orders, kswapd is responsible for balancing a node but it should not reclaim excessively. It avoids excessive reclaim by considering if any zone in a node is balanced then the node is balanced. In the cases where there are imbalanced zone sizes (e.g. ZONE_DMA with both ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep prematurely as just one small zone was balanced. This alters the sleep logic of kswapd slightly. It counts the number of pages that make up the balanced zones. If the total number of balanced pages is more than a quarter of the zone, kswapd will go back to sleep. This should keep a node balanced without reclaiming an excessive number of pages. Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Eric B Munson <[email protected]> Cc: Simon Kirby <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: kswapd: stop high-order balancing when any suitable zone is balancedMel Gorman2-12/+64
Simon Kirby reported the following problem We're seeing cases on a number of servers where cache never fully grows to use all available memory. Sometimes we see servers with 4 GB of memory that never seem to have less than 1.5 GB free, even with a constantly-active VM. In some cases, these servers also swap out while this happens, even though they are constantly reading the working set into memory. We have been seeing this happening for a long time; I don't think it's anything recent, and it still happens on 2.6.36. After some debugging work by Simon, Dave Hansen and others, the prevaling theory became that kswapd is reclaiming order-3 pages requested by SLUB too aggressive about it. There are two apparent problems here. On the target machine, there is a small Normal zone in comparison to DMA32. As kswapd tries to balance all zones, it would continually try reclaiming for Normal even though DMA32 was balanced enough for callers. The second problem is that sleeping_prematurely() does not use the same logic as balance_pgdat() when deciding whether to sleep or not. This keeps kswapd artifically awake. A number of tests were run and the figures from previous postings will look very different for a few reasons. One, the old figures were forcing my network card to use GFP_ATOMIC in attempt to replicate Simon's problem. Second, I previous specified slub_min_order=3 again in an attempt to reproduce Simon's problem. In this posting, I'm depending on Simon to say whether his problem is fixed or not and these figures are to show the impact to the ordinary cases. Finally, the "vmscan" figures are taken from /proc/vmstat instead of the tracepoints. There is less information but recording is less disruptive. The first test of relevance was postmark with a process running in the background reading a large amount of anonymous memory in blocks. The objective was to vaguely simulate what was happening on Simon's machine and it's memory intensive enough to have kswapd awake. POSTMARK traceonly kanyzone Transactions per second: 156.00 ( 0.00%) 153.00 (-1.96%) Data megabytes read per second: 21.51 ( 0.00%) 21.52 ( 0.05%) Data megabytes written per second: 29.28 ( 0.00%) 29.11 (-0.58%) Files created alone per second: 250.00 ( 0.00%) 416.00 (39.90%) Files create/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%) Files deleted alone per second: 520.00 ( 0.00%) 420.00 (-23.81%) Files delete/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 16.58 17.4 Total Elapsed Time (seconds) 218.48 222.47 VMstat Reclaim Statistics: vmscan Direct reclaims 0 4 Direct reclaim pages scanned 0 203 Direct reclaim pages reclaimed 0 184 Kswapd pages scanned 326631 322018 Kswapd pages reclaimed 312632 309784 Kswapd low wmark quickly 1 4 Kswapd high wmark quickly 122 475 Kswapd skip congestion_wait 1 0 Pages activated 700040 705317 Pages deactivated 212113 203922 Pages written 9875 6363 Total pages scanned 326631 322221 Total pages reclaimed 312632 309968 %age total pages scanned/reclaimed 95.71% 96.20% %age total pages scanned/written 3.02% 1.97% proc vmstat: Faults Major Faults 300 254 Minor Faults 645183 660284 Page ins 493588 486704 Page outs 4960088 4986704 Swap ins 1230 661 Swap outs 9869 6355 Performance is mildly affected because kswapd is no longer doing as much work and the background memory consumer process is getting in the way. Note that kswapd scanned and reclaimed fewer pages as it's less aggressive and overall fewer pages were scanned and reclaimed. Swap in/out is particularly reduced again reflecting kswapd throwing out fewer pages. The slight performance impact is unfortunate here but it looks like a direct result of kswapd being less aggressive. As the bug report is about too many pages being freed by kswapd, it may have to be accepted for now. The second test is a streaming IO benchmark that was previously used by Johannes to show regressions in page reclaim. MICRO traceonly kanyzone User/Sys Time Running Test (seconds) 29.29 28.87 Total Elapsed Time (seconds) 492.18 488.79 VMstat Reclaim Statistics: vmscan Direct reclaims 2128 1460 Direct reclaim pages scanned 2284822 1496067 Direct reclaim pages reclaimed 148919 110937 Kswapd pages scanned 15450014 16202876 Kswapd pages reclaimed 8503697 8537897 Kswapd low wmark quickly 3100 3397 Kswapd high wmark quickly 1860 7243 Kswapd skip congestion_wait 708 801 Pages activated 9635 9573 Pages deactivated 1432 1271 Pages written 223 1130 Total pages scanned 17734836 17698943 Total pages reclaimed 8652616 8648834 %age total pages scanned/reclaimed 48.79% 48.87% %age total pages scanned/written 0.00% 0.01% proc vmstat: Faults Major Faults 165 221 Minor Faults 9655785 9656506 Page ins 3880 7228 Page outs 37692940 37480076 Swap ins 0 69 Swap outs 19 15 Again fewer pages are scanned and reclaimed as expected and this time the test completed faster. Note that kswapd is hitting its watermarks faster (low and high wmark quickly) which I expect is due to kswapd reclaiming fewer pages. I also ran fs-mark, iozone and sysbench but there is nothing interesting to report in the figures. Performance is not significantly changed and the reclaim statistics look reasonable. Tgis patch: When the allocator enters its slow path, kswapd is woken up to balance the node. It continues working until all zones within the node are balanced. For order-0 allocations, this makes perfect sense but for higher orders it can have unintended side-effects. If the zone sizes are imbalanced, kswapd may reclaim heavily within a smaller zone discarding an excessive number of pages. The user-visible behaviour is that kswapd is awake and reclaiming even though plenty of pages are free from a suitable zone. This patch alters the "balance" logic for high-order reclaim allowing kswapd to stop if any suitable zone becomes balanced to reduce the number of pages it reclaims from other zones. kswapd still tries to ensure that order-0 watermarks for all zones are met before sleeping. Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Eric B Munson <[email protected]> Cc: Simon Kirby <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: remove likely() from grab_cache_page_write_begin()Steven Rostedt1-1/+1
Running the annotated branch profiler on a box doing average work (firefox, evolution, xchat, distcc farm), the likely() used in grab_cache_page_write_begin() was incorrect most of the time: correct incorrect % Function File Line ------- --------- - -------- ---- ---- 1924262 71332401 97 grab_cache_page_write_begin filemap.c 2206 Adding a trace_printk() and running the function tracer limited to just this function I can see: gconfd-2-2696 [000] 4467.268935: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=7 gconfd-2-2696 [000] 4467.268946: grab_cache_page_write_begin <-ext3_write_begin gconfd-2-2696 [000] 4467.268947: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=8 gconfd-2-2696 [000] 4467.268959: grab_cache_page_write_begin <-ext3_write_begin gconfd-2-2696 [000] 4467.268960: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=9 gconfd-2-2696 [000] 4467.268972: grab_cache_page_write_begin <-ext3_write_begin gconfd-2-2696 [000] 4467.268973: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=10 gconfd-2-2696 [000] 4467.268991: grab_cache_page_write_begin <-ext3_write_begin gconfd-2-2696 [000] 4467.268992: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=11 gconfd-2-2696 [000] 4467.269005: grab_cache_page_write_begin <-ext3_write_begin Which shows that a lot of calls from ext3_write_begin will result in the page returned by "find_lock_page" will be NULL. Signed-off-by: Steven Rostedt <[email protected]> Acked-by: Nick Piggin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13vmalloc: remove redundant unlikely()Tobias Klauser1-1/+1
IS_ERR() already implies unlikely(), so it can be omitted here. Signed-off-by: Tobias Klauser <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mempolicy: remove tasklist_lock from migrate_pagesKOSAKI Motohiro1-3/+0
Today, tasklist_lock in migrate_pages doesn't protect anything. rcu_read_lock() provide enough protection from pid hash walk. Signed-off-by: KOSAKI Motohiro <[email protected]> Reported-by: Peter Zijlstra <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mlock: do not hold mmap_sem for extended periods of timeMichel Lespinasse4-25/+47
__get_user_pages gets a new 'nonblocking' parameter to signal that the caller is prepared to re-acquire mmap_sem and retry the operation if needed. This is used to split off long operations if they are going to block on a disk transfer, or when we detect contention on the mmap_sem. [[email protected]: remove ref to rwsem_is_contended()] Signed-off-by: Michel Lespinasse <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Nick Piggin <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: move VM_LOCKED check to __mlock_vma_pages_range()Michel Lespinasse1-13/+13
Use a single code path for faulting in pages during mlock. The reason to have it in this patch series is that I did not want to update both code paths in a later change that releases mmap_sem when blocking on disk. Signed-off-by: Michel Lespinasse <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Nick Piggin <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: add FOLL_MLOCK follow_page flag.Michel Lespinasse2-60/+27
Move the code to mlock pages from __mlock_vma_pages_range() to follow_page(). This allows __mlock_vma_pages_range() to not have to break down work into 16-page batches. An additional motivation for doing this within the present patch series is that it'll make it easier for a later chagne to drop mmap_sem when blocking on disk (we'd like to be able to resume at the page that was read from disk instead of at the start of a 16-page batch). Signed-off-by: Michel Lespinasse <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Nick Piggin <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mlock: only hold mmap_sem in shared mode when faulting in pagesMichel Lespinasse1-17/+64
Currently mlock() holds mmap_sem in exclusive mode while the pages get faulted in. In the case of a large mlock, this can potentially take a very long time, during which various commands such as 'ps auxw' will block. This makes sysadmins unhappy: real 14m36.232s user 0m0.003s sys 0m0.015s (output from 'time ps auxw' while a 20GB file was being mlocked without being previously preloaded into page cache) I propose that mlock() could release mmap_sem after the VM_LOCKED bits have been set in all appropriate VMAs. Then a second pass could be done to actually mlock the pages, in small batches, releasing mmap_sem when we block on disk access or when we detect some contention. This patch: Before this change, mlock() holds mmap_sem in exclusive mode while the pages get faulted in. In the case of a large mlock, this can potentially take a very long time. Various things will block while mmap_sem is held, including 'ps auxw'. This can make sysadmins angry. I propose that mlock() could release mmap_sem after the VM_LOCKED bits have been set in all appropriate VMAs. Then a second pass could be done to actually mlock the pages with mmap_sem held for reads only. We need to recheck the vma flags after we re-acquire mmap_sem, but this is easy. In the case where a vma has been munlocked before mlock completes, pages that were already marked as PageMlocked() are handled by the munlock() call, and mlock() is careful to not mark new page batches as PageMlocked() after the munlock() call has cleared the VM_LOCKED vma flags. So, the end result will be identical to what'd happen if munlock() had executed after the mlock() call. In a later change, I will allow the second pass to release mmap_sem when blocking on disk accesses or when it is otherwise contended, so that it won't be held for long periods of time even in shared mode. Signed-off-by: Michel Lespinasse <[email protected]> Tested-by: Valdis Kletnieks <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Nick Piggin <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mlock: avoid dirtying pages and triggering writebackMichel Lespinasse2-2/+12
When faulting in pages for mlock(), we want to break COW for anonymous or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is no need to write-fault into VM_SHARED vmas since shared file pages can be mlocked first and dirtied later, when/if they actually get written to. Skipping the write fault is desirable, as we don't want to unnecessarily cause these pages to be dirtied and queued for writeback. Signed-off-by: Michel Lespinasse <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Kosaki Motohiro <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Theodore Tso <[email protected]> Cc: Michael Rubin <[email protected]> Cc: Suleiman Souhlal <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13do_wp_page: clarify dirty_page handlingMichel Lespinasse1-34/+38
Reorganize the code so that dirty pages are handled closer to the place that makes them dirty (handling write fault into shared, writable VMAs). No behavior changes. Signed-off-by: Michel Lespinasse <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Kosaki Motohiro <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Theodore Tso <[email protected]> Cc: Michael Rubin <[email protected]> Cc: Suleiman Souhlal <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13do_wp_page: remove the 'reuse' flagMichel Lespinasse1-6/+5
mlocking a shared, writable vma currently causes the corresponding pages to be marked as dirty and queued for writeback. This seems rather unnecessary given that the pages are not being actually modified during mlock. It is understood that for non-shared mappings (file or anon) we want to use a write fault in order to break COW, but there is just no such need for shared mappings. The first two patches in this series do not introduce any behavior change. The intent there is to make it obvious that dirtying file pages is only done in the (writable, shared) case. I think this clarifies the code, but I wouldn't mind dropping these two patches if there is no consensus about them. The last patch is where we actually avoid dirtying shared mappings during mlock. Note that as a side effect of this, we won't call page_mkwrite() for the mappings that define it, and won't be pre-allocating data blocks at the FS level if the mapped file was sparsely allocated. My understanding is that mlock does not need to provide such guarantee, as evidenced by the fact that it never did for the filesystems that don't define page_mkwrite() - including some common ones like ext3. However, I would like to gather feedback on this from filesystem people as a precaution. If this turns out to be a showstopper, maybe block preallocation can be added back on using a different interface. Large shared mlocks are getting significantly (>2x) faster in my tests, as the disk can be fully used for reading the file instead of having to share between this and writeback. This patch: Reorganize the code to remove the 'reuse' flag. No behavior changes. Signed-off-by: Michel Lespinasse <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Kosaki Motohiro <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Theodore Tso <[email protected]> Cc: Michael Rubin <[email protected]> Cc: Suleiman Souhlal <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: clear PageError bit in msync & fsyncRik van Riel1-1/+1
Temporary IO failures, eg. due to loss of both multipath paths, can permanently leave the PageError bit set on a page, resulting in msync or fsync returning -EIO over and over again, even if IO is now getting to the disk correctly. We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in the filemap_fdatawait_range function. Also clearing the PageError bit on the page allows subsequent msync or fsync calls on this file to return without an error, if the subsequent IO succeeds. Unfortunately data written out in the msync or fsync call that returned -EIO can still get lost, because the page dirty bit appears to not get restored on IO error. However, the alternative could be potentially all of memory filling up with uncleanable dirty pages, hanging the system, so there is no nice choice here... Signed-off-by: Rik van Riel <[email protected]> Acked-by: Valerie Aurora <[email protected]> Acked-by: Jeff Layton <[email protected]> Cc: Theodore Ts'o <[email protected]> Acked-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: unify module_alloc code for vmallocDavid Rientjes1-21/+29
Four architectures (arm, mips, sparc, x86) use __vmalloc_area() for module_init(). Much of the code is duplicated and can be generalized in a globally accessible function, __vmalloc_node_range(). __vmalloc_node() now calls into __vmalloc_node_range() with a range of [VMALLOC_START, VMALLOC_END) for functionally equivalent behavior. Each architecture may then use __vmalloc_node_range() directly to remove the duplication of code. Signed-off-by: David Rientjes <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Russell King <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: remove gfp mask from pcpu_get_vm_areasDavid Rientjes2-13/+10
pcpu_get_vm_areas() only uses GFP_KERNEL allocations, so remove the gfp_t formal and use the mask internally. Signed-off-by: David Rientjes <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: remove unused get_vm_area_nodeDavid Rientjes1-7/+0
get_vm_area_node() is unused in the kernel and can thus be removed. Signed-off-by: David Rientjes <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: vmscan: rename lumpy_mode to reclaim_modeMel Gorman1-35/+35
With compaction being used instead of lumpy reclaim, the name lumpy_mode and associated variables is a bit misleading. Rename lumpy_mode to reclaim_mode which is a better fit. There is no functional change. Signed-off-by: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Rik van Riel <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Andy Whitcroft <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: compaction: perform a faster migration scan when migrating asynchronouslyMel Gorman1-0/+15
try_to_compact_pages() is initially called to only migrate pages asychronously and kswapd always compacts asynchronously. Both are being optimistic so it is important to complete the work as quickly as possible to minimise stalls. This patch alters the scanner when asynchronous to only consider MIGRATE_MOVABLE pageblocks as migration candidates. This reduces stalls when allocating huge pages while not impairing allocation success rates as a full scan will be performed if necessary after direct reclaim. Signed-off-by: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Rik van Riel <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Andy Whitcroft <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: migration: cleanup migrate_pages API by matching types for offlining and ↵Mel Gorman4-8/+10
sync With the introduction of the boolean sync parameter, the API looks a little inconsistent as offlining is still an int. Convert offlining to a bool for the sake of being tidy. Signed-off-by: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Rik van Riel <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Andy Whitcroft <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: migration: allow migration to operate asynchronously and avoid ↵Mel Gorman7-26/+49
synchronous compaction in the faster path Migration synchronously waits for writeback if the initial passes fails. Callers of memory compaction do not necessarily want this behaviour if the caller is latency sensitive or expects that synchronous migration is not going to have a significantly better success rate. This patch adds a sync parameter to migrate_pages() allowing the caller to indicate if wait_on_page_writeback() is allowed within migration or not. For reclaim/compaction, try_to_compact_pages() is first called asynchronously, direct reclaim runs and then try_to_compact_pages() is called synchronously as there is a greater expectation that it'll succeed. [[email protected]: build/merge fix] Signed-off-by: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Rik van Riel <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Andy Whitcroft <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaimMel Gorman4-49/+175
Lumpy reclaim is disruptive. It reclaims a large number of pages and ignores the age of the pages it reclaims. This can incur significant stalls and potentially increase the number of major faults. Compaction has reached the point where it is considered reasonably stable (meaning it has passed a lot of testing) and is a potential candidate for displacing lumpy reclaim. This patch introduces an alternative to lumpy reclaim whe compaction is available called reclaim/compaction. The basic operation is very simple - instead of selecting a contiguous range of pages to reclaim, a number of order-0 pages are reclaimed and then compaction is later by either kswapd (compact_zone_order()) or direct compaction (__alloc_pages_direct_compact()). [[email protected]: fix build] [[email protected]: use conventional task_struct naming] Signed-off-by: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Rik van Riel <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Andy Whitcroft <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: vmscan: convert lumpy_mode into a bitmaskMel Gorman1-18/+28
Currently lumpy_mode is an enum and determines if lumpy reclaim is off, syncronous or asyncronous. In preparation for using compaction instead of lumpy reclaim, this patch converts the flags into a bitmap. Signed-off-by: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Andy Whitcroft <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: compaction: add trace events for memory compaction activityMel Gorman1-1/+13
In preparation for a patches promoting the use of memory compaction over lumpy reclaim, this patch adds trace points for memory compaction activity. Using them, we can monitor the scanning activity of the migration and free page scanners as well as the number and success rates of pages passed to page migration. Signed-off-by: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Andy Whitcroft <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: convert sprintf_symbol to %pSJoe Perches2-14/+6
Signed-off-by: Joe Perches <[email protected]> Acked-by: Pekka Enberg <[email protected]> Cc: Jiri Kosina <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: find_get_pages_contig fixletNick Piggin1-3/+10
Testing ->mapping and ->index without a ref is not stable as the page may have been reused at this point. Signed-off-by: Nick Piggin <[email protected]> Reviewed-by: Wu Fengguang <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13vmscan: factor out kswapd sleeping logic from kswapd()KOSAKI Motohiro1-46/+46
Currently, kswapd() has deep nesting and is slightly hard to read. Clean this up. Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm/page-writeback.c: fix __set_page_dirty_no_writeback() return valueBob Liu1-1/+1
__set_page_dirty_no_writeback() should return true if it actually transitioned the page from a clean to dirty state although it seems nobody uses its return value at present. Signed-off-by: Bob Liu <[email protected]> Acked-by: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: vmstat: use a single setter function and callback for adjusting percpu ↵Mel Gorman2-31/+24
thresholds reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid errors due to counter drift. The functions duplicate some code so this patch replaces them with a single set_pgdat_percpu_threshold() that takes a callback function to calculate the desired threshold as a parameter. [[email protected]: readability tweak] [[email protected]: set_pgdat_percpu_threshold(): don't use for_each_online_cpu] Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: page allocator: adjust the per-cpu counter threshold when memory is lowMel Gorman4-40/+107
Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory is low") noted that watermarks were based on the vmstat NR_FREE_PAGES. To avoid synchronization overhead, these counters are maintained on a per-cpu basis and drained both periodically and when a threshold is above a threshold. On large CPU systems, the difference between the estimate and real value of NR_FREE_PAGES can be very high. The system can get into a case where pages are allocated far below the min watermark potentially causing livelock issues. The commit solved the problem by taking a better reading of NR_FREE_PAGES when memory was low. Unfortately, as reported by Shaohua Li this accurate reading can consume a large amount of CPU time on systems with many sockets due to cache line bouncing. This patch takes a different approach. For large machines where counter drift might be unsafe and while kswapd is awake, the per-cpu thresholds for the target pgdat are reduced to limit the level of drift to what should be a safe level. This incurs a performance penalty in heavy memory pressure by a factor that depends on the workload and the machine but the machine should function correctly without accidentally exhausting all memory on a node. There is an additional cost when kswapd wakes and sleeps but the event is not expected to be frequent - in Shaohua's test case, there was one recorded sleep and wake event at least. To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is introduced that takes a more accurate reading of NR_FREE_PAGES when called from wakeup_kswapd, when deciding whether it is really safe to go back to sleep in sleeping_prematurely() and when deciding if a zone is really balanced or not in balance_pgdat(). We are still using an expensive function but limiting how often it is called. When the test case is reproduced, the time spent in the watermark functions is reduced. The following report is on the percentage of time spent cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(), zone_page_state(). vanilla 11.6615% disable-threshold 0.2584% David said: : We had to pull aa454840 "mm: page allocator: calculate a better estimate : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36 : internally because tests showed that it would cause the machine to stall : as the result of heavy kswapd activity. I merged it back with this fix as : it is pending in the -mm tree and it solves the issue we were seeing, so I : definitely think this should be pushed to -stable (and I would seriously : consider it for 2.6.37 inclusion even at this late date). Signed-off-by: Mel Gorman <[email protected]> Reported-by: Shaohua Li <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Tested-by: Nicolas Bareil <[email protected]> Cc: David Rientjes <[email protected]> Cc: Kyle McMartin <[email protected]> Cc: <[email protected]> [2.6.37.1, 2.6.36.x] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds1-3/+4
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits) block: ensure that completion error gets properly traced blktrace: add missing probe argument to block_bio_complete block cfq: don't use atomic_t for cfq_group block cfq: don't use atomic_t for cfq_queue block: trace event block fix unassigned field block: add internal hd part table references block: fix accounting bug on cross partition merges kref: add kref_test_and_get bio-integrity: mark kintegrityd_wq highpri and CPU intensive block: make kblockd_workqueue smarter Revert "sd: implement sd_check_events()" block: Clean up exit_io_context() source code. Fix compile warnings due to missing removal of a 'ret' variable fs/block: type signature of major_to_index(int) to major_to_index(unsigned) block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p) cfq-iosched: don't check cfqg in choose_service_tree() fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors cdrom: export cdrom_check_events() sd: implement sd_check_events() sr: implement sr_check_events() ...
2011-01-13Merge branch 'for-next' of ↵Linus Torvalds5-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) Documentation/trace/events.txt: Remove obsolete sched_signal_send. writeback: fix global_dirty_limits comment runtime -> real-time ppc: fix comment typo singal -> signal drivers: fix comment typo diable -> disable. m68k: fix comment typo diable -> disable. wireless: comment typo fix diable -> disable. media: comment typo fix diable -> disable. remove doc for obsolete dynamic-printk kernel-parameter remove extraneous 'is' from Documentation/iostats.txt Fix spelling milisec -> ms in snd_ps3 module parameter description Fix spelling mistakes in comments Revert conflicting V4L changes i7core_edac: fix typos in comments mm/rmap.c: fix comment sound, ca0106: Fix assignment to 'channel'. hrtimer: fix a typo in comment init/Kconfig: fix typo anon_inodes: fix wrong function name in comment fix comment typos concerning "consistent" poll: fix a typo in comment ... Fix up trivial conflicts in: - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c) - fs/ext4/ext4.h Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.
2011-01-12Merge branch 'apei' into releaseLen Brown1-0/+1
2011-01-12ACPI, APEI, Generic Hardware Error Source POLL/IRQ/NMI notification type supportHuang Ying1-0/+1
Generic Hardware Error Source provides a way to report platform hardware errors (such as that from chipset). It works in so called "Firmware First" mode, that is, hardware errors are reported to firmware firstly, then reported to Linux by firmware. This way, some non-standard hardware error registers or non-standard hardware link can be checked by firmware to produce more valuable hardware error information for Linux. This patch adds POLL/IRQ/NMI notification types support. Because the memory area used to transfer hardware error information from BIOS to Linux can be determined only in NMI, IRQ or timer handler, but general ioremap can not be used in atomic context, so a special version of atomic ioremap is implemented for that. Known issue: - Error information can not be printed for recoverable errors notified via NMI, because printk is not NMI-safe. Will fix this via delay printing to IRQ context via irq_work or make printk NMI-safe. v2: - adjust printk format per comments. Signed-off-by: Huang Ying <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Signed-off-by: Len Brown <[email protected]>
2011-01-11slub: Avoid use of slub_lock in show_slab_objects()Christoph Lameter1-2/+2
The purpose of the locking is to prevent removal and additions of nodes when statistics are gathered for a slab cache. So we need to avoid racing with memory hotplug functionality. It is enough to take the memory hotplug locks there instead of the slub_lock. online_pages() currently does not acquire the memory_hotplug lock. Another patch will be submitted by the memory hotplug authors to take the memory hotplug lock and describe the uses of the memory hotplug lock to protect against adding and removal of nodes from non hotplug data structures. Cc: <[email protected]> # 2.6.37 Reported-and-tested-by: Bart Van Assche <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2011-01-11memory hotplug: one more lock on memory hotplugKAMEZAWA Hiroyuki1-0/+4
Now, memory_hotplug_(un)lock() is used for add/remove/offline pages for avoiding races with hibernation. But this should be held in online_pages(), too. It seems asymmetric. There are cases where one has to avoid a race with memory hotplug notifier and his own local code, and hotplug v.s. hotplug. This will add a generic solution for avoiding races. In other view, having lock here has no big impacts. online pages is tend to be done by udev script at el against each memory section one by one. Then, it's better to have lock here, too. Cc: <[email protected]> # 2.6.37 Reviewed-by: Christoph Lameter <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2011-01-10Merge branch 'for-linus' of ↵Linus Torvalds2-22/+46
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slub: Fix a crash during slabinfo -v tracing/slab: Move kmalloc tracepoint out of inline code slub: Fix slub_lock down/up imbalance slub: Fix build breakage in Documentation/vm slub tracing: move trace calls out of always inlined functions to reduce kernel code size slub: move slabinfo.c to tools/slub/slabinfo.c
2011-01-09Merge branch 'slab/next' into for-linusPekka Enberg2-22/+46
2011-01-07Merge branch 'for-2.6.38' of ↵Linus Torvalds3-45/+118
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits) gameport: use this_cpu_read instead of lookup x86: udelay: Use this_cpu_read to avoid address calculation x86: Use this_cpu_inc_return for nmi counter x86: Replace uses of current_cpu_data with this_cpu ops x86: Use this_cpu_ops to optimize code vmstat: User per cpu atomics to avoid interrupt disable / enable irq_work: Use per cpu atomics instead of regular atomics cpuops: Use cmpxchg for xchg to avoid lock semantics x86: this_cpu_cmpxchg and this_cpu_xchg operations percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support percpu,x86: relocate this_cpu_add_return() and friends connector: Use this_cpu operations xen: Use this_cpu_inc_return taskstats: Use this_cpu_ops random: Use this_cpu_inc_return fs: Use this_cpu_inc_return in buffer.c highmem: Use this_cpu_xx_return() operations vmstat: Use this_cpu_inc_return for vm statistics x86: Support for this_cpu_add, sub, dec, inc_return percpu: Generic support for this_cpu_add, sub, dec, inc_return ... Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c} as per Tejun.
2011-01-07Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2-2/+2
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits) usb: don't use flush_scheduled_work() speedtch: don't abuse struct delayed_work media/video: don't use flush_scheduled_work() media/video: explicitly flush request_module work ioc4: use static work_struct for ioc4_load_modules() init: don't call flush_scheduled_work() from do_initcalls() s390: don't use flush_scheduled_work() rtc: don't use flush_scheduled_work() mmc: update workqueue usages mfd: update workqueue usages dvb: don't use flush_scheduled_work() leds-wm8350: don't use flush_scheduled_work() mISDN: don't use flush_scheduled_work() macintosh/ams: don't use flush_scheduled_work() vmwgfx: don't use flush_scheduled_work() tpm: don't use flush_scheduled_work() sonypi: don't use flush_scheduled_work() hvsi: don't use flush_scheduled_work() xen: don't use flush_scheduled_work() gdrom: don't use flush_scheduled_work() ... Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c as per Tejun.
2011-01-07fs: icache RCU free inodesNick Piggin1-1/+8
RCU free the struct inode. This will allow: - Subsequent store-free path walking patch. The inode must be consulted for permissions when walking, so an RCU inode reference is a must. - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want to take i_lock no longer need to take sb_inode_list_lock to walk the list in the first place. This will simplify and optimize locking. - Could remove some nested trylock loops in dcache code - Could potentially simplify things a bit in VM land. Do not need to take the page lock to follow page->mapping. The downsides of this is the performance cost of using RCU. In a simple creat/unlink microbenchmark, performance drops by about 10% due to inability to reuse cache-hot slab objects. As iterations increase and RCU freeing starts kicking over, this increases to about 20%. In cases where inode lifetimes are longer (ie. many inodes may be allocated during the average life span of a single inode), a lot of this cache reuse is not applicable, so the regression caused by this patch is smaller. The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU, however this adds some complexity to list walking and store-free path walking, so I prefer to implement this at a later date, if it is shown to be a win in real situations. I haven't found a regression in any non-micro benchmark so I doubt it will be a problem. Signed-off-by: Nick Piggin <[email protected]>
2011-01-07fs: dcache remove dcache_lockNick Piggin1-3/+0
dcache_lock no longer protects anything. remove it. Signed-off-by: Nick Piggin <[email protected]>
2011-01-07kernel: kmem_ptr_validate considered harmfulNick Piggin4-97/+1
This is a nasty and error prone API. It is no longer used, remove it. Signed-off-by: Nick Piggin <[email protected]>
2011-01-04writeback: fix global_dirty_limits comment runtime -> real-timeMinchan Kim1-1/+1
Change runtime with real-time Cc: Wu Fengguang <[email protected]> Signed-off-by: Minchan Kim <[email protected]> Signed-off-by: Jiri Kosina <[email protected]>
2010-12-30memcg: fix wrong VM_BUG_ON() in try_charge()'s mm->owner checkKAMEZAWA Hiroyuki1-10/+9
At __mem_cgroup_try_charge(), VM_BUG_ON(!mm->owner) is checked. But as commented in mem_cgroup_from_task(), mm->owner can be NULL in some racy case. This check of VM_BUG_ON() is bad. A possible story to hit this is at swapoff()->try_to_unuse(). It passes mm_struct to mem_cgroup_try_charge_swapin() while mm->owner is NULL. If we can't get proper mem_cgroup from swap_cgroup information, mm->owner is used as charge target and we see NULL. Cc: Daisuke Nishimura <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Reported-by: Hugh Dickins <[email protected]> Reported-by: Thomas Meyer <[email protected]> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Balbir Singh <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Cc: [email protected] Signed-off-by: Linus Torvalds <[email protected]>
2010-12-29memblock: Make find_memory_core_early() find from top-downYinghai Lu1-1/+33
That is used for find ram in node or bootmem type. We should make it top-down so it will be consistent to memblock_find, and to avoid allocating potentially valuable low memory before we actually need it. Suggested-by: Jeremy Fitzhardinge <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2010-12-27Merge branch 'nommu-fixes-for-linus' of ↵Linus Torvalds1-1/+27
git://git.kernel.org/pub/scm/linux/kernel/git/lethal/nommu-2.6 * 'nommu-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/nommu-2.6: nommu: Provide stubbed alloc/free_vm_area() implementation. nommu: Fix up vmalloc_node() symbol export regression.
2010-12-27mm/rmap.c: fix commentFigo.zhang1-1/+1
clean up comment. Signed-off-by: Figo.zhang <[email protected]> Acked-by: Rik van Riel <[email protected]> Signed-off-by: Jiri Kosina <[email protected]>
2010-12-24Merge branch 'for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: print out alloc information with KERN_DEBUG instead of KERN_INFO kthread_work: make lockdep happy
2010-12-24nommu: Provide stubbed alloc/free_vm_area() implementation.Paul Mundt1-1/+26
Now that these have been introduced in to the vmalloc API, sync up the nommu side of things. At present we don't deal with VMAs as such, so for the time being these will simply BUG() out. In the future it should be possible to support this interface by layering on top of the vm_regions. Signed-off-by: Paul Mundt <[email protected]>
2010-12-24nommu: Fix up vmalloc_node() symbol export regression.Paul Mundt1-0/+1
Commit e1ca778 ("mm: add vzalloc() and vzalloc_node() helpers") ended up accidentally deleting the vmalloc_node() symbol export, resulting in: "vmalloc_node" [net/core/pktgen.ko] undefined! "vmalloc_node" [net/netfilter/x_tables.ko] undefined! regressions. Signed-off-by: Paul Mundt <[email protected]>