aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2013-09-11mm: correct the comment about the value for buddy _mapcountWang Sheng-Hui1-4/+7
Set _mapcount PAGE_BUDDY_MAPCOUNT_VALUE to make the page buddy. Not the magic number -2. Signed-off-by: Wang Sheng-Hui <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm/page-writeback.c: add strictlimit featureMaxim Patlasov1-61/+202
The feature prevents mistrusted filesystems (ie: FUSE mounts created by unprivileged users) to grow a large number of dirty pages before throttling. For such filesystems balance_dirty_pages always check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under "freerun", it's not allowed to skip bdi checks. The only use case for now is fuse: it sets bdi max_ratio to 1% by default and system administrators are supposed to expect that this limit won't be exceeded. The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A filesystem may set the flag when it initializes its BDI. The problematic scenario comes from the fact that nobody pays attention to the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse writeback). The implementation of fuse writeback releases original page (by calling end_page_writeback) almost immediately. A fuse request queued for real processing bears a copy of original page. Hence, if userspace fuse daemon doesn't finalize write requests in timely manner, an aggressive mmap writer can pollute virtually all memory by those temporary fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but nobody cares. To make further explanations shorter, let me use "NR_WRITEBACK_TEMP problem" as a shortcut for "a possibility of uncontrolled grow of amount of RAM consumed by temporary pages allocated by kernel fuse to process writeback". The problem was very easy to reproduce. There is a trivial example filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I added "sleep(1);" to the write methods, then recompiled and mounted it. Then created a huge file on the mount point and run a simple program which mmap-ed the file to a memory region, then wrote a data to the region. An hour later I observed almost all RAM consumed by fuse writeback. Since then some unrelated changes in kernel fuse made it more difficult to reproduce, but it is still possible now. Putting this theoretical happens-in-the-lab thing aside, there is another thing that really hurts real world (FUSE) users. This is write-through page cache policy FUSE currently uses. I.e. handling write(2), kernel fuse populates page cache and flushes user data to the server synchronously. This is excessively suboptimal. Pavel Emelyanov's patches ("writeback cache policy") solve the problem, but they also make resolving NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying a huge file to a fuse mount would result in memory starvation. Miklos, the maintainer of FUSE, believes strictlimit feature the way to go. And eventually putting FUSE topics aside, there is one more use-case for strictlimit feature. Using a slow USB stick (mass storage) in a machine with huge amount of RAM installed is a well-known pain. Let's make simple computations. Assuming 64GB of RAM installed, existing implementation of balance_dirty_pages will start throttling only after 9.6GB of RAM becomes dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file /media/my-usb-storage/" may return in a few seconds, but subsequent "umount /media/my-usb-storage/" will take more than two hours if effective throughput of the storage is, to say, 1MB/sec. After inclusion of strictlimit feature, it will be trivial to add a knob (e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand. Manually or via udev rule. May be I'm wrong, but it seems to be quite a natural desire to limit the amount of dirty memory for some devices we are not fully trust (in the sense of sustainable throughput). [[email protected]: fix warning in page-writeback.c] Signed-off-by: Maxim Patlasov <[email protected]> Cc: Jan Kara <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: James Bottomley <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm/backing-dev.c: check user buffer length before copying data to the ↵Chen Gang1-1/+1
related user buffer '*lenp' may be less than "sizeof(kbuf)" so we must check this before the next copy_to_user(). pdflush_proc_obsolete() is called by sysctl which 'procname' is "nr_pdflush_threads", if the user passes buffer length less than "sizeof(kbuf)", it will cause issue. Signed-off-by: Chen Gang <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Jeff Moyer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm/mremap.c: call pud_free() after fail calling pmd_alloc()Chen Gang1-1/+4
In alloc_new_pmd(), if pud_alloc() was called successfully, but pmd_alloc() fails, avoid leaking `pud'. Signed-off-by: Chen Gang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm/vmalloc: use wrapper function get_vm_area_size to caculate size of vm areaWanpeng Li1-6/+6
Use wrapper function get_vm_area_size to calculate size of vm area. Signed-off-by: Wanpeng Li <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Fengguang Wu <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Wanpeng Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm/sparse: introduce alloc_usemap_and_memmapWanpeng Li1-76/+57
After commit 9bdac9142407 ("sparsemem: Put mem map for one node together."), vmemmap for one node will be allocated together, its logic is similar as memory allocation for pageblock flags. This patch introduces alloc_usemap_and_memmap to extract the same logic of memory alloction for pageblock flags and vmemmap. Signed-off-by: Wanpeng Li <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Fengguang Wu <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: vmscan: fix do_try_to_free_pages() livelockLisa Du6-40/+43
This patch is based on KOSAKI's work and I add a little more description, please refer https://lkml.org/lkml/2012/6/14/74. Currently, I found system can enter a state that there are lots of free pages in a zone but only order-0 and order-1 pages which means the zone is heavily fragmented, then high order allocation could make direct reclaim path's long stall(ex, 60 seconds) especially in no swap and no compaciton enviroment. This problem happened on v3.4, but it seems issue still lives in current tree, the reason is do_try_to_free_pages enter live lock: kswapd will go to sleep if the zones have been fully scanned and are still not balanced. As kswapd thinks there's little point trying all over again to avoid infinite loop. Instead it changes order from high-order to 0-order because kswapd think order-0 is the most important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep and may leave zone->all_unreclaimable =3D 0. It assume high-order users can still perform direct reclaim if they wish. Direct reclaim continue to reclaim for a high order which is not a COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble= . This is because to avoid too early oom-kill. So it means direct_reclaim depends on kswapd to break this loop. In worst case, direct-reclaim may continue to page reclaim forever when kswapd sleeps forever until someone like watchdog detect and finally kill the process. As described in: http://thread.gmane.org/gmane.linux.kernel.mm/103737 We can't turn on zone->all_unreclaimable from direct reclaim path because direct reclaim path don't take any lock and this way is racy. Thus this patch removes zone->all_unreclaimable field completely and recalculates zone reclaimable state every time. Note: we can't take the idea that direct-reclaim see zone->pages_scanned directly and kswapd continue to use zone->all_unreclaimable. Because, it is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use zone->all_unreclaimable as a name) describes the detail. [[email protected]: uninline zone_reclaimable_pages() and zone_reclaimable()] Cc: Aaditya Kumar <[email protected]> Cc: Ying Han <[email protected]> Cc: Nick Piggin <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Bob Liu <[email protected]> Cc: Neil Zhang <[email protected]> Cc: Russell King - ARM Linux <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Acked-by: Minchan Kim <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Lisa Du <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: munlock: manual pte walk in fast path instead of follow_page_mask()Vlastimil Babka1-31/+79
Currently munlock_vma_pages_range() calls follow_page_mask() to obtain each individual struct page. This entails repeated full page table translations and page table lock taken for each page separately. This patch avoids the costly follow_page_mask() where possible, by iterating over ptes within single pmd under single page table lock. The first pte is obtained by get_locked_pte() for non-THP page acquired by the initial follow_page_mask(). The rest of the on-stack pagevec for munlock is filled up using pte_walk as long as pte_present() and vm_normal_page() are sufficient to obtain the struct page. After this patch, a 14% speedup was measured for munlocking a 56GB large memory area with THP disabled. Signed-off-by: Vlastimil Babka <[email protected]> Cc: Jörn Engel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: munlock: remove redundant get_page/put_page pair on the fast pathVlastimil Babka1-12/+14
The performance of the fast path in munlock_vma_range() can be further improved by avoiding atomic ops of a redundant get_page()/put_page() pair. When calling get_page() during page isolation, we already have the pin from follow_page_mask(). This pin will be then returned by __pagevec_lru_add(), after which we do not reference the pages anymore. After this patch, an 8% speedup was measured for munlocking a 56GB large memory area with THP disabled. Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: Jörn Engel <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: munlock: bypass per-cpu pvec for putback_lru_pageVlastimil Babka1-4/+69
After introducing batching by pagevecs into munlock_vma_range(), we can further improve performance by bypassing the copying into per-cpu pagevec and the get_page/put_page pair associated with that. Instead we perform LRU putback directly from our pagevec. However, this is possible only for single-mapped pages that are evictable after munlock. Unevictable pages require rechecking after putting on the unevictable list, so for those we fallback to putback_lru_page(), hich handles that. After this patch, a 13% speedup was measured for munlocking a 56GB large memory area with THP disabled. [[email protected]:clarify comment] Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: Jörn Engel <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: munlock: batch NR_MLOCK zone state updatesVlastimil Babka1-3/+3
Depending on previous batch which introduced batched isolation in munlock_vma_range(), we can batch also the updates of NR_MLOCK page stats. After the whole pagevec is processed for page isolation, the stats are updated only once with the number of successful isolations. There were however no measurable perfomance gains. Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: Jörn Engel <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: munlock: batch non-THP page isolation and munlock+putback using pagevecVlastimil Babka1-40/+156
Currently, munlock_vma_range() calls munlock_vma_page on each page in a loop, which results in repeated taking and releasing of the lru_lock spinlock for isolating pages one by one. This patch batches the munlock operations using an on-stack pagevec, so that isolation is done under single lru_lock. For THP pages, the old behavior is preserved as they might be split while putting them into the pagevec. After this patch, a 9% speedup was measured for munlocking a 56GB large memory area with THP disabled. A new function __munlock_pagevec() is introduced that takes a pagevec and: 1) It clears PageMlocked and isolates all pages under lru_lock. Zone page stats can be also updated using the variant which assumes disabled interrupts. 2) It finishes the munlock and lru putback on all pages under their lock_page. Note that previously, lock_page covered also the PageMlocked clearing and page isolation, but it is not needed for those operations. Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: Jörn Engel <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: munlock: remove unnecessary call to lru_add_drain()Vlastimil Babka1-1/+0
In munlock_vma_range(), lru_add_drain() is currently called in a loop before each munlock_vma_page() call. This is suboptimal for performance when munlocking many pages. The benefits of per-cpu pagevec for batching the LRU putback are removed since the pagevec only holds at most one page from the previous loop's iteration. The lru_add_drain() call also does not serve any purposes for correctness - it does not even drain pagavecs of all cpu's. The munlock code already expects and handles situations where a page cannot be isolated from the LRU (e.g. because it is on some per-cpu pagevec). The history of the (not commented) call also suggest that it appears there as an oversight rather than intentionally. Before commit ff6a6da6 ("mm: accelerate munlock() treatment of THP pages") the call happened only once upon entering the function. The commit has moved the call into the while loope. So while the other changes in the commit improved munlock performance for THP pages, it introduced the abovementioned suboptimal per-cpu pagevec usage. Further in history, before commit 408e82b7 ("mm: munlock use follow_page"), munlock_vma_pages_range() was just a wrapper around __mlock_vma_pages_range which performed both mlock and munlock depending on a flag. However, before ba470de4 ("mmap: handle mlocked pages during map, remap, unmap") the function handled only mlock, not munlock. The lru_add_drain call thus comes from the implementation in commit b291f000 ("mlock: mlocked pages are unevictable" and was intended only for mlocking, not munlocking. The original intention of draining the LRU pagevec at mlock time was to ensure the pages were on the LRU before the lock operation so that they could be placed on the unevictable list immediately. There is very little motivation to do the same in the munlock path this, particularly for every single page. This patch therefore removes the call completely. After removing the call, a 10% speedup was measured for munlock() of a 56GB large memory area with THP disabled. Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: Jörn Engel <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: putback_lru_page: remove unnecessary call to page_lru_base_type()Vlastimil Babka1-6/+6
The goal of this patch series is to improve performance of munlock() of large mlocked memory areas on systems without THP. This is motivated by reported very long times of crash recovery of processes with such areas, where munlock() can take several seconds. See http://lwn.net/Articles/548108/ The work was driven by a simple benchmark (to be included in mmtests) that mmaps() e.g. 56GB with MAP_LOCKED | MAP_POPULATE and measures the time of munlock(). Profiling was performed by attaching operf --pid to the process and sending a signal to trigger the munlock() part and then notify bach the monitoring wrapper to stop operf, so that only munlock() appears in the profile. The profiles have shown that CPU time is spent mostly by atomic operations and repeated locking per single pages. This series aims to reduce both, starting from simpler to more complex changes. Patch 1 performs a simple cleanup in putback_lru_page() so that page lru base type is not determined without being actually needed. Patch 2 removes an unnecessary call to lru_add_drain() which drains the per-cpu pagevec after each munlocked page is put there. Patch 3 changes munlock_vma_range() to use an on-stack pagevec for isolating multiple non-THP pages under a single lru_lock instead of locking and processing each page separately. Patch 4 changes the NR_MLOCK accounting to be called only once per the pvec introduced by previous patch. Patch 5 uses the introduced pagevec to batch also the work of putback_lru_page when possible, bypassing the per-cpu pvec and associated overhead. Patch 6 removes a redundant get_page/put_page pair which saves costly atomic operations. Patch 7 avoids calling follow_page_mask() on each individual page, and obtains multiple page references under a single page table lock where possible. Measurements were made using 3.11-rc3 as a baseline. The first set of measurements shows the possibly ideal conditions where batching should help the most. All memory is allocated from a single NUMA node and THP is disabled. timedmunlock 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 0 1 2 3 4 5 6 7 Elapsed min 3.38 ( 0.00%) 3.39 ( -0.13%) 3.00 ( 11.33%) 2.70 ( 20.20%) 2.67 ( 21.11%) 2.37 ( 29.88%) 2.20 ( 34.91%) 1.91 ( 43.59%) Elapsed mean 3.39 ( 0.00%) 3.40 ( -0.23%) 3.01 ( 11.33%) 2.70 ( 20.26%) 2.67 ( 21.21%) 2.38 ( 29.88%) 2.21 ( 34.93%) 1.92 ( 43.46%) Elapsed stddev 0.01 ( 0.00%) 0.01 (-43.09%) 0.01 ( 15.42%) 0.01 ( 23.42%) 0.00 ( 89.78%) 0.01 ( -7.15%) 0.00 ( 76.69%) 0.02 (-91.77%) Elapsed max 3.41 ( 0.00%) 3.43 ( -0.52%) 3.03 ( 11.29%) 2.72 ( 20.16%) 2.67 ( 21.63%) 2.40 ( 29.50%) 2.21 ( 35.21%) 1.96 ( 42.39%) Elapsed range 0.03 ( 0.00%) 0.04 (-51.16%) 0.02 ( 6.27%) 0.02 ( 14.67%) 0.00 ( 88.90%) 0.03 (-19.18%) 0.01 ( 73.70%) 0.06 (-113.35% The second set of measurements simulates the worst possible conditions for batching by using numactl --interleave, so that there is in fact only one page per pagevec. Even in this case the series seems to improve performance thanks to reduced atomic operations and removal of lru_add_drain(). timedmunlock 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 0 1 2 3 4 5 6 7 Elapsed min 4.00 ( 0.00%) 4.04 ( -0.93%) 3.87 ( 3.37%) 3.72 ( 6.94%) 3.81 ( 4.72%) 3.69 ( 7.82%) 3.64 ( 8.92%) 3.41 ( 14.81%) Elapsed mean 4.17 ( 0.00%) 4.15 ( 0.51%) 4.03 ( 3.49%) 3.89 ( 6.84%) 3.86 ( 7.48%) 3.89 ( 6.69%) 3.70 ( 11.27%) 3.48 ( 16.59%) Elapsed stddev 0.16 ( 0.00%) 0.08 ( 50.76%) 0.10 ( 41.58%) 0.16 ( 4.59%) 0.05 ( 72.38%) 0.19 (-12.91%) 0.05 ( 68.09%) 0.06 ( 66.03%) Elapsed max 4.34 ( 0.00%) 4.32 ( 0.56%) 4.19 ( 3.62%) 4.12 ( 5.15%) 3.91 ( 9.88%) 4.12 ( 5.25%) 3.80 ( 12.58%) 3.56 ( 18.08%) Elapsed range 0.34 ( 0.00%) 0.28 ( 17.91%) 0.32 ( 6.45%) 0.40 (-15.73%) 0.10 ( 70.06%) 0.43 (-24.84%) 0.15 ( 55.32%) 0.15 ( 56.16%) For completeness, a third set of measurements shows the situation where THP is enabled and allocations are again done on a single NUMA node. Here munlock() is already very fast thanks to huge pages, and this series does not compromise that performance. It seems that the removal of call to lru_add_drain() still helps a bit. timedmunlock 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 0 1 2 3 4 5 6 7 Elapsed min 0.01 ( 0.00%) 0.01 ( -0.11%) 0.01 ( 6.59%) 0.01 ( 5.41%) 0.01 ( 5.45%) 0.01 ( 5.03%) 0.01 ( 6.08%) 0.01 ( 5.20%) Elapsed mean 0.01 ( 0.00%) 0.01 ( -0.27%) 0.01 ( 6.39%) 0.01 ( 5.30%) 0.01 ( 5.32%) 0.01 ( 5.03%) 0.01 ( 5.97%) 0.01 ( 5.22%) Elapsed stddev 0.00 ( 0.00%) 0.00 ( -9.59%) 0.00 ( 10.77%) 0.00 ( 3.24%) 0.00 ( 24.42%) 0.00 ( 31.86%) 0.00 ( -7.46%) 0.00 ( 6.11%) Elapsed max 0.01 ( 0.00%) 0.01 ( -0.01%) 0.01 ( 6.83%) 0.01 ( 5.42%) 0.01 ( 5.79%) 0.01 ( 5.53%) 0.01 ( 6.08%) 0.01 ( 5.26%) Elapsed range 0.00 ( 0.00%) 0.00 ( 7.30%) 0.00 ( 24.38%) 0.00 ( 6.10%) 0.00 ( 30.79%) 0.00 ( 42.52%) 0.00 ( 6.11%) 0.00 ( 10.07%) This patch (of 7): In putback_lru_page() since commit c53954a092 (""mm: remove lru parameter from __lru_cache_add and lru_cache_add_lru") it is no longer needed to determine lru list via page_lru_base_type(). This patch replaces it with simple flag is_unevictable which says that the page was put on the inevictable list. This is the only information that matters in subsequent tests. Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: Jörn Engel <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: track vma changes with VM_SOFTDIRTY bitCyrill Gorcunov1-1/+11
Pavel reported that in case if vma area get unmapped and then mapped (or expanded) in-place, the soft dirty tracker won't be able to recognize this situation since it works on pte level and ptes are get zapped on unmap, loosing soft dirty bit of course. So to resolve this situation we need to track actions on vma level, there VM_SOFTDIRTY flag comes in. When new vma area created (or old expanded) we set this bit, and keep it here until application calls for clearing soft dirty bit. Thus when user space application track memory changes now it can detect if vma area is renewed. Reported-by: Pavel Emelyanov <[email protected]> Signed-off-by: Cyrill Gorcunov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Matt Mackall <[email protected]> Cc: Xiao Guangrong <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Rob Landley <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: page_alloc: fix comment get_page_from_freelistSeungHun Lee1-1/+1
cpuset_zone_allowed is changed to cpuset_zone_allowed_softwall and the comment is moved to __cpuset_node_allowed_softwall. So fix this comment. Signed-off-by: SeungHun Lee <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: fix aio performance regression for database caused by THPKhalid Aziz1-25/+52
I am working with a tool that simulates oracle database I/O workload. This tool (orion to be specific - <http://docs.oracle.com/cd/E11882_01/server.112/e16638/iodesign.htm#autoId24>) allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag. It then does aio into these pages from flash disks using various common block sizes used by database. I am looking at performance with two of the most common block sizes - 1M and 64K. aio performance with these two block sizes plunged after Transparent HugePages was introduced in the kernel. Here are performance numbers: pre-THP 2.6.39 3.11-rc5 1M read 8384 MB/s 5629 MB/s 6501 MB/s 64K read 7867 MB/s 4576 MB/s 4251 MB/s I have narrowed the performance impact down to the overheads introduced by THP in __get_page_tail() and put_compound_page() routines. perf top shows >40% of cycles being spent in these two routines. Every time direct I/O to hugetlbfs pages starts, kernel calls get_page() to grab a reference to the pages and calls put_page() when I/O completes to put the reference away. THP introduced significant amount of locking overhead to get_page() and put_page() when dealing with compound pages because hugepages can be split underneath get_page() and put_page(). It added this overhead irrespective of whether it is dealing with hugetlbfs pages or transparent hugepages. This resulted in 20%-45% drop in aio performance when using hugetlbfs pages. Since hugetlbfs pages can not be split, there is no reason to go through all the locking overhead for these pages from what I can see. I added code to __get_page_tail() and put_compound_page() to bypass all the locking code when working with hugetlbfs pages. This improved performance significantly. Performance numbers with this patch: pre-THP 3.11-rc5 3.11-rc5 + Patch 1M read 8384 MB/s 6501 MB/s 8371 MB/s 64K read 7867 MB/s 4251 MB/s 6510 MB/s Performance with 64K read is still lower than what it was before THP, but still a 53% improvement. It does mean there is more work to be done but I will take a 53% improvement for now. Please take a look at the following patch and let me know if it looks reasonable. [[email protected]: tweak comments] Signed-off-by: Khalid Aziz <[email protected]> Cc: Pravin B Shelar <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Andi Kleen <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: compaction: do not compact pgdat for order-0Mel Gorman1-0/+3
If kswapd was reclaiming for a high order and resets it to 0 due to fragmentation it will still call compact_pgdat. For the most part, this will fail a compaction_suitable() test and not compact but it is unnecessarily sloppy. It could be fixed in the caller but fix it in the API instead. [[email protected]: pointed out that it was a potential problem] Signed-off-by: Mel Gorman <[email protected]> Cc: Hillf Danton <[email protected]> Acked-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11kmemcg: don't allocate extra memory for root memcg_cache_paramsAndrey Vagin1-3/+6
The memcg_cache_params structure contains the common part and the union, which represents two different types of data: one for root cashes and another for child caches. The size of child data is fixed. The size of the memcg_caches array is calculated in runtime. Currently the size of memcg_cache_params for root caches is calculated incorrectly, because it includes the size of parameters for child caches. ssize_t size = memcg_caches_array_size(num_groups); size *= sizeof(void *); size += sizeof(struct memcg_cache_params); v2: Fix a typo in calculations Signed-off-by: Andrey Vagin <[email protected]> Cc: Glauber Costa <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Balbir Singh <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11memblock, numa: binary search node idYinghai Lu2-10/+27
Current early_pfn_to_nid() on arch that support memblock go over memblock.memory one by one, so will take too many try near the end. We can use existing memblock_search to find the node id for given pfn, that could save some time on bigger system that have many entries memblock.memory array. Here are the timing differences for several machines. In each case with the patch less time was spent in __early_pfn_to_nid(). 3.11-rc5 with patch difference (%) -------- ---------- -------------- UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%) UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%) UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%) UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%) Time in seconds. Signed-off-by: Yinghai Lu <[email protected]> Cc: Tejun Heo <[email protected]> Acked-by: Russ Anderson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mbind: add BUG_ON(!vma) in new_vma_page()Naoya Horiguchi1-3/+5
new_vma_page() is called only by page migration called from do_mbind(), where pages to be migrated are queued into a pagelist by queue_pages_range(). queue_pages_range() confirms that a queued page belongs to some vma, so !vma case is not supposed to be happen. This patch adds BUG_ON() to catch this unexpected case. Signed-off-by: Naoya Horiguchi <[email protected]> Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm/mempolicy: rename check_*range to queue_pages_*rangeNaoya Horiguchi1-18/+23
The function check_range() (and its family) is not well-named, because it does not only checking something, but moving pages from list to list to do page migration for them. So queue_pages_*range is more desirable name. Signed-off-by: Naoya Horiguchi <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: prepare to remove /proc/sys/vm/hugepages_treat_as_movableNaoya Horiguchi1-18/+14
Now hugepage migration is enabled, although restricted on pmd-based hugepages for now (due to lack of testing.) So we should allocate migratable hugepages from ZONE_MOVABLE if possible. This patch makes GFP flags in hugepage allocation dependent on migration support, not only the value of hugepages_treat_as_movable. It provides no change on the behavior for architectures which do not support hugepage migration, Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Andi Kleen <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: migrate: check movability of hugepage in unmap_and_move_huge_page()Naoya Horiguchi1-0/+10
Currently hugepage migration works well only for pmd-based hugepages (mainly due to lack of testing,) so we had better not enable migration of other levels of hugepages until we are ready for it. Some users of hugepage migration (mbind, move_pages, and migrate_pages) do page table walk and check pud/pmd_huge() there, so they are safe. But the other users (softoffline and memory hotremove) don't do this, so without this patch they can try to migrate unexpected types of hugepages. To prevent this, we introduce hugepage_migration_support() as an architecture dependent check of whether hugepage are implemented on a pmd basis or not. And on some architecture multiple sizes of hugepages are available, so hugepage_migration_support() also checks hugepage size. Signed-off-by: Naoya Horiguchi <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: memory-hotplug: enable memory hotplug to handle hugepageNaoya Horiguchi4-9/+129
Until now we can't offline memory blocks which contain hugepages because a hugepage is considered as an unmovable page. But now with this patch series, a hugepage has become movable, so by using hugepage migration we can offline such memory blocks. What's different from other users of hugepage migration is that we need to decompose all the hugepages inside the target memory block into free buddy pages after hugepage migration, because otherwise free hugepages remaining in the memory block intervene the memory offlining. For this reason we introduce new functions dissolve_free_huge_page() and dissolve_free_huge_pages(). Other than that, what this patch does is straightforwardly to add hugepage migration code, that is, adding hugepage code to the functions which scan over pfn and collect hugepages to be migrated, and adding a hugepage allocation function to alloc_migrate_target(). As for larger hugepages (1GB for x86_64), it's not easy to do hotremove over them because it's larger than memory block. So we now simply leave it to fail as it is. [[email protected]: remove duplicated include] Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Andi Kleen <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Wei Yongjun <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: mbind: add hugepage migration code to mbind()Naoya Horiguchi2-1/+17
Extend do_mbind() to handle vma with VM_HUGETLB set. We will be able to migrate hugepage with mbind(2) after applying the enablement patch which comes later in this series. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Andi Kleen <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: migrate: add hugepage migration code to move_pages()Naoya Horiguchi2-4/+26
Extend move_pages() to handle vma with VM_HUGETLB set. We will be able to migrate hugepage with move_pages(2) after applying the enablement patch which comes later in this series. We avoid getting refcount on tail pages of hugepage, because unlike thp, hugepage is not split and we need not care about races with splitting. And migration of larger (1GB for x86_64) hugepage are not enabled. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Andi Kleen <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11migrate: add hugepage migration code to migrate_pages()Naoya Horiguchi1-5/+39
Extend check_range() to handle vma with VM_HUGETLB set. We will be able to migrate hugepage with migrate_pages(2) after applying the enablement patch which comes later in this series. Note that for larger hugepages (covered by pud entries, 1GB for x86_64 for example), we simply skip it now. Note that using pmd_huge/pud_huge assumes that hugepages are pointed to by pmd/pud. This is not true in some architectures implementing hugepage with other mechanisms like ia64, but it's OK because pmd_huge/pud_huge simply return 0 in such arch and page walker simply ignores such hugepages. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Andi Kleen <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: soft-offline: use migrate_pages() instead of migrate_huge_page()Naoya Horiguchi2-29/+14
Currently migrate_huge_page() takes a pointer to a hugepage to be migrated as an argument, instead of taking a pointer to the list of hugepages to be migrated. This behavior was introduced in commit 189ebff28 ("hugetlb: simplify migrate_huge_page()"), and was OK because until now hugepage migration is enabled only for soft-offlining which migrates only one hugepage in a single call. But the situation will change in the later patches in this series which enable other users of page migration to support hugepage migration. They can kick migration for both of normal pages and hugepages in a single call, so we need to go back to original implementation which uses linked lists to collect the hugepages to be migrated. With this patch, soft_offline_huge_page() switches to use migrate_pages(), and migrate_huge_page() is not used any more. So let's remove it. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Andi Kleen <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: migrate: make core migration code aware of hugepageNaoya Horiguchi2-2/+31
Currently hugepage migration is available only for soft offlining, but it's also useful for some other users of page migration (clearly because users of hugepage can enjoy the benefit of mempolicy and memory hotplug.) So this patchset tries to extend such users to support hugepage migration. The target of this patchset is to enable hugepage migration for NUMA related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and memory hotplug. This patchset does not add hugepage migration for memory compaction, because users of memory compaction mainly expect to construct thp by arranging raw pages, and there's little or no need to compact hugepages. CMA, another user of page migration, can have benefit from hugepage migration, but is not enabled to support it for now (just because of lack of testing and expertise in CMA.) Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in x86_64, or hugepages in architectures like ia64) is not enabled for now (again, because of lack of testing.) As for how these are achived, I extended the API (migrate_pages()) to handle hugepage (with patch 1 and 2) and adjusted code of each caller to check and collect movable hugepages (with patch 3-7). Remaining 2 patches are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is about making sure that we only migrate pmd-based hugepages. And patch 9 is about choosing appropriate zone for hugepage allocation. My test is mainly functional one, simply kicking hugepage migration via each entry point and confirm that migration is done correctly. Test code is available here: git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git And I always run libhugetlbfs test when changing hugetlbfs's code. With this patchset, no regression was found in the test. This patch (of 9): Before enabling each user of page migration to support hugepage, this patch enables the list of pages for migration to link not only LRU pages, but also hugepages. As a result, putback_movable_pages() and migrate_pages() can handle both of LRU pages and hugepages. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Andi Kleen <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: return a reserved page to a reserved pool if failedJoonsoo Kim1-1/+12
If we fail with a reserved page, just calling put_page() is not sufficient, because put_page() invoke free_huge_page() at last step and it doesn't know whether a page comes from a reserved pool or not. So it doesn't do anything related to reserved count. This makes reserve count lower than how we need, because reserve count already decrease in dequeue_huge_page_vma(). This patch fix this situation. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: grab a page_table_lock after page_cache_releaseJoonsoo Kim1-2/+3
We don't need to grab a page_table_lock when we try to release a page. So, defer to grab a page_table_lock. Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Aneesh Kumar <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: remove useless check about mapping typeJoonsoo Kim1-2/+1
is_vma_resv_set(vma, HPAGE_RESV_OWNER) implys that this mapping is for private. So we don't need to check whether this mapping is for shared or not. This patch is just for clean-up. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Naoya Horiguchi <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: fix subpool accounting handlingJoonsoo Kim1-4/+6
If we alloc hugepage with avoid_reserve, we don't dequeue reserved one. So, we should check subpool counter when avoid_reserve. This patch implement it. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: change variable name reservations to resvJoonsoo Kim1-13/+13
'reservations' is so long name as a variable and we use 'resv_map' to represent 'struct resv_map' in other place. To reduce confusion and unreadability, change it. Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Aneesh Kumar <[email protected]> Cc: Naoya Horiguchi <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: protect reserved pages when soft offlining a hugepageJoonsoo Kim1-2/+3
Don't use the reserve pool when soft offlining a hugepage. Check we have free pages outside the reserve pool before we dequeue the huge page. Otherwise, we can steal other's reserve page. Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Aneesh Kumar <[email protected]> Cc: Naoya Horiguchi <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm/hotplug: remove stop_machine() from try_offline_node()Toshi Kani1-9/+22
lock_device_hotplug() serializes hotplug & online/offline operations. The lock is held in common sysfs online/offline interfaces and ACPI hotplug code paths. And here are the code paths: - CPU & Mem online/offline via sysfs online store_online()->lock_device_hotplug() - Mem online via sysfs state: store_mem_state()->lock_device_hotplug() - ACPI CPU & Mem hot-add: acpi_scan_bus_device_check()->lock_device_hotplug() - ACPI CPU & Mem hot-delete: acpi_scan_hot_remove()->lock_device_hotplug() try_offline_node() off-lines a node if all memory sections and cpus are removed on the node. It is called from acpi_processor_remove() and acpi_memory_remove_memory()->remove_memory() paths, both of which are in the ACPI hotplug code. try_offline_node() calls stop_machine() to stop all cpus while checking all cpu status with the assumption that the caller is not protected from CPU hotplug or CPU online/offline operations. However, the caller is always serialized with lock_device_hotplug(). Also, the code needs to be properly serialized with a lock, not by stopping all cpus at a random place with stop_machine(). This patch removes the use of stop_machine() in try_offline_node() and adds comments to try_offline_node() and remove_memory() that lock_device_hotplug() is required. Signed-off-by: Toshi Kani <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Tang Chen <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Wanpeng Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm/hotplug: verify hotplug memory rangeToshi Kani1-0/+23
add_memory() and remove_memory() can only handle a memory range aligned with section. There are problems when an unaligned range is added and then deleted as follows: - add_memory() with an unaligned range succeeds, but __add_pages() called from add_memory() adds a whole section of pages even though a given memory range is less than the section size. - remove_memory() to the added unaligned range hits BUG_ON() in __remove_pages(). This patch changes add_memory() and remove_memory() to check if a given memory range is aligned with section at the beginning. As the result, add_memory() fails with -EINVAL when a given range is unaligned, and does not add such memory range. This prevents remove_memory() to be called with an unaligned range as well. Note that remove_memory() has to use BUG_ON() since this function cannot fail. [[email protected]: avoid printk warnings] Signed-off-by: Toshi Kani <[email protected]> Acked-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Tang Chen <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11readahead: make context readahead more conservativeFengguang Wu1-4/+4
This helps performance on moderately dense random reads on SSD. Transaction-Per-Second numbers provided by Taobao: QPS case ------------------------------------------------------- 7536 disable context readahead totally w/ patch: 7129 slower size rampup and start RA on the 3rd read 6717 slower size rampup w/o patch: 5581 unmodified context readahead Before, readahead will be started whenever reading page N+1 when it happen to read N recently. After patch, we'll only start readahead when *three* random reads happen to access pages N, N+1, N+2. The probability of this happening is extremely low for pure random reads, unless they are very dense, which actually deserves some readahead. Also start with a smaller readahead window. The impact to interleaved sequential reads should be small, because for a long run stream, the the small readahead window rampup phase is negletable. The context readahead actually benefits clustered random reads on HDD whose seek cost is pretty high. However as SSD is increasingly used for random read workloads it's better for the context readahead to concentrate on interleaved sequential reads. Another SSD rand read test from Miao # file size: 2GB # read IO amount: 625MB sysbench --test=fileio \ --max-requests=10000 \ --num-threads=1 \ --file-num=1 \ --file-block-size=64K \ --file-test-mode=rndrd \ --file-fsync-freq=0 \ --file-fsync-end=off run shows the performance of btrfs grows up from 69MB/s to 121MB/s, ext4 from 104MB/s to 121MB/s. Signed-off-by: Wu Fengguang <[email protected]> Tested-by: Tao Ma <[email protected]> Tested-by: Miao Xie <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: use zone_is_initialized() instead of if(zone->wait_table)Xishi Qiu1-1/+1
Use "zone_is_initialized()" instead of "if (zone->wait_table)". Simplify the code, no functional change. Signed-off-by: Xishi Qiu <[email protected]> Cc: Cody P Schafer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: use zone_is_empty() instead of if(zone->spanned_pages)Xishi Qiu2-4/+4
Use "zone_is_empty()" instead of "if (zone->spanned_pages)". Simplify the code, no functional change. Signed-off-by: Xishi Qiu <[email protected]> Cc: Cody P Schafer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: use zone_end_pfn() instead of zone_start_pfn+spanned_pagesXishi Qiu1-3/+4
Use "zone_end_pfn()" instead of "zone->zone_start_pfn + zone->spanned_pages". Simplify the code, no functional change. [[email protected]: fix build] Signed-off-by: Xishi Qiu <[email protected]> Cc: Cody P Schafer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm/zbud: fix some trivial typos in commentsJianguo Wu1-2/+2
Signed-off-by: Jianguo Wu <[email protected]> Cc: Seth Jennings <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm/hotplug: remove unnecessary BUG_ON in __offline_pages()Xishi Qiu1-1/+0
I think we can remove "BUG_ON(start_pfn >= end_pfn)" in __offline_pages(), because in memory_block_action() "nr_pages = PAGES_PER_SECTION * sections_per_block" is always greater than 0. memory_block_action() offline_pages() __offline_pages() BUG_ON(start_pfn >= end_pfn) In v2.6.32, If info->length==0, this way may hit this BUG_ON(). acpi_memory_disable_device() remove_memory(info->start_addr, info->length) offline_pages() A later Fujitsu patch renamed this function and the BUG_ON() is unnecessary. Signed-off-by: Xishi Qiu <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Cc: Toshi Kani <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, vmalloc: use well-defined find_last_bit() funcJoonsoo Kim1-9/+6
Our intention in here is to find last_bit within the region to flush. There is well-defined function, find_last_bit() for this purpose and its performance may be slightly better than current implementation. So change it. Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Zhang Yanfei <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, vmalloc: remove useless variable in vmap_blockJoonsoo Kim1-2/+0
vbq in vmap_block isn't used. So remove it. Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Zhang Yanfei <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11vmstat: use this_cpu() to avoid irqon/off sequence in refresh_cpu_vm_statsChristoph Lameter1-19/+16
Disabling interrupts repeatedly can be avoided in the inner loop if we use a this_cpu operation. Signed-off-by: Christoph Lameter <[email protected]> Cc: KOSAKI Motohiro <[email protected]> CC: Tejun Heo <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11vmstat: create fold_diffChristoph Lameter1-7/+11
Both functions that update global counters use the same mechanism. Create a function that contains the common code. Signed-off-by: Christoph Lameter <[email protected]> Cc: KOSAKI Motohiro <[email protected]> CC: Tejun Heo <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11vmstat: create separate function to fold per cpu diffs into local countersChristoph Lameter2-7/+35
The main idea behind this patchset is to reduce the vmstat update overhead by avoiding interrupt enable/disable and the use of per cpu atomics. This patch (of 3): It is better to have a separate folding function because refresh_cpu_vm_stats() also does other things like expire pages in the page allocator caches. If we have a separate function then refresh_cpu_vm_stats() is only called from the local cpu which allows additional optimizations. The folding function is only called when a cpu is being downed and therefore no other processor will be accessing the counters. Also simplifies synchronization. [[email protected]: fix UP build] Signed-off-by: Christoph Lameter <[email protected]> Cc: KOSAKI Motohiro <[email protected]> CC: Tejun Heo <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11swap: clean-up #ifdef in page_mapping()Joonsoo Kim1-4/+1
PageSwapCache() is always false when !CONFIG_SWAP, so compiler properly discard related code. Therefore, we don't need #ifdef explicitly. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>