aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2014-12-13mm: unmapped page migration avoid unmap+remap overheadHugh Dickins1-10/+18
Page migration's __unmap_and_move(), and rmap's try_to_unmap(), were created for use on pages almost certainly mapped into userspace. But nowadays compaction often applies them to unmapped page cache pages: which may exacerbate contention on i_mmap_rwsem quite unnecessarily, since try_to_unmap_file() makes no preliminary page_mapped() check. Now check page_mapped() in __unmap_and_move(); and avoid repeating the same overhead in rmap_walk_file() - don't remove_migration_ptes() when we never inserted any. (The PageAnon(page) comment blocks now look even sillier than before, but clean that up on some other occasion. And note in passing that try_to_unmap_one() does not use a migration entry when PageSwapCache, so remove_migration_ptes() will then not update that swap entry to newpage pte: not a big deal, but something else to clean up later.) Davidlohr remarked in "mm,fs: introduce helpers around the i_mmap_mutex" conversion to i_mmap_rwsem, that "The biggest winner of these changes is migration": a part of the reason might be all of that unnecessary taking of i_mmap_mutex in page migration; and it's rather a shame that I didn't get around to sending this patch in before his - this one is much less useful after Davidlohr's conversion to rwsem, but still good. Signed-off-by: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13fs, seq_file: fallback to vmalloc instead of oom kill processesDavid Rientjes1-1/+5
Since commit 058504edd026 ("fs/seq_file: fallback to vmalloc allocation"), seq_buf_alloc() falls back to vmalloc() when the kmalloc() for contiguous memory fails. This was done to address order-4 slab allocations for reading /proc/stat on large machines and noticed because PAGE_ALLOC_COSTLY_ORDER < 4, so there is no infinite loop in the page allocator when allocating new slab for such high-order allocations. Contiguous memory isn't necessary for caller of seq_buf_alloc(), however. Other GFP_KERNEL high-order allocations that are <= PAGE_ALLOC_COSTLY_ORDER will simply loop forever in the page allocator and oom kill processes as a result. We don't want to kill processes so that we can allocate contiguous memory in situations when contiguous memory isn't necessary. This patch does the kmalloc() allocation with __GFP_NORETRY for high-order allocations. This still utilizes memory compaction and direct reclaim in the allocation path, the only difference is that it will fail immediately instead of oom kill processes when out of memory. [[email protected]: add comment] Signed-off-by: David Rientjes <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm: vmscan: invoke slab shrinkers from shrink_zone()Johannes Weiner7-149/+106
The slab shrinkers are currently invoked from the zonelist walkers in kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the eligible LRU pages and assemble a nodemask to pass to NUMA-aware shrinkers, which then again have to walk over the nodemask. This is redundant code, extra runtime work, and fairly inaccurate when it comes to the estimation of actually scannable LRU pages. The code duplication will only get worse when making the shrinkers cgroup-aware and requiring them to have out-of-band cgroup hierarchy walks as well. Instead, invoke the shrinkers from shrink_zone(), which is where all reclaimers end up, to avoid this duplication. Take the count for eligible LRU pages out of get_scan_count(), which considers many more factors than just the availability of swap space, like zone_reclaimable_pages() currently does. Accumulate the number over all visited lruvecs to get the per-zone value. Some nodes have multiple zones due to memory addressing restrictions. To avoid putting too much pressure on the shrinkers, only invoke them once for each such node, using the class zone of the allocation as the pivot zone. For now, this integrates the slab shrinking better into the reclaim logic and gets rid of duplicative invocations from kswapd, direct reclaim, and zone reclaim. It also prepares for cgroup-awareness, allowing memcg-capable shrinkers to be added at the lruvec level without much duplication of both code and runtime work. This changes kswapd behavior, which used to invoke the shrinkers for each zone, but with scan ratios gathered from the entire node, resulting in meaningless pressure quantities on multi-zone nodes. Zone reclaim behavior also changes. It used to shrink slabs until the same amount of pages were shrunk as were reclaimed from the LRUs. Now it merely invokes the shrinkers once with the zone's scan ratio, which makes the shrinkers go easier on caches that implement aging and would prefer feeding back pressure from recently used slab objects to unused LRU pages. [[email protected]: assure class zone is populated] Signed-off-by: Johannes Weiner <[email protected]> Cc: Dave Chinner <[email protected]> Signed-off-by: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm,vmacache: count number of system-wide flushesDavidlohr Bueso3-0/+4
These flushes deal with sequence number overflows, such as for long lived threads. These are rare, but interesting from a debugging PoV. As such, display the number of flushes when vmacache debugging is enabled. Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13Documentation: add new page_owner documentJoonsoo Kim1-0/+81
page owner is for the tracking about who allocated each page. This document explains what is the page owner feature and what is the merit of it. And, simple HOW-TO is also explained. See the document for detailed information. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Jungsoo Son <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/page_owner: correct owner information for early allocated pagesJoonsoo Kim1-2/+91
Extended memory to store page owner information is initialized some time later than that page allocator starts. Until initialization, many pages can be allocated and they have no owner information. This make debugging using page owner harder, so some fixup will be helpful. This patch fixes up this situation by setting fake owner information immediately after page extension is initialized. Information doesn't tell the right owner, but, at least, it can tell whether page is allocated or not, more correctly. On my testing, this patch catches 13343 early allocated pages, although they are mostly allocated from page extension feature. Anyway, after then, there is no page left that it is allocated and has no page owner flag. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Jungsoo Son <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/page_owner: keep track of page ownersJoonsoo Kim11-3/+554
This is the page owner tracking code which is introduced so far ago. It is resident on Andrew's tree, though, nobody tried to upstream so it remain as is. Our company uses this feature actively to debug memory leak or to find a memory hogger so I decide to upstream this feature. This functionality help us to know who allocates the page. When allocating a page, we store some information about allocation in extra memory. Later, if we need to know status of all pages, we can get and analyze it from this stored information. In previous version of this feature, extra memory is statically defined in struct page, but, in this version, extra memory is allocated outside of struct page. It enables us to turn on/off this feature at boottime without considerable memory waste. Although we already have tracepoint for tracing page allocation/free, using it to analyze page owner is rather complex. We need to enlarge the trace buffer for preventing overlapping until userspace program launched. And, launched program continually dump out the trace buffer for later analysis and it would change system behaviour with more possibility rather than just keeping it in memory, so bad for debug. Moreover, we can use page_owner feature further for various purposes. For example, we can use it for fragmentation statistics implemented in this patch. And, I also plan to implement some CMA failure debugging feature using this interface. I'd like to give the credit for all developers contributed this feature, but, it's not easy because I don't know exact history. Sorry about that. Below is people who has "Signed-off-by" in the patches in Andrew's tree. Contributor: Alexander Nyberg <[email protected]> Mel Gorman <[email protected]> Dave Hansen <[email protected]> Minchan Kim <[email protected]> Michal Nazarewicz <[email protected]> Andrew Morton <[email protected]> Jungsoo Son <[email protected]> Signed-off-by: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Jungsoo Son <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13stacktrace: introduce snprint_stack_trace for buffer outputJoonsoo Kim2-0/+37
Current stacktrace only have the function for console output. page_owner that will be introduced in following patch needs to print the output of stacktrace into the buffer for our own output format so so new function, snprint_stack_trace(), is needed. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Jungsoo Son <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/nommu: use alloc_pages_exact() rather than its own implementationJoonsoo Kim1-22/+11
do_mmap_private() in nommu.c try to allocate physically contiguous pages with arbitrary size in some cases and we now have good abstract function to do exactly same thing, alloc_pages_exact(). So, change to use it. There is no functional change. This is the preparation step for support page owner feature accurately. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Jungsoo Son <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/debug-pagealloc: make debug-pagealloc boottime configurableJoonsoo Kim9-7/+57
Now, we have prepared to avoid using debug-pagealloc in boottime. So introduce new kernel-parameter to disable debug-pagealloc in boottime, and makes related functions to be disabled in this case. Only non-intuitive part is change of guard page functions. Because guard page is effective only if debug-pagealloc is enabled, turning off according to debug-pagealloc is reasonable thing to do. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Jungsoo Son <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/debug-pagealloc: prepare boottime configurable on/offJoonsoo Kim8-44/+106
Until now, debug-pagealloc needs extra flags in struct page, so we need to recompile whole source code when we decide to use it. This is really painful, because it takes some time to recompile and sometimes rebuild is not possible due to third party module depending on struct page. So, we can't use this good feature in many cases. Now, we have the page extension feature that allows us to insert extra flags to outside of struct page. This gets rid of third party module issue mentioned above. And, this allows us to determine if we need extra memory for this page extension in boottime. With these property, we can avoid using debug-pagealloc in boottime with low computational overhead in the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our development process greatly. This patch is the preparation step to achive above goal. debug-pagealloc originally uses extra field of struct page, but, after this patch, it will use field of struct page_ext. Because memory for page_ext is allocated later than initialization of page allocator in CONFIG_SPARSEMEM, we should disable debug-pagealloc feature temporarily until initialization of page_ext. This patch implements this. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Jungsoo Son <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/page_ext: resurrect struct page extending code for debuggingJoonsoo Kim7-0/+485
When we debug something, we'd like to insert some information to every page. For this purpose, we sometimes modify struct page itself. But, this has drawbacks. First, it requires re-compile. This makes us hesitate to use the powerful debug feature so development process is slowed down. And, second, sometimes it is impossible to rebuild the kernel due to third party module dependency. At third, system behaviour would be largely different after re-compile, because it changes size of struct page greatly and this structure is accessed by every part of kernel. Keeping this as it is would be better to reproduce errornous situation. This feature is intended to overcome above mentioned problems. This feature allocates memory for extended data per page in certain place rather than the struct page itself. This memory can be accessed by the accessor functions provided by this code. During the boot process, it checks whether allocation of huge chunk of memory is needed or not. If not, it avoids allocating memory at all. With this advantage, we can include this feature into the kernel in default and can avoid rebuild and solve related problems. Until now, memcg uses this technique. But, now, memcg decides to embed their variable to struct page itself and it's code to extend struct page has been removed. I'd like to use this code to develop debug feature, so this patch resurrect it. To help these things to work well, this patch introduces two callbacks for clients. One is the need callback which is mandatory if user wants to avoid useless memory allocation at boot-time. The other is optional, init callback, which is used to do proper initialization after memory is allocated. Detailed explanation about purpose of these functions is in code comment. Please refer it. Others are completely same with previous extension code in memcg. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Jungsoo Son <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm, gfp: escalatedly define GFP_HIGHUSER and GFP_HIGHUSER_MOVABLEJianyu Zhan1-5/+2
GFP_USER, GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE are escalatedly confined defined, also implied by their names: GFP_USER = GFP_USER GFP_USER + __GFP_HIGHMEM = GFP_HIGHUSER GFP_USER + __GFP_HIGHMEM + __GFP_MOVABLE = GFP_HIGHUSER_MOVABLE So just make GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE escalatedly defined to reflect this fact. It also makes the definition clear and texturally warn on any furture break-up of this escalated relastionship. Signed-off-by: Jianyu Zhan <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13include/linux/kmemleak.h: needs slab.hAndrew Morton1-0/+2
include/linux/kmemleak.h: In function 'kmemleak_alloc_recursive': include/linux/kmemleak.h:43: error: 'SLAB_NOLEAKTRACE' undeclared (first use in this function) Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/memcontrol.c: remove the unused arg in __memcg_kmem_get_cache()Zhang Zhen2-4/+3
The gfp was passed in but never used in this function. Signed-off-by: Zhang Zhen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm: move swp_entry_t definition to include/linux/mm_types.hTejun Heo2-8/+8
swp_entry_t being defined in include/linux/swap.h instead of include/linux/mm_types.h causes cyclic include dependency later when include/linux/page_cgroup.h is included from writeback path. Move the definition to include/linux/mm_types.h. While at it, reformat the comment above it. Signed-off-by: Tejun Heo <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13memory-hotplug: remove redundant call of page_to_pfnZhang Zhen1-2/+2
This is just a small optimization. The start_pfn can be obtained directly by phys_index << PFN_SECTION_SHIFT. So the call of page_to_pfn() is redundant and remove it. Signed-off-by: Zhang Zhen <[email protected]> Acked-by: Yasuaki Ishimatsu <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Wang Nan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13iommu/amd: use handle_mm_fault directlyJesse Barnes1-31/+53
This could be useful for debug in the future if we want to track major/minor faults more closely, and also avoids the put_page trick we used with gup. In order to do this, we also track the task struct in the PASID state structure. This lets us update the appropriate task stats after the fault has been handled, and may aid with debug in the future as well. Signed-off-by: Jesse Barnes <[email protected]> Tested-by: Oded Gabbay <[email protected]> Cc: Joerg Roedel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm: export find_extend_vma() and handle_mm_fault() for driver useJesse Barnes2-0/+3
This lets drivers like the AMD IOMMUv2 driver handle faults a bit more simply, rather than doing tricks with page refs and get_user_pages(). Signed-off-by: Jesse Barnes <[email protected]> Cc: Oded Gabbay <[email protected]> Cc: Joerg Roedel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13hugetlb: hugetlb_register_all_nodes(): add __init markerLuiz Capitulino1-1/+1
This function is only called during initialization. Signed-off-by: Luiz Capitulino <[email protected]> Cc: Andi Kleen <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Davidlohr Bueso <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13hugetlb: alloc_bootmem_huge_page(): use IS_ALIGNED()Luiz Capitulino1-1/+1
No reason to duplicate the code of an existing macro. Signed-off-by: Luiz Capitulino <[email protected]> Cc: Andi Kleen <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Davidlohr Bueso <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13hugetlb: fix hugepages= entry in kernel-parameters.txtLuiz Capitulino1-3/+1
The hugepages= entry in kernel-parameters.txt states that 1GB pages can only be allocated at boot time and not freed afterwards. This is not true since commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime"), at least for x86_64. Instead of adding arch-specifc observations to the hugepages= entry, this commit just drops the out of date information. Further information about arch-specific support and available features can be obtained in the hugetlb documentation. Signed-off-by: Luiz Capitulino <[email protected]> Cc: Andi Kleen <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Davidlohr Bueso <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13memcg: turn memcg_kmem_skip_account into a bit fieldVladimir Davydov2-35/+7
It isn't supposed to stack, so turn it into a bit-field to save 4 bytes on the task_struct. Also, remove the memcg_stop/resume_kmem_account helpers - it is clearer to set/clear the flag inline. Regarding the overwhelming comment to the helpers, which is removed by this patch too, we already have a compact yet accurate explanation in memcg_schedule_cache_create, no need in yet another one. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13memcg: only check memcg_kmem_skip_account in __memcg_kmem_get_cacheVladimir Davydov1-28/+0
__memcg_kmem_get_cache can recurse if it calls kmalloc (which it does if the cgroup's kmem cache doesn't exist), because kmalloc may call __memcg_kmem_get_cache internally again. To avoid the recursion, we use the task_struct->memcg_kmem_skip_account flag. However, there's no need checking the flag in memcg_kmem_newpage_charge, because there's no way how this function could result in recursion, if called from memcg_kmem_get_cache. So let's remove the redundant code. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13memcg: zap kmem_account_flagsVladimir Davydov1-21/+10
The only such flag is KMEM_ACCOUNTED_ACTIVE, but it's set iff mem_cgroup->kmemcg_id is initialized, so we can check kmemcg_id instead of having a separate flags field. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm: mincore: add hwpoison page handleWeijie Yang1-2/+5
When the encountered pte is a swap entry, the current code handles two cases: migration and normal swapentry, but we have a third case: hwpoison page. This patch adds hwpoison page handle, consider hwpoison page incore as same as migration. [[email protected]: coding-style fixes] Signed-off-by: Weijie Yang <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/rmap: calculate page offset when neededDavidlohr Bueso1-2/+4
Call page_to_pgoff() to get the page offset once we are sure we actually need it, and any very obvious initial function checks have passed. Trivial micro-optimization, and potentially save some cycles. Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/debug-pagealloc: cleanup page guard codeJoonsoo Kim1-19/+19
Page guard is used by debug-pagealloc feature. Currently, it is open-coded, but, I think that more abstraction of it makes core page allocator code more readable. There is no functional difference. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Gioh Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/memblock.c: refactor functions to set/clear MEMBLOCK_HOTPLUGTony Luck1-23/+20
There is a lot of duplication in the rubric around actually setting or clearing a mem region flag. Create a new helper function to do this and reduce each of memblock_mark_hotplug() and memblock_clear_hotplug() to a single line. This will be useful if someone were to add a new mem region flag - which I hope to be doing some day soon. But it looks like a plausible cleanup even without that - so I'd like to get it out of the way now. Signed-off-by: Tony Luck <[email protected]> Cc: Santosh Shilimkar <[email protected]> Cc: Tang Chen <[email protected]> Cc: Grygorii Strashko <[email protected]> Cc: Zhang Yanfei <[email protected]> Cc: Philipp Hachtmann <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Emil Medve <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13memcg: do not abuse memcg_kmem_skip_accountVladimir Davydov1-7/+0
task_struct->memcg_kmem_skip_account was initially introduced to avoid recursion during kmem cache creation: memcg_kmem_get_cache, which is called by kmem_cache_alloc to determine the per-memcg cache to account allocation to, may issue lazy cache creation if the needed cache doesn't exist, which means issuing yet another kmem_cache_alloc. We can't just pass a flag to the nested kmem_cache_alloc disabling kmem accounting, because there are hidden allocations, e.g. in INIT_WORK. So we introduced a flag on the task_struct, memcg_kmem_skip_account, making memcg_kmem_get_cache return immediately. By its nature, the flag may also be used to disable accounting for allocations shared among different cgroups, and currently it is used this way in memcg_activate_kmem. Using it like this looks like abusing it to me. If we want to disable accounting for some allocations (which we will definitely want one day), we should either add GFP_NO_MEMCG or GFP_MEMCG flag in order to blacklist/whitelist some allocations. For now, let's simply remove memcg_stop/resume_kmem_account from memcg_activate_kmem. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13memcg: don't check mm in __memcg_kmem_{get_cache,newpage_charge}Vladimir Davydov1-2/+2
We already assured the current task has mm in memcg_kmem_should_charge, no need to double check. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13memcg: __mem_cgroup_free: remove stale disarm_static_keys commentVladimir Davydov1-11/+0
cpuset code stopped using cgroup_lock in favor of cpuset_mutex long ago. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm: cma: align to physical address, not CMA region positionGregory Fong1-3/+16
The alignment in cma_alloc() was done w.r.t. the bitmap. This is a problem when, for example: - a device requires 16M (order 12) alignment - the CMA region is not 16 M aligned In such a case, can result with the CMA region starting at, say, 0x2f800000 but any allocation you make from there will be aligned from there. Requesting an allocation of 32 M with 16 M alignment will result in an allocation from 0x2f800000 to 0x31800000, which doesn't work very well if your strange device requires 16M alignment. Change to use bitmap_find_next_zero_area_off() to account for the difference in alignment at reserve-time and alloc-time. Signed-off-by: Gregory Fong <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Kukjin Kim <[email protected]> Cc: Laurent Pinchart <[email protected]> Cc: Laura Abbott <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13lib: bitmap: add alignment offset for bitmap_find_next_zero_area()Michal Nazarewicz2-16/+44
Add a bitmap_find_next_zero_area_off() function which works like bitmap_find_next_zero_area() function except it allows an offset to be specified when alignment is checked. This lets caller request a bit such that its number plus the offset is aligned according to the mask. [[email protected]: Retrieved from https://patchwork.linuxtv.org/patch/6254/ and updated documentation] Signed-off-by: Michal Nazarewicz <[email protected]> Signed-off-by: Kyungmin Park <[email protected]> Signed-off-by: Marek Szyprowski <[email protected]> Signed-off-by: Gregory Fong <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Kukjin Kim <[email protected]> Cc: Laurent Pinchart <[email protected]> Cc: Laura Abbott <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/memory.c: share the i_mmap_rwsemDavidlohr Bueso1-2/+2
The unmap_mapping_range family of functions do the unmapping of user pages (ultimately via zap_page_range_single) without touching the actual interval tree, thus share the lock. Signed-off-by: Davidlohr Bueso <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Srikar Dronamraju <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/nommu: share the i_mmap_rwsemDavidlohr Bueso1-5/+4
Shrinking/truncate logic can call nommu_shrink_inode_mappings() to verify that any shared mappings of the inode in question aren't broken (dead zone). afaict the only user being ramfs to handle the size change attribute. Pretty much a no-brainer to share the lock. Signed-off-by: Davidlohr Bueso <[email protected]> Acked-by: "Kirill A. Shutemov" <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Srikar Dronamraju <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/memory-failure: share the i_mmap_rwsemDavidlohr Bueso1-2/+2
No brainer conversion: collect_procs_file() only schedules a process for later kill, share the lock, similarly to the anon vma variant. Signed-off-by: Davidlohr Bueso <[email protected]> Acked-by: "Kirill A. Shutemov" <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Srikar Dronamraju <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/xip: share the i_mmap_rwsemDavidlohr Bueso1-14/+9
__xip_unmap() will remove the xip sparse page from the cache and take down pte mapping, without altering the interval tree, thus share the i_mmap_rwsem when searching for the ptes to unmap. Additionally, tidy up the function a bit and make variables only local to the interval tree walk loop. Signed-off-by: Davidlohr Bueso <[email protected]> Acked-by: "Kirill A. Shutemov" <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Srikar Dronamraju <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13uprobes: share the i_mmap_rwsemDavidlohr Bueso1-2/+2
Both register and unregister call build_map_info() in order to create the list of mappings before installing or removing breakpoints for every mm which maps file backed memory. As such, there is no reason to hold the i_mmap_rwsem exclusively, so share it and allow concurrent readers to build the mapping data. Signed-off-by: Davidlohr Bueso <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: "Kirill A. Shutemov" <[email protected]> Cc: Oleg Nesterov <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Rik van Riel <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/rmap: share the i_mmap_rwsemDavidlohr Bueso2-3/+13
Similarly to the anon memory counterpart, we can share the mapping's lock ownership as the interval tree is not modified when doing doing the walk, only the file page. Signed-off-by: Davidlohr Bueso <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: "Kirill A. Shutemov" <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Srikar Dronamraju <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm: convert i_mmap_mutex to rwsemDavidlohr Bueso10-29/+30
The i_mmap_mutex is a close cousin of the anon vma lock, both protecting similar data, one for file backed pages and the other for anon memory. To this end, this lock can also be a rwsem. In addition, there are some important opportunities to share the lock when there are no tree modifications. This conversion is straightforward. For now, all users take the write lock. [[email protected]: update fremap.c] Signed-off-by: Davidlohr Bueso <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Acked-by: "Kirill A. Shutemov" <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Srikar Dronamraju <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm: use new helper functions around the i_mmap_mutexDavidlohr Bueso12-40/+40
Convert all open coded mutex_lock/unlock calls to the i_mmap_[lock/unlock]_write() helpers. Signed-off-by: Davidlohr Bueso <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: "Kirill A. Shutemov" <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Srikar Dronamraju <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm,fs: introduce helpers around the i_mmap_mutexDavidlohr Bueso1-0/+10
This series is a continuation of the conversion of the i_mmap_mutex to rwsem, following what we have for the anon memory counterpart. With Hugh's feedback from the first iteration. Ultimately, the most obvious paths that require exclusive ownership of the lock is when we modify the VMA interval tree, via vma_interval_tree_insert() and vma_interval_tree_remove() families. Cases such as unmapping, where the ptes content is changed but the tree remains untouched should make it safe to share the i_mmap_rwsem. As such, the code of course is straightforward, however the devil is very much in the details. While its been tested on a number of workloads without anything exploding, I would not be surprised if there are some less documented/known assumptions about the lock that could suffer from these changes. Or maybe I'm just missing something, but either way I believe its at the point where it could use more eyes and hopefully some time in linux-next. Because the lock type conversion is the heart of this patchset, its worth noting a few comparisons between mutex vs rwsem (xadd): (i) Same size, no extra footprint. (ii) Both have CONFIG_XXX_SPIN_ON_OWNER capabilities for exclusive lock ownership. (iii) Both can be slightly unfair wrt exclusive ownership, with writer lock stealing properties, not necessarily respecting FIFO order for granting the lock when contended. (iv) Mutexes can be slightly faster than rwsems when the lock is non-contended. (v) Both suck at performance for debug (slowpaths), which shouldn't matter anyway. Sharing the lock is obviously beneficial, and sem writer ownership is close enough to mutexes. The biggest winner of these changes is migration. As for concrete numbers, the following performance results are for a 4-socket 60-core IvyBridge-EX with 130Gb of RAM. Both alltests and disk (xfs+ramdisk) workloads of aim7 suite do quite well with this set, with a steady ~60% throughput (jpm) increase for alltests and up to ~30% for disk for high amounts of concurrency. Lower counts of workload users (< 100) does not show much difference at all, so at least no regressions. 3.18-rc1 3.18-rc1-i_mmap_rwsem alltests-100 17918.72 ( 0.00%) 28417.97 ( 58.59%) alltests-200 16529.39 ( 0.00%) 26807.92 ( 62.18%) alltests-300 16591.17 ( 0.00%) 26878.08 ( 62.00%) alltests-400 16490.37 ( 0.00%) 26664.63 ( 61.70%) alltests-500 16593.17 ( 0.00%) 26433.72 ( 59.30%) alltests-600 16508.56 ( 0.00%) 26409.20 ( 59.97%) alltests-700 16508.19 ( 0.00%) 26298.58 ( 59.31%) alltests-800 16437.58 ( 0.00%) 26433.02 ( 60.81%) alltests-900 16418.35 ( 0.00%) 26241.61 ( 59.83%) alltests-1000 16369.00 ( 0.00%) 26195.76 ( 60.03%) alltests-1100 16330.11 ( 0.00%) 26133.46 ( 60.03%) alltests-1200 16341.30 ( 0.00%) 26084.03 ( 59.62%) alltests-1300 16304.75 ( 0.00%) 26024.74 ( 59.61%) alltests-1400 16231.08 ( 0.00%) 25952.35 ( 59.89%) alltests-1500 16168.06 ( 0.00%) 25850.58 ( 59.89%) alltests-1600 16142.56 ( 0.00%) 25767.42 ( 59.62%) alltests-1700 16118.91 ( 0.00%) 25689.58 ( 59.38%) alltests-1800 16068.06 ( 0.00%) 25599.71 ( 59.32%) alltests-1900 16046.94 ( 0.00%) 25525.92 ( 59.07%) alltests-2000 16007.26 ( 0.00%) 25513.07 ( 59.38%) disk-100 7582.14 ( 0.00%) 7257.48 ( -4.28%) disk-200 6962.44 ( 0.00%) 7109.15 ( 2.11%) disk-300 6435.93 ( 0.00%) 6904.75 ( 7.28%) disk-400 6370.84 ( 0.00%) 6861.26 ( 7.70%) disk-500 6353.42 ( 0.00%) 6846.71 ( 7.76%) disk-600 6368.82 ( 0.00%) 6806.75 ( 6.88%) disk-700 6331.37 ( 0.00%) 6796.01 ( 7.34%) disk-800 6324.22 ( 0.00%) 6788.00 ( 7.33%) disk-900 6253.52 ( 0.00%) 6750.43 ( 7.95%) disk-1000 6242.53 ( 0.00%) 6855.11 ( 9.81%) disk-1100 6234.75 ( 0.00%) 6858.47 ( 10.00%) disk-1200 6312.76 ( 0.00%) 6845.13 ( 8.43%) disk-1300 6309.95 ( 0.00%) 6834.51 ( 8.31%) disk-1400 6171.76 ( 0.00%) 6787.09 ( 9.97%) disk-1500 6139.81 ( 0.00%) 6761.09 ( 10.12%) disk-1600 4807.12 ( 0.00%) 6725.33 ( 39.90%) disk-1700 4669.50 ( 0.00%) 5985.38 ( 28.18%) disk-1800 4663.51 ( 0.00%) 5972.99 ( 28.08%) disk-1900 4674.31 ( 0.00%) 5949.94 ( 27.29%) disk-2000 4668.36 ( 0.00%) 5834.93 ( 24.99%) In addition, a 67.5% increase in successfully migrated NUMA pages, thus improving node locality. The patch layout is simple but designed for bisection (in case reversion is needed if the changes break upstream) and easier review: o Patches 1-4 convert the i_mmap lock from mutex to rwsem. o Patches 5-10 share the lock in specific paths, each patch details the rationale behind why it should be safe. This patchset has been tested with: postgres 9.4 (with brand new hugetlb support), hugetlbfs test suite (all tests pass, in fact more tests pass with these changes than with an upstream kernel), ltp, aim7 benchmarks, memcached and iozone with the -B option for mmap'ing. *Untested* paths are nommu, memory-failure, uprobes and xip. This patch (of 8): Various parts of the kernel acquire and release this mutex, so add i_mmap_lock_write() and immap_unlock_write() helper functions that will encapsulate this logic. The next patch will make use of these. Signed-off-by: Davidlohr Bueso <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Acked-by: "Kirill A. Shutemov" <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Srikar Dronamraju <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13MAINTAINERS: update Xiubo's email addressXiubo Li1-1/+1
My current email address will be gone shortly, update my email to be a gmail one. Signed-off-by: Xiubo Li <[email protected]> Cc: Timur Tabi <[email protected]> Cc: Takashi Iwai <[email protected]> Acked-by: Nicolin Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13rtc: snvs: fix build with CONFIG_PM_SLEEP disabledGuenter Roeck1-2/+9
Commit 7654e9d4fd8f ("drivers/rtc/rtc-snvs: fix suspend/resume") replaces SIMPLE_DEV_PM_OPS with direct declaration of snvs_rtc_pm_ops, but does so outside #ifdef CONFIG_PM_SLEEP. This causes the driver build to fail if CONFIG_PM_SLEEP is not configured. Fixes: 7654e9d4fd8f ("drivers/rtc/rtc-snvs: fix suspend/resume") Signed-off-by: Guenter Roeck <[email protected]> Cc: Sanchayan Maity <[email protected]> Cc: Alessandro Zummo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13arm/arm64: KVM: Don't allow creating VCPUs after vgic_initializedChristoffer Dall1-0/+5
When the vgic initializes its internal state it does so based on the number of VCPUs available at the time. If we allow KVM to create more VCPUs after the VGIC has been initialized, we are likely to error out in unfortunate ways later, perform buffer overflows etc. Acked-by: Marc Zyngier <[email protected]> Reviewed-by: Eric Auger <[email protected]> Signed-off-by: Christoffer Dall <[email protected]>
2014-12-13arm/arm64: KVM: Add (new) vgic_initialized macroChristoffer Dall2-1/+8
Some code paths will need to check to see if the internal state of the vgic has been initialized (such as when creating new VCPUs), so introduce such a macro that checks the nr_cpus field which is set when the vgic has been initialized. Also set nr_cpus = 0 in kvm_vgic_destroy, because the error path in vgic_init() will call this function, and code should never errornously assume the vgic to be properly initialized after an error. Acked-by: Marc Zyngier <[email protected]> Reviewed-by: Eric Auger <[email protected]> Signed-off-by: Christoffer Dall <[email protected]>
2014-12-13arm/arm64: KVM: Rename vgic_initialized to vgic_readyChristoffer Dall3-6/+6
The vgic_initialized() macro currently returns the state of the vgic->ready flag, which indicates if the vgic is ready to be used when running a VM, not specifically if its internal state has been initialized. Rename the macro accordingly in preparation for a more nuanced initialization flow. Acked-by: Marc Zyngier <[email protected]> Reviewed-by: Eric Auger <[email protected]> Signed-off-by: Christoffer Dall <[email protected]>
2014-12-13arm/arm64: KVM: vgic: move reset initialization into vgic_init_maps()Peter Maydell3-50/+37
VGIC initialization currently happens in three phases: (1) kvm_vgic_create() (triggered by userspace GIC creation) (2) vgic_init_maps() (triggered by userspace GIC register read/write requests, or from kvm_vgic_init() if not already run) (3) kvm_vgic_init() (triggered by first VM run) We were doing initialization of some state to correspond with the state of a freshly-reset GIC in kvm_vgic_init(); this is too late, since it will overwrite changes made by userspace using the register access APIs before the VM is run. Move this initialization earlier, into the vgic_init_maps() phase. This fixes a bug where QEMU could successfully restore a saved VM state snapshot into a VM that had already been run, but could not restore it "from cold" using the -loadvm command line option (the symptoms being that the restored VM would run but interrupts were ignored). Finally rename vgic_init_maps to vgic_init and renamed kvm_vgic_init to kvm_vgic_map_resources. [ This patch is originally written by Peter Maydell, but I have modified it somewhat heavily, renaming various bits and moving code around. If something is broken, I am to be blamed. - Christoffer ] Acked-by: Marc Zyngier <[email protected]> Reviewed-by: Eric Auger <[email protected]> Signed-off-by: Peter Maydell <[email protected]> Signed-off-by: Christoffer Dall <[email protected]>
2014-12-13arm/arm64: KVM: Introduce stage2_unmap_vmChristoffer Dall4-0/+74
Introduce a new function to unmap user RAM regions in the stage2 page tables. This is needed on reboot (or when the guest turns off the MMU) to ensure we fault in pages again and make the dcache, RAM, and icache coherent. Using unmap_stage2_range for the whole guest physical range does not work, because that unmaps IO regions (such as the GIC) which will not be recreated or in the best case faulted in on a page-by-page basis. Call this function on secondary and subsequent calls to the KVM_ARM_VCPU_INIT ioctl so that a reset VCPU will detect the guest Stage-1 MMU is off when faulting in pages and make the caches coherent. Acked-by: Marc Zyngier <[email protected]> Signed-off-by: Christoffer Dall <[email protected]>