aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-10-26tools/testing/selftests/vm/gup_benchmark.c: allow user specified fileKeith Busch1-4/+13
Allow a user to specify a file to map by adding a new option, '-f', providing a means to test various file backings. If not specified, the benchmark will use a private mapping of /dev/zero, which produces an anonymous mapping as before. [[email protected]: avoid using comma operator] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Kirill Shutemov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usageKeith Busch1-0/+1
If the '-w' parameter was provided, the benchmark would exit due to a mssing 'break'. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Keith Busch <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/gup_benchmark.c: add additional pinning methodsKeith Busch2-4/+37
Provide new gup benchmark ioctl commands to run different user page pinning methods, get_user_pages_longterm() and get_user_pages(), in addition to the existing get_user_pages_fast(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Keith Busch <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/gup_benchmark.c: time put_page()Keith Busch2-4/+11
We'd like to measure time to unpin user pages, so this adds a second benchmark timer on put_page, separate from get_page. Adding the field breaks this ioctl ABI, but should be okay since this an in-tree kernel selftest. [[email protected]: add expansion to struct gup_benchmark for future use] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Keith Busch <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm: don't raise MEMCG_OOM event due to failed high-order allocationRoman Gushchin2-2/+6
It was reported that on some of our machines containers were restarted with OOM symptoms without an obvious reason. Despite there were almost no memory pressure and plenty of page cache, MEMCG_OOM event was raised occasionally, causing the container management software to think, that OOM has happened. However, no tasks have been killed. The following investigation showed that the problem is caused by a failing attempt to charge a high-order page. In such case, the OOM killer is never invoked. As shown below, it can happen under conditions, which are very far from a real OOM: e.g. there is plenty of clean page cache and no memory pressure. There is no sense in raising an OOM event in this case, as it might confuse a user and lead to wrong and excessive actions (e.g. restart the workload, as in my case). Let's look at the charging path in try_charge(). If the memory usage is about memory.max, which is absolutely natural for most memory cgroups, we try to reclaim some pages. Even if we were able to reclaim enough memory for the allocation, the following check can fail due to a race with another concurrent allocation: if (mem_cgroup_margin(mem_over_limit) >= nr_pages) goto retry; For regular pages the following condition will save us from triggering the OOM: if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER)) goto retry; But for high-order allocation this condition will intentionally fail. The reason behind is that we'll likely fall to regular pages anyway, so it's ok and even preferred to return ENOMEM. In this case the idea of raising MEMCG_OOM looks dubious. Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation order check, so that the event won't be raised for high order allocations. This change doesn't affect regular pages allocation and charging. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlockDave Chinner1-18/+15
We've recently seen a workload on XFS filesystems with a repeatable deadlock between background writeback and a multi-process application doing concurrent writes and fsyncs to a small range of a file. range_cyclic writeback Process 1 Process 2 xfs_vm_writepages write_cache_pages writeback_index = 2 cycled = 0 .... find page 2 dirty lock Page 2 ->writepage page 2 writeback page 2 clean page 2 added to bio no more pages write() locks page 1 dirties page 1 locks page 2 dirties page 1 fsync() .... xfs_vm_writepages write_cache_pages start index 0 find page 1 towrite lock Page 1 ->writepage page 1 writeback page 1 clean page 1 added to bio find page 2 towrite lock Page 2 page 2 is writeback <blocks> write() locks page 1 dirties page 1 fsync() .... xfs_vm_writepages write_cache_pages start index 0 !done && !cycled sets index to 0, restarts lookup find page 1 dirty find page 1 towrite lock Page 1 page 1 is writeback <blocks> lock Page 1 <blocks> DEADLOCK because: - process 1 needs page 2 writeback to complete to make enough progress to issue IO pending for page 1 - writeback needs page 1 writeback to complete so process 2 can progress and unlock the page it is blocked on, then it can issue the IO pending for page 2 - process 2 can't make progress until process 1 issues IO for page 1 The underlying cause of the problem here is that range_cyclic writeback is processing pages in descending index order as we hold higher index pages in a structure controlled from above write_cache_pages(). The write_cache_pages() caller needs to be able to submit these pages for IO before write_cache_pages restarts writeback at mapping index 0 to avoid wcp inverting the page lock/writeback wait order. generic_writepages() is not susceptible to this bug as it has no private context held across write_cache_pages() - filesystems using this infrastructure always submit pages in ->writepage immediately and so there is no problem with range_cyclic going back to mapping index 0. However: mpage_writepages() has a private bio context, exofs_writepages() has page_collect fuse_writepages() has fuse_fill_wb_data nfs_writepages() has nfs_pageio_descriptor xfs_vm_writepages() has xfs_writepage_ctx All of these ->writepages implementations can hold pages under writeback in their private structures until write_cache_pages() returns, and hence they are all susceptible to this deadlock. Also worth noting is that ext4 has it's own bastardised version of write_cache_pages() and so it /may/ have an equivalent deadlock. I looked at the code long enough to understand that it has a similar retry loop for range_cyclic writeback reaching the end of the file and then promptly ran away before my eyes bled too much. I'll leave it for the ext4 developers to determine if their code is actually has this deadlock and how to fix it if it has. There's a few ways I can see avoid this deadlock. There's probably more, but these are the first I've though of: 1. get rid of range_cyclic altogether 2. range_cyclic always stops at EOF, and we start again from writeback index 0 on the next call into write_cache_pages() 2a. wcp also returns EAGAIN to ->writepages implementations to indicate range cyclic has hit EOF. writepages implementations can then flush the current context and call wpc again to continue. i.e. lift the retry into the ->writepages implementation 3. range_cyclic uses trylock_page() rather than lock_page(), and it skips pages it can't lock without blocking. It will already do this for pages under writeback, so this seems like a no-brainer 3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid blocking as per pages under writeback. I don't think #1 is an option - range_cyclic prevents frequently dirtied lower file offset from starving background writeback of rarely touched higher file offsets. #2 is simple, and I don't think it will have any impact on performance as going back to the start of the file implies an immediate seek. We'll have exactly the same number of seeks if we switch writeback to another inode, and then come back to this one later and restart from index 0. #2a is pretty much "status quo without the deadlock". Moving the retry loop up into the wcp caller means we can issue IO on the pending pages before calling wcp again, and so avoid locking or waiting on pages in the wrong order. I'm not convinced we need to do this given that we get the same thing from #2 on the next writeback call from the writeback infrastructure. #3 is really just a band-aid - it doesn't fix the access/wait inversion problem, just prevents it from becoming a deadlock situation. I'd prefer we fix the inversion, not sweep it under the carpet like this. #3a is really an optimisation that just so happens to include the band-aid fix of #3. So it seems that the simplest way to fix this issue is to implement solution #2 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Nicholas Piggin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm: move mirrored memory specific code outside of memmap_init_zonePavel Tatashin1-38/+33
memmap_init_zone, is getting complex, because it is called from different contexts: hotplug, and during boot, and also because it must handle some architecture quirks. One of them is mirrored memory. Move the code that decides whether to skip mirrored memory outside of memmap_init_zone, into a separate function. [[email protected]: uninline overlap_memmap_init()] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Tatashin <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Abdul Haleem <[email protected]> Cc: Baoquan He <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Souptick Joarder <[email protected]> Cc: Steven Sistare <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm: calculate deferred pages after skipping mirrored memoryPavel Tatashin1-20/+25
update_defer_init() should be called only when struct page is about to be initialized. Because it counts number of initialized struct pages, but there we may skip struct pages if there is some mirrored memory. So move, update_defer_init() after checking for mirrored memory. Also, rename update_defer_init() to defer_init() and reverse the return boolean to emphasize that this is a boolean function, that tells that the reset of memmap initialization should be deferred. Make this function self-contained: do not pass number of already initialized pages in this zone by using static counters. I found this bug by reading the code. The effect is that fewer than expected struct pages are initialized early in boot, and it is possible that in some corner cases we may fail to boot when mirrored pages are used. The deferred on demand code should somewhat mitigate this. But this still brings some inconsistencies compared to when booting without mirrored pages, so it is better to fix. [[email protected]: add comment about defer_init's lack of locking] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: make defer_init non-inline, __meminit] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Tatashin <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Abdul Haleem <[email protected]> Cc: Baoquan He <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Souptick Joarder <[email protected]> Cc: Steven Sistare <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Pasha Tatashin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm: make memmap_init a proper functionPavel Tatashin2-5/+5
memmap_init is sometimes a macro sometimes a function based on __HAVE_ARCH_MEMMAP_INIT. It is only a function on ia64. Make memmap_init a weak function instead, and let ia64 redefine it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Tatashin <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Steven Sistare <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Souptick Joarder <[email protected]> Cc: Baoquan He <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Abdul Haleem <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Pasha Tatashin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/memcontrol.c: convert mem_cgroup_id::ref to refcount_t typeKirill Tkhai2-7/+5
This will allow to use generic refcount_t interfaces to check counters overflow instead of currently existing VM_BUG_ON(). The only difference after the patch is VM_BUG_ON() may cause BUG(), while refcount_t fires with WARN(). But this seems not to be significant here, since such the problems are usually caught by syzbot with panic-on-warn enabled. Link: http://lkml.kernel.org/r/153910718919.7006.13400779039257185427.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Andrea Parri <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/page_alloc.c: initialize num_movable in move_freepages()David Rientjes1-4/+3
If move_freepages_block() returns 0 because !zone_spans_pfn(), *num_movable can hold the value from the stack because it does not get initialized in move_freepages(). Move the initialization to move_freepages_block() to guarantee the value actually makes sense. This currently doesn't affect its only caller where num_movable != NULL, so no bug fix, but just more robust. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Rientjes <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Greg Thelen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/zsmalloc.c: fix fall-through annotationGustavo A. R. Silva1-1/+1
Replace "fallthru" with a proper "fall through" annotation. This fix is part of the ongoing efforts to enabling -Wimplicit-fallthrough Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Gustavo A. R. Silva <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26userfaultfd: selftest: recycle lock threads firstPeter Xu1-5/+6
Now we recycle the uffd servicing threads earlier than the lock threads. It might happen that when the lock thread is still blocked at a pthread mutex lock while the servicing thread has already quitted for the cpu so the lock thread will be blocked forever and hang the test program. To fix the possible race, recycle the lock threads first. This never happens with current missing-only tests, but when I start to run the write-protection tests (the feature is not yet posted upstream) it happens every time of the run possibly because in that new test we'll need to service two page faults for each lock operation. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Acked-by: Mike Rapoport <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Zi Yan <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26userfaultfd: selftest: generalize read and pollPeter Xu1-34/+43
We do very similar things in read and poll modes, but we're copying the codes around. Share the codes properly on reading the message and handling the page fault to make the code cleaner. Meanwhile this solves previous mismatch of behaviors between the two modes on that the old code: - did not check EAGAIN case in read() mode - ignored BOUNCE_VERIFY check in read() mode Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Acked-by: Mike Rapoport <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Zi Yan <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26userfaultfd: selftest: cleanup help messagesPeter Xu1-18/+28
Firstly, the help in the comment region is obsolete, now we support three parameters. Since at it, change it and move it into the help message of the program. Also, the help messages dumped here and there is obsolete too. Use a single usage() helper. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Acked-by: Mike Rapoport <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Zi Yan <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/vmstat.c: assert that vmstat_text is in sync with stat_items_sizeJann Horn1-0/+2
Having two gigantic arrays that must manually be kept in sync, including ifdefs, isn't exactly robust. To make it easier to catch such issues in the future, add a BUILD_BUG_ON(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jann Horn <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Kemi Wang <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/memory.c: recheck page table entry with page table lock heldAneesh Kumar K.V1-4/+30
We clear the pte temporarily during read/modify/write update of the pte. If we take a page fault while the pte is cleared, the application can get SIGBUS. One such case is with remap_pfn_range without a backing vm_ops->fault callback. do_fault will return SIGBUS in that case. cpu 0 cpu1 mprotect() ptep_modify_prot_start()/pte cleared. . . page fault. . . prep_modify_prot_commit() Fix this by taking page table lock and rechecking for pte_none. [[email protected]: fix crash observed with syzkaller run] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Willem de Bruijn <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Ido Schimmel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm: dax: add comment for PFN_SPECIALYang Shi1-0/+2
The comment for PFN_SPECIAL is missed in pfn_t.h. Add comment to get consistent with other pfn flags. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Suggested-by: Dan Williams <[email protected]> Reviewed-by: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm: brk: downgrade mmap_sem to read when shrinkingYang Shi1-11/+35
brk might be used to shrink memory mapping too other than munmap(). So, it may hold write mmap_sem for long time when shrinking large mapping, as what commit ("mm: mmap: zap pages with read mmap_sem in munmap") described. The brk() will not manipulate vmas anymore after __do_munmap() call for the mapping shrink use case. But, it may set mm->brk after __do_munmap(), which needs hold write mmap_sem. However, a simple trick can workaround this by setting mm->brk before __do_munmap(). Then restore the original value if __do_munmap() fails. With this trick, it is safe to downgrade to read mmap_sem. So, the same optimization, which downgrades mmap_sem to read for zapping pages, is also feasible and reasonable to this case. The period of holding exclusive mmap_sem for shrinking large mapping would be reduced significantly with this optimization. [[email protected]: tweak comment] [[email protected]: fix unsigned compare against 0 issue] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Colin Ian King <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm: mremap: downgrade mmap_sem to read when shrinkingYang Shi3-6/+20
Other than munmap, mremap might be used to shrink memory mapping too. So, it may hold write mmap_sem for long time when shrinking large mapping, as what commit ("mm: mmap: zap pages with read mmap_sem in munmap") described. The mremap() will not manipulate vmas anymore after __do_munmap() call for the mapping shrink use case, so it is safe to downgrade to read mmap_sem. So, the same optimization, which downgrades mmap_sem to read for zapping pages, is also feasible and reasonable to this case. The period of holding exclusive mmap_sem for shrinking large mapping would be reduced significantly with this optimization. MREMAP_FIXED and MREMAP_MAYMOVE are more complicated to adopt this optimization since they need manipulate vmas after do_munmap(), downgrading mmap_sem may create race window. Simple mapping shrink is the low hanging fruit, and it may cover the most cases of unmap with munmap together. [[email protected]: tweak comment] [[email protected]: fix unsigned compare against 0 issue] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Colin Ian King <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/filemap.c: use vmf_error()Souptick Joarder1-3/+1
These codes can be replaced with new inline vmf_error(). Link: http://lkml.kernel.org/r/20180927171411.GA23331@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26hugetlb: introduce generic version of huge_ptep_getAlexandre Ghiti11-37/+10
ia64, mips, parisc, powerpc, sh, sparc, x86 architectures use the same version of huge_ptep_get, so move this generic implementation into asm-generic/hugetlb.h. [[email protected]: fix ARM 3level page tables] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]> Reviewed-by: Luiz Capitulino <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Tested-by: Helge Deller <[email protected]> [parisc] Acked-by: Catalin Marinas <[email protected]> [arm64] Acked-by: Paul Burton <[email protected]> [MIPS] Acked-by: Ingo Molnar <[email protected]> [x86] Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David S. Miller <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James E.J. Bottomley <[email protected]> Cc: James Hogan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26hugetlb: introduce generic version of huge_ptep_set_access_flags()Alexandre Ghiti10-28/+14
arm, ia64, sh, x86 architectures use the same version of huge_ptep_set_access_flags, so move this generic implementation into asm-generic/hugetlb.h. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]> Reviewed-by: Luiz Capitulino <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Tested-by: Helge Deller <[email protected]> [parisc] Acked-by: Catalin Marinas <[email protected]> [arm64] Acked-by: Paul Burton <[email protected]> [MIPS] Acked-by: Ingo Molnar <[email protected]> [x86] Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David S. Miller <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James E.J. Bottomley <[email protected]> Cc: James Hogan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26hugetlb: introduce generic version of huge_ptep_set_wrprotect()Alexandre Ghiti13-42/+13
arm, ia64, mips, powerpc, sh, x86 architectures use the same version of huge_ptep_set_wrprotect, so move this generic implementation into asm-generic/hugetlb.h. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]> Reviewed-by: Luiz Capitulino <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Tested-by: Helge Deller <[email protected]> [parisc] Acked-by: Catalin Marinas <[email protected]> [arm64] Acked-by: Paul Burton <[email protected]> [MIPS] Acked-by: Ingo Molnar <[email protected]> [x86] Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David S. Miller <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James E.J. Bottomley <[email protected]> Cc: James Hogan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26hugetlb: introduce generic version of prepare_hugepage_rangeAlexandre Ghiti10-68/+19
arm, arm64, powerpc, sparc, x86 architectures use the same version of prepare_hugepage_range, so move this generic implementation into asm-generic/hugetlb.h. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]> Reviewed-by: Luiz Capitulino <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Tested-by: Helge Deller <[email protected]> [parisc] Acked-by: Catalin Marinas <[email protected]> [arm64] Acked-by: Paul Burton <[email protected]> [MIPS] Acked-by: Ingo Molnar <[email protected]> [x86] Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David S. Miller <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James E.J. Bottomley <[email protected]> Cc: James Hogan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26hugetlb: introduce generic version of huge_pte_wrprotectAlexandre Ghiti10-45/+7
arm, arm64, ia64, mips, parisc, powerpc, sh, sparc, x86 architectures use the same version of huge_pte_wrprotect, so move this generic implementation into asm-generic/hugetlb.h. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]> Reviewed-by: Luiz Capitulino <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Tested-by: Helge Deller <[email protected]> [parisc] Acked-by: Catalin Marinas <[email protected]> [arm64] Acked-by: Paul Burton <[email protected]> [MIPS] Acked-by: Ingo Molnar <[email protected]> [x86] Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David S. Miller <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James E.J. Bottomley <[email protected]> Cc: James Hogan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26hugetlb: introduce generic version of huge_pte_none()Alexandre Ghiti10-40/+8
arm, arm64, ia64, mips, parisc, powerpc, sh, sparc, x86 architectures use the same version of huge_pte_none, so move this generic implementation into asm-generic/hugetlb.h. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]> Reviewed-by: Luiz Capitulino <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Tested-by: Helge Deller <[email protected]> [parisc] Acked-by: Catalin Marinas <[email protected]> [arm64] Acked-by: Paul Burton <[email protected]> [MIPS] Acked-by: Ingo Molnar <[email protected]> [x86] Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David S. Miller <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James E.J. Bottomley <[email protected]> Cc: James Hogan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26hugetlb: introduce generic version of huge_ptep_clear_flushAlexandre Ghiti10-12/+15
arm, x86 architectures use the same version of huge_ptep_clear_flush, so move this generic implementation into asm-generic/hugetlb.h. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]> Reviewed-by: Luiz Capitulino <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Tested-by: Helge Deller <[email protected]> [parisc] Acked-by: Catalin Marinas <[email protected]> [arm64] Acked-by: Paul Burton <[email protected]> [MIPS] Acked-by: Ingo Molnar <[email protected]> [x86] Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David S. Miller <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James E.J. Bottomley <[email protected]> Cc: James Hogan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26hugetlb: introduce generic version of huge_ptep_get_and_clear()Alexandre Ghiti10-24/+13
arm, ia64, sh, x86 architectures use the same version of huge_ptep_get_and_clear, so move this generic implementation into asm-generic/hugetlb.h. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]> Reviewed-by: Luiz Capitulino <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Tested-by: Helge Deller <[email protected]> [parisc] Acked-by: Catalin Marinas <[email protected]> [arm64] Acked-by: Paul Burton <[email protected]> [MIPS] Acked-by: Ingo Molnar <[email protected]> [x86] Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David S. Miller <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James E.J. Bottomley <[email protected]> Cc: James Hogan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26hugetlb: introduce generic version of set_huge_pte_at()Alexandre Ghiti10-37/+10
arm, ia64, mips, powerpc, sh, x86 architectures use the same version of set_huge_pte_at, so move this generic implementation into asm-generic/hugetlb.h. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]> Reviewed-by: Luiz Capitulino <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Tested-by: Helge Deller <[email protected]> [parisc] Acked-by: Catalin Marinas <[email protected]> [arm64] Acked-by: Paul Burton <[email protected]> [MIPS] Acked-by: Ingo Molnar <[email protected]> [x86] Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David S. Miller <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James E.J. Bottomley <[email protected]> Cc: James Hogan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26hugetlb: introduce generic version of hugetlb_free_pgd_rangeAlexandre Ghiti10-62/+26
arm, arm64, mips, parisc, sh, x86 architectures use the same version of hugetlb_free_pgd_range, so move this generic implementation into asm-generic/hugetlb.h. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]> Reviewed-by: Luiz Capitulino <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Tested-by: Helge Deller <[email protected]> [parisc] Acked-by: Catalin Marinas <[email protected]> [arm64] Acked-by: Paul Burton <[email protected]> [MIPS] Acked-by: Ingo Molnar <[email protected]> [x86] Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David S. Miller <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James E.J. Bottomley <[email protected]> Cc: James Hogan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26hugetlb: harmonize hugetlb.h arch specific defines with pgtable.hAlexandre Ghiti2-2/+2
In order to reduce copy/paste of functions across architectures and then make riscv hugetlb port (and future ports) simpler and smaller, this patchset intends to factorize the numerous hugetlb primitives that are defined across all the architectures. Except for prepare_hugepage_range, this patchset moves the versions that are just pass-through to standard pte primitives into asm-generic/hugetlb.h by using the same #ifdef semantic that can be found in asm-generic/pgtable.h, i.e. __HAVE_ARCH_***. s390 architecture has not been tackled in this serie since it does not use asm-generic/hugetlb.h at all. This patchset has been compiled on all addressed architectures with success (except for parisc, but the problem does not come from this series). This patch (of 11): asm-generic/hugetlb.h proposes generic implementations of hugetlb related functions: use __HAVE_ARCH_HUGE* defines in order to make arch specific implementations of hugetlb functions consistent with pgtable.h scheme. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]> Reviewed-by: Luiz Capitulino <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Acked-by: Catalin Marinas <[email protected]> [arm64] Cc: Russell King <[email protected]> Cc: Will Deacon <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Paul Burton <[email protected]> Cc: James Hogan <[email protected]> Cc: James E.J. Bottomley <[email protected]> Cc: Helge Deller <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: David S. Miller <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Ingo Molnar <[email protected]> [x86] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm: remove unnecessary local variable addr in __get_user_pages_fast()Wei Yang1-3/+2
The local variable `addr' in __get_user_pages_fast() is just a shadow of `start'. Since `start' never changes after assignment to `addr', it is fine to replace `start' with it. Also the meaning of [start, end] is more obvious than [addr, end] when passed to gup_pgd_range(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm: defer ZONE_DEVICE page initialization to the point where we init pgmapAlexander Duyck4-23/+108
The ZONE_DEVICE pages were being initialized in two locations. One was with the memory_hotplug lock held and another was outside of that lock. The problem with this is that it was nearly doubling the memory initialization time. Instead of doing this twice, once while holding a global lock and once without, I am opting to defer the initialization to the one outside of the lock. This allows us to avoid serializing the overhead for memory init and we can instead focus on per-node init times. One issue I encountered is that devm_memremap_pages and hmm_devmmem_pages_create were initializing only the pgmap field the same way. One wasn't initializing hmm_data, and the other was initializing it to a poison value. Since this is something that is exposed to the driver in the case of hmm I am opting for a third option and just initializing hmm_data to 0 since this is going to be exposed to unknown third party drivers. [[email protected]: fix reference count for pgmap in devm_memremap_pages] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexander Duyck <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Tested-by: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm: create non-atomic version of SetPageReserved for init useAlexander Duyck2-2/+8
It doesn't make much sense to use the atomic SetPageReserved at init time when we are using memset to clear the memory and manipulating the page flags via simple "&=" and "|=" operations in __init_single_page. This patch adds a non-atomic version __SetPageReserved that can be used during page init and shows about a 10% improvement in initialization times on the systems I have available for testing. On those systems I saw initialization times drop from around 35 seconds to around 32 seconds to initialize a 3TB block of persistent memory. I believe the main advantage of this is that it allows for more compiler optimization as the __set_bit operation can be reordered whereas the atomic version cannot. I tried adding a bit of documentation based on f1dd2cd13c4 ("mm, memory_hotplug: do not associate hotadded memory to zones until online"). Ideally the reserved flag should be set earlier since there is a brief window where the page is initialization via __init_single_page and we have not set the PG_Reserved flag. I'm leaving that for a future patch set as that will require a more significant refactor. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexander Duyck <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm: provide kernel parameter to allow disabling page init poisoningAlexander Duyck5-6/+69
Patch series "Address issues slowing persistent memory initialization", v5. The main thing this patch set achieves is that it allows us to initialize each node worth of persistent memory independently. As a result we reduce page init time by about 2 minutes because instead of taking 30 to 40 seconds per node and going through each node one at a time, we process all 4 nodes in parallel in the case of a 12TB persistent memory setup spread evenly over 4 nodes. This patch (of 3): On systems with a large amount of memory it can take a significant amount of time to initialize all of the page structs with the PAGE_POISON_PATTERN value. I have seen it take over 2 minutes to initialize a system with over 12TB of RAM. In order to work around the issue I had to disable CONFIG_DEBUG_VM and then the boot time returned to something much more reasonable as the arch_add_memory call completed in milliseconds versus seconds. However in doing that I had to disable all of the other VM debugging on the system. In order to work around a kernel that might have CONFIG_DEBUG_VM enabled on a system that has a large amount of memory I have added a new kernel parameter named "vm_debug" that can be set to "-" in order to disable it. Link: http://lkml.kernel.org/r/[email protected] Reviewed-by: Pavel Tatashin <[email protected]> Signed-off-by: Alexander Duyck <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26memcg: remove memcg_kmem_skip_accountShakeel Butt2-26/+1
The flag memcg_kmem_skip_account was added during the era of opt-out kmem accounting. There is no need for such flag in the opt-in world as there aren't any __GFP_ACCOUNT allocations within memcg_create_cache_enqueue(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Greg Thelen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/memory_hotplug.c: clean up node_states_check_changes_offline()Oscar Salvador1-51/+29
This patch, as the previous one, gets rid of the wrong if statements. While at it, I realized that the comments are sometimes very confusing, to say the least, and wrong. For example: ___ zone_last = ZONE_MOVABLE; /* * check whether node_states[N_HIGH_MEMORY] will be changed * If we try to offline the last present @nr_pages from the node, * we can determind we will need to clear the node from * node_states[N_HIGH_MEMORY]. */ for (; zt <= zone_last; zt++) present_pages += pgdat->node_zones[zt].present_pages; if (nr_pages >= present_pages) arg->status_change_nid = zone_to_nid(zone); else arg->status_change_nid = -1; ___ In case the node gets empry, it must be removed from N_MEMORY. We already check N_HIGH_MEMORY a bit above within the CONFIG_HIGHMEM ifdef code. Not to say that status_change_nid is for N_MEMORY, and not for N_HIGH_MEMORY. So I re-wrote some of the comments to what I think is better. [[email protected]: address feedback from Pavel] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: <[email protected]> Cc: Mathieu Malaterre <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/memory_hotplug.c: simplify node_states_check_changes_onlineOscar Salvador1-50/+7
While looking at node_states_check_changes_online, I stumbled upon some confusing things. Right after entering the function, we find this: if (N_MEMORY == N_NORMAL_MEMORY) zone_last = ZONE_MOVABLE; This is wrong. N_MEMORY cannot really be equal to N_NORMAL_MEMORY. My guess is that this wanted to be something like: if (N_NORMAL_MEMORY == N_HIGH_MEMORY) to check if we have CONFIG_HIGHMEM. Later on, in the CONFIG_HIGHMEM block, we have: if (N_MEMORY == N_HIGH_MEMORY) zone_last = ZONE_MOVABLE; Again, this is wrong, and will never be evaluated to true. Besides removing these wrong if statements, I simplified the function a bit. [[email protected]: address feedback from Pavel] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Michal Hocko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/memory_hotplug.c: tidy up node_states_clear_node()Oscar Salvador1-4/+2
node_states_clear has the following if statements: if ((N_MEMORY != N_NORMAL_MEMORY) && (arg->status_change_nid_high >= 0)) ... if ((N_MEMORY != N_HIGH_MEMORY) && (arg->status_change_nid >= 0)) ... N_MEMORY can never be equal to neither N_NORMAL_MEMORY nor N_HIGH_MEMORY. Similar problem was found in [1]. Since this is wrong, let us get rid of it. [1] https://patchwork.kernel.org/patch/10579155/ Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Michal Hocko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/memory_hotplug.c: spare unnecessary calls to node_set_stateOscar Salvador1-1/+2
In node_states_check_changes_online, we check if the node will have to be set for any of the N_*_MEMORY states after the pages have been onlined. Later on, we perform the activation in node_states_set_node. Currently, in node_states_set_node we set the node to N_MEMORY unconditionally. This means that we call node_set_state for N_MEMORY every time pages go online, but we only need to do it if the node has not yet been set for N_MEMORY. Fix this by checking status_change_nid. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: <[email protected]> Cc: Mathieu Malaterre <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/filemap.c: Use existing variablehaiqing.shq1-1/+1
Use the variable write_len instead of ov_iter_count(from). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: haiqing.shq <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Jan Kara <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm: unmap VM_PFNMAP mappings with optimized pathYang Shi1-9/+0
When unmapping VM_PFNMAP mappings, vm flags need to be updated. Since the vmas have been detached, so it sounds safe to update vm flags with read mmap_sem. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Matthew Wilcox <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm: unmap VM_HUGETLB mappings with optimized pathYang Shi1-1/+1
When unmapping VM_HUGETLB mappings, vm flags need to be updated. Since the vmas have been detached, so it sounds safe to update vm flags with read mmap_sem. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Matthew Wilcox <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm: mmap: zap pages with read mmap_sem in munmapYang Shi1-11/+48
Patch series "mm: zap pages with read mmap_sem in munmap for large mapping", v11. Background: Recently, when we ran some vm scalability tests on machines with large memory, we ran into a couple of mmap_sem scalability issues when unmapping large memory space, please refer to https://lkml.org/lkml/2017/12/14/733 and https://lkml.org/lkml/2018/2/20/576. History: Then akpm suggested to unmap large mapping section by section and drop mmap_sem at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784). V1 patch series was submitted to the mailing list per Andrew's suggestion (see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great feedback and suggestions. Then this topic was discussed on LSFMM summit 2018. In the summit, Michal Hocko suggested (also in the v1 patches review) to try "two phases" approach. Zapping pages with read mmap_sem, then doing via cleanup with write mmap_sem (for discussion detail, see https://lwn.net/Articles/753269/) Approach: Zapping pages is the most time consuming part, according to the suggestion from Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas. But, we can't call MADV_DONTNEED directly, since there are two major drawbacks: * The unexpected state from PF if it wins the race in the middle of munmap. It may return zero page, instead of the content or SIGSEGV. * Can't handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which is a showstopper from akpm But, some part may need write mmap_sem, for example, vma splitting. So, the design is as follows: acquire write mmap_sem lookup vmas (find and split vmas) deal with special mappings detach vmas downgrade_write zap pages free page tables release mmap_sem The vm events with read mmap_sem may come in during page zapping, but since vmas have been detached before, they, i.e. page fault, gup, etc, will not be able to find valid vma, then just return SIGSEGV or -EFAULT as expected. If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special mappings. They will be handled by falling back to regular do_munmap() with exclusive mmap_sem held in this patch since they may update vm flags. But, with the "detach vmas first" approach, the vmas have been detached when vm flags are updated, so it sounds safe to update vm flags with read mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be handled by using the optimized path in the following separate patches for bisectable sake. Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES). However it is fine to have false-positive MMF_RECALC_UPROBES according to uprobes developer. So, uprobe unmap will not be handled by the regular path. With the "detach vmas first" approach we don't have to re-acquire mmap_sem again to clean up vmas to avoid race window which might get the address space changed since downgrade_write() doesn't release the lock to lead regression, which simply downgrades to read lock. And, since the lock acquire/release cost is managed to the minimum and almost as same as before, the optimization could be extended to any size of mapping without incurring significant penalty to small mappings. For the time being, just do this in munmap syscall path. Other vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain intact due to some implementation difficulties since they acquire write mmap_sem from very beginning and hold it until the end, do_munmap() might be called in the middle. But, the optimized do_munmap would like to be called without mmap_sem held so that we can do the optimization. So, if we want to do the similar optimization for mmap/mremap path, I'm afraid we would have to redesign them. mremap might be called on very large area depending on the usecases, the optimization to it will be considered in the future. This patch (of 3): When running some mmap/munmap scalability tests with large memory (i.e. > 300GB), the below hung task issue may happen occasionally. INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 Call Trace: [<ffffffff817154d0>] ? __schedule+0x250/0x730 [<ffffffff817159e6>] schedule+0x36/0x80 [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150 [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30 [<ffffffff81717db0>] down_read+0x20/0x40 [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0 [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100 [<ffffffff81241d87>] __vfs_read+0x37/0x150 [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0 [<ffffffff81242266>] vfs_read+0x96/0x130 [<ffffffff812437b5>] SyS_read+0x55/0xc0 [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5 It is because munmap holds mmap_sem exclusively from very beginning to all the way down to the end, and doesn't release it in the middle. When unmapping large mapping, it may take long time (take ~18 seconds to unmap 320GB mapping with every single page mapped on an idle machine). Zapping pages is the most time consuming part, according to the suggestion from Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas. But, some part may need write mmap_sem, for example, vma splitting. So, the design is as follows: acquire write mmap_sem lookup vmas (find and split vmas) deal with special mappings detach vmas downgrade_write zap pages free page tables release mmap_sem The vm events with read mmap_sem may come in during page zapping, but since vmas have been detached before, they, i.e. page fault, gup, etc, will not be able to find valid vma, then just return SIGSEGV or -EFAULT as expected. If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special mappings. They will be handled by without downgrading mmap_sem in this patch since they may update vm flags. But, with the "detach vmas first" approach, the vmas have been detached when vm flags are updated, so it sounds safe to update vm flags with read mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be handled by using the optimized path in the following separate patches for bisectable sake. Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES). However it is fine to have false-positive MMF_RECALC_UPROBES according to uprobes developer. With the "detach vmas first" approach we don't have to re-acquire mmap_sem again to clean up vmas to avoid race window which might get the address space changed since downgrade_write() doesn't release the lock to lead regression, which simply downgrades to read lock. And, since the lock acquire/release cost is managed to the minimum and almost as same as before, the optimization could be extended to any size of mapping without incurring significant penalty to small mappings. For the time being, just do this in munmap syscall path. Other vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain intact due to some implementation difficulties since they acquire write mmap_sem from very beginning and hold it until the end, do_munmap() might be called in the middle. But, the optimized do_munmap would like to be called without mmap_sem held so that we can do the optimization. So, if we want to do the similar optimization for mmap/mremap path, I'm afraid we would have to redesign them. mremap might be called on very large area depending on the usecases, the optimization to it will be considered in the future. With the patches, exclusive mmap_sem hold time when munmap a 80GB address space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level from second. munmap_test-15002 [008] 594.380138: funcgraph_entry: | __vm_munmap() { munmap_test-15002 [008] 594.380146: funcgraph_entry: !2485684 us | unmap_region(); munmap_test-15002 [008] 596.865836: funcgraph_exit: !2485692 us | } Here the execution time of unmap_region() is used to evaluate the time of holding read mmap_sem, then the remaining time is used with holding exclusive lock. [1] https://lwn.net/Articles/753269/ Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]>Suggested-by: Michal Hocko <[email protected]> Suggested-by: Kirill A. Shutemov <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Reviewed-by: Matthew Wilcox <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26vfree: add debug might_sleep()Andrey Ryabinin1-0/+2
Add might_sleep() call to vfree() to catch potential sleep-in-atomic bugs earlier. [[email protected]: drop might_sleep_if() from kvfree()] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Ryabinin <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/vmalloc.c: improve vfree() kerneldocAndrey Ryabinin1-0/+2
vfree() might sleep if called not in interrupt context. Explain that in the comment. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Ryabinin <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26kvfree(): fix misleading commentAndrey Ryabinin1-1/+1
vfree() might sleep if called not in interrupt context. So does kvfree() too. Fix misleading kvfree()'s comment about allowed context. Link: http://lkml.kernel.org/r/[email protected] Fixes: 04b8e946075d ("mm/util.c: improve kvfree() kerneldoc") Signed-off-by: Andrey Ryabinin <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/mempolicy.c: use match_string() helper to simplify the codezhong jiang1-8/+3
match_string() returns the index of an array for a matching string, which can be used intead of open coded implementation. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: zhong jiang <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-10-26mm/swap.c: remove duplicated includeYueHaibing1-1/+0
Remove duplicated include linux/memremap.h Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: YueHaibing <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>