Age | Commit message (Collapse) | Author | Files | Lines |
|
A current "lazy drain" model suffers from at least two issues.
First one is related to the unsorted list of vmap areas, thus in order to
identify the [min:max] range of areas to be drained, it requires a full
list scan. What is a time consuming if the list is too long.
Second one and as a next step is about merging all fragments with a free
space. What is also a time consuming because it has to iterate over
entire list which holds outstanding lazy areas.
See below the "preemptirqsoff" tracer that illustrates a high latency. It
is ~24676us. Our workloads like audio and video are effected by such long
latency:
<snip>
tracer: preemptirqsoff
preemptirqsoff latency trace v1.1.5 on 4.9.186-perf+
--------------------------------------------------------------------
latency: 24676 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 P:8)
-----------------
| task: crtc_commit:112-261 (uid:0 nice:0 policy:1 rt_prio:16)
-----------------
=> started at: __purge_vmap_area_lazy
=> ended at: __purge_vmap_area_lazy
_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| / delay
cmd pid ||||| time | caller
\ / ||||| \ | /
crtc_com-261 1...1 1us*: _raw_spin_lock <-__purge_vmap_area_lazy
[...]
crtc_com-261 1...1 24675us : _raw_spin_unlock <-__purge_vmap_area_lazy
crtc_com-261 1...1 24677us : trace_preempt_on <-__purge_vmap_area_lazy
crtc_com-261 1...1 24683us : <stack trace>
=> free_vmap_area_noflush
=> remove_vm_area
=> __vunmap
=> vfree
=> drm_property_free_blob
=> drm_mode_object_unreference
=> drm_property_unreference_blob
=> __drm_atomic_helper_crtc_destroy_state
=> sde_crtc_destroy_state
=> drm_atomic_state_default_clear
=> drm_atomic_state_clear
=> drm_atomic_state_free
=> complete_commit
=> _msm_drm_commit_work_cb
=> kthread_worker_fn
=> kthread
=> ret_from_fork
<snip>
To address those two issues we can redesign a purging of the outstanding
lazy areas. Instead of queuing vmap areas to the list, we replace it by
the separate rb-tree. In hat case an area is located in the tree/list in
ascending order. It will give us below advantages:
a) Outstanding vmap areas are merged creating bigger coalesced blocks,
thus it becomes less fragmented.
b) It is possible to calculate a flush range [min:max] without scanning
all elements. It is O(1) access time or complexity;
c) The final merge of areas with the rb-tree that represents a free
space is faster because of (a). As a result the lock contention is
also reduced.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: huang ying <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is a dedicated and separate function that finds and removes a
continuous kernel virtual area. As a final step it also releases the
"area", a descriptor of corresponding vm_struct.
Use free_vmap_area() in the __vmalloc_node_range() instead of open coded
steps which are exactly the same, to perform a cleanup.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
With a machine with 3 TB (more than 2 TB memory). If you use vmalloc to
allocate > 2 TB memory, the array_size below will be overflowed.
The array_size is an unsigned int and can only be used to allocate less
than 2 TB memory. If you pass 2*1028*1028*1024*1024 = 2 * 2^40 in the
argument of vmalloc. The array_size will become 2*2^31 = 2^32. The 2^32
cannot be store with a 32 bit integer.
The fix is to change the type of array_size to unsigned long.
[[email protected]: rework for current mainline]
Link: https://bugzilla.kernel.org/show_bug.cgi?id=210023
Reported-by: <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since I butchered this I figured better to make sure we have testcases for
this now. Since we only have a locking context for __GFP_FS that's the
only thing we're testing right now.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Daniel Vetter <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Thomas Hellström (Intel) <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Maarten Lankhorst <[email protected]>
Cc: Christian König <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Waiman Long <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Extracted from slab.h, which seems to have the most complete version
including the correct might_sleep() check. Roll it out to slob.c.
Motivated by a discussion with Paul about possibly changing call_rcu
behaviour to allocate memory, but only roughly every 500th call.
There are a lot fewer places in the kernel that care about whether
allocating memory is allowed or not (due to deadlocks with reclaim code)
than places that care whether sleeping is allowed. But debugging these
also tends to be a lot harder, so nice descriptive checks could come in
handy. I might have some use eventually for annotations in drivers/gpu.
Note that unlike fs_reclaim_acquire/release gfpflags_allow_blocking does
not consult the PF_MEMALLOC flags. But there is no flag equivalent for
GFP_NOWAIT, hence this check can't go wrong due to
memalloc_no*_save/restore contexts. Willy is working on a patch series
which might change this:
https://lore.kernel.org/linux-mm/[email protected]/
I think best would be if that updates gfpflags_allow_blocking(), since
there's a ton of callers all over the place for that already.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Daniel Vetter <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Paul E. McKenney <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Christian König <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Maarten Lankhorst <[email protected]>
Cc: Thomas Hellström (Intel) <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
fs_reclaim_acquire/release nicely catch recursion issues when allocating
GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep
the excessive caches in check). For mmu notifier recursions we do have
lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep
map for invalidate_range_start/end").
But these only fire if a path actually results in some pte invalidation -
for most small allocations that's very rarely the case. The other trouble
is that pte invalidation can happen any time when __GFP_RECLAIM is set.
Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good
enough to avoid potential mmu notifier recursion.
I was pondering whether we should just do the general annotation, but
there's always the risk for false positives. Plus I'm assuming that the
core fs and io code is a lot better reviewed and tested than random mmu
notifier code in drivers. Hence why I decide to only annotate for that
specific case.
Furthermore even if we'd create a lockdep map for direct reclaim, we'd
still need to explicit pull in the mmu notifier map - there's a lot more
places that do pte invalidation than just direct reclaim, these two
contexts arent the same.
Note that the mmu notifiers needing their own independent lockdep map is
also the reason we can't hold them from fs_reclaim_acquire to
fs_reclaim_release - it would nest with the acquistion in the pte
invalidation code, causing a lockdep splat. And we can't remove the
annotations from pte invalidation and all the other places since they're
called from many other places than page reclaim. Hence we can only do the
equivalent of might_lock, but on the raw lockdep map.
With this we can also remove the lockdep priming added in 66204f1d2d1b
("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly
more powerful.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Daniel Vetter <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Thomas Hellström (Intel) <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Maarten Lankhorst <[email protected]>
Cc: Christian König <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Don't allow splitting of vm_special_mapping's. It affects vdso/vvar
areas. Uprobes have only one page in xol_area so they aren't affected.
Those restrictions were enforced by checks in .mremap() callbacks.
Restrict resizing with generic .split() callback.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dmitry Safonov <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If original VMA can't be split at the desired address, do_munmap() will
fail and leave both new-copied VMA and old VMA. De-facto it's
MREMAP_DONTUNMAP behaviour, which is unexpected.
Currently, it may fail such way for hugetlbfs and dax device mappings.
Minimize such unpleasant situations to OOM by checking .may_split() before
attempting to create a VMA copy.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dmitry Safonov <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Rename the callback to reflect that it's not called *on* or *after* split,
but rather some time before the splitting to check if it's possible.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dmitry Safonov <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As kernel expect to see only one of such mappings, any further operations
on the VMA-copy may be unexpected by the kernel. Maybe it's being on the
safe side, but there doesn't seem to be any expected use-case for this, so
restrict it now.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: commit e346b3813067 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
Signed-off-by: Dmitry Safonov <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently memory is accounted post-mremap() with MREMAP_DONTUNMAP, which
may break overcommit policy. So, check if there's enough memory before
doing actual VMA copy.
Don't unset VM_ACCOUNT on MREMAP_DONTUNMAP. By semantics, such mremap()
is actually a memory allocation. That also simplifies the error-path a
little.
Also, as it's memory allocation on success don't reset hiwater_vm value.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: commit e346b3813067 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
Signed-off-by: Dmitry Safonov <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mremap: move_vma() fixes".
This patch (of 6):
move_vma() copies VMA without adding it to account, then unmaps old part
of VMA. On failure it unmaps the new VMA. With hacks accounting in
munmap is disabled as it's a copy of existing VMA.
Account the memory on munmap() failure which was previously copied into
a new VMA.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: commit e2ea83742133 ("[PATCH] mremap: move_vma fixes and cleanup")
Signed-off-by: Dmitry Safonov <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Russell King <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Ralph Campbell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Code outside mm/ should not be calling free_unref_page(). Also move
free_unref_page_list().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The page has just been allocated, so its refcount is 1. free_unref_page()
is for use on pages which have a zero refcount. Use __free_page() like
the other implementations of pte_alloc_one().
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 1ae9ae5f7df7 ("sparc: handle pgtable_page_ctor() fail")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: David Miller <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The goal of these tracepoints is to be able to debug lock contention
issues. This lock is acquired on most (all?) mmap / munmap / page fault
operations, so a multi-threaded process which does a lot of these can
experience significant contention.
We trace just before we start acquisition, when the acquisition returns
(whether it succeeded or not), and when the lock is released (or
downgraded). The events are broken out by lock type (read / write).
The events are also broken out by memcg path. For container-based
workloads, users often think of several processes in a memcg as a single
logical "task", so collecting statistics at this level is useful.
The end goal is to get latency information. This isn't directly included
in the trace events. Instead, users are expected to compute the time
between "start locking" and "acquire returned", using e.g. synthetic
events or BPF. The benefit we get from this is simpler code.
Because we use tracepoint_enabled() to decide whether or not to trace,
this patch has effectively no overhead unless tracepoints are enabled at
runtime. If tracepoints are enabled, there is a performance impact, but
how much depends on exactly what e.g. the BPF program does.
[[email protected]: fix use-after-free race and css ref leak in tracepoints]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: v3]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: in-depth examples of tracepoint_enabled() usage, and per-cpu-per-context buffer design]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Axel Rasmussen <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Chinwen Chang <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Laurent Dufour <[email protected]>
Cc: Yafang Shao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
check_pte() needs a correct colon for kernel-doc markup, otherwise, gcc
has the following warning for W=1, mm/page_vma_mapped.c:86: warning:
Function parameter or member 'pvmw' not described in 'check_pte'
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alex Shi <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add and change parameter explanation for wp_pte and clean_record_pte, to
avoid W1 warning:
mm/mapping_dirty_helpers.c:34: warning: Function parameter or member 'end' not described in 'wp_pte'
mm/mapping_dirty_helpers.c:88: warning: Function parameter or member 'end' not described in 'clean_record_pte'
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alex Shi <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Despite a comment that said that page fault accounting would be charged to
whatever task_struct* was passed into __access_remote_vm(), the tsk
argument was actually unused.
Making page fault accounting actually use this task struct is quite a
project, so there is no point in keeping the tsk argument.
Delete both the comment, and the argument.
[[email protected]: changelog addition]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
HAVE_MOVE_PUD enables remapping pages at the PUD level if both the
source and destination addresses are PUD-aligned.
With HAVE_MOVE_PUD enabled it can be inferred that there is
approximately a 13x improvement in performance on x86. (See data
below).
------- Test Results ---------
The following results were obtained using a 5.4 kernel, by remapping
a PUD-aligned, 1GB sized region to a PUD-aligned destination.
The results from 10 iterations of the test are given below:
Total mremap times for 1GB data on x86. All times are in nanoseconds.
Control HAVE_MOVE_PUD
180394 15089
235728 14056
238931 25741
187330 13838
241742 14187
177925 14778
182758 14728
160872 14418
205813 15107
245722 13998
205721.5 15594 <-- Mean time in nanoseconds
A 1GB mremap completion time drops from ~205 microseconds
to ~15 microseconds on x86. (~13x speed up).
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kalesh Singh <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Hassan Naveed <[email protected]>
Cc: Jia He <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Lokesh Gidra <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Mina Almasry <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Ram Pai <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Sandipan Das <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
HAVE_MOVE_PUD enables remapping pages at the PUD level if both the source
and destination addresses are PUD-aligned.
With HAVE_MOVE_PUD enabled it can be inferred that there is approximately
a 19x improvement in performance on arm64. (See data below).
------- Test Results ---------
The following results were obtained using a 5.4 kernel, by remapping a
PUD-aligned, 1GB sized region to a PUD-aligned destination. The results
from 10 iterations of the test are given below:
Total mremap times for 1GB data on arm64. All times are in nanoseconds.
Control HAVE_MOVE_PUD
1247761 74271
1219896 46771
1094792 59687
1227760 48385
1043698 76666
1101771 50365
1159896 52500
1143594 75261
1025833 61354
1078125 48697
1134312.6 59395.7 <-- Mean time in nanoseconds
A 1GB mremap completion time drops from ~1.1 milliseconds to ~59
microseconds on arm64. (~19x speed up).
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kalesh Singh <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Hassan Naveed <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jia He <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Lokesh Gidra <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Mina Almasry <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Ram Pai <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Sandipan Das <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Android needs to move large memory regions for garbage collection. The GC
requires moving physical pages of multi-gigabyte heap using mremap.
During this move, the application threads have to be paused for
correctness. It is critical to keep this pause as short as possible to
avoid jitters during user interaction.
Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD level if
the source and destination addresses are PUD-aligned. For
CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves PGD
entries, since the PUD entry is “folded back” onto the PGD entry. Add
HAVE_MOVE_PUD so that architectures where moving at the PUD level isn't
supported/tested can turn this off by not selecting the config.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kalesh Singh <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reported-by: kernel test robot <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Hassan Naveed <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jia He <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Lokesh Gidra <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Mina Almasry <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Ram Pai <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Sandipan Das <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Speed up mremap on large regions", v4.
mremap time can be optimized by moving entries at the PMD/PUD level if the
source and destination addresses are PMD/PUD-aligned and PMD/PUD-sized.
Enable moving at the PMD and PUD levels on arm64 and x86. Other
architectures where this type of move is supported and known to be safe
can also opt-in to these optimizations by enabling HAVE_MOVE_PMD and
HAVE_MOVE_PUD.
Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
region on x86 and arm64:
- HAVE_MOVE_PMD is already enabled on x86 : N/A
- Enabling HAVE_MOVE_PUD on x86 : ~13x speed up
- Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
- Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
give a total of ~150x speed up on arm64.
This patch (of 4):
Test mremap on regions of various sizes and alignments and validate data
after remapping. Also provide total time for remapping the region which
is useful for performance comparison of the mremap optimizations that move
pages at the PMD/PUD levels if HAVE_MOVE_PMD and/or HAVE_MOVE_PUD are
enabled.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kalesh Singh <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Lokesh Gidra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Hassan Naveed <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Jia He <[email protected]>
Cc: Ram Pai <[email protected]>
Cc: Sandipan Das <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Mina Almasry <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Cleanup fill_list() to keep all the pgmap manipulations in a single
location of the function. Update the exit unwind path accordingly.
Link: http://lore.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/160272253442.3136502.16683842453317773487.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <[email protected]>
Reported-by: Boris Ostrovsky <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
For many workloads, pagetable consumption is significant and it makes
sense to expose it in the memory.stat for the memory cgroups. However at
the moment, the pagetables are accounted per-zone. Converting them to
per-node and using the right interface will correctly account for the
memory cgroups as well.
[[email protected]: export __mod_lruvec_page_state to modules for arch/mips/kvm/]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "memcg: add pagetable comsumption to memory.stat", v2.
Many workloads consumes significant amount of memory in pagetables. One
specific use-case is the user space network driver which mmaps the
application memory to provide zero copy transfer. This driver can consume
a large amount memory in page tables. This patch series exposes the
pagetable comsumption for each memory cgroup.
This patch (of 2):
This does not change any functionality and only move the functions which
update the lruvec stats to vmstat.h from memcontrol.h. The main reason
for this patch is to be able to use these functions in the page table
contructor function which is defined in mm.h and we can not include the
memcontrol.h in that file. Also this is a better place for this interface
in general. The lruvec abstraction, while invented for memcg, isn't
specific to memcg at all.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Swapcache readahead pages are charged before being used, so it is unlikely
that they will be migrated before charging. Remove the incorrect comment.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix the following coccinelle warnings:
mm/memcontrol.c:7341:2-22: WARNING: Assignment of 0/1 to bool variable
mm/memcontrol.c:7343:2-22: WARNING: Assignment of 0/1 to bool variable
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kaixu Xia <[email protected]>
Reported-by: Tosk Robot <[email protected]>
Acked-by: Souptick Joarder <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The *_lruvec_slab_state is also suitable for pages allocated from buddy,
not just for the slab objects. But the function name seems to tell us
that only slab object is applicable. So we can rename the keyword of slab
to kmem.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit 2ef1bf118c40 ("mm: memcg: deprecate the non-hierarchical mode")
removed the only use of memcg_has_children() in
mem_cgroup_hierarchy_write() as part of the feature deprecation.
Hence, since then, make CC=clang W=1 warns:
mm/memcontrol.c:3421:20: warning: unused function 'memcg_has_children' [-Wunused-function]
Simply remove this obsolete unused function.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lukas Bulwahn <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use page_counter_read() in page_counter_set_max().
Link: https://lkml.kernel.org/r/20201113141048.GA178922@rlk
Signed-off-by: Hui Su <[email protected]>
Reviewed-by: Pankaj Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
With the deprecation of the non-hierarchical mode of the memory controller
there are no more examples of broken hierarchies left.
Let's remove the cgroup core code which was supposed to print warnings
about creating of broken hierarchies.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Update cgroup v1 docs after the deprecation of the non-hierarchical mode
of the memory controller.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm: memcg: deprecate cgroup v1 non-hierarchical mode", v1.
The non-hierarchical cgroup v1 mode is a legacy of early days
of the memory controller and doesn't bring any value today.
However, it complicates the code and creates many edge cases
all over the memory controller code.
It's a good time to deprecate it completely. This patchset removes
the internal logic, adjusts the user interface and updates
the documentation. The alt patch removes some bits of the cgroup
core code, which become obsolete.
Michal Hocko said:
"All that we know today is that we have a warning in place to complain
loudly when somebody relies on use_hierarchy=0 with a deeper
hierarchy. For all those years we have seen _zero_ reports that would
describe a sensible usecase.
Moreover we (SUSE) have backported this warning into old distribution
kernels (since 3.0 based kernels) to extend the coverage and didn't
hear even for users who adopt new kernels only very slowly. The only
report we have seen so far was a LTP test suite which doesn't really
reflect any real life usecase"
This patch (of 3):
The non-hierarchical cgroup v1 mode is a legacy of early days of the
memory controller and doesn't bring any value today. However, it
complicates the code and creates many edge cases all over the memory
controller code.
It's a good time to deprecate it completely.
Functionally this patch enabled is by default for all cgroups and forbids
switching it off. Nothing changes if cgroup v2 is used: hierarchical mode
was enforced from scratch.
To protect the ABI memory.use_hierarchy interface is preserved with a
limited functionality: reading always returns "1", writing of "1" passes
silently, writing of any other value fails with -EINVAL and a warning to
dmesg (on the first occasion).
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch fixes/removes some obsolete comments in the code related
to the kernel memory accounting:
- kmem_cache->memcg_params.memcg_caches has been removed by commit
9855609bde03 ("mm: memcg/slab: use a single set of kmem_caches for
all accounted allocations")
- memcg->kmemcg_id is not used as a gate for kmem accounting since
commit 0b8f73e10428 ("mm: memcontrol: clean up alloc, online,
offline, free functions")
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The page->mem_cgroup member is replaced by memcg_data, and add a helper
page_memcg() for it. Need to update comments to avoid confusing.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alex Shi <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
v2"), the code to check the secondary MMU's page table access bit is
broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
secondary MMU's page table before the check. More specifically for those
secondary MMUs which unmap the memory in
mmu_notifier_invalidate_range_start() like kvm.
However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
access check before trying to unmap the page. So, at worst the reclaim
will miss accesses in a very short window if we remove page table access
check in unmapping code.
There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
reclaim. From memcg reclaim the page_referenced() only account the
accesses from the processes which are in the same memcg of the target page
but the unmapping code is considering accesses from all the processes, so,
decreasing the effectiveness of memcg reclaim.
The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
code.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
Signed-off-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The rcu_read_lock/unlock only can guarantee that the memcg will not be
freed, but it cannot guarantee the success of css_get to memcg.
If the whole process of a cgroup offlining is completed between reading a
objcg->memcg pointer and bumping the css reference on another CPU, and
there are exactly 0 external references to this memory cgroup (how we get
to the obj_cgroup_charge() then?), css_get() can change the ref counter
from 0 back to 1.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Chris Down <[email protected]>
Cc: Christian Brauner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Consider the following memcg hierarchy.
root
/ \
A B
If we failed to get the reference on objcg of memcg A, the
get_obj_cgroup_from_current can return the wrong objcg for the root
memcg.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Chris Down <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Eugene Syromiatnikov <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Adrian Reber <[email protected]>
Cc: Marco Elver <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The mz->usage_in_excess >= mz_node->usage_in_excess check is exactly the
else case of mz->usage_in_excess < mz_node->usage_in_excess. So we could
replace else if (mz->usage_in_excess >= mz_node->usage_in_excess) with
else equally. Also drop the comment which doesn't really explain much.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since commit 991e7673859e ("mm: memcontrol: account kernel stack per
node") there is no user of the mod_memcg_obj_state(). So just remove
it.
Also rework type of the idx parameter of the mod_objcg_state() from int
to enum node_stat_item.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: David Rientjes <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yafang Shao <[email protected]>
Cc: Chris Down <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As huge page usage in the page cache and for shmem files proliferates in
our production environment, the performance monitoring team has asked for
per-cgroup stats on those pages.
We already track and export anon_thp per cgroup. We already track file
THP and shmem THP per node, so making them per-cgroup is only a matter of
switching from node to lruvec counters. All callsites are in places where
the pages are charged and locked, so page->memcg is stable.
[[email protected]: add documentation]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Song Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix a typo, punctuation, use uppercase for CPUs, and limit
tmpfs to keeping only its files in virtual memory (phrasing).
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Chris Down <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
shmem_mapping() isn't worth an out-of-line call from any callsite.
So make it inline by
- make shmem_aops global
- export shmem_aops
- inline the shmem_mapping()
and replace the direct call 'shmem_aops' with shmem_mapping()
in shmem.c.
Link: https://lkml.kernel.org/r/20201115165207.GA265355@rlk
Signed-off-by: Hui Su <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
With the merge of commit 2e1692966034 ("ceph: have ceph_writepages_start
call pagevec_lookup_range_tag"), nothing calls this anymore.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jeff Layton <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We could use helper memset to fill the swap_map with SWAP_HAS_CACHE instead
of a direct loop here to simplify the code. Also we can remove the local
variable i and map this way.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When the code went to the out label, it must have p == NULL. So what out
label really does is redundant if check and return err. We should Remove
this unnecessary out label because it does not handle resource free and so
on.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
swap_ra_info() may leave ra_info untouched in non_swap_entry() case as
page table lock is not held. In this case, we have ra_info.nr_pte == 0
and it is meaningless to continue with swap cache readahead. Skip such
ops by init ra_info.win = 1.
[[email protected]: clean up struct init]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit 570a335b8e22 ("swap_info: swap count continuations") introduced the
func add_swap_count_continuation() but forgot to use the helper function
swap_count() introduced by commit 355cfa73ddff ("mm: modify swap_map and
add SWAP_HAS_CACHE flag").
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
release_pages() is an optimized, inlined version of __put_pages() except
that zone device struct pages that are not page_is_devmap_managed() (i.e.,
memory_type MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA), fall
through to the code that could return the zone device page to the page
allocator instead of adjusting the pgmap reference count.
Clearly these type of pages are not having the reference count decremented
to zero via release_pages() or page allocation problems would be seen.
Just to be safe, handle the 1 to zero case in release_pages() like
__put_page() does.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ralph Campbell <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
These functions accomplish the same thing but have different
implementations.
unpin_user_page() has a bug where it calls mod_node_page_state() after
calling put_page() which creates a risk that the page could have been
hot-uplugged from the system.
Fix this by using put_compound_head() as the only implementation.
__unpin_devmap_managed_user_page() and related can be deleted as well in
favour of the simpler, but slower, version in put_compound_head() that has
an extra atomic page_ref_sub, but always calls put_page() which internally
contains the special devmap code.
Move put_compound_head() to be directly after try_grab_compound_head() so
people can find it in future.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 1970dc6f5226 ("mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting")
Signed-off-by: Jason Gunthorpe <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
CC: Joao Martins <[email protected]>
CC: Jonathan Corbet <[email protected]>
CC: Dan Williams <[email protected]>
CC: Dave Chinner <[email protected]>
CC: Christoph Hellwig <[email protected]>
CC: Jane Chu <[email protected]>
CC: "Kirill A. Shutemov" <[email protected]>
CC: Michal Hocko <[email protected]>
CC: Mike Kravetz <[email protected]>
CC: Shuah Khan <[email protected]>
CC: Muchun Song <[email protected]>
CC: Vlastimil Babka <[email protected]>
CC: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|