Age | Commit message (Collapse) | Author | Files | Lines |
|
Minchan Kim asked the following question -- what locks protects
address_space destroying when race happens between inode trauncation and
__isolate_lru_page? Jan Kara clarified by describing the race as follows
CPU1 CPU2
truncate(inode) __isolate_lru_page()
...
truncate_inode_page(mapping, page);
delete_from_page_cache(page)
spin_lock_irqsave(&mapping->tree_lock, flags);
__delete_from_page_cache(page, NULL)
page_cache_tree_delete(..)
... mapping = page_mapping(page);
page->mapping = NULL;
...
spin_unlock_irqrestore(&mapping->tree_lock, flags);
page_cache_free_page(mapping, page)
put_page(page)
if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_space
if (mapping && !mapping->a_ops->migratepage)
- we've dereferenced mapping which is potentially already free.
The race is theoretically possible but unlikely. Before the
delete_from_page_cache, truncate_cleanup_page is called so the page is
likely to be !PageDirty or PageWriteback which gets skipped by the only
caller that checks the mappping in __isolate_lru_page. Even if the race
occurs, a substantial amount of work has to happen during a tiny window
with no preemption but it could potentially be done using a virtual
machine to artifically slow one CPU or halt it during the critical
window.
This patch should eliminate the race with truncation by try-locking the
page before derefencing mapping and aborting if the lock was not
acquired. There was a suggestion from Huang Ying to use RCU as a
side-effect to prevent mapping being freed. However, I do not like the
solution as it's an unconventional means of preserving a mapping and
it's not a context where rcu_read_lock is obviously protecting rcu data.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: c82449352854 ("mm: compaction: make isolate_lru_page() filter-aware again")
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Nothing actually calls userfaultfd_file_create() besides the
userfaultfd() system call itself. So simplify things by folding it into
the system call and using anon_inode_getfd() instead of
anon_inode_getfile(). Do the same in resolve_userfault_fork() as well.
This removes over 50 lines with no change in functionality.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric Biggers <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Marc-André Lureau <[email protected]>
Suggested-by: Mike Kravetz <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Herrmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The memfd & fuse tests will share more common code in the following
commits to test hugetlb support.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Marc-André Lureau <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Herrmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Link: http://lkml.kernel.org/r/[email protected]
Suggested-by: Mike Kravetz <[email protected]>
Signed-off-by: Marc-André Lureau <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Herrmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove most of the special-casing of hugetlbfs now that sealing is
supported.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Marc-André Lureau <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Herrmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Adapt add_seals()/get_seals() to work with hugetbfs-backed memory.
Teach memfd_create() to allow sealing operations on MFD_HUGETLB.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Marc-André Lureau <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Herrmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Implements memfd sealing, similar to shmem:
- WRITE: deny fallocate(PUNCH_HOLE). mmap() write is denied in
memfd_add_seals(). write() doesn't exist for hugetlbfs.
- SHRINK: added similar check as shmem_setattr()
- GROW: added similar check as shmem_setattr() & shmem_fallocate()
Except write() operation that doesn't exist with hugetlbfs, that should
make sealing as close as it can be to shmem support.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Marc-André Lureau <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Herrmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
hugetlbfs inode information will need to be accessed by code in
mm/shmem.c for file sealing operations. Move inode information
definition from .c file to header for needed access.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Marc-André Lureau <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Herrmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Those functions are called for memfd files, backed by shmem or hugetlb
(the next patches will handle hugetlb).
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Marc-André Lureau <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Herrmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "memfd: add sealing to hugetlb-backed memory", v3.
Recently, Mike Kravetz added hugetlbfs support to memfd. However, he
didn't add sealing support. One of the reasons to use memfd is to have
shared memory sealing when doing IPC or sharing memory with another
process with some extra safety. qemu uses shared memory & hugetables
with vhost-user (used by dpdk), so it is reasonable to use memfd now
instead for convenience and security reasons.
This patch (of 9):
The functions are called through shmem_fcntl() only. And no danger in
removing the EXPORTs as the routines only work with shmem file structs.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Marc-André Lureau <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Herrmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
PG_buddy doesn't exist any more. It's called PageBuddy now.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Be really explicit about what bits / bytes are reserved for users that
want to store extra information about the pages they allocate.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Randy Dunlap <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Neither of these values get even close to 256; compound_dtor is
currently at a maximum of 3, and compound_order can't be over 64. No
machine has inefficient access to bytes since EV5, and while those are
still supported, we don't optimise for them any more. This does not
shrink struct page, but it removes an ifdef and frees up 2-6 bytes for
future use.
diff of pahole output:
struct callback_head callback_head; /* 32 16 */
struct {
long unsigned int compound_head; /* 32 8 */
- unsigned int compound_dtor; /* 40 4 */
- unsigned int compound_order; /* 44 4 */
+ unsigned char compound_dtor; /* 40 1 */
+ unsigned char compound_order; /* 41 1 */
}; /* 32 16 */
}; /* 32 16 */
union {
[[email protected]: add comment]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Instead of putting the ifdef in the middle of the definition of struct
page, pull it forward to the rest of the ifdeffery around the SLUB
cmpxchg_double optimisation.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The comment on page->mapping is terse, and out of date (it does not
mention the possibility of PAGE_MAPPING_MOVABLE). Instead, point the
interested reader to page-flags.h where there is a much better comment.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The "third double word block" isn't on 32-bit systems. The layout looks
like this:
unsigned long flags;
struct address_space *mapping
pgoff_t index;
atomic_t _mapcount;
atomic_t _refcount;
which is 32 bytes on 64-bit, but 20 bytes on 32-bit. Nobody is trying to
use the fact that it's double-word aligned today, so just remove the
misleading claims.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
I found the struct { union { struct { union { struct { } } } } } layout
rather confusing. Fortunately, there is an easier way to write this.
The innermost union is of four things which are the size of an int, so
the ones which are used by slab/slob/slub can be pulled up two levels to
be in the outermost union with 'counters'. That leaves us with struct {
union { struct { atomic_t; atomic_t; } } } which has the same layout,
but is easier to read.
Output from the current git version of pahole, diffed with -uw to ignore
the whitespace changes from the indentation:
}; /* 16 8 */
union {
long unsigned int counters; /* 24 8 */
- struct {
- union {
- atomic_t _mapcount; /* 24 4 */
unsigned int active; /* 24 4 */
struct {
unsigned int inuse:16; /* 24:16 4 */
@@ -21,7 +18,8 @@
unsigned int frozen:1; /* 24: 0 4 */
}; /* 24 4 */
int units; /* 24 4 */
- }; /* 24 4 */
+ struct {
+ atomic_t _mapcount; /* 24 4 */
atomic_t _refcount; /* 28 4 */
}; /* 24 8 */
}; /* 24 8 */
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Restructure struct page", v2.
This series does not attempt any grand restructuring. Instead, it cures
the worst of the indentitis, fixes the documentation and reduces the
ifdeffery. The only layout change is compound_dtor and compound_order
are each reduced to one byte.
This patch (of 8):
Instead of an ifdef block at the end of the struct, which needed its own
comment, define _struct_page_alignment up at the top where it fits
nicely with the existing comment.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Structure zs_pool has special flag to indicate success of shrinker
initialization. unregister_shrinker() has improved and can detect by
itself whether actual deinitialization should be performed or not, so
extra flag becomes redundant.
[[email protected]: update comment (Aliaksei), remove unneeded cast]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Aliaksei Karaliou <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This uses the new annotation to determine if an mm has mmu notifiers
with blockable invalidate range callbacks to avoid oom reaping.
Otherwise, the callbacks are used around unmap_page_range().
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Rientjes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Christian König <[email protected]>
Cc: Dimitri Sivanich <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Oded Gabbay <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Mike Marciniszyn <[email protected]>
Cc: Sean Hefty <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit 4d4bbd8526a8 ("mm, oom_reaper: skip mm structs with mmu
notifiers") prevented the oom reaper from unmapping private anonymous
memory with the oom reaper when the oom victim mm had mmu notifiers
registered.
The rationale is that doing mmu_notifier_invalidate_range_{start,end}()
around the unmap_page_range(), which is needed, can block and the oom
killer will stall forever waiting for the victim to exit, which may not
be possible without reaping.
That concern is real, but only true for mmu notifiers that have
blockable invalidate_range_{start,end}() callbacks. This patch adds a
"flags" field to mmu notifier ops that can set a bit to indicate that
these callbacks do not block.
The implementation is steered toward an expensive slowpath, such as
after the oom reaper has grabbed mm->mmap_sem of a still alive oom
victim.
[[email protected]: mmu_notifier_invalidate_range_end() can also call the invalidate_range() must not block, fix comment]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: make mm_has_blockable_invalidate_notifiers() return bool, use rwsem_is_locked()]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Rientjes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Acked-by: Christian König <[email protected]>
Acked-by: Dimitri Sivanich <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Oded Gabbay <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Mike Marciniszyn <[email protected]>
Cc: Sean Hefty <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In the current design, khugepaged needs to acquire mmap_sem before
scanning an mm. But in some corner cases, khugepaged may scan a process
which is modifying its memory mapping, so khugepaged blocks in
uninterruptible state. But the process might hold the mmap_sem for a
long time when modifying a huge memory space and it may trigger the
below khugepaged hung issue:
INFO: task khugepaged:270 blocked for more than 120 seconds.
Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
khugepaged D 0 270 2 0x00000000
ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440
ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64
0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0
Call Trace:
schedule+0x36/0x80
rwsem_down_read_failed+0xf0/0x150
call_rwsem_down_read_failed+0x18/0x30
down_read+0x20/0x40
khugepaged+0x476/0x11d0
kthread+0xe6/0x100
ret_from_fork+0x25/0x30
So it sounds pointless to just block khugepaged waiting for the
semaphore so replace down_read() with down_read_trylock() to move to
scan the next mm quickly instead of just blocking on the semaphore so
that other processes can get more chances to install THP. Then
khugepaged can come back to scan the skipped mm when it has finished the
current round full_scan.
And it appears that the change can improve khugepaged efficiency a
little bit.
Below is the test result when running LTP on a 24 cores 4GB memory 2
nodes NUMA VM:
pristine w/ trylock
full_scan 197 187
pages_collapsed 21 26
thp_fault_alloc 40818 44466
thp_fault_fallback 18413 16679
thp_collapse_alloc 21 150
thp_collapse_alloc_failed 14 16
thp_file_alloc 369 369
[[email protected]: coding-style fixes]
[[email protected]: tweak comment]
[[email protected]: avoid uninitialized variable use]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Instead of marking the pmd ready for split, invalidate the pmd. This
should take care of powerpc requirement. Only side effect is that we
mark the pmd invalid early. This can result in us blocking access to
the page a bit longer if we race against a thp split.
[[email protected]: rebased, dirty THP once]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: David Daney <[email protected]>
Cc: David Miller <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use the modifed pmdp_invalidate() that returns the previous value of pmd
to transfer dirty and accessed bits.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: David Daney <[email protected]>
Cc: David Miller <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Vlastimil noted that pmdp_invalidate() is not atomic and we can lose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().
The patch change pmdp_invalidate() to make the entry non-present
atomically and return previous value of the entry. This value can be
used to check if CPU set dirty/accessed bits under us.
The race window is very small and I haven't seen any reports that can be
attributed to the bug. For this reason, I don't think backporting to
stable trees needed.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reported-by: Vlastimil Babka <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: David Daney <[email protected]>
Cc: David Miller <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vineet Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We need an atomic way to setup pmd page table entry, avoiding races with
CPU setting dirty/accessed bits. This is required to implement
pmdp_invalidate() that doesn't lose these bits.
On PAE we can avoid expensive cmpxchg8b for cases when new page table
entry is not present. If it's present, fallback to cpmxchg loop.
[[email protected]: add missing `do' to do-while loop]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It's required to avoid losing dirty and accessed bits.
[[email protected]: add a `do' to the do-while loop]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Nitin Gupta <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: David Miller <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It's required to avoid losing dirty and accessed bits.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Martin Schwidefsky <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It's required to avoid losing dirty and accessed bits.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
MIPS doesn't support hardware dirty/accessed bits.
generic_pmdp_establish() is suitable in this case.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: David Daney <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We need an atomic way to setup pmd page table entry, avoiding races with
CPU setting dirty/accessed bits. This is required to implement
pmdp_invalidate() that doesn't lose these bits.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
ARM LPAE doesn't have hardware dirty/accessed bits.
generic_pmdp_establish() is the right implementation of pmdp_establish
for this case.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
ARC doesn't support hardware dirty/accessed bits.
generic_pmdp_establish() is suitable in this case.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Do not lose dirty bit on THP pages", v4.
Vlastimil noted that pmdp_invalidate() is not atomic and we can lose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().
The bug can lead to data loss, but the race window is tiny and I haven't
seen any reports that suggested that it happens in reality. So I don't
think it worth sending it to stable.
Unfortunately, there's no way to address the issue in a generic way. We
need to fix all architectures that support THP one-by-one.
All architectures that have THP supported have to provide atomic
pmdp_invalidate() that returns previous value.
If generic implementation of pmdp_invalidate() is used, architecture
needs to provide atomic pmdp_estabish().
pmdp_estabish() is not used out-side generic implementation of
pmdp_invalidate() so far, but I think this can change in the future.
This patch (of 12):
This is an implementation of pmdp_establish() that is only suitable for
an architecture that doesn't have hardware dirty/accessed bits. In this
case we can't race with CPU which sets these bits and non-atomic
approach is fine.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: David Daney <[email protected]>
Cc: David Miller <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vineet Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We don't have to use an entire 'long' for the number of elements in the
pagevec; we know it's a number between 0 and 14 (now 15). So we can
store it in a char, and then the bool packs next to it and we still have
two or six bytes of padding for more elements in the header. That gives
us space to cram in an extra page.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Several users of unmap_mapping_range() would prefer to express their
range in pages rather than bytes. Unfortuately, on a 32-bit kernel, you
have to remember to cast your page number to a 64-bit type before
shifting it, and four places in the current tree didn't remember to do
that. That's a sign of a bad interface.
Conveniently, unmap_mapping_range() actually converts from bytes into
pages, so hoist the guts of unmap_mapping_range() into a new function
unmap_mapping_pages() and convert the callers which want to use pages.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Reported-by: "zhangyi (F)" <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If THP migration is enabled, for a VMA handled by userfaultfd, consider
the following situation,
do_page_fault()
__do_huge_pmd_anonymous_page()
handle_userfault()
userfault_msg()
/* a huge page is allocated and mapped at fault address */
/* the huge page is under migration, leaves migration entry
in page table */
userfaultfd_must_wait()
/* return true because !pmd_present() */
/* may wait in loop until fatal signal */
That is, it may be possible for userfaultfd_must_wait() encounters a PMD
entry which is !pmd_none() && !pmd_present(). In the current
implementation, we will wait for such PMD entries, which may cause
unnecessary waiting, and potential soft lockup.
This is fixed via avoiding to wait when !pmd_none() && !pmd_present(),
only wait when pmd_none().
This may be not a problem in practice, because userfaultfd_must_wait()
is always called with mm->mmap_sem read-locked. mremap() will
write-lock mm->mmap_sem. And UFFDIO_COPY doesn't support to copy THP
mapping. But the change introduced still makes the code more correct,
and makes the PMD and PTE code more consistent.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
pmd_trans_splitting() was removed after THP refcounting redesign,
therefore related comment should be updated.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yisheng Xie <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
register_page_bootmem_info_section()
In register_page_bootmem_info_section() we call __nr_to_section() in
order to get the mem_section struct at the beginning of the function.
Since we already got it, there is no need for a second call to
__nr_to_section().
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
memory
If start_code / end_code pointers are screwed then "VmExe" could be
bigger than total executable virtual memory and "VmLib" becomes
negative:
VmExe: 294320 kB
VmLib: 18446744073709327564 kB
VmExe and VmLib documented as text segment and shared library code size.
Now their sum will be always equal to mm->exec_vm which sums size of
executable and not writable and not stack areas.
I've seen this for huge (>2Gb) statically linked binary which has whole
world inside. For it start_code .. end_code range also covers one of
rodata sections. Probably this is bug in customized linker, elf loader
or both.
Anyway CONFIG_CHECKPOINT_RESTORE allows to change these pointers, thus
we cannot trust them without validation.
Link: http://lkml.kernel.org/r/150728955451.743749.11276392315459539583.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The comment describes @fullmm argument, but the function has no such
parameter.
Update the comment to match the code and convert it to kernel-doc
markup.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
register_page_bootmem_info_section()
When we call register_page_bootmem_info_section() having
CONFIG_SPARSEMEM_VMEMMAP enabled, we check if the pfn is valid.
This check is redundant as we already checked this in
register_page_bootmem_info_node() before calling
register_page_bootmem_info_section(), so let's get rid of it.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
hugepages_treat_as_movable has been introduced by 396faf0303d2 ("Allow
huge page allocations to use GFP_HIGH_MOVABLE") to allow hugetlb
allocations from ZONE_MOVABLE even when hugetlb pages were not
migrateable. The purpose of the movable zone was different at the time.
It aimed at reducing memory fragmentation and hugetlb pages being long
lived and large werre not contributing to the fragmentation so it was
acceptable to use the zone back then.
Things have changed though and the primary purpose of the zone became
migratability guarantee. If we allow non migrateable hugetlb pages to
be in ZONE_MOVABLE memory hotplug might fail to offline the memory.
Remove the knob and only rely on hugepage_migration_supported to allow
movable zones.
Mel said:
: Primarily it was aimed at allowing the hugetlb pool to safely shrink with
: the ability to grow it again. The use case was for batched jobs, some of
: which needed huge pages and others that did not but didn't want the memory
: useless pinned in the huge pages pool.
:
: I suspect that more users rely on THP than hugetlbfs for flexible use of
: huge pages with fallback options so I think that removing the option
: should be ok.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Reported-by: Alexandru Moise <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Alexandru Moise <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove unused function pgdat_reclaimable_pages() and
node_page_state_snapshot() which becomes unused as well.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jan Kara <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use vma_pages function on vma object instead of explicit computation.
mm/interval_tree.c:21:27-33: WARNING: Consider using vma_pages helper
Generated by: scripts/coccinelle/api/vma_pages.cocci
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vasyl Gomonovych <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Acked-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Architectures like PPC64 support mmap hint address based large address
space selection. This test can be run on those architectures too. Move
the test from the x86 selftests to selftest/vm so that other
architectures can use it too.
We also add a few new test scenarios in this patch. We do test a few
boundary conditions before we do a high address mmap. PPC64 uses the
address limit to validate the address in the fault path. We had bugs in
this area w.r.t SLB fault handling before we updated the addess limit.
We also touch the allocated space to make sure we don't have any bugs in
the fault handling path.
[[email protected]: restore tools/testing/selftests/vm/Makefile alpha ordering]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Shakeel Butt reported he has observed in production systems that the job
loader gets stuck for 10s of seconds while doing a mount operation. It
turns out that it was stuck in register_shrinker() because some
unrelated job was under memory pressure and was spending time in
shrink_slab(). Machines have a lot of shrinkers registered and jobs
under memory pressure have to traverse all of those memcg-aware
shrinkers and affect unrelated jobs which want to register their own
shrinkers.
To solve the issue, this patch simply bails out slab shrinking if it is
found that someone wants to register a shrinker in parallel. A downside
is it could cause unfair shrinking between shrinkers. However, it
should be rare and we can add compilcated logic if we find it's not
enough.
[[email protected]: tweak code comment]
Link: http://lkml.kernel.org/r/20171115005602.GB23810@bbox
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Shakeel Butt <[email protected]>
Reported-by: Shakeel Butt <[email protected]>
Tested-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
__get_free_pages() will return a virtual address, but it is not just a
32-bit address, for example on a 64-bit system. And this comment really
confuses new readers of mm.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jiankang Chen <[email protected]>
Reported-by: Hanjun Guo <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Yisheng Xie <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix ptr_ret.cocci warnings:
mm/page_owner.c:639:1-3: WARNING: PTR_ERR_OR_ZERO can be used
Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
Generated by: scripts/coccinelle/api/ptr_ret.cocci
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vasyl Gomonovych <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|