Age | Commit message (Collapse) | Author | Files | Lines |
|
The hugetlb cgroup reservation test charge_reserved_hugetlb.sh assume
that no cgroup filesystems are mounted before running the test. That is
not true in many cases. As a result, the test fails to run. Fix that
by querying the current cgroup mount setting and using the existing
cgroup setup instead before attempting to freshly mount a cgroup
filesystem.
Similar change is also made for hugetlb_reparenting_test.sh as well,
though it still has problem if cgroup v2 isn't used.
The patched test scripts were run on a centos 8 based system to verify
that they ran properly.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 29750f71a9b4 ("hugetlb_cgroup: add hugetlb_cgroup reservation tests")
Signed-off-by: Waiman Long <[email protected]>
Acked-by: Mina Almasry <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There are interfaces to adjust max_ptes_none, max_ptes_swap,
max_ptes_shared values, see
/sys/kernel/mm/transparent_hugepage/khugepaged/.
But system administrator may not know which value is the best. So Add
those events to support adjusting max_ptes_* to suitable values.
For example, if default max_ptes_swap value causes too much failures,
and system uses zram whose IO is fast, administrator could increase
max_ptes_swap until THP_SCAN_EXCEED_SWAP_PTE not increase anymore.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Yang <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Saravanan D <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The hugetlb vma mremap() test currently maps 1GB of memory to trigger
pmd sharing and make sure that 'unshare' path in mremap code works. The
test originally only mapped 10MB of memory (as specified by the header
comment) but was later modified to 1GB to tackle this case.
However, not all machines will have 1GB of memory to spare for this
test. Adding a mapping size arg will allow run_vmtest.sh to pass an
adequate mapping size, while allowing users to run the test
independently with arbitrary size mappings.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Mina Almasry <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
For hugetlb backed jobs/VMs it's critical to understand the numa
information for the memory backing these jobs to deliver optimal
performance.
Currently this technically can be queried from /proc/self/numa_maps, but
there are significant issues with that. Namely:
1. Memory can be mapped or unmapped.
2. numa_maps are per process and need to be aggregated across all
processes in the cgroup. For shared memory this is more involved as
the userspace needs to make sure it doesn't double count shared
mappings.
3. I believe querying numa_maps needs to hold the mmap_lock which adds
to the contention on this lock.
For these reasons I propose simply adding hugetlb.*.numa_stat file,
which shows the numa information of the cgroup similarly to
memory.numa_stat.
On cgroup-v2:
cat /sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat
total=2097152 N0=2097152 N1=0
On cgroup-v1:
cat /sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat
total=2097152 N0=2097152 N1=0
hierarichal_total=2097152 N0=2097152 N1=0
This patch was tested manually by allocating hugetlb memory and querying
the hugetlb.*.numa_stat file of the cgroup and its parents.
[[email protected]: fix spelling mistake "hierarichal" -> "hierarchical"]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix copy/paste array assignment]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mina Almasry <[email protected]>
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Jue Wang <[email protected]>
Cc: Yang Yao <[email protected]>
Cc: Joanna Li <[email protected]>
Cc: Cannon Matthews <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In kdump kernel of x86_64, page allocation failure is observed:
kworker/u2:2: page allocation failure: order:0, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 0 PID: 55 Comm: kworker/u2:2 Not tainted 5.16.0-rc4+ #5
Hardware name: AMD Dinar/Dinar, BIOS RDN1505B 06/05/2013
Workqueue: events_unbound async_run_entry_fn
Call Trace:
<TASK>
dump_stack_lvl+0x48/0x5e
warn_alloc.cold+0x72/0xd6
__alloc_pages_slowpath.constprop.0+0xc69/0xcd0
__alloc_pages+0x1df/0x210
new_slab+0x389/0x4d0
___slab_alloc+0x58f/0x770
__slab_alloc.constprop.0+0x4a/0x80
kmem_cache_alloc_trace+0x24b/0x2c0
sr_probe+0x1db/0x620
......
device_add+0x405/0x920
......
__scsi_add_device+0xe5/0x100
ata_scsi_scan_host+0x97/0x1d0
async_run_entry_fn+0x30/0x130
process_one_work+0x1e8/0x3c0
worker_thread+0x50/0x3b0
? rescuer_thread+0x350/0x350
kthread+0x16b/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Mem-Info:
......
The above failure happened when calling kmalloc() to allocate buffer with
GFP_DMA. It requests to allocate slab page from DMA zone while no managed
pages at all in there.
sr_probe()
--> get_capabilities()
--> buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
Because in the current kernel, dma-kmalloc will be created as long as
CONFIG_ZONE_DMA is enabled. However, kdump kernel of x86_64 doesn't have
managed pages on DMA zone since commit 6f599d84231f ("x86/kdump: Always
reserve the low 1M when the crashkernel option is specified"). The
failure can be always reproduced.
For now, let's mute the warning of allocation failure if requesting pages
from DMA zone while no managed pages.
[[email protected]: fix warning]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
Signed-off-by: Baoquan He <[email protected]>
Acked-by: John Donnelly <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Laight <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently three dma atomic pools are initialized as long as the relevant
kernel codes are built in. While in kdump kernel of x86_64, this is not
right when trying to create atomic_pool_dma, because there's no managed
pages in DMA zone. In the case, DMA zone only has low 1M memory
presented and locked down by memblock allocator. So no pages are added
into buddy of DMA zone. Please check commit f1d4d47c5851 ("x86/setup:
Always reserve the first 1M of RAM").
Then in kdump kernel of x86_64, it always prints below failure message:
DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-0.rc5.20210611git929d931f2b40.42.fc35.x86_64 #1
Hardware name: Dell Inc. PowerEdge R910/0P658H, BIOS 2.12.0 06/04/2018
Call Trace:
dump_stack+0x7f/0xa1
warn_alloc.cold+0x72/0xd6
__alloc_pages_slowpath.constprop.0+0xf29/0xf50
__alloc_pages+0x24d/0x2c0
alloc_page_interleave+0x13/0xb0
atomic_pool_expand+0x118/0x210
__dma_atomic_pool_init+0x45/0x93
dma_atomic_pool_init+0xdb/0x176
do_one_initcall+0x67/0x320
kernel_init_freeable+0x290/0x2dc
kernel_init+0xa/0x111
ret_from_fork+0x22/0x30
Mem-Info:
......
DMA: failed to allocate 128 KiB GFP_KERNEL|GFP_DMA pool for atomic allocation
DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations
Here, let's check if DMA zone has managed pages, then create
atomic_pool_dma if yes. Otherwise just skip it.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
Signed-off-by: Baoquan He <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: John Donnelly <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Laight <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Handle warning of allocation failure on DMA zone w/o
managed pages", v4.
**Problem observed:
On x86_64, when crash is triggered and entering into kdump kernel, page
allocation failure can always be seen.
---------------------------------
DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 0 PID: 1 Comm: swapper/0
Call Trace:
dump_stack+0x7f/0xa1
warn_alloc.cold+0x72/0xd6
......
__alloc_pages+0x24d/0x2c0
......
dma_atomic_pool_init+0xdb/0x176
do_one_initcall+0x67/0x320
? rcu_read_lock_sched_held+0x3f/0x80
kernel_init_freeable+0x290/0x2dc
? rest_init+0x24f/0x24f
kernel_init+0xa/0x111
ret_from_fork+0x22/0x30
Mem-Info:
------------------------------------
***Root cause:
In the current kernel, it assumes that DMA zone must have managed pages
and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not
always true. E.g in kdump kernel of x86_64, only low 1M is presented and
locked down at very early stage of boot, so that this low 1M won't be
added into buddy allocator to become managed pages of DMA zone. This
exception will always cause page allocation failure if page is requested
from DMA zone.
***Investigation:
This failure happens since below commit merged into linus's tree.
1a6a9044b967 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options
23721c8e92f7 x86/crash: Remove crash_reserve_low_1M()
f1d4d47c5851 x86/setup: Always reserve the first 1M of RAM
7c321eb2b843 x86/kdump: Remove the backup region handling
6f599d84231f x86/kdump: Always reserve the low 1M when the crashkernel option is specified
Before them, on x86_64, the low 640K area will be reused by kdump kernel.
So in kdump kernel, the content of low 640K area is copied into a backup
region for dumping before jumping into kdump. Then except of those firmware
reserved region in [0, 640K], the left area will be added into buddy
allocator to become available managed pages of DMA zone.
However, after above commits applied, in kdump kernel of x86_64, the low
1M is reserved by memblock, but not released to buddy allocator. So any
later page allocation requested from DMA zone will fail.
At the beginning, if crashkernel is reserved, the low 1M need be locked
down because AMD SME encrypts memory making the old backup region
mechanims impossible when switching into kdump kernel.
Later, it was also observed that there are BIOSes corrupting memory
under 1M. To solve this, in commit f1d4d47c5851, the entire region of
low 1M is always reserved after the real mode trampoline is allocated.
Besides, recently, Intel engineer mentioned their TDX (Trusted domain
extensions) which is under development in kernel also needs to lock down
the low 1M. So we can't simply revert above commits to fix the page allocation
failure from DMA zone as someone suggested.
***Solution:
Currently, only DMA atomic pool and dma-kmalloc will initialize and
request page allocation with GFP_DMA during bootup.
So only initializ DMA atomic pool when DMA zone has available managed
pages, otherwise just skip the initialization.
For dma-kmalloc(), for the time being, let's mute the warning of
allocation failure if requesting pages from DMA zone while no manged
pages. Meanwhile, change code to use dma_alloc_xx/dma_map_xx API to
replace kmalloc(GFP_DMA), or do not use GFP_DMA when calling kmalloc() if
not necessary. Christoph is posting patches to fix those under
drivers/scsi/. Finally, we can remove the need of dma-kmalloc() as people
suggested.
This patch (of 3):
In some places of the current kernel, it assumes that dma zone must have
managed pages if CONFIG_ZONE_DMA is enabled. While this is not always
true. E.g in kdump kernel of x86_64, only low 1M is presented and locked
down at very early stage of boot, so that there's no managed pages at all
in DMA zone. This exception will always cause page allocation failure if
page is requested from DMA zone.
Here add function has_managed_dma() and the relevant helper functions to
check if there's DMA zone with managed pages. It will be used in later
patches.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
Signed-off-by: Baoquan He <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: John Donnelly <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Laight <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Clarify that the alloc_contig_pages() allocated range will always be
aligned to the requested nr_pages.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
kmalloc(..., GFP_DMA32) does not return DMA32 memory because the DMA32
kmalloc cache array is not implemented. (Reason: there is no such user
in kernel).
Put a short comment about this so people can understand this by reading
the comment.
[1] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miles Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
alloc_pages_vma is meant to allocate a page with a vma specific memory
policy. The initial node parameter is always a local node so it is
pointless to waste a function argument for this. Drop the parameter.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Ben Widawsky <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Arthur Marsh reported we would hit the error below when building kernel
with gcc-12:
CC mm/page_alloc.o
mm/page_alloc.c: In function `mem_init_print_info':
mm/page_alloc.c:8173:27: error: comparison between two arrays [-Werror=array-compare]
8173 | if (start <= pos && pos < end && size > adj) \
|
In C++20, the comparision between arrays should be warned.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Xiongwei Song <[email protected]>
Reported-by: Arthur Marsh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Return statements in functions returning bool should use true/false
instead of 1/0.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Changcheng Deng <[email protected]>
Reported-by: Zeal Robot <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
For embedded systems with low total memory, having to run applications
with relatively large memory requirements, 10% max limitation for
watermark_scale_factor poses an issue of triggering direct reclaim every
time such application is started. This results in slow application
startup times and bad end-user experience.
By increasing watermark_scale_factor max limit we allow vendors more
flexibility to choose the right level of kswapd aggressiveness for their
device and workload requirements.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Lukas Middendorf <[email protected]>
Cc: Antti Palosaari <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Iurii Zaikin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Zhang Yi <[email protected]>
Cc: Fengfei Xi <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Various places in the kernel - largely in filesystems - respond to a
memory allocation failure by looping around and re-trying. Some of
these cannot conveniently use __GFP_NOFAIL, for reasons such as:
- a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
- a need to check for the process being signalled between failures
- the possibility that other recovery actions could be performed
- the allocation is quite deep in support code, and passing down an
extra flag to say if __GFP_NOFAIL is wanted would be clumsy.
Many of these currently use congestion_wait() which (in almost all
cases) simply waits the given timeout - congestion isn't tracked for
most devices.
It isn't clear what the best delay is for loops, but it is clear that
the various filesystems shouldn't be responsible for choosing a timeout.
This patch introduces memalloc_retry_wait() with takes on that
responsibility. Code that wants to retry a memory allocation can call
this function passing the GFP flags that were used. It will wait
however is appropriate.
For now, it only considers __GFP_NORETRY and whatever
gfpflags_allow_blocking() tests. If blocking is allowed without
__GFP_NORETRY, then alloc_page either made some reclaim progress, or
waited for a while, before failing. So there is no need for much
further waiting. memalloc_retry_wait() will wait until the current
jiffie ends. If this condition is not met, then alloc_page() won't have
waited much if at all. In that case memalloc_retry_wait() waits about
200ms. This is the delay that most current loops uses.
linux/sched/mm.h needs to be included in some files now,
but linux/backing-dev.h does not.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: NeilBrown <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: Chuck Lever <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
sl?b and vmalloc allocators reduce the given gfp mask for their internal
needs. For that they use GFP_RECLAIM_MASK to preserve the reclaim
behavior and constrains.
__GFP_NOLOCKDEP is not a part of that mask because it doesn't really
control the reclaim behavior strictly speaking. On the other hand it
tells the underlying page allocator to disable reclaim recursion
detection so arguably it should be part of the mask.
Having __GFP_NOLOCKDEP in the mask will not alter the behavior in any
form so this change is safe pretty much by definition. It also adds a
support for this flag to SL?B and vmalloc allocators which will in turn
allow its use to kvmalloc as well. A lack of the support has been
noticed recently in
http://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Reported-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Dave Chinner <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented by
previous patches so we can allow the support for kvmalloc. This will
allow some external users to simplify or completely remove their
helpers.
GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
explicitly documented so let's add a note about that.
ceph_kvmalloc is the first helper to be dropped and changed to kvmalloc.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit b7d90e7a5ea8 ("mm/vmalloc: be more explicit about supported gfp
flags") has been merged prematurely without the rest of the series and
without addressed review feedback from Neil. Fix that up now. Only
wording is changed slightly.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Dave Chinner has mentioned that some of the xfs code would benefit from
kvmalloc support for __GFP_NOFAIL because they have allocations that
cannot fail and they do not fit into a single page.
The large part of the vmalloc implementation already complies with the
given gfp flags so there is no work for those to be done. The area and
page table allocations are an exception to that. Implement a retry loop
for those.
Add a short sleep before retrying. 1 jiffy is a completely random
timeout. Ideally the retry would wait for an explicit event - e.g. a
change to the vmalloc space change if the failure was caused by the
space fragmentation or depletion. But there are multiple different
reasons to retry and this could become much more complex. Keep the
retry simple for now and just sleep to prevent from hogging CPUs.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "extend vmalloc support for constrained allocations", v2.
Based on a recent discussion with Dave and Neil [1] I have tried to
implement NOFS, NOIO, NOFAIL support for the vmalloc to make life of
kvmalloc users easier.
A requirement for NOFAIL support for kvmalloc was new to me but this
seems to be really needed by the xfs code.
NOFS/NOIO was a known and a long term problem which was hoped to be
handled by the scope API. Those scope should have been used at the
reclaim recursion boundaries both to document them and also to remove
the necessity of NOFS/NOIO constrains for all allocations within that
scope. Instead workarounds were developed to wrap a single allocation
instead (like ceph_kvmalloc).
First patch implements NOFS/NOIO support for vmalloc. The second one
adds NOFAIL support and the third one bundles all together into kvmalloc
and drops ceph_kvmalloc which can use kvmalloc directly now.
[1] http://lkml.kernel.org/r/[email protected]
This patch (of 4):
vmalloc historically hasn't supported GFP_NO{FS,IO} requests because
page table allocations do not support externally provided gfp mask and
performed GFP_KERNEL like allocations.
Since few years we have scope (memalloc_no{fs,io}_{save,restore}) APIs
to enforce NOFS and NOIO constrains implicitly to all allocators within
the scope. There was a hope that those scopes would be defined on a
higher level when the reclaim recursion boundary starts/stops (e.g.
when a lock required during the memory reclaim is required etc.). It
seems that not all NOFS/NOIO users have adopted this approach and
instead they have taken a workaround approach to wrap a single
[k]vmalloc allocation by a scope API.
These workarounds do not serve the purpose of a better reclaim recursion
documentation and reduction of explicit GFP_NO{FS,IO} usege so let's
just provide them with the semantic they are asking for without a need
for workarounds.
Add support for GFP_NOFS and GFP_NOIO to vmalloc directly. All internal
allocations already comply with the given gfp_mask. The only current
exception is vmap_pages_range which maps kernel page tables. Infer the
proper scope API based on the given gfp mask.
[[email protected]: mm/vmalloc.c needs linux/sched/mm.h]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This reverts commit 2618c60b8b5836 ("dma: make dma pool to use
kmalloc_node").
While working myself into the dmapool code I've found this little odd
kmalloc_node().
What basically happens here is that we allocate the housekeeping
structure on the numa node where the device is attached to. Since the
device is never doing DMA to or from that memory this doesn't seem to
make sense at all.
So while this doesn't seem to cause much harm it's probably cleaner to
revert the change for consistency.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christian König <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Greg KH <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
All callers pass NULL, so we can stop calculating the value we would
store in it.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now that we don't report it to the caller of reuse_swap_page(), we don't
need to request it from page_trans_huge_map_swapcount().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
None of the callers care about the total_map_swapcount() any more.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add page table check hooks into routines that modify user page tables.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pasha Tatashin <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Check user page table entries at the time they are added and removed.
Allows to synchronously catch memory corruption issues related to double
mapping.
When a pte for an anonymous page is added into page table, we verify
that this pte does not already point to a file backed page, and vice
versa if this is a file backed page that is being added we verify that
this page does not have an anonymous mapping
We also enforce that read-only sharing for anonymous pages is allowed
(i.e. cow after fork). All other sharing must be for file pages.
Page table check allows to protect and debug cases where "struct page"
metadata became corrupted for some reason. For example, when refcnt or
mapcount become invalid.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pasha Tatashin <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We have ptep_get_and_clear() and ptep_get_and_clear_full() helpers to
clear PTE from user page tables, but there is no variant for simple
clear of a present PTE from user page tables without using a low level
pte_clear() which can be either native or para-virtualised.
Add a new ptep_clear() that can be used in common code to clear PTEs
from page table. We will need this call later in order to add a hook
for page table check.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pasha Tatashin <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "page table check", v3.
Ensure that some memory corruptions are prevented by checking at the
time of insertion of entries into user page tables that there is no
illegal sharing.
We have recently found a problem [1] that existed in kernel since 4.14.
The problem was caused by broken page ref count and led to memory
leaking from one process into another. The problem was accidentally
detected by studying a dump of one process and noticing that one page
contains memory that should not belong to this process.
There are some other page->_refcount related problems that were recently
fixed: [2], [3] which potentially could also lead to illegal sharing.
In addition to hardening refcount [4] itself, this work is an attempt to
prevent this class of memory corruption issues.
It uses a simple state machine that is independent from regular MM logic
to check for illegal sharing at time pages are inserted and removed from
page tables.
[1] https://lore.kernel.org/all/[email protected]
[2] https://lore.kernel.org/all/[email protected]
[3] https://lore.kernel.org/all/[email protected]
[4] https://lore.kernel.org/all/[email protected]
This patch (of 4):
There are a few places where we first update the entry in the user page
table, and later change the struct page to indicate that this is
anonymous or file page.
In most places, however, we first configure the page metadata and then
insert entries into the page table. Page table check, will use the
information from struct page to verify the type of entry is inserted.
Change the order in all places to first update struct page, and later to
update page table.
This means that we first do calls that may change the type of page (anon
or file):
page_move_anon_rmap
page_add_anon_rmap
do_page_add_anon_rmap
page_add_new_anon_rmap
page_add_file_rmap
hugepage_add_anon_rmap
hugepage_add_new_anon_rmap
And after that do calls that add entries to the page table:
set_huge_pte_at
set_pte_at
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pasha Tatashin <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a new document to explain Virtually Mapped Kernel Stack Support.
This is a compilation of information from the code and original patch
series that introduced the Virtually Mapped Kernel Stacks feature.
This document summarizes the feature and provides details on allocation,
free, and stack overflow handling. Provides reference to available
tests.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shuah Khan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
With exit_mmap holding mmap_write_lock during free_pgtables call,
process_mrelease does not need to elevate mm->mm_users in order to
prevent exit_mmap from destrying pagetables while __oom_reap_task_mm is
walking the VMA tree. The change prevents process_mrelease from calling
the last mmput, which can lead to waiting for IO completion in exit_aio.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: Jan Engelhardt <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Tim Murray <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add comments for vm_operations_struct::close documenting locking
requirements for this callback and its callers.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: Jan Engelhardt <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Tim Murray <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
oom-reaper and process_mrelease system call should protect against races
with exit_mmap which can destroy page tables while they walk the VMA
tree. oom-reaper protects from that race by setting MMF_OOM_VICTIM and
by relying on exit_mmap to set MMF_OOM_SKIP before taking and releasing
mmap_write_lock. process_mrelease has to elevate mm->mm_users to
prevent such race.
Both oom-reaper and process_mrelease hold mmap_read_lock when walking
the VMA tree. The locking rules and mechanisms could be simpler if
exit_mmap takes mmap_write_lock while executing destructive operations
such as free_pgtables.
Change exit_mmap to hold the mmap_write_lock when calling unlock_range,
free_pgtables and remove_vma. Note also that because oom-reaper checks
VM_LOCKED flag, unlock_range() should not be allowed to race with it.
Before this patch, remove_vma used to be called with no locks held,
however with fput being executed asynchronously and vm_ops->close not
being allowed to hold mmap_lock (it is called from __split_vma with
mmap_sem held for write), changing that should be fine.
In most cases this lock should be uncontended. Previously, Kirill
reported ~4% regression caused by a similar change [1]. We reran the
same test and although the individual results are quite noisy, the
percentiles show lower regression with 1.6% being the worst case [2].
The change allows oom-reaper and process_mrelease to execute safely
under mmap_read_lock without worries that exit_mmap might destroy page
tables from under them.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: Jan Engelhardt <[email protected]>
Cc: Tim Murray <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
linux/mm_types.h should only define structure definitions, to make it
cheap to include elsewhere. The atomic_t helper function definitions
are particularly large, so it's better to move the helpers using those
into the existing linux/mm_inline.h and only include that where needed.
As a follow-up, we may want to go through all the indirect includes in
mm_types.h and reduce them as much as possible.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Eric Biederman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The patch to add anonymous vma names causes a build failure in some
configurations:
include/linux/mm_types.h: In function 'is_same_vma_anon_name':
include/linux/mm_types.h:924:37: error: implicit declaration of function 'strcmp' [-Werror=implicit-function-declaration]
924 | return name && vma_name && !strcmp(name, vma_name);
| ^~~~~~
include/linux/mm_types.h:22:1: note: 'strcmp' is defined in header '<string.h>'; did you forget to '#include <string.h>'?
This should not really be part of linux/mm_types.h in the first place,
as that header is meant to only contain structure defintions and need a
minimum set of indirect includes itself.
While the header clearly includes more than it should at this point,
let's not make it worse by including string.h as well, which would pull
in the expensive (compile-speed wise) fortify-string logic.
Move the new functions into a separate header that only needs to be
included in a couple of locations.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: "mm: add a field to store names for private anonymous memory"
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
While forking a process with high number (64K) of named anonymous vmas
the overhead caused by strdup() is noticeable. Experiments with ARM64
Android device show up to 40% performance regression when forking a
process with 64k unpopulated anonymous vmas using the max name lengths
vs the same process with the same number of anonymous vmas having no
name.
Introduce anon_vma_name refcounted structure to avoid the overhead of
copying vma names during fork() and when splitting named anonymous vmas.
When a vma is duplicated, instead of copying the name we increment the
refcount of this structure. Multiple vmas can point to the same
anon_vma_name as long as they increment the refcount. The name member
of anon_vma_name structure is assigned at structure allocation time and
is never changed. If vma name changes then the refcount of the original
structure is dropped, a new anon_vma_name structure is allocated to hold
the new name and the vma pointer is updated to point to the new
structure.
With this approach the fork() performance regressions is reduced 3-4x
times and with usecases using more reasonable number of VMAs (a few
thousand) the regressions is not measurable.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jan Glauber <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Cc: Shaohua Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in
use. At a minimum there is libc malloc and the stack, and in many cases
there are libc malloc, the stack, direct syscalls to mmap anonymous
memory, and multiple VM heaps (one for small objects, one for big
objects, etc.). Each of these layers usually has its own tools to
inspect its usage; malloc by compiling a debug version, the VM through
heap inspection tools, and for direct syscalls there is usually no way
to track them.
On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages
mapped in userspace and slice their usage by process, shared (COW) vs.
unique mappings, backing, etc. This can account for real physical
memory usage even in cases like fork without exec (which Android uses
heavily to share as many private COW pages as possible between
processes), Kernel SamePage Merging, and clean zero pages. It produces
a measurement of the pages that only exist in that process (USS, for
unique), and a measurement of the physical memory usage of that process
with the cost of shared pages being evenly split between processes that
share them (PSS).
If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap
walking tool that can understand the heap debugging of every layer, or
for every layer's heap debugging tools to implement the pagemap walking
logic, in which case it is hard to get a consistent view of memory
across the whole system.
Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that somebody
needs to clean up on crashes. It needs to be readable while the process
is still running, so it has to have some sort of synchronization with
every layer of userspace. Efficiently tracking the ranges requires
reimplementing something like the kernel vma trees, and linking to it
from every layer of userspace. It requires more memory, more syscalls,
more runtime cost, and more complexity to separately track regions that
the kernel is already tracking.
This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas. The names of named
anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
[anon:<name>].
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
Setting the name to NULL clears it. The name length limit is 80 bytes
including NUL-terminator and is checked to contain only printable ascii
characters (including space), except '[',']','\','$' and '`'.
Ascii strings are being used to have a descriptive identifiers for vmas,
which can be understood by the users reading /proc/pid/maps or
/proc/pid/smaps. Names can be standardized for a given system and they
can include some variable parts such as the name of the allocator or a
library, tid of the thread using it, etc.
The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string. Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name. The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.
CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
feature. It keeps the feature disabled by default to prevent any
additional memory overhead and to avoid confusing procfs parsers on
systems which are not ready to support named anonymous vmas.
The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names. In that design, name
pointers could be shared between vmas. However during the last
upstreaming attempt, Kees Cook raised concerns [2] about this approach
and suggested to copy the name into kernel memory space, perform
validity checks [3] and store as a string referenced from
vm_area_struct.
One big concern is about fork() performance which would need to strdup
anonymous vma names. Dave Hansen suggested experimenting with
worst-case scenario of forking a process with 64k vmas having longest
possible names [4]. I ran this experiment on an ARM64 Android device
and recorded a worst-case regression of almost 40% when forking such a
process.
This regression is addressed in the followup patch which replaces the
pointer to a name with a refcounted structure that allows sharing the
name pointer between vmas of the same name. Instead of duplicating the
string during fork() or when splitting a vma it increments the refcount.
[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/[email protected]/
Changes for prctl(2) manual page (in the options section):
PR_SET_VMA
Sets an attribute specified in arg2 for virtual memory areas
starting from the address specified in arg3 and spanning the
size specified in arg4. arg5 specifies the value of the attribute
to be set. Note that assigning an attribute to a virtual memory
area might prevent it from being merged with adjacent virtual
memory areas due to the difference in that attribute's value.
Currently, arg2 must be one of:
PR_SET_VMA_ANON_NAME
Set a name for anonymous virtual memory areas. arg5 should
be a pointer to a null-terminated string containing the
name. The name length including null byte cannot exceed
80 bytes. If arg5 is NULL, the name of the appropriate
anonymous virtual memory areas will be reset. The name
can contain only printable ascii characters (including
space), except '[',']','\','$' and '`'.
This feature is available only if the kernel is built with
the CONFIG_ANON_VMA_NAME option enabled.
[[email protected]: docs: proc.rst: /proc/PID/maps: fix malformed table]
Link: https://lkml.kernel.org/r/[email protected]
[surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
work here was done by Colin Cross, therefore, with his permission, keeping
him as the author]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Colin Cross <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jan Glauber <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Cc: Shaohua Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm: rearrange madvise code to allow for reuse", v11.
Avoid performance regression of the new anon vma name field refcounting it.
I checked the image sizes with allnoconfig builds:
unpatched Linus' ToT
text data bss dec hex filename
1324759 32 73928 1398719 1557bf vmlinux
After the first patch is applied (madvise refactoring)
text data bss dec hex filename
1322346 32 73928 1396306 154e52 vmlinux
>>> 2413 bytes decrease vs ToT <<<
After all patches applied with CONFIG_ANON_VMA_NAME=n
text data bss dec hex filename
1322337 32 73928 1396297 154e49 vmlinux
>>> 2422 bytes decrease vs ToT <<<
After all patches applied with CONFIG_ANON_VMA_NAME=y
text data bss dec hex filename
1325228 32 73928 1399188 155994 vmlinux
>>> 469 bytes increase vs ToT <<<
This patch (of 3):
Refactor the madvise syscall to allow for parts of it to be reused by a
prctl syscall that affects vmas.
Move the code that walks vmas in a virtual address range into a function
that takes a function pointer as a parameter. The only caller for now
is sys_madvise, which uses it to call madvise_vma_behavior on each vma,
but the next patch will add an additional caller.
Move handling all vma behaviors inside madvise_behavior, and rename it
to madvise_vma_behavior.
Move the code that updates the flags on a vma, including splitting or
merging the vma as necessary, into a new function called
madvise_update_vma. The next patch will add support for updating a new
anon_name field as well.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Colin Cross <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Jan Glauber <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since commit 4064b9827063 ("mm: allow VM_FAULT_RETRY for multiple
times") allowed VM_FAULT_RETRY for multiple times, the
FAULT_FLAG_ALLOW_RETRY bit of fault_flag will not be changed in the page
fault path, so the following check is no longer needed:
flags & FAULT_FLAG_ALLOW_RETRY
So just remove it.
[[email protected]: coding style fixes]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Qi Zheng <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Kirill Shutemov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Chengming Zhou <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix the following coccicheck REVIEW:
tools/testing/selftests/vm/userfaultfd.c:1531:21-22:use swap() to make code cleaner
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: chiminghao <[email protected]>
Reported-by: Zeal Robot <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The kvmalloc* allocation functions can fallback to vmalloc allocations
and more often on long running machines. In addition the kernel does
have __GFP_ACCOUNT kvmalloc* calls. So, often on long running machines,
the memory.stat does not tell the complete picture which type of memory
is charged to the memcg. So add a per-memcg vmalloc stat.
[[email protected]: page_memcg() within rcu lock, per Muchun]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: remove cast, per Muchun]
[[email protected]: remove area->page[0] checks and move to page by page accounting per Michal]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Make use of the struct_size() helper instead of an open-coded version,
in order to avoid any potential type mistakes or integer overflows that,
in the worst scenario, could lead to heap overflows.
Link: https://github.com/KSPP/linux/issues/160
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Wang Weiyang <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit 11192d9c124d ("memcg: flush stats only if updated") added
tracking of memcg stats updates which is used by the readers to flush
only if the updates are over a certain threshold. However each
individual update can correspond to a large value change for a given
stat. For example adding or removing a hugepage to an LRU changes the
stat by thp_nr_pages (512 on x86_64).
Treating the update related to THP as one can keep the stat off, in
theory, by (thp_nr_pages * nr_cpus * CHARGE_BATCH) before flush.
To handle such scenarios, this patch adds consideration of the stat
update value as well instead of just the update event. In addition let
the asyn flusher unconditionally flush the stats to put time limit on
the stats skew and hopefully a lot less readers would need to flush.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Michal Koutný" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Our container agent wants to know when a container exits if it was OOM
killed or not to report to the user. We use memory.oom.group = 1 to
ensure that OOM kills within the container's cgroup kill everything.
Existing memory.events are insufficient for knowing if this triggered:
1) Our current approach reads memory.events oom_kill and reports the
container was killed if the value is non-zero. This is erroneous in
some cases where containers create their children cgroups with
memory.oom.group=1 as such OOM kills will get counted against the
container cgroup's oom_kill counter despite not actually OOM killing
the entire container.
2) Reading memory.events.local will fail to identify OOM kills in leaf
cgroups (that don't set memory.oom.group) within the container
cgroup.
This patch adds a new oom_group_kill event when memory.oom.group
triggers to allow userspace to cleanly identify when an entire cgroup is
oom killed.
[[email protected]: changes from Johannes and Chris]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dan Schatzberg <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Chris Down <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
propagate_protected_usage() is called to propagate the usage change in
the page_counter structure. But there is a call to this function from
page_counter_try_charge() when there is actually no usage change. Hence
this call should be removed.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Donghai Qiao <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit 494c1dfe855e ("mm: memcg/slab: create a new set of kmalloc-cg-<n>
caches") makes cgroup_memory_nokmem global, however, it is unnecessary
because there is already a function mem_cgroup_kmem_disabled() which
exports it.
Just make it static and replace it with mem_cgroup_kmem_disabled() in
mm/slab_common.c.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Chris Down <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The 'a' and 'b' bitmaps are local to this function, so no concurrent
access can occur. So the non-atomic '__set_bit()' can be used to save a
few cycles.
Link: https://lkml.kernel.org/r/e52476da5cee57151745c5c3c934a69798dc6fa4.1638132190.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix a data race in commit 779750d20b93 ("shmem: split huge pages beyond
i_size under memory pressure").
Here are call traces causing race:
Call Trace 1:
shmem_unused_huge_shrink+0x3ae/0x410
? __list_lru_walk_one.isra.5+0x33/0x160
super_cache_scan+0x17c/0x190
shrink_slab.part.55+0x1ef/0x3f0
shrink_node+0x10e/0x330
kswapd+0x380/0x740
kthread+0xfc/0x130
? mem_cgroup_shrink_node+0x170/0x170
? kthread_create_on_node+0x70/0x70
ret_from_fork+0x1f/0x30
Call Trace 2:
shmem_evict_inode+0xd8/0x190
evict+0xbe/0x1c0
do_unlinkat+0x137/0x330
do_syscall_64+0x76/0x120
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
A simple explanation:
Image there are 3 items in the local list (@list). In the first
traversal, A is not deleted from @list.
1) A->B->C
^
|
pos (leave)
In the second traversal, B is deleted from @list. Concurrently, A is
deleted from @list through shmem_evict_inode() since last reference
counter of inode is dropped by other thread. Then the @list is corrupted.
2) A->B->C
^ ^
| |
evict pos (drop)
We should make sure the inode is either on the global list or deleted from
any local list before iput().
Fixed by moving inodes back to global list before we put them.
[[email protected]: coding style fixes]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
Signed-off-by: Gang Li <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The current behavior of memory failure is to truncate the page cache
regardless of dirty or clean. If the page is dirty the later access
will get the obsolete data from disk without any notification to the
users. This may cause silent data loss. It is even worse for shmem
since shmem is in-memory filesystem, truncating page cache means
discarding data blocks. The later read would return all zero.
The right approach is to keep the corrupted page in page cache, any
later access would return error for syscalls or SIGBUS for page fault,
until the file is truncated, hole punched or removed. The regular
storage backed filesystems would be more complicated so this patch is
focused on shmem. This also unblock the support for soft offlining
shmem THP.
[[email protected]: coding style fixes]
[[email protected]: fix uninitialized variable use in me_pagecache_clean()]
Link: https://lkml.kernel.org/r/[email protected]
[Fix invalid pointer dereference in shmem_read_mapping_page_gfp() with a
slight different implementation from what Ajay Garg <[email protected]>
and Muchun Song <[email protected]> proposed and reworked the
error handling of shmem_write_begin() suggested by Linus]
Link: https://lore.kernel.org/linux-mm/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ajay Garg <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Andy Lavr <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When BUG_ON check for THP migration entry, the existing code only check
thp_migration_supported case, but not for !thp_migration_supported case.
If !thp_migration_supported() and !pmd_present(), the original code may
dead loop in theory. To make the BUG_ON check consistent, we need catch
both cases.
Move the BUG_ON check one step earlier, because if the bug happen we
should know it instead of depend on FOLL_MIGRATION been used by caller.
Because pmdval instead of *pmd is read by the is_pmd_migration_entry()
check, the existing code don't help to avoid useless locking within
pmd_migration_entry_wait(), so remove that check.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Li Xinhai <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
fault_in_readable() and fault_in_writeable() perform __get_user() and
__put_user() in a loop, implying multiple user access locking/unlocking.
To avoid that, use user access blocks.
Link: https://lkml.kernel.org/r/720dcf79314acca1a78fae56d478cc851952149d.1637084492.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <[email protected]>
Reviewed-by: Andreas Gruenbacher <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Return value directly instead of taking this in another redundant
variable.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: chiminghao <[email protected]>
Reported-by: Zeal Robot <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Pankaj Gupta <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|