Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull nfsd updates from Chuck Lever:
"Bruce has announced he is leaving Red Hat at the end of the month and
is stepping back from his role as NFSD co-maintainer. As a result,
this includes a patch removing him from the MAINTAINERS file.
There is one patch in here that Jeff Layton was carrying in the locks
tree. Since he had only one for this cycle, he asked us to send it to
you via the nfsd tree.
There continues to be 0-day reports from Robert Morris @MIT. This time
we include a fix for a crash in the COPY_NOTIFY operation.
Highlights:
- Bruce steps down as NFSD maintainer
- Prepare for dynamic nfsd thread management
- More work on supporting re-exporting NFS mounts
- One fs/locks patch on behalf of Jeff Layton
Notable bug fixes:
- Fix zero-length NFSv3 WRITEs
- Fix directory cinfo on FS's that do not support iversion
- Fix WRITE verifiers for stable writes
- Fix crash on COPY_NOTIFY with a special state ID"
* tag 'nfsd-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (51 commits)
SUNRPC: Fix sockaddr handling in svcsock_accept_class trace points
SUNRPC: Fix sockaddr handling in the svc_xprt_create_error trace point
fs/locks: fix fcntl_getlk64/fcntl_setlk64 stub prototypes
nfsd: fix crash on COPY_NOTIFY with special stateid
MAINTAINERS: remove bfields
NFSD: Move fill_pre_wcc() and fill_post_wcc()
Revert "nfsd: skip some unnecessary stats in the v4 case"
NFSD: Trace boot verifier resets
NFSD: Rename boot verifier functions
NFSD: Clean up the nfsd_net::nfssvc_boot field
NFSD: Write verifier might go backwards
nfsd: Add a tracepoint for errors in nfsd4_clone_file_range()
NFSD: De-duplicate net_generic(nf->nf_net, nfsd_net_id)
NFSD: De-duplicate net_generic(SVC_NET(rqstp), nfsd_net_id)
NFSD: Clean up nfsd_vfs_write()
nfsd: Replace use of rwsem with errseq_t
NFSD: Fix verifier returned in stable WRITEs
nfsd: Retry once in nfsd_open on an -EOPENSTALE return
nfsd: Add errno mapping for EREMOTEIO
nfsd: map EBADF
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock cleanup from Mike Rapoport:
"Remove #ifdef __KERNEL__ from memblock.h
memblock.h is not a uAPI header, so __KERNEL__ guard can be deleted"
* tag 'memblock-v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
memblock: Remove #ifdef __KERNEL__ from memblock.h
|
|
Merge misc updates from Andrew Morton:
"146 patches.
Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak,
dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap,
memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb,
userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp,
ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and
damon)"
* emailed patches from Andrew Morton <[email protected]>: (146 commits)
mm/damon: hide kernel pointer from tracepoint event
mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log
mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging
mm/damon/dbgfs: remove an unnecessary variable
mm/damon: move the implementation of damon_insert_region to damon.h
mm/damon: add access checking for hugetlb pages
Docs/admin-guide/mm/damon/usage: update for schemes statistics
mm/damon/dbgfs: support all DAMOS stats
Docs/admin-guide/mm/damon/reclaim: document statistics parameters
mm/damon/reclaim: provide reclamation statistics
mm/damon/schemes: account how many times quota limit has exceeded
mm/damon/schemes: account scheme actions that successfully applied
mm/damon: remove a mistakenly added comment for a future feature
Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts
Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning
Docs/admin-guide/mm/damon/usage: remove redundant information
Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks
mm/damon: convert macro functions to static inline functions
mm/damon: modify damon_rand() macro to static inline function
mm/damon: move damon_rand() definition into damon.h
...
|
|
bitmap_for_each_{set,clear}_region() are similar to for_each_bit()
macros in include/linux/find.h, but interface and implementation
of them are different.
This patch adds for_each_bitrange() macros and drops unused
bitmap_*_region() API in sake of unification.
Signed-off-by: Yury Norov <[email protected]>
Tested-by: Wolfram Sang <[email protected]>
Acked-by: Dennis Zhou <[email protected]>
Acked-by: Ulf Hansson <[email protected]> # For MMC
|
|
The macros iterate thru all set/clear bits in a bitmap. They search a
first bit using find_first_bit(), and the rest bits using find_next_bit().
Since find_next_bit() is called shortly after find_first_bit(), we can
save few lines of I-cache by not using find_first_bit().
Signed-off-by: Yury Norov <[email protected]>
Tested-by: Wolfram Sang <[email protected]>
|
|
for_each_bit() macros depend on find_bit() machinery, and so the
proper place for them is the find.h header.
Signed-off-by: Yury Norov <[email protected]>
Tested-by: Wolfram Sang <[email protected]>
|
|
cpumask_first() is a more effective analogue of 'next' version if n == -1
(which means start == 0). This patch replaces 'next' with 'first' where
things look trivial.
There's no cpumask_first_zero() function, so create it.
Signed-off-by: Yury Norov <[email protected]>
Tested-by: Wolfram Sang <[email protected]>
|
|
Now we have an efficient implementation for find_first_and_bit(),
so switch cpumask to use it where appropriate.
Signed-off-by: Yury Norov <[email protected]>
Tested-by: Wolfram Sang <[email protected]>
|
|
Currently find_first_and_bit() is an alias to find_next_and_bit(). However,
it is widely used in cpumask, so it worth to optimize it. This patch adds
its own implementation for find_first_and_bit().
On x86_64 find_bit_benchmark says:
Before (#define find_first_and_bit(...) find_next_and_bit(..., 0):
Start testing find_bit() with random-filled bitmap
[ 140.291468] find_first_and_bit: 46890919 ns, 32671 iterations
Start testing find_bit() with sparse bitmap
[ 140.295028] find_first_and_bit: 7103 ns, 1 iterations
After:
Start testing find_bit() with random-filled bitmap
[ 162.574907] find_first_and_bit: 25045813 ns, 32846 iterations
Start testing find_bit() with sparse bitmap
[ 162.578458] find_first_and_bit: 4900 ns, 1 iterations
(Thanks to Alexey Klimov for thorough testing.)
Signed-off-by: Yury Norov <[email protected]>
Tested-by: Wolfram Sang <[email protected]>
Tested-by: Alexey Klimov <[email protected]>
|
|
In 5.12 cycle we enabled GENERIC_FIND_FIRST_BIT config option for ARM64
and MIPS. It increased performance and shrunk .text size; and so far
I didn't receive any negative feedback on the change.
https://lore.kernel.org/linux-arch/[email protected]/
Now I think it's a good time to switch all architectures to use
find_{first,last}_bit() unconditionally, and so remove corresponding
config option.
The patch does't introduce functioal changes for arc, arm, arm64, mips,
m68k, s390 and x86, for other architectures I expect improvement both in
performance and .text size.
Signed-off-by: Yury Norov <[email protected]>
Tested-by: Alexander Lobakin <[email protected]> (mips)
Reviewed-by: Alexander Lobakin <[email protected]> (mips)
Reviewed-by: Andy Shevchenko <[email protected]>
Acked-by: Will Deacon <[email protected]>
Tested-by: Wolfram Sang <[email protected]>
|
|
find_bit API and bitmap API are closely related, but inclusion paths
are different - include/asm-generic and include/linux, correspondingly.
In the past it made a lot of troubles due to circular dependencies
and/or undefined symbols. Fix this by moving find.h under include/linux.
Signed-off-by: Yury Norov <[email protected]>
Tested-by: Wolfram Sang <[email protected]>
Acked-by: Geert Uytterhoeven <[email protected]>
|
|
Usually, inline function is declared static since it should sit between
storage and type. And implement it in a header file if used by multiple
files.
And this change also fixes compile issue when backport damon to 5.10.
mm/damon/vaddr.c: In function `damon_va_evenly_split_region':
./include/linux/damon.h:425:13: error: inlining failed in call to `always_inline' `damon_insert_region': function body not available
425 | inline void damon_insert_region(struct damon_region *r,
| ^~~~~~~~~~~~~~~~~~~
mm/damon/vaddr.c:86:3: note: called from here
86 | damon_insert_region(n, r, next, t);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Guoqing Jiang <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If the time/space quotas of a given DAMON-based operation scheme is too
small, the scheme could show unexpectedly slow progress. However, there
is no good way to notice the case in runtime. This commit extends the
DAMOS stat to provide how many times the quota limits exceeded so that
the users can easily notice the case and tune the scheme.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm/damon/schemes: Extend stats for better online analysis and tuning".
To help online access pattern analysis and tuning of DAMON-based
Operation Schemes (DAMOS), DAMOS provides simple statistics for each
scheme. Introduction of DAMOS time/space quota further made the tuning
easier by making the risk management easier. However, that also made
understanding of the working schemes a little bit more difficult.
For an example, progress of a given scheme can now be throttled by not
only the aggressiveness of the target access pattern, but also the
time/space quotas. So, when a scheme is showing unexpectedly slow
progress, it's difficult to know by what the progress of the scheme is
throttled, with currently provided statistics.
This patchset extends the statistics to contain some metrics that can be
helpful for such online schemes analysis and tuning (patches 1-2),
exports those to users (patches 3 and 5), and add documents (patches 4
and 6).
This patch (of 6):
DAMON-based operation schemes (DAMOS) stats provide only the number and
the amount of regions that the action of the scheme has tried to be
applied. Because the action could be failed for some reasons, the
currently provided information is sometimes not useful or convenient
enough for schemes profiling and tuning. To improve this situation,
this commit extends the DAMOS stats to provide the number and the amount
of regions that the action has successfully applied.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Due to a mistake in patches reordering, a comment for a future feature
called 'arbitrary monitoring target support'[1], which is still under
development, has added. Because it only introduces confusion and we
don't have a plan to post the patches soon, this commit removes the
mistakenly added part.
[1] https://lore.kernel.org/linux-mm/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 1f366e421c8f ("mm/damon/core: implement DAMON-based Operation Schemes (DAMOS)")
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm/damon: Misc cleanups".
This patchset contains miscellaneous cleanups for DAMON's macro
functions and documentation.
This patch (of 6):
This commit converts macro functions in DAMON to static inline functions,
for better type checking, code documentation, etc[1].
[1] https://lore.kernel.org/linux-mm/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
damon_rand() cannot be implemented as a macro.
Example:
damon_rand(a++, b);
The value of 'a' will be incremented twice, This is obviously
unreasonable, So there fix it.
Link: https://lkml.kernel.org/r/110ffcd4e420c86c42b41ce2bc9f0fe6a4f32cd3.1638795127.git.xhao@linux.alibaba.com
Fixes: b9a6ac4e4ede ("mm/damon: adaptively adjust regions")
Signed-off-by: Xin Hao <[email protected]>
Reported-by: Andrew Morton <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
damon_rand() is called in three files:damon/core.c, damon/ paddr.c,
damon/vaddr.c, i think there is no need to redefine this twice, So move
it to damon.h will be a good choice.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Xin Hao <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In damon.h some func definitions about VA & PA can only be used in its
own file, so there no need to define in the header file, and the header
file will look cleaner.
If other files later need these functions, the prototypes can be added
to damon.h at that time.
[[email protected]: remove unnecessary function prototype position changes]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/45fd5b3ef6cce8e28dbc1c92f9dc845ccfc949d7.1636989871.git.xhao@linux.alibaba.com
Signed-off-by: Xin Hao <[email protected]>
Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
"page_idle_ops" as a global var, but its scope of use within this
document. So it should be static.
"page_ext_ops" is a var used in the kernel initial phase. And other
functions are aslo used in the kernel initial phase. So they should be
__init or __initdata to reclaim memory.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ting Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In theory, the following race is possible for batched TLB flushing.
CPU0 CPU1
---- ----
shrink_page_list()
unmap
zap_pte_range()
flush_tlb_batched_pending()
flush_tlb_mm()
try_to_unmap()
set_tlb_ubc_flush_pending()
mm->tlb_flush_batched = true
mm->tlb_flush_batched = false
After the TLB is flushed on CPU1 via flush_tlb_mm() and before
mm->tlb_flush_batched is set to false, some PTE is unmapped on CPU0 and
the TLB flushing is pended. Then the pended TLB flushing will be lost.
Although both set_tlb_ubc_flush_pending() and
flush_tlb_batched_pending() are called with PTL locked, different PTL
instances may be used.
Because the race window is really small, and the lost TLB flushing will
cause problem only if a TLB entry is inserted before the unmapping in
the race window, the race is only theoretical. But the fix is simple
and cheap too.
Syzbot has reported this too as follows:
==================================================================
BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one
write to 0xffff8881072cfbbc of 1 bytes by task 17406 on cpu 1:
flush_tlb_batched_pending+0x5f/0x80 mm/rmap.c:691
madvise_free_pte_range+0xee/0x7d0 mm/madvise.c:594
walk_pmd_range mm/pagewalk.c:128 [inline]
walk_pud_range mm/pagewalk.c:205 [inline]
walk_p4d_range mm/pagewalk.c:240 [inline]
walk_pgd_range mm/pagewalk.c:277 [inline]
__walk_page_range+0x981/0x1160 mm/pagewalk.c:379
walk_page_range+0x131/0x300 mm/pagewalk.c:475
madvise_free_single_vma mm/madvise.c:734 [inline]
madvise_dontneed_free mm/madvise.c:822 [inline]
madvise_vma mm/madvise.c:996 [inline]
do_madvise+0xe4a/0x1140 mm/madvise.c:1202
__do_sys_madvise mm/madvise.c:1228 [inline]
__se_sys_madvise mm/madvise.c:1226 [inline]
__x64_sys_madvise+0x5d/0x70 mm/madvise.c:1226
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
write to 0xffff8881072cfbbc of 1 bytes by task 71 on cpu 0:
set_tlb_ubc_flush_pending mm/rmap.c:636 [inline]
try_to_unmap_one+0x60e/0x1220 mm/rmap.c:1515
rmap_walk_anon+0x2fb/0x470 mm/rmap.c:2301
try_to_unmap+0xec/0x110
shrink_page_list+0xe91/0x2620 mm/vmscan.c:1719
shrink_inactive_list+0x3fb/0x730 mm/vmscan.c:2394
shrink_list mm/vmscan.c:2621 [inline]
shrink_lruvec+0x3c9/0x710 mm/vmscan.c:2940
shrink_node_memcgs+0x23e/0x410 mm/vmscan.c:3129
shrink_node+0x8f6/0x1190 mm/vmscan.c:3252
kswapd_shrink_node mm/vmscan.c:4022 [inline]
balance_pgdat+0x702/0xd30 mm/vmscan.c:4213
kswapd+0x200/0x340 mm/vmscan.c:4473
kthread+0x2c7/0x2e0 kernel/kthread.c:327
ret_from_fork+0x1f/0x30
value changed: 0x01 -> 0x00
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 71 Comm: kswapd0 Not tainted 5.16.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
==================================================================
[[email protected]: tweak comments]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Reported-by: [email protected]
Cc: Nadav Amit <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Marco Elver <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
After recent soft-offline rework, error pages can be taken off from
buddy allocator, but the existing unpoison_memory() does not properly
undo the operation. Moreover, due to the recent change on
__get_hwpoison_page(), get_page_unless_zero() is hardly called for
hwpoisoned pages. So __get_hwpoison_page() highly likely returns -EBUSY
(meaning to fail to grab page refcount) and unpoison just clears
PG_hwpoison without releasing a refcount. That does not lead to a
critical issue like kernel panic, but unpoisoned pages never get back to
buddy (leaked permanently), which is not good.
To (partially) fix this, we need to identify "taken off" pages from
other types of hwpoisoned pages. We can't use refcount or page flags
for this purpose, so a pseudo flag is defined by hacking ->private
field. Someone might think that put_page() is enough to cancel
taken-off pages, but the normal free path contains some operations not
suitable for the current purpose, and can fire VM_BUG_ON().
Note that unpoison_memory() is now supposed to be cancel hwpoison events
injected only by madvise() or
/sys/devices/system/memory/{hard,soft}_offline_page, not by MCE
injection, so please don't try to use unpoison when testing with MCE
injection.
[[email protected]: report build failure for ARCH=i386]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Ding Hui <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
These action_page_types are no longer used, so remove them.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Naoya Horiguchi <[email protected]>
Acked-by: Yang Shi <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Ding Hui <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Tony Luck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Cc: Ben Widawsky <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This syscall can be used to set a home node for the MPOL_BIND and
MPOL_PREFERRED_MANY memory policy. Users should use this syscall after
setting up a memory policy for the specified range as shown below.
mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
new_nodes->size + 1, 0);
sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
home_node, 0);
The syscall allows specifying a home node/preferred node from which
kernel will fulfill memory allocation requests first.
For address range with MPOL_BIND memory policy, if nodemask specifies
more than one node, page allocations will come from the node in the
nodemask with sufficient free memory that is closest to the home
node/preferred node.
For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
page allocation will come from the node in the nodemask with sufficient
free memory that is closest to the home node/preferred node. If there
is not enough memory in all the nodes specified in the nodemask, the
allocation will be attempted from the closest numa node to the home node
in the system.
This helps applications to hint at a memory allocation preference node
and fallback to _only_ a set of nodes if the memory is not available on
the preferred node. Fallback allocation is attempted from the node
which is nearest to the preferred node.
This helps applications to have control on memory allocation numa nodes
and avoids default fallback to slow memory NUMA nodes. For example a
system with NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of
slow memory
new_nodes = numa_bitmask_alloc(nr_nodes);
numa_bitmask_setbit(new_nodes, 1);
numa_bitmask_setbit(new_nodes, 2);
numa_bitmask_setbit(new_nodes, 3);
p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp, new_nodes->size + 1, 0);
sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);
This will allocate from nodes closer to node 2 and will make sure the
kernel will only allocate from nodes 1, 2, and 3. Memory will not be
allocated from slow memory nodes 10, 11, and 12. This differs from
default MPOL_BIND behavior in that with default MPOL_BIND the allocation
will be attempted from node closer to the local node. One of the
reasons to specify a home node is to allow allocations from cpu less
NUMA node and its nearby NUMA nodes.
With MPOL_PREFERRED_MANY on the other hand will first try to allocate
from the closest node to node 2 from the node list 1, 2 and 3. If those
nodes don't have enough memory, kernel will allocate from slow memory
node 10, 11 and 12 which ever is closer to node 2.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Cc: Ben Widawsky <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
drop_slab_node is only used in drop_slab. So remove it's declaration
from header file and add keyword static for it's definition.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There are interfaces to adjust max_ptes_none, max_ptes_swap,
max_ptes_shared values, see
/sys/kernel/mm/transparent_hugepage/khugepaged/.
But system administrator may not know which value is the best. So Add
those events to support adjusting max_ptes_* to suitable values.
For example, if default max_ptes_swap value causes too much failures,
and system uses zram whose IO is fast, administrator could increase
max_ptes_swap until THP_SCAN_EXCEED_SWAP_PTE not increase anymore.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Yang <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Saravanan D <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
For hugetlb backed jobs/VMs it's critical to understand the numa
information for the memory backing these jobs to deliver optimal
performance.
Currently this technically can be queried from /proc/self/numa_maps, but
there are significant issues with that. Namely:
1. Memory can be mapped or unmapped.
2. numa_maps are per process and need to be aggregated across all
processes in the cgroup. For shared memory this is more involved as
the userspace needs to make sure it doesn't double count shared
mappings.
3. I believe querying numa_maps needs to hold the mmap_lock which adds
to the contention on this lock.
For these reasons I propose simply adding hugetlb.*.numa_stat file,
which shows the numa information of the cgroup similarly to
memory.numa_stat.
On cgroup-v2:
cat /sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat
total=2097152 N0=2097152 N1=0
On cgroup-v1:
cat /sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat
total=2097152 N0=2097152 N1=0
hierarichal_total=2097152 N0=2097152 N1=0
This patch was tested manually by allocating hugetlb memory and querying
the hugetlb.*.numa_stat file of the cgroup and its parents.
[[email protected]: fix spelling mistake "hierarichal" -> "hierarchical"]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix copy/paste array assignment]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mina Almasry <[email protected]>
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Jue Wang <[email protected]>
Cc: Yang Yao <[email protected]>
Cc: Joanna Li <[email protected]>
Cc: Cannon Matthews <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Handle warning of allocation failure on DMA zone w/o
managed pages", v4.
**Problem observed:
On x86_64, when crash is triggered and entering into kdump kernel, page
allocation failure can always be seen.
---------------------------------
DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 0 PID: 1 Comm: swapper/0
Call Trace:
dump_stack+0x7f/0xa1
warn_alloc.cold+0x72/0xd6
......
__alloc_pages+0x24d/0x2c0
......
dma_atomic_pool_init+0xdb/0x176
do_one_initcall+0x67/0x320
? rcu_read_lock_sched_held+0x3f/0x80
kernel_init_freeable+0x290/0x2dc
? rest_init+0x24f/0x24f
kernel_init+0xa/0x111
ret_from_fork+0x22/0x30
Mem-Info:
------------------------------------
***Root cause:
In the current kernel, it assumes that DMA zone must have managed pages
and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not
always true. E.g in kdump kernel of x86_64, only low 1M is presented and
locked down at very early stage of boot, so that this low 1M won't be
added into buddy allocator to become managed pages of DMA zone. This
exception will always cause page allocation failure if page is requested
from DMA zone.
***Investigation:
This failure happens since below commit merged into linus's tree.
1a6a9044b967 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options
23721c8e92f7 x86/crash: Remove crash_reserve_low_1M()
f1d4d47c5851 x86/setup: Always reserve the first 1M of RAM
7c321eb2b843 x86/kdump: Remove the backup region handling
6f599d84231f x86/kdump: Always reserve the low 1M when the crashkernel option is specified
Before them, on x86_64, the low 640K area will be reused by kdump kernel.
So in kdump kernel, the content of low 640K area is copied into a backup
region for dumping before jumping into kdump. Then except of those firmware
reserved region in [0, 640K], the left area will be added into buddy
allocator to become available managed pages of DMA zone.
However, after above commits applied, in kdump kernel of x86_64, the low
1M is reserved by memblock, but not released to buddy allocator. So any
later page allocation requested from DMA zone will fail.
At the beginning, if crashkernel is reserved, the low 1M need be locked
down because AMD SME encrypts memory making the old backup region
mechanims impossible when switching into kdump kernel.
Later, it was also observed that there are BIOSes corrupting memory
under 1M. To solve this, in commit f1d4d47c5851, the entire region of
low 1M is always reserved after the real mode trampoline is allocated.
Besides, recently, Intel engineer mentioned their TDX (Trusted domain
extensions) which is under development in kernel also needs to lock down
the low 1M. So we can't simply revert above commits to fix the page allocation
failure from DMA zone as someone suggested.
***Solution:
Currently, only DMA atomic pool and dma-kmalloc will initialize and
request page allocation with GFP_DMA during bootup.
So only initializ DMA atomic pool when DMA zone has available managed
pages, otherwise just skip the initialization.
For dma-kmalloc(), for the time being, let's mute the warning of
allocation failure if requesting pages from DMA zone while no manged
pages. Meanwhile, change code to use dma_alloc_xx/dma_map_xx API to
replace kmalloc(GFP_DMA), or do not use GFP_DMA when calling kmalloc() if
not necessary. Christoph is posting patches to fix those under
drivers/scsi/. Finally, we can remove the need of dma-kmalloc() as people
suggested.
This patch (of 3):
In some places of the current kernel, it assumes that dma zone must have
managed pages if CONFIG_ZONE_DMA is enabled. While this is not always
true. E.g in kdump kernel of x86_64, only low 1M is presented and locked
down at very early stage of boot, so that there's no managed pages at all
in DMA zone. This exception will always cause page allocation failure if
page is requested from DMA zone.
Here add function has_managed_dma() and the relevant helper functions to
check if there's DMA zone with managed pages. It will be used in later
patches.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
Signed-off-by: Baoquan He <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: John Donnelly <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Laight <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
kmalloc(..., GFP_DMA32) does not return DMA32 memory because the DMA32
kmalloc cache array is not implemented. (Reason: there is no such user
in kernel).
Put a short comment about this so people can understand this by reading
the comment.
[1] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miles Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
alloc_pages_vma is meant to allocate a page with a vma specific memory
policy. The initial node parameter is always a local node so it is
pointless to waste a function argument for this. Drop the parameter.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Ben Widawsky <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Return statements in functions returning bool should use true/false
instead of 1/0.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Changcheng Deng <[email protected]>
Reported-by: Zeal Robot <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Various places in the kernel - largely in filesystems - respond to a
memory allocation failure by looping around and re-trying. Some of
these cannot conveniently use __GFP_NOFAIL, for reasons such as:
- a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
- a need to check for the process being signalled between failures
- the possibility that other recovery actions could be performed
- the allocation is quite deep in support code, and passing down an
extra flag to say if __GFP_NOFAIL is wanted would be clumsy.
Many of these currently use congestion_wait() which (in almost all
cases) simply waits the given timeout - congestion isn't tracked for
most devices.
It isn't clear what the best delay is for loops, but it is clear that
the various filesystems shouldn't be responsible for choosing a timeout.
This patch introduces memalloc_retry_wait() with takes on that
responsibility. Code that wants to retry a memory allocation can call
this function passing the GFP flags that were used. It will wait
however is appropriate.
For now, it only considers __GFP_NORETRY and whatever
gfpflags_allow_blocking() tests. If blocking is allowed without
__GFP_NORETRY, then alloc_page either made some reclaim progress, or
waited for a while, before failing. So there is no need for much
further waiting. memalloc_retry_wait() will wait until the current
jiffie ends. If this condition is not met, then alloc_page() won't have
waited much if at all. In that case memalloc_retry_wait() waits about
200ms. This is the delay that most current loops uses.
linux/sched/mm.h needs to be included in some files now,
but linux/backing-dev.h does not.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: NeilBrown <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: Chuck Lever <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented by
previous patches so we can allow the support for kvmalloc. This will
allow some external users to simplify or completely remove their
helpers.
GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
explicitly documented so let's add a note about that.
ceph_kvmalloc is the first helper to be dropped and changed to kvmalloc.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
All callers pass NULL, so we can stop calculating the value we would
store in it.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
None of the callers care about the total_map_swapcount() any more.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Check user page table entries at the time they are added and removed.
Allows to synchronously catch memory corruption issues related to double
mapping.
When a pte for an anonymous page is added into page table, we verify
that this pte does not already point to a file backed page, and vice
versa if this is a file backed page that is being added we verify that
this page does not have an anonymous mapping
We also enforce that read-only sharing for anonymous pages is allowed
(i.e. cow after fork). All other sharing must be for file pages.
Page table check allows to protect and debug cases where "struct page"
metadata became corrupted for some reason. For example, when refcnt or
mapcount become invalid.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pasha Tatashin <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We have ptep_get_and_clear() and ptep_get_and_clear_full() helpers to
clear PTE from user page tables, but there is no variant for simple
clear of a present PTE from user page tables without using a low level
pte_clear() which can be either native or para-virtualised.
Add a new ptep_clear() that can be used in common code to clear PTEs
from page table. We will need this call later in order to add a hook
for page table check.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pasha Tatashin <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add comments for vm_operations_struct::close documenting locking
requirements for this callback and its callers.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: Jan Engelhardt <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Tim Murray <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
linux/mm_types.h should only define structure definitions, to make it
cheap to include elsewhere. The atomic_t helper function definitions
are particularly large, so it's better to move the helpers using those
into the existing linux/mm_inline.h and only include that where needed.
As a follow-up, we may want to go through all the indirect includes in
mm_types.h and reduce them as much as possible.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Eric Biederman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The patch to add anonymous vma names causes a build failure in some
configurations:
include/linux/mm_types.h: In function 'is_same_vma_anon_name':
include/linux/mm_types.h:924:37: error: implicit declaration of function 'strcmp' [-Werror=implicit-function-declaration]
924 | return name && vma_name && !strcmp(name, vma_name);
| ^~~~~~
include/linux/mm_types.h:22:1: note: 'strcmp' is defined in header '<string.h>'; did you forget to '#include <string.h>'?
This should not really be part of linux/mm_types.h in the first place,
as that header is meant to only contain structure defintions and need a
minimum set of indirect includes itself.
While the header clearly includes more than it should at this point,
let's not make it worse by including string.h as well, which would pull
in the expensive (compile-speed wise) fortify-string logic.
Move the new functions into a separate header that only needs to be
included in a couple of locations.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: "mm: add a field to store names for private anonymous memory"
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
While forking a process with high number (64K) of named anonymous vmas
the overhead caused by strdup() is noticeable. Experiments with ARM64
Android device show up to 40% performance regression when forking a
process with 64k unpopulated anonymous vmas using the max name lengths
vs the same process with the same number of anonymous vmas having no
name.
Introduce anon_vma_name refcounted structure to avoid the overhead of
copying vma names during fork() and when splitting named anonymous vmas.
When a vma is duplicated, instead of copying the name we increment the
refcount of this structure. Multiple vmas can point to the same
anon_vma_name as long as they increment the refcount. The name member
of anon_vma_name structure is assigned at structure allocation time and
is never changed. If vma name changes then the refcount of the original
structure is dropped, a new anon_vma_name structure is allocated to hold
the new name and the vma pointer is updated to point to the new
structure.
With this approach the fork() performance regressions is reduced 3-4x
times and with usecases using more reasonable number of VMAs (a few
thousand) the regressions is not measurable.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jan Glauber <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Cc: Shaohua Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in
use. At a minimum there is libc malloc and the stack, and in many cases
there are libc malloc, the stack, direct syscalls to mmap anonymous
memory, and multiple VM heaps (one for small objects, one for big
objects, etc.). Each of these layers usually has its own tools to
inspect its usage; malloc by compiling a debug version, the VM through
heap inspection tools, and for direct syscalls there is usually no way
to track them.
On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages
mapped in userspace and slice their usage by process, shared (COW) vs.
unique mappings, backing, etc. This can account for real physical
memory usage even in cases like fork without exec (which Android uses
heavily to share as many private COW pages as possible between
processes), Kernel SamePage Merging, and clean zero pages. It produces
a measurement of the pages that only exist in that process (USS, for
unique), and a measurement of the physical memory usage of that process
with the cost of shared pages being evenly split between processes that
share them (PSS).
If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap
walking tool that can understand the heap debugging of every layer, or
for every layer's heap debugging tools to implement the pagemap walking
logic, in which case it is hard to get a consistent view of memory
across the whole system.
Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that somebody
needs to clean up on crashes. It needs to be readable while the process
is still running, so it has to have some sort of synchronization with
every layer of userspace. Efficiently tracking the ranges requires
reimplementing something like the kernel vma trees, and linking to it
from every layer of userspace. It requires more memory, more syscalls,
more runtime cost, and more complexity to separately track regions that
the kernel is already tracking.
This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas. The names of named
anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
[anon:<name>].
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
Setting the name to NULL clears it. The name length limit is 80 bytes
including NUL-terminator and is checked to contain only printable ascii
characters (including space), except '[',']','\','$' and '`'.
Ascii strings are being used to have a descriptive identifiers for vmas,
which can be understood by the users reading /proc/pid/maps or
/proc/pid/smaps. Names can be standardized for a given system and they
can include some variable parts such as the name of the allocator or a
library, tid of the thread using it, etc.
The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string. Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name. The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.
CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
feature. It keeps the feature disabled by default to prevent any
additional memory overhead and to avoid confusing procfs parsers on
systems which are not ready to support named anonymous vmas.
The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names. In that design, name
pointers could be shared between vmas. However during the last
upstreaming attempt, Kees Cook raised concerns [2] about this approach
and suggested to copy the name into kernel memory space, perform
validity checks [3] and store as a string referenced from
vm_area_struct.
One big concern is about fork() performance which would need to strdup
anonymous vma names. Dave Hansen suggested experimenting with
worst-case scenario of forking a process with 64k vmas having longest
possible names [4]. I ran this experiment on an ARM64 Android device
and recorded a worst-case regression of almost 40% when forking such a
process.
This regression is addressed in the followup patch which replaces the
pointer to a name with a refcounted structure that allows sharing the
name pointer between vmas of the same name. Instead of duplicating the
string during fork() or when splitting a vma it increments the refcount.
[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/[email protected]/
Changes for prctl(2) manual page (in the options section):
PR_SET_VMA
Sets an attribute specified in arg2 for virtual memory areas
starting from the address specified in arg3 and spanning the
size specified in arg4. arg5 specifies the value of the attribute
to be set. Note that assigning an attribute to a virtual memory
area might prevent it from being merged with adjacent virtual
memory areas due to the difference in that attribute's value.
Currently, arg2 must be one of:
PR_SET_VMA_ANON_NAME
Set a name for anonymous virtual memory areas. arg5 should
be a pointer to a null-terminated string containing the
name. The name length including null byte cannot exceed
80 bytes. If arg5 is NULL, the name of the appropriate
anonymous virtual memory areas will be reset. The name
can contain only printable ascii characters (including
space), except '[',']','\','$' and '`'.
This feature is available only if the kernel is built with
the CONFIG_ANON_VMA_NAME option enabled.
[[email protected]: docs: proc.rst: /proc/PID/maps: fix malformed table]
Link: https://lkml.kernel.org/r/[email protected]
[surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
work here was done by Colin Cross, therefore, with his permission, keeping
him as the author]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Colin Cross <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jan Glauber <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Cc: Shaohua Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The kvmalloc* allocation functions can fallback to vmalloc allocations
and more often on long running machines. In addition the kernel does
have __GFP_ACCOUNT kvmalloc* calls. So, often on long running machines,
the memory.stat does not tell the complete picture which type of memory
is charged to the memcg. So add a per-memcg vmalloc stat.
[[email protected]: page_memcg() within rcu lock, per Muchun]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: remove cast, per Muchun]
[[email protected]: remove area->page[0] checks and move to page by page accounting per Michal]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Our container agent wants to know when a container exits if it was OOM
killed or not to report to the user. We use memory.oom.group = 1 to
ensure that OOM kills within the container's cgroup kill everything.
Existing memory.events are insufficient for knowing if this triggered:
1) Our current approach reads memory.events oom_kill and reports the
container was killed if the value is non-zero. This is erroneous in
some cases where containers create their children cgroups with
memory.oom.group=1 as such OOM kills will get counted against the
container cgroup's oom_kill counter despite not actually OOM killing
the entire container.
2) Reading memory.events.local will fail to identify OOM kills in leaf
cgroups (that don't set memory.oom.group) within the container
cgroup.
This patch adds a new oom_group_kill event when memory.oom.group
triggers to allow userspace to cleanly identify when an entire cgroup is
oom killed.
[[email protected]: changes from Johannes and Chris]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dan Schatzberg <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Chris Down <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
dump_mapping() is a big chunk of dump_page(), and it'd be handy to be
able to call it when we don't have a struct page. Split it out and move
it to fs/inode.c. Take the opportunity to simplify some of the debug
messages a little.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a new @vmemmap_shift property for struct dev_pagemap which specifies
that a devmap is composed of a set of compound pages of order
@vmemmap_shift, instead of base pages. When a compound page devmap is
requested, all but the first page are initialised as tail pages instead
of order-0 pages.
For certain ZONE_DEVICE users like device-dax which have a fixed page
size, this creates an opportunity to optimize GUP and GUP-fast walkers,
treating it the same way as THP or hugetlb pages.
Additionally, commit 7118fc2906e2 ("hugetlb: address ref count racing in
prep_compound_gigantic_page") removed set_page_count() because the
setting of page ref count to zero was redundant. devmap pages don't
come from page allocator though and only head page refcount is used for
compound pages, hence initialize tail page count to zero.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Joao Martins <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Yongqiang reports a kmemleak panic when module insmod/rmmod with KASAN
enabled(without KASAN_VMALLOC) on x86[1].
When the module area allocates memory, it's kmemleak_object is created
successfully, but the KASAN shadow memory of module allocation is not
ready, so when kmemleak scan the module's pointer, it will panic due to
no shadow memory with KASAN check.
module_alloc
__vmalloc_node_range
kmemleak_vmalloc
kmemleak_scan
update_checksum
kasan_module_alloc
kmemleak_ignore
Note, there is no problem if KASAN_VMALLOC enabled, the modules area
entire shadow memory is preallocated. Thus, the bug only exits on ARCH
which supports dynamic allocation of module area per module load, for
now, only x86/arm64/s390 are involved.
Add a VM_DEFER_KMEMLEAK flags, defer vmalloc'ed object register of
kmemleak in module_alloc() to fix this issue.
[1] https://lore.kernel.org/all/[email protected]/
[[email protected]: fix build]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: simplify ifdefs, per Andrey]
Link: https://lkml.kernel.org/r/CA+fCnZcnwJHUQq34VuRxpdoY6_XbJCDJ-jopksS5Eia4PijPzw@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 793213a82de4 ("s390/kasan: dynamic shadow mem allocation for modules")
Fixes: 39d114ddc682 ("arm64: add KASAN support")
Fixes: bebf56a1b176 ("kasan: enable instrumentation of global variables")
Signed-off-by: Kefeng Wang <[email protected]>
Reported-by: Yongqiang Liu <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Kefeng Wang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a new helper function kthread_run_on_cpu(), which includes
kthread_create_on_cpu/wake_up_process().
In some cases, use kthread_run_on_cpu() directly instead of
kthread_create_on_node/kthread_bind/wake_up_process() or
kthread_create_on_cpu/wake_up_process() or
kthreadd_create/kthread_bind/wake_up_process() to simplify the code.
[[email protected]: export kthread_create_on_cpu to modules]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Cai Huoqing <[email protected]>
Cc: Bernard Metzler <[email protected]>
Cc: Cai Huoqing <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: "Paul E . McKenney" <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Avoid the wrapper holding cf_mutex since it is not protecting anything.
To avoid confusion and unnecessary overhead incurred by it, remove.
Fixes: f489f27bc0ab ("vdpa: Sync calls set/get config/status with cf_mutex")
Signed-off-by: Eli Cohen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Si-Wei Liu<[email protected]>
Acked-by: Jason Wang <[email protected]>
|