Age | Commit message (Collapse) | Author | Files | Lines |
|
when the checked address is illegal,the corresponding shadow address from
kasan_mem_to_shadow may have no mapping in mmu table. Access such shadow
address causes kernel oops. Here is a sample about oops on arm64(VA
39bit) with KASAN_SW_TAGS and KASAN_OUTLINE on:
[ffffffb80aaaaaaa] pgd=000000005d3ce003, p4d=000000005d3ce003,
pud=000000005d3ce003, pmd=0000000000000000
Internal error: Oops: 0000000096000006 [#1] PREEMPT SMP
Modules linked in:
CPU: 3 PID: 100 Comm: sh Not tainted 6.6.0-rc1-dirty #43
Hardware name: linux,dummy-virt (DT)
pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __hwasan_load8_noabort+0x5c/0x90
lr : do_ib_ob+0xf4/0x110
ffffffb80aaaaaaa is the shadow address for efffff80aaaaaaaa.
The problem is reading invalid shadow in kasan_check_range.
The generic kasan also has similar oops.
It only reports the shadow address which causes oops but not
the original address.
Commit 2f004eea0fc8("x86/kasan: Print original address on #GP")
introduce to kasan_non_canonical_hook but limit it to KASAN_INLINE.
This patch extends it to KASAN_OUTLINE mode.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 2f004eea0fc8("x86/kasan: Print original address on #GP")
Signed-off-by: Haibo Li <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: AngeloGioacchino Del Regno <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Haibo Li <[email protected]>
Cc: Matthias Brugger <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Malloc libraries, like jemalloc and tcalloc, take decisions on when to
call madvise independently from the code in the main application.
This sometimes results in the application page faulting on an address,
right after the malloc library has shot down the backing memory with
MADV_DONTNEED.
Usually this is harmless, because we always have some 4kB pages sitting
around to satisfy a page fault. However, with hugetlbfs systems often
allocate only the exact number of huge pages that the application wants.
Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
any lock taken on the page fault path, which can open up the following
race condition:
CPU 1 CPU 2
MADV_DONTNEED
unmap page
shoot down TLB entry
page fault
fail to allocate a huge page
killed with SIGBUS
free page
Fix that race by pulling the locking from __unmap_hugepage_final_range
into helper functions called from zap_page_range_single. This ensures
page faults stay locked out of the MADV_DONTNEED VMA until the huge pages
have actually been freed.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 04ada095dcfc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
Signed-off-by: Rik van Riel <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Extend the locking scheme used to protect shared hugetlb mappings from
truncate vs page fault races, in order to protect private hugetlb mappings
(with resv_map) against MADV_DONTNEED.
Add a read-write semaphore to the resv_map data structure, and use that
from the hugetlb_vma_(un)lock_* functions, in preparation for closing the
race between MADV_DONTNEED and page faults.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 04ada095dcfc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
Signed-off-by: Rik van Riel <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "hugetlbfs: close race between MADV_DONTNEED and page fault", v7.
Malloc libraries, like jemalloc and tcalloc, take decisions on when to
call madvise independently from the code in the main application.
This sometimes results in the application page faulting on an address,
right after the malloc library has shot down the backing memory with
MADV_DONTNEED.
Usually this is harmless, because we always have some 4kB pages sitting
around to satisfy a page fault. However, with hugetlbfs systems often
allocate only the exact number of huge pages that the application wants.
Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
any lock taken on the page fault path, which can open up the following
race condition:
CPU 1 CPU 2
MADV_DONTNEED
unmap page
shoot down TLB entry
page fault
fail to allocate a huge page
killed with SIGBUS
free page
Fix that race by extending the hugetlb_vma_lock locking scheme to also
cover private hugetlb mappings (with resv_map), and pulling the locking
from __unmap_hugepage_final_range into helper functions called from
zap_page_range_single. This ensures page faults stay locked out of the
MADV_DONTNEED VMA until the huge pages have actually been freed.
This patch (of 3):
Hugetlbfs leaves a dangling pointer in the VMA if mmap fails. This has
not been a problem so far, but other code in this patch series tries to
follow that pointer.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 04ada095dcfc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
Signed-off-by: Mike Kravetz <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
When a zswap store fails due to the limit, it acquires a pool reference
and queues the shrinker. When the shrinker runs, it drops the reference.
However, there can be multiple store attempts before the shrinker wakes up
and runs once. This results in reference leaks and eventual saturation
warnings for the pool refcount.
Fix this by dropping the reference again when the shrinker is already
queued. This ensures one reference per shrinker run.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Johannes Weiner <[email protected]>
Reported-by: Chris Mason <[email protected]>
Acked-by: Nhat Pham <[email protected]>
Cc: Vitaly Wool <[email protected]>
Cc: Domenico Cerasuolo <[email protected]>
Cc: <[email protected]> [5.6+]
Signed-off-by: Andrew Morton <[email protected]>
|
|
This adds documentation for the new metric pages_skipped.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Stefan Roesch <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This adds documentation for the smart scan mode of KSM.
[[email protected]: fix typo]
[[email protected]: document that smart_scan defaults to on]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Stefan Roesch <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This change adds the "pages skipped" metric. To be able to evaluate how
successful smart page scanning is, the pages skipped metric can be
compared to the pages scanned metric.
The pages skipped metric is a cumulative counter. The counter is stored
under /sys/kernel/mm/ksm/pages_skipped.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Stefan Roesch <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "Smart scanning mode for KSM", v3.
This patch series adds "smart scanning" for KSM.
What is smart scanning?
=======================
KSM evaluates all the candidate pages for each scan. It does not use historic
information from previous scans. This has the effect that candidate pages that
couldn't be used for KSM de-duplication continue to be evaluated for each scan.
The idea of "smart scanning" is to keep historic information. With the historic
information we can temporarily skip the candidate page for one or several scans.
Details:
========
"Smart scanning" is to keep two small counters to store if the page has been
used for KSM. One counter stores how often we already tried to use the page for
KSM and the other counter stores how often we skip a page.
How often we skip the candidate page depends how often a page failed KSM
de-duplication. The code skips a maximum of 8 times. During testing this has
shown to be a good compromise for different workloads.
New sysfs knob:
===============
Smart scanning is not enabled by default. With /sys/kernel/mm/ksm/smart_scan
smart scanning can be enabled.
Monitoring:
===========
To monitor how effective smart scanning is a new sysfs knob has been introduced.
/sys/kernel/mm/pages_skipped report how many pages have been skipped by smart
scanning.
Results:
========
- Various workloads have shown a 20% - 25% reduction in page scans
For the instagram workload for instance, the number of pages scanned has been
reduced from over 20M pages per scan to less than 15M pages.
- Less pages scans also resulted in an overall higher de-duplication rate as
some shorter lived pages could be de-duplicated additionally
- Less pages scanned allows to reduce the pages_to_scan parameter
and this resulted in a 25% reduction in terms of CPU.
- The improvements have been observed for workloads that enable KSM with
madvise as well as prctl
This patch (of 4):
This change adds a "smart" page scanning mode for KSM. So far all the
candidate pages are continuously scanned to find candidates for
de-duplication. There are a considerably number of pages that cannot be
de-duplicated. This is costly in terms of CPU. By using smart scanning
considerable CPU savings can be achieved.
This change takes the history of scanning pages into account and skips the
page scanning of certain pages for a while if de-deduplication for this
page has not been successful in the past.
To do this it introduces two new fields in the ksm_rmap_item structure:
age and remaining_skips. age, is the KSM age and remaining_skips
determines how often scanning of this page is skipped. The age field is
incremented each time the page is scanned and the page cannot be de-
duplicated. age updated is capped at U8_MAX.
How often a page is skipped is dependent how often de-duplication has been
tried so far and the number of skips is currently limited to 8. This
value has shown to be effective with different workloads.
The feature is currently disable by default and can be enabled with the
new smart_scan knob.
The feature has shown to be very effective: upt to 25% of the page scans
can be eliminated; the pages_to_scan rate can be reduced by 40 - 50% and a
similar de-duplication rate can be maintained.
[[email protected]: make ksm_smart_scan default true, for testing]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Stefan Roesch <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Stefan Roesch <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
used for slow memory type in kmem driver. This limits the usage of kmem
driver, for example, it cannot be used for HBM (high bandwidth memory).
So, we use the general abstract distance calculation mechanism in kmem
drivers to get more accurate abstract distance on systems with proper
support. The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as fallback
only.
Now, multiple memory types may be managed by kmem. These memory types are
put into the "kmem_memory_types" list and protected by
kmem_memory_type_lock.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Tested-by: Bharata B Rao <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Rafael J Wysocki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
A memory tiering abstract distance calculation algorithm based on ACPI
HMAT is implemented. The basic idea is as follows.
The performance attributes of system default DRAM nodes are recorded as
the base line. Whose abstract distance is MEMTIER_ADISTANCE_DRAM. Then,
the ratio of the abstract distance of a memory node (target) to
MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance
attributes of the node to that of the default DRAM nodes.
The functions to record the read/write latency/bandwidth of the default
DRAM nodes and calculate abstract distance according to read/write
latency/bandwidth ratio will be used by CXL CDAT (Coherent Device
Attribute Table) and other memory device drivers. So, they are put in
memory-tiers.c.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Tested-by: Bharata B Rao <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Rafael J Wysocki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Previously, in hmat_register_target_initiators(), the performance
attributes are calculated and the corresponding sysfs links and files are
created too. Which is called during memory onlining.
But now, to calculate the abstract distance of a memory target before
memory onlining, we need to calculate the performance attributes for a
memory target without creating sysfs links and files.
To do that, hmat_register_target_initiators() is refactored to make it
possible to calculate performance attributes separately.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Tested-by: Alistair Popple <[email protected]>
Tested-by: Bharata B Rao <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Rafael J Wysocki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "memory tiering: calculate abstract distance based on ACPI
HMAT", v4.
We have the explicit memory tiers framework to manage systems with
multiple types of memory, e.g., DRAM in DIMM slots and CXL memory devices.
Where, same kind of memory devices will be grouped into memory types,
then put into memory tiers. To describe the performance of a memory type,
abstract distance is defined. Which is in direct proportion to the memory
latency and inversely proportional to the memory bandwidth. To keep the
code as simple as possible, fixed abstract distance is used in dax/kmem to
describe slow memory such as Optane DCPMM.
To support more memory types, in this series, we added the abstract
distance calculation algorithm management mechanism, provided a algorithm
implementation based on ACPI HMAT, and used the general abstract distance
calculation interface in dax/kmem driver. So, dax/kmem can support HBM
(high bandwidth memory) in addition to the original Optane DCPMM.
This patch (of 4):
The abstract distance may be calculated by various drivers, such as ACPI
HMAT, CXL CDAT, etc. While it may be used by various code which hot-add
memory node, such as dax/kmem etc. To decouple the algorithm users and
the providers, the abstract distance calculation algorithms management
mechanism is implemented in this patch. It provides interface for the
providers to register the implementation, and interface for the users.
Multiple algorithm implementations can cooperate via calculating abstract
distance for different memory nodes. The preference of algorithm
implementations can be specified via priority (notifier_block.priority).
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Tested-by: Bharata B Rao <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Rafael J Wysocki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
hugetlb_folio_init_vmemmap()
No functional difference, folio_ref_freeze() is currently a wrapper for
page_ref_freeze().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Usama Arif <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Remove special cased hugetlb handling code within the page cache by
changing the granularity of ->index to the base page size rather than the
huge page size. The motivation of this patch is to reduce complexity
within the filemap code while also increasing performance by removing
branches that are evaluated on every page cache lookup.
To support the change in index, new wrappers for hugetlb page cache
interactions are added. These wrappers perform the conversion to a linear
index which is now expected by the page cache for huge pages.
========================= PERFORMANCE ======================================
Perf was used to check the performance differences after the patch.
Overall the performance is similar to mainline with a very small larger
overhead that occurs in __filemap_add_folio() and
hugetlb_add_to_page_cache(). This is because of the larger overhead that
occurs in xa_load() and xa_store() as the xarray is now using more entries
to store hugetlb folios in the page cache.
Timing
aarch64
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
real 1m49.568s
user 0m0.000s
sys 1m49.461s
6.5-rc3:
[root]# time fallocate -l 700GB test.txt
real 1m47.495s
user 0m0.000s
sys 1m47.370s
1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m47.024s
user 0m0.000s
sys 1m46.921s
6.5-rc3:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m44.551s
user 0m0.000s
sys 1m44.438s
x86
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
real 0m22.383s
user 0m0.000s
sys 0m22.255s
6.5-rc3:
[opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
real 0m22.735s
user 0m0.038s
sys 0m22.567s
1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
real 0m25.786s
user 0m0.001s
sys 0m25.589s
6.5-rc3:
[root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
real 0m33.454s
user 0m0.001s
sys 0m33.193s
aarch64:
workload - fallocate a 700GB file backed by huge pages
6.5-rc3 + this patch:
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--95.04%--__pi_clear_page
|
|--3.57%--clear_huge_page
| |
| |--2.63%--rcu_all_qs
| |
| --0.91%--__cond_resched
|
--0.67%--__cond_resched
0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache
0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio
6.5-rc3
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--94.91%--__pi_clear_page
|
|--4.11%--clear_huge_page
| |
| |--3.00%--rcu_all_qs
| |
| --1.10%--__cond_resched
|
--0.59%--__cond_resched
0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio
x86
workload - fallocate a 100GB file backed by huge pages
6.5-rc3 + this patch:
2MB Page Size:
hugetlbfs_fallocate
|
--99.57%--clear_huge_page
|
--98.47%--clear_page_erms
|
--0.53%--asm_sysvec_apic_timer_interrupt
0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store
6.5-rc3
2MB Page Size:
--99.93%--__x64_sys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
--99.38%--clear_huge_page
|
|--98.40%--clear_page_erms
|
--0.59%--__cond_resched
0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio
========================= TESTING ======================================
This patch passes libhugetlbfs tests and LTP hugetlb tests
********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 110 113
* Skipped: 0 0
* PASS: 107 113
* FAIL: 0 0
* Killed by signal: 3 0
* Bad configuration: 0 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********
Done executing testcases.
LTP Version: 20220527-178-g2761a81c4
page migration was also tested using Mike Kravetz's test program.[8]
[[email protected]: fix an NULL vs IS_ERR() bug]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sidhartha Kumar <[email protected]>
Signed-off-by: Dan Carpenter <[email protected]>
Reported-and-tested-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This adds a new test case to the ksm functional tests to make sure that
the KSM setting is inherited by the child process when doing a fork/exec.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Stefan Roesch <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Carl Klemm <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm/ksm: add fork-exec support for prctl", v4.
A process can enable KSM with the prctl system call. When the process is
forked the KSM flag is inherited by the child process. However if the
process is executing an exec system call directly after the fork, the KSM
setting is cleared. This patch series addresses this problem.
1) Change the mask in coredump.h for execing a new process
2) Add a new test case in ksm_functional_tests
This patch (of 2):
Today we have two ways to enable KSM:
1) madvise system call
This allows to enable KSM for a memory region for a long time.
2) prctl system call
This is a recent addition to enable KSM for the complete process.
In addition when a process is forked, the KSM setting is inherited.
This change only affects the second case.
One of the use cases for (2) was to support the ability to enable
KSM for cgroups. This allows systemd to enable KSM for the seed
process. By enabling it in the seed process all child processes inherit
the setting.
This works correctly when the process is forked. However it doesn't
support fork/exec workflow.
From the previous cover letter:
....
Use case 3:
With the madvise call sharing opportunities are only enabled for the
current process: it is a workload-local decision. A considerable number
of sharing opportunities may exist across multiple workloads or jobs
(if they are part of the same security domain). Only a higler level
entity like a job scheduler or container can know for certain if its
running one or more instances of a job. That job scheduler however
doesn't have the necessary internal workload knowledge to make targeted
madvise calls.
....
In addition it can also be a bit surprising that fork keeps the KSM
setting and fork/exec does not.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Stefan Roesch <[email protected]>
Fixes: d7597f59d1d3 ("mm: add new api to enable ksm per process")
Reviewed-by: David Hildenbrand <[email protected]>
Reported-by: Carl Klemm <[email protected]>
Tested-by: Carl Klemm <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
si_meminfo() will read and assign more info not just free/ram pages. For
just DAMOS_WMARK_FREE_MEM_RATE use, only get free and ram pages is ok to
save cpu.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Huan Yang <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The cpupid (or access time) is stored in the head page for THP, so it is
safely to make should_numa_migrate_memory() and numa_hint_fault_latency()
to take a folio. This is in preparation for large folio numa balancing.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
In preparation for large folio numa balancing, make mpol_misplaced() to
take a folio, no functional change intended.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
In preparation for large folio numa balancing, make numa_migrate_prep() to
take a folio, no functional change intended.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Numa balancing only try to migrate non-compound page in do_numa_page(),
use a folio in it to save several compound_head calls, note we use
folio_estimated_sharers(), it is enough to check the folio sharers since
only normal page is handled, if large folio numa balancing is supported, a
precise folio sharers check would be used, no functional change intended.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Use a folio in do_huge_pmd_numa_page(), reduce three page_folio() calls to
one, no functional change intended.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm: convert numa balancing functions to use a folio", v2.
do_numa_pages() only handles non-compound pages, and only PMD-mapped THPs
are handled in do_huge_pmd_numa_page(). But a large, PTE-mapped folio
will be supported so let's convert more numa balancing functions to
use/take a folio in preparation for that, no functional change intended
for now.
This patch (of 6):
The new vm_normal_folio_pmd() wrapper is similar to vm_normal_folio(),
which allow them to completely replace the struct page variables with
struct folio variables.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using
the existing helper folio_next_index() in filemap_map_pages().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Minjie Du <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Yin Fengwei <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add some tests to cover the new PR_MDWE_NO_INHERIT flag of the
PR_SET_MDWE prctl.
Check that:
- it can't be set without PR_SET_MDWE
- MDWE flags can't be unset
- when set, PR_SET_MDWE doesn't propagate to children
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Florent Revest <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Alexey Izbyshev <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Ayush Jain <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Joey Gouly <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Szabolcs Nagy <[email protected]>
Cc: Topi Miettinen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This extends the current PR_SET_MDWE prctl arg with a bit to indicate that
the process doesn't want MDWE protection to propagate to children.
To implement this no-inherit mode, the tag in current->mm->flags must be
absent from MMF_INIT_MASK. This means that the encoding for "MDWE but
without inherit" is different in the prctl than in the mm flags. This
leads to a bit of bit-mangling in the prctl implementation.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Florent Revest <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Alexey Izbyshev <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Ayush Jain <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Joey Gouly <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Szabolcs Nagy <[email protected]>
Cc: Topi Miettinen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Defining a prctl flag as an int is a footgun because on a 64 bit machine
and with a variadic implementation of prctl (like in musl and glibc), when
used directly as a prctl argument, it can get casted to long with garbage
upper bits which would result in unexpected behaviors.
This patch changes the constant to an unsigned long to eliminate that
possibilities. This does not break UAPI.
I think that a stable backport would be "nice to have": to reduce the
chances that users build binaries that could end up with garbage bits in
their MDWE prctl arguments. We are not aware of anyone having yet
encountered this corner case with MDWE prctls but a backport would reduce
the likelihood it happens, since this sort of issues has happened with
other prctls. But If this is perceived as a backporting burden, I suppose
we could also live without a stable backport.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: b507808ebce2 ("mm: implement memory-deny-write-execute as a prctl")
Signed-off-by: Florent Revest <[email protected]>
Suggested-by: Alexey Izbyshev <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Ayush Jain <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Joey Gouly <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Szabolcs Nagy <[email protected]>
Cc: Topi Miettinen <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Invalid prctls return a negative code and set errno. It's good practice
to check that errno is set as expected.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Florent Revest <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Alexey Izbyshev <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Ayush Jain <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Joey Gouly <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Szabolcs Nagy <[email protected]>
Cc: Topi Miettinen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
I checked with the original author, the mmap_FIXED test case wasn't
properly tested and fails. Currently, it maps two consecutive (non
overlapping) pages and expects the second mapping to be denied by MDWE but
these two pages have nothing to do with each other so MDWE is actually out
of the picture here.
What the test actually intended to do was to remap a virtual address using
MAP_FIXED. However, this operation unmaps the existing mapping and
creates a new one so the va is backed by a new page and MDWE is again out
of the picture, all remappings should succeed.
This patch keeps the test case to make it clear that this situation is
expected to work: MDWE shouldn't block a MAP_FIXED replacement.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 4cf1fe34fd18 ("kselftest: vm: add tests for memory-deny-write-execute")
Signed-off-by: Florent Revest <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Tested-by: Ryan Roberts <[email protected]>
Tested-by: Ayush Jain <[email protected]>
Cc: Alexey Izbyshev <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Joey Gouly <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Szabolcs Nagy <[email protected]>
Cc: Topi Miettinen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "MDWE without inheritance", v4.
Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which
tags current with a flag that prevents pages that were previously not
executable from becoming executable. This tag always gets inherited by
children tasks. (it's in MMF_INIT_MASK)
At Google, we've been using a somewhat similar downstream patch for a few
years now. To make the adoption of this feature easier, we've had it
support a mode in which the W^X flag does not propagate to children. For
example, this is handy if a C process which wants W^X protection suspects
it could start children processes that would use a JIT.
I'd like to align our features with the upstream prctl. This series
proposes a new NO_INHERIT flag to the MDWE prctl to make this kind of
adoption easier. It sets a different flag in current that is not in
MMF_INIT_MASK and which does not propagate.
As part of looking into MDWE, I also fixed a couple of things in the MDWE
test.
The background for this was discussed in these threads:
v1: https://lore.kernel.org/all/[email protected]/
v2: https://lore.kernel.org/all/[email protected]/
This patch (of 6):
Fix tabs/spaces inconsistency in the mdwe test.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Florent Revest <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Alexey Izbyshev <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Ayush Jain <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Joey Gouly <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Szabolcs Nagy <[email protected]>
Cc: Topi Miettinen <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The current memory reclaim delay statistics only count the direct memory
reclaim of the task in do_try_to_free_pages(). In systems with NUMA open,
some tasks occasionally experience slower response times, but the total
count of reclaim does not increase, using ftrace can show that
node_reclaim has occurred.
The memory reclaim occurring in get_page_from_freelist() is also due to
heavy memory load. To get the impact of tasks in memory reclaim, this
patch adds the statistics of the memory reclaim delay statistics for
__node_reclaim().
Link: https://lkml.kernel.org/r/181C946095F0252B+7cc60eca-1abf-4502-aad3-ffd8ef89d910@ex.bilibili.com
Signed-off-by: Wen Yu Li <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Document what mmu_notifier_invalidate_range_start_nonblock() is for. Also
add a __must_check annotation to signal that callers must bail out if a
notifier vetoes the operation.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jann Horn <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Since commit b25806dcd3d5("mm: memcontrol: deprecate swapaccounting=0
mode") do_memsw_account() is synonymous with
!cgroup_subsys_on_dfl(memory_cgrp_subsys), It always equals true in
memcg1_stat_format(). Remove the unused code.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liu Shixin <[email protected]>
Suggested-by: Michal Koutný <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
Acked-by: Tejun heo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "Expose swapcache stat for memcg v1", v2.
Since commit b6038942480e ("mm: memcg: add swapcache stat for memcg v2")
adds swapcache stat for the cgroup v2, it seems there is no reason to hide
it in memcg v1. Conversely, with swapcached it is more accurate to
evaluate the available memory for memcg.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liu Shixin <[email protected]>
Suggested-by: Yosry Ahmed <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Recently, we found that cross-die access to pagetable pages on ARM64
machines can cause performance fluctuations in our business. Currently,
there are no PMU events available to track this situation on our ARM64
machines, so accurate pagetable accounting can help to analyze this issue,
but now the PUD level pagetable accounting is missed.
So introduce pagetable_pud_ctor/dtor() to help to get accurate PUD
pagetable accounting, as well as converting the architectures which use
generic PUD pagetable allocation to add corresponding PUD pagetable
accounting. Moreover this patch will mark the PUD level pagetable with
PG_table flag, which will help to do sanity validation in
unpoison_memory().
On my testing machine, I can see more pagetables statistics after the patch
with page-types tool:
Before patch:
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000004000000 27326 106 __________________________g_________________ pgtable
After patch:
0x0000000004000000 27541 107 __________________________g_________________ pgtable
Link: https://lkml.kernel.org/r/876c71c03a7e69c17722a690e3225a4f7b172fb2.1695017383.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Acked-by: Vishal Moola (Oracle) <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
First found this typo as reviewing memory tier code. Fix it by sed like:
$ sed -i 's/sibiling/sibling/g' $(git grep -l sibiling)
so the acpi one will be corrected as well.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Li Zhijian <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Len Brown <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
do_pages_move does not handle compat pointers for the page list.
correctly. Add in_compat_syscall check and appropriate get_user fetch
when iterating the page list.
It makes the syscall in compat mode (32-bit userspace, 64-bit kernel)
work the same way as the native 32-bit syscall again, restoring the
behavior before my broken commit 5b1b561ba73c ("mm: simplify
compat_sys_move_pages").
More specifically, my patch moved the parsing of the 'pages' array from
the main entry point into do_pages_stat(), which left the syscall
working correctly for the 'stat' operation (nodes = NULL), while the
'move' operation (nodes != NULL) is now missing the conversion and
interprets 'pages' as an array of 64-bit pointers instead of the
intended 32-bit userspace pointers.
It is possible that nobody noticed this bug because the few
applications that actually call move_pages are unlikely to run in
compat mode because of their large memory requirements, but this
clearly fixes a user-visible regression and should have been caught by
ltp.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 5b1b561ba73c ("mm: simplify compat_sys_move_pages")
Signed-off-by: Gregory Price <[email protected]>
Reported-by: Arnd Bergmann <[email protected]>
Co-developed-by: Arnd Bergmann <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
We used to determine the number of page table entries to set for a NAPOT
hugepage by using the pte value which actually fails when the pte to set
is a swap entry.
So take advantage of a recent fix for arm64 reported in [1] which
introduces the size of the mapping as an argument of set_huge_pte_at(): we
can then use this size to compute the number of page table entries to set
for a NAPOT region.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 82a1a1f3bfb6 ("riscv: mm: support Svnapot in hugetlb page")
Signed-off-by: Alexandre Ghiti <[email protected]>
Reported-by: Ryan Roberts <[email protected]>
Closes: https://lore.kernel.org/linux-arm-kernel/[email protected]/ [1]
Reviewed-by: Andrew Jones <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Qinglin Pan <[email protected]>
Cc: Conor Dooley <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "Fix set_huge_pte_at()".
A recent report [1] from Ryan for arm64 revealed that we do not handle
swap entries when setting a hugepage backed by a NAPOT region (the contpte
riscv equivalent).
As explained in [1], the issue was discovered by a new test in kselftest
which uses poison entries, but the symptoms are different from arm64 though:
- the riscv kernel bugs because we do not handle VM_FAULT_HWPOISON*,
this is fixed by patch 1,
- after that, the test passes because the first pte_napot() fails (the
poison entry does not have the N bit set), and then we only set the
first page table entry covering the NAPOT hugepage, which is enough
for hugetlb_fault() to correctly raise a VM_FAULT_HWPOISON wherever we
write in this mapping since only this first page table entry is
checked
(see https://elixir.bootlin.com/linux/v6.6-rc3/source/mm/hugetlb.c#L6071).
But this seems fragile so patch 2 sets all page table entries of a
NAPOT mapping.
[1]: https://lore.kernel.org/linux-arm-kernel/[email protected]/
This patch (of 2):
We used to panic when such faults were encountered but we should handle
those faults gracefully for userspace by sending a SIGBUS to the process,
like most architectures do.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
Acked-by: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Andrew Jones <[email protected]>
Cc: Conor Dooley <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Qinglin Pan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
When the calling function fails after the dup_anon_vma(), the
duplication of the anon_vma is not being undone. Add the necessary
unlink_anon_vma() call to the error paths that are missing them.
This issue showed up during inspection of the error path in vma_merge()
for an unrelated vma iterator issue.
Users may experience increased memory usage, which may be problematic as
the failure would likely be caused by a low memory situation.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree")
Signed-off-by: Liam R. Howlett <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
During the error path, the vma iterator may not be correctly positioned or
set to the correct range. Undo the vma_prev() call by resetting to the
passed in address. Re-walking to the same range will fix the range to the
area previously passed in.
Users would notice increased cycles as vma_merge() would be called an
extra time with vma == prev, and thus would fail to merge and return.
Link: https://lore.kernel.org/linux-mm/CAG48ez12VN1JAOtTNMY+Y2YnsU45yL5giS-Qn=ejtiHpgJAbdQ@mail.gmail.com/
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 18b098af2890 ("vma_merge: set vma iterator to correct position.")
Signed-off-by: Liam R. Howlett <[email protected]>
Reported-by: Jann Horn <[email protected]>
Closes: https://lore.kernel.org/linux-mm/CAG48ez12VN1JAOtTNMY+Y2YnsU45yL5giS-Qn=ejtiHpgJAbdQ@mail.gmail.com/
Reviewed-by: Lorenzo Stoakes <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Calling vm_brk_flags() with flags set other than VM_EXEC will exit the
function without releasing the mmap_write_lock.
Just do the sanity check before the lock is acquired. This doesn't fix an
actual issue since no caller sets a flag other than VM_EXEC.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 2e7ce7d354f2 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()")
Signed-off-by: Sebastian Ott <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The two users of mbind_range() are expecting that mbind_range() will
update the pointer to the previous VMA, or return an error. However,
set_mempolicy_home_node() does not call mbind_range() if there is no VMA
policy. The fix is to update the pointer to the previous VMA prior to
continuing iterating the VMAs when there is no policy.
Users may experience a WARN_ON() during VMA policy updates when updating
a range of VMAs on the home node.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/linux-mm/CALcu4rbT+fMVNaO_F2izaCT+e7jzcAciFkOvk21HGJsmLcUuwQ@mail.gmail.com/
Fixes: f4e9e0e69468 ("mm/mempolicy: fix use-after-free of VMA iterator")
Signed-off-by: Liam R. Howlett <[email protected]>
Reported-by: Yikebaer Aizezi <[email protected]>
Closes: https://lore.kernel.org/linux-mm/CALcu4rbT+fMVNaO_F2izaCT+e7jzcAciFkOvk21HGJsmLcUuwQ@mail.gmail.com/
Reviewed-by: Lorenzo Stoakes <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
When guard page debug is enabled and set_page_guard returns success, we
miss to forward page to point to start of next split range and we will do
split unexpectedly in page range without target page. Move start page
update before set_page_guard to fix this.
As we split to wrong target page, then splited pages are not able to merge
back to original order when target page is put back and splited pages
except target page is not usable. To be specific:
Consider target page is the third page in buddy page with order 2.
| buddy-2 | Page | Target | Page |
After break down to target page, we will only set first page to Guard
because of bug.
| Guard | Page | Target | Page |
When we try put_page_back_buddy with target page, the buddy page of target
if neither guard nor buddy, Then it's not able to construct original page
with order 2
| Guard | Page | buddy-0 | Page |
All pages except target page is not in free list and is not usable.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages")
Signed-off-by: Kemeng Shi <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Current kernel only lock base size folio during mlock syscall.
Add large folio support with following rules:
- Only mlock large folio when it's in VM_LOCKED VMA range
and fully mapped to page table.
fully mapped folio is required as if folio is not fully
mapped to a VM_LOCKED VMA, if system is in memory pressure,
page reclaim is allowed to pick up this folio, split it
and reclaim the pages which are not in VM_LOCKED VMA.
- munlock will apply to the large folio which is in VMA range
or cross the VMA boundary.
This is required to handle the case that the large folio is
mlocked, later the VMA is split in the middle of large folio.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yin Fengwei <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
If large folio is in the range of VM_LOCKED VMA, it should be mlocked to
avoid being picked by page reclaim. Which may split the large folio and
then mlock each pages again.
Mlock this kind of large folio to prevent them being picked by page
reclaim.
For the large folio which cross the boundary of VM_LOCKED VMA or not fully
mapped to VM_LOCKED VMA, we'd better not to mlock it. So if the system is
under memory pressure, this kind of large folio will be split and the
pages ouf of VM_LOCKED VMA can be reclaimed.
Ideally, for large folio, we should mlock it when the large folio is fully
mapped to VMA and munlock it if any page are unmampped from VMA. But it's
not easy to detect whether the large folio is fully mapped to VMA in some
cases (like add/remove rmap). So we update mlock_vma_folio() and
munlock_vma_folio() to mlock/munlock the folio according to vma->vm_flags.
Let caller to decide whether they should call these two functions.
For add rmap, only mlock normal 4K folio and postpone large folio handling
to page reclaim phase. It is possible to reuse page table iterator to
detect whether folio is fully mapped or not during page reclaim phase.
For remove rmap, invoke munlock_vma_folio() to munlock folio unconditionly
because rmap makes folio not fully mapped to VMA.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yin Fengwei <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "support large folio for mlock", v3.
Yu mentioned at [1] about the mlock() can't be applied to large folio.
I leant the related code and here is my understanding:
- For RLIMIT_MEMLOCK related, there is no problem. Because the
RLIMIT_MEMLOCK statistics is not related underneath page. That means
underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK
statistics collection which is always correct.
- For keeping the page in RAM, there is no problem either. At least,
during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit set in
vm_flags, the folio will be kept whatever the folio is mlocked or not.
So the function of mlock for large folio works. But it's not optimized
because the page reclaim needs scan these large folio and may split them.
This series identified the large folio for mlock to four types:
- The large folio is in VM_LOCKED range and fully mapped to the
range
- The large folio is in the VM_LOCKED range but not fully mapped to
the range
- The large folio cross VM_LOCKED VMA boundary
- The large folio cross last level page table boundary
For the first type, we mlock large folio so page reclaim will skip it.
For the second/third type, we don't mlock large folio. As the pages not
mapped to VM_LOACKED range are mapped to none VM_LOCKED range, if system
is in memory pressure situation, the large folio can be picked by page
reclaim and split. Then the pages not mapped to VM_LOCKED range can be
reclaimed.
For the fourth type, we don't mlock large folio because locking one page
table lock can't prevent the part in another last level page table being
unmapped. Thanks to Ryan for pointing this out.
To check whether the folio is fully mapped to the range, PTEs needs be
checked to see whether the page of folio is associated. Which needs take
page table lock and is heavy operation. So far, the only place needs this
check is madvise and page reclaim. These functions already have their own
PTE iterator.
patch1 introduce API to check whether large folio is in VMA range.
patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support
large folio mlock/munlock.
patch3 make mlock/munlock syscall support large folio.
Yu also mentioned a race which can make folio unevictable after munlock
during RFC v2 discussion [3]:
We decided that race issue didn't block this series based on:
- That race issue was not introduced by this series
- We had a looks-ok fix for that race issue. Need to wait
for mlock_count fixing patch as Yosry Ahmed suggested [4]
[1] https://lore.kernel.org/linux-mm/CAOUHufbtNPkdktjt_5qM45GegVO-rCFOMkSh0HQminQ12zsV8Q@mail.gmail.com/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://lore.kernel.org/linux-mm/CAOUHufZ6=9P_=CAOQyw0xw-3q707q-1FVV09dBNDC-hpcpj2Pg@mail.gmail.com/
This patch (of 3):
folio_in_range() will be used to check whether the folio is mapped to
specific VMA and whether the mapping address of folio is in the range.
Also a helper function folio_within_vma() to check whether folio
is in the range of vma based on folio_in_range().
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yin Fengwei <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
When CONFIG_DAMON_KUNIT_TEST=y and making CONFIG_DEBUG_KMEMLEAK=y and
CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y, the below memory leak is detected.
The damon_ctx which is allocated by kzalloc() in damon_new_ctx() in
damon_test_ops_registration() and damon_test_set_attrs() are not freed.
So use damon_destroy_ctx() to free it. After applying this patch, the
following memory leak is never detected
unreferenced object 0xffff2b49c6968800 (size 512):
comm "kunit_try_catch", pid 350, jiffies 4294895294 (age 557.028s)
hex dump (first 32 bytes):
88 13 00 00 00 00 00 00 a0 86 01 00 00 00 00 00 ................
00 87 93 03 00 00 00 00 0a 00 00 00 00 00 00 00 ................
backtrace:
[<0000000088e71769>] slab_post_alloc_hook+0xb8/0x368
[<0000000073acab3b>] __kmem_cache_alloc_node+0x174/0x290
[<00000000b5f89cef>] kmalloc_trace+0x40/0x164
[<00000000eb19e83f>] damon_new_ctx+0x28/0xb4
[<00000000daf6227b>] damon_test_ops_registration+0x34/0x328
[<00000000559c4801>] kunit_try_run_case+0x50/0xac
[<000000003932ed49>] kunit_generic_run_threadfn_adapter+0x20/0x2c
[<000000003c3e9211>] kthread+0x124/0x130
[<0000000028f85bdd>] ret_from_fork+0x10/0x20
unreferenced object 0xffff2b49c1a9cc00 (size 512):
comm "kunit_try_catch", pid 356, jiffies 4294895306 (age 557.000s)
hex dump (first 32 bytes):
88 13 00 00 00 00 00 00 a0 86 01 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 0a 00 00 00 00 00 00 00 ................
backtrace:
[<0000000088e71769>] slab_post_alloc_hook+0xb8/0x368
[<0000000073acab3b>] __kmem_cache_alloc_node+0x174/0x290
[<00000000b5f89cef>] kmalloc_trace+0x40/0x164
[<00000000eb19e83f>] damon_new_ctx+0x28/0xb4
[<00000000058495c4>] damon_test_set_attrs+0x30/0x1a8
[<00000000559c4801>] kunit_try_run_case+0x50/0xac
[<000000003932ed49>] kunit_generic_run_threadfn_adapter+0x20/0x2c
[<000000003c3e9211>] kthread+0x124/0x130
[<0000000028f85bdd>] ret_from_fork+0x10/0x20
Link: https://lkml.kernel.org/r/[email protected]
Fixes: d1836a3b2a9a ("mm/damon/core-test: initialise context before test in damon_test_set_attrs()")
Fixes: 4f540f5ab4f2 ("mm/damon/core-test: add a kunit test case for ops registration")
Signed-off-by: Jinjie Ruan <[email protected]>
Reviewed-by: Feng Tang <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Cc: Brendan Higgins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm/damon/core-test: Fix memory leaks in core-test", v3.
There are a few memory leaks in core-test which are detected by kmemleak.
This patchset fixes the issues.
This patch (of 2):
When CONFIG_DAMON_KUNIT_TEST=y and making CONFIG_DEBUG_KMEMLEAK=y
and CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y, the below memory leak is detected.
The damon_region which is allocated by kmem_cache_alloc() in
damon_new_region() in damon_test_regions() and
damon_test_update_monitoring_result() are not freed.
So for damon_test_regions(), replace damon_del_region() call with
damon_destroy_region() so that it calls both damon_del_region() and
damon_free_region(), the latter will free the damon_region. For
damon_test_update_monitoring_result(), call damon_free_region() to
free it. After applying this patch, the following memory leak is never
detected.
unreferenced object 0xffff2b49c3edc000 (size 56):
comm "kunit_try_catch", pid 338, jiffies 4294895280 (age 557.084s)
hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 49 2b ff ff ............I+..
backtrace:
[<0000000088e71769>] slab_post_alloc_hook+0xb8/0x368
[<00000000b528f67c>] kmem_cache_alloc+0x168/0x284
[<000000008603f022>] damon_new_region+0x28/0x54
[<00000000a3b8c64e>] damon_test_regions+0x38/0x270
[<00000000559c4801>] kunit_try_run_case+0x50/0xac
[<000000003932ed49>] kunit_generic_run_threadfn_adapter+0x20/0x2c
[<000000003c3e9211>] kthread+0x124/0x130
[<0000000028f85bdd>] ret_from_fork+0x10/0x20
unreferenced object 0xffff2b49c5b20000 (size 56):
comm "kunit_try_catch", pid 354, jiffies 4294895304 (age 556.988s)
hex dump (first 32 bytes):
03 00 00 00 00 00 00 00 07 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 96 00 00 00 49 2b ff ff ............I+..
backtrace:
[<0000000088e71769>] slab_post_alloc_hook+0xb8/0x368
[<00000000b528f67c>] kmem_cache_alloc+0x168/0x284
[<000000008603f022>] damon_new_region+0x28/0x54
[<00000000ca019f80>] damon_test_update_monitoring_result+0x18/0x34
[<00000000559c4801>] kunit_try_run_case+0x50/0xac
[<000000003932ed49>] kunit_generic_run_threadfn_adapter+0x20/0x2c
[<000000003c3e9211>] kthread+0x124/0x130
[<0000000028f85bdd>] ret_from_fork+0x10/0x20
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests")
Fixes: f4c978b6594b ("mm/damon/core-test: add a test for damon_update_monitoring_results()")
Signed-off-by: Jinjie Ruan <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Cc: Brendan Higgins <[email protected]>
Cc: Feng Tang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|