Age | Commit message (Collapse) | Author | Files | Lines |
|
For now, FAULT_FLAG_UNSHARE only applies to anonymous pages, which
implies a COW mapping. Let's hide FAULT_FLAG_UNSHARE early if we're not
dealing with a COW mapping, such that we treat it like a read fault as
documented and don't have to worry about the flag throughout all fault
handlers.
While at it, centralize the check for mutual exclusion of
FAULT_FLAG_UNSHARE and FAULT_FLAG_WRITE and just drop the check that
either flag is set in the WP handler.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Let's test whether R/O long-term pinning is reliable for non-anonymous
memory: when R/O long-term pinning a page, the expectation is that we
break COW early before pinning, such that actual write access via the
page tables won't break COW later and end up replacing the R/O-pinned
page in the page table.
Consequently, R/O long-term pinning in private mappings would only target
exclusive anonymous pages.
For now, all tests fail:
# [RUN] R/O longterm GUP pin ... with shared zeropage
not ok 151 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with memfd
not ok 152 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with tmpfile
not ok 153 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with huge zeropage
not ok 154 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB)
not ok 155 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB)
not ok 156 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with shared zeropage
not ok 157 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with memfd
not ok 158 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with tmpfile
not ok 159 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with huge zeropage
not ok 160 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB)
not ok 161 Longterm R/O pin is reliable
# [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB)
not ok 162 Longterm R/O pin is reliable
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Let's add basic tests for COW with non-anonymous pages in private
mappings: write access should properly trigger COW and result in the
private changes not being visible through other page mappings.
Especially, add tests for:
* Zeropage
* Huge zeropage
* Ordinary pagecache pages via memfd and tmpfile()
* Hugetlb pages via memfd
Fortunately, all tests pass.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm/gup: remove FOLL_FORCE usage from drivers (reliable R/O
long-term pinning)".
For now, we did not support reliable R/O long-term pinning in COW
mappings. That means, if we would trigger R/O long-term pinning in
MAP_PRIVATE mapping, we could end up pinning the (R/O-mapped) shared
zeropage or a pagecache page.
The next write access would trigger a write fault and replace the pinned
page by an exclusive anonymous page in the process page table; whatever
the process would write to that private page copy would not be visible by
the owner of the previous page pin: for example, RDMA could read stale
data. The end result is essentially an unexpected and hard-to-debug
memory corruption.
Some drivers tried working around that limitation by using
"FOLL_FORCE|FOLL_WRITE|FOLL_LONGTERM" for R/O long-term pinning for now.
FOLL_WRITE would trigger a write fault, if required, and break COW before
pinning the page. FOLL_FORCE is required because the VMA might lack write
permissions, and drivers wanted to make that working as well, just like
one would expect (no write access, but still triggering a write access to
break COW).
However, that is not a practical solution, because
(1) Drivers that don't stick to that undocumented and debatable pattern
would still run into that issue. For example, VFIO only uses
FOLL_LONGTERM for R/O long-term pinning.
(2) Using FOLL_WRITE just to work around a COW mapping + page pinning
limitation is unintuitive. FOLL_WRITE would, for example, mark the
page softdirty or trigger uffd-wp, even though, there actually isn't
going to be any write access.
(3) The purpose of FOLL_FORCE is debug access, not access without lack of
VMA permissions by arbitrarty drivers.
So instead, make R/O long-term pinning work as expected, by breaking COW
in a COW mapping early, such that we can remove any FOLL_FORCE usage from
drivers and make FOLL_FORCE ptrace-specific (renaming it to FOLL_PTRACE).
More details in patch #8.
This patch (of 19):
Originally, the plan was to have a separate tests for testing COW of
non-anonymous (e.g., shared zeropage) pages.
Turns out, that we'd need a lot of similar functionality and that there
isn't a really good reason to separate it. So let's prepare for non-anon
tests by renaming to "cow".
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Walls <[email protected]>
Cc: Anton Ivanov <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Bernard Metzler <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Benvenuti <[email protected]>
Cc: Christian Gmeiner <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Airlie <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Dennis Dalessandro <[email protected]>
Cc: "Eric W . Biederman" <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Inki Dae <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kentaro Takeda <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Kyungmin Park <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Nelson Escobar <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Oded Gabbay <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul Moore <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Russell King <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Seung-Woo Kim <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tomasz Figa <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Commit 6a108a14fa35 ("kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT")
introduces CONFIG_EXPERT to carry the previous intent of CONFIG_EMBEDDED
and just gives that intent a much better name. That has been clearly a
good and long overdue renaming, and it is clearly an improvement to the
kernel build configuration that has shown to help managing the kernel
build configuration in the last decade.
However, rather than bravely and radically just deleting CONFIG_EMBEDDED,
this commit gives CONFIG_EMBEDDED a new intended semantics, but keeps it
open for future contributors to implement that intended semantics:
A new CONFIG_EMBEDDED option is added that automatically selects
CONFIG_EXPERT when enabled and can be used in the future to isolate
options that should only be considered for embedded systems (RISC
architectures, SLOB, etc).
Since then, this CONFIG_EMBEDDED implicitly had two purposes:
- It can make even more options visible beyond what CONFIG_EXPERT makes
visible. In other words, it may introduce another level of enabling the
visibility of configuration options: always visible, visible with
CONFIG_EXPERT and visible with CONFIG_EMBEDDED.
- Set certain default values of some configurations differently,
following the assumption that configuring a kernel build for an
embedded system generally starts with a different set of default values
compared to kernel builds for all other kind of systems.
Considering the second purpose, note that already probably arguing that a
kernel build for an embedded system would choose some values differently
is already tricky: the set of embedded systems with Linux kernels is
already quite diverse. Many embedded system have powerful CPUs and it
would not be clear that all embedded systems just optimize towards one
specific aspect, e.g., a smaller kernel image size. So, it is unclear if
starting with "one set of default configuration" that is induced by
CONFIG_EMBEDDED is a good offer for developers configuring their kernels.
Also, the differences of needed user-space features in an embedded system
compared to a non-embedded system are probably difficult or even
impossible to name in some generic way.
So it is not surprising that in the last decade hardly anyone has
contributed changes to make something default differently in case of
CONFIG_EMBEDDED=y.
Currently, in v6.0-rc4, SECRETMEM is the only config switched off if
CONFIG_EMBEDDED=y.
As long as that is actually the only option that currently is selected or
deselected, it is better to just make SECRETMEM configurable at build time
by experts using menuconfig instead.
Make SECRETMEM configurable when EXPERT is set and otherwise default to
yes. Further, SECRETMEM needs ARCH_HAS_SET_DIRECT_MAP.
This allows us to remove CONFIG_EMBEDDED in the close future.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lukas Bulwahn <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Reviewed-by: Masahiro Yamada <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This restriction was created because FOLL_LONGTERM used to scan the vma
list, so it could not tolerate becoming unlocked. That was fixed in
commit 52650c8b466b ("mm/gup: remove the vma allocation from
gup_longterm_locked()") and the restriction on !vma was removed.
However, the locked restriction remained, even though it isn't necessary
anymore.
Adjust __gup_longterm_locked() so it can handle the mmap_read_lock()
becoming unlocked while it is looping for migration. Migration does not
require the mmap_read_sem because it is only handling struct pages. If we
had to unlock then ensure the whole thing returns unlocked.
Remove __get_user_pages_remote() and __gup_longterm_unlocked(). These
cases can now just directly call other functions.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jason Gunthorpe <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: John Hubbard <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
When testing overflow and overread, there is no need to keep unnecessary
compilation warnings, we should simply ignore them.
The motivation for this patch is to eliminate the compilation warning,
maybe one day we will compile the kernel with "-Werror -Wall", at which
point this compilation warning will turn into a compilation error, we
should fix this error in advance.
How to reproduce the problem (with gcc-11.3.1):
$ make -C tools/testing/selftests/
...
warning: `write' reading 4294967295 bytes from a region of size 1
[-Wstringop-overread]
warning: `read' writing 4294967295 bytes into a region of size 25
overflows the destination [-Wstringop-overflow=]
"-Wno-stringop-overread" is supported at least in gcc-11.1.0.
Link: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=d14c547abd484d3540b692bb8048c4a6efe92c8b
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Rong Tao <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The ei pointer does not need to cast the type.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Li zeming <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Mike Kravetz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Currently, drop_caches are reclaiming node-by-node, looping on each node
until reclaim could not make progress. This can however leave quite some
slab entries (such as filesystem inodes) unreclaimed if objects say on
node 1 keep objects on node 0 pinned. So move the "loop until no
progress" loop to the node-by-node iteration to retry reclaim also on
other nodes if reclaim on some nodes made progress. This fixes problem
when drop_caches was not reclaiming lots of otherwise perfectly fine to
reclaim inodes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jan Kara <[email protected]>
Reported-by: You Zhou <[email protected]>
Reported-by: Pengfei Xu <[email protected]>
Tested-by: Pengfei Xu <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Since commit 9a10064f5625 ("mm: add a field to store names for private
anonymous memory"), name for private anonymous memory, but not shared
anonymous, can be set. However, naming shared anonymous memory just as
useful for tracking purposes.
Extend the functionality to be able to set names for shared anon.
There are two ways to create anonymous shared memory, using memfd or
directly via mmap():
1. fd = memfd_create(...)
mem = mmap(..., MAP_SHARED, fd, ...)
2. mem = mmap(..., MAP_SHARED | MAP_ANONYMOUS, -1, ...)
In both cases the anonymous shared memory is created the same way by
mapping an unlinked file on tmpfs.
The memfd way allows to give a name for anonymous shared memory, but
not useful when parts of shared memory require to have distinct names.
Example use case: The VMM maps VM memory as anonymous shared memory (not
private because VMM is sandboxed and drivers are running in their own
processes). However, the VM tells back to the VMM how parts of the memory
are actually used by the guest, how each of the segments should be backed
(i.e. 4K pages, 2M pages), and some other information about the segments.
The naming allows us to monitor the effective memory footprint for each
of these segments from the host without looking inside the guest.
Sample output:
/* Create shared anonymous segmenet */
anon_shmem = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
/* Name the segment: "MY-NAME" */
rv = prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME,
anon_shmem, SIZE, "MY-NAME");
cat /proc/<pid>/maps (and smaps):
7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 [anon_shmem:MY-NAME]
If the segment is not named, the output is:
7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 /dev/zero (deleted)
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pasha Tatashin <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Bagas Sanjaya <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Vincent Whitchurch <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: xu xin <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The shrinker.h header depends on a user including other headers before it
for types used by shrinker.h. Fix this by including the appropriate
headers in shrinker.h.
./include/linux/shrinker.h:13:9: error: unknown type name `gfp_t'
13 | gfp_t gfp_mask;
| ^~~~~
./include/linux/shrinker.h:71:26: error: field `list' has incomplete type
71 | struct list_head list;
| ^~~~
./include/linux/shrinker.h:82:9: error: unknown type name `atomic_long_t'
82 | atomic_long_t *nr_deferred;
|
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 83aeeada7c69 ("vmscan: use atomic-long for shrinker batching")
Fixes: b0d40c92adaf ("superblock: introduce per-sb cache shrinker infrastructure")
Signed-off-by: T.J. Mercier <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The common hugetlb unmap routine __unmap_hugepage_range performs mmu
notification calls. However, in the case where __unmap_hugepage_range is
called via __unmap_hugepage_range_final, mmu notification calls are
performed earlier in other calling routines.
Remove mmu notification calls from __unmap_hugepage_range. Add
notification calls to the only other caller: unmap_hugepage_range.
unmap_hugepage_range is called for truncation and hole punch, so change
notification type from UNMAP to CLEAR as this is more appropriate.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Kravetz <[email protected]>
Suggested-by: Peter Xu <[email protected]>
Cc: Wei Chen <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mina Almasry <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Let's add one sanity check for CONFIG_DEBUG_VM on the write bit in
whatever chance we have when walking through the pgtables. It can bring
the error earlier even before the app notices the data was corrupted on
the snapshot. Also it helps us to identify this is a wrong pgtable setup,
so hopefully a great information to have for debugging too.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nadav Amit <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
I noticed a typo in a code comment and I fixed it.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yixuan Cao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
MADV_FREE pages have been moved into the LRU_INACTIVE_FILE list by commit
f7ad2a6cb9f7 ("mm: move MADV_FREE pages into LRU_INACTIVE_FILE list").
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jian Wen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
alloc_memory_type() returns error pointers on error instead of NULL. Use
IS_ERR() to check the return value to fix this.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 7b88bda3761b ("mm/demotion/dax/kmem: set node's abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE")
Signed-off-by: Miaoqian Lin <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Wei Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Quite straightforward, the page functions are converted to corresponding
folio functions. Same for comments.
THP specific code are converted to be large folio.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Tested-by: Baolin Wang <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "migrate: convert migrate_pages()/unmap_and_move() to use
folios", v2.
The conversion is quite straightforward, just replace the page API to the
corresponding folio API. migrate_pages() and unmap_and_move() mostly work
with folios (head pages) only.
This patch (of 2):
Quite straightforward, the page functions are converted to corresponding
folio functions. Same for comments.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Revert commit 831568214883 ("mm: migration: fix the FOLL_GET failure on
following huge page"), since after commit 1a6baaa0db73 ("s390/hugetlb:
switch to generic version of follow_huge_pud()") and commit 57a196a58421
("hugetlb: simplify hugetlb handling in follow_page_mask") were merged,
now all the following huge page routines can support FOLL_GET operation.
Link: https://lkml.kernel.org/r/496786039852aba90ffa68f10d0df3f4236a990b.1667983080.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Acked-by: Haiyue Wang <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
commit fdf756f71271 ("sched: Fix more TASK_state comparisons") makes
hung_task not to monitor TASK_IDLE tasks. The special handling to
workaround hung_task warnings is not required anymore.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavankumar Kondeti <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Document zram re-compression sysfs knobs.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add a new flag to zram block state that shows if the page is
incompressible: that none of the algorithm (including secondary ones)
could compress it.
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Minchan Kim <[email protected]>
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add support for incompressible pages writeback:
echo incompressible > /sys/block/zramX/writeback
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Document user-space visible device attributes that are enabled by
ZRAM_MULTI_COMP.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Recompression iterates through all the registered secondary compression
algorithms in order of their priorities so that we have higher chances of
finding the algorithm that compresses a particular page. This, however,
may not always be best approach and sometimes we may want to limit
recompression to only one particular algorithm. For instance, when a
higher priority algorithm uses too much power and device has a relatively
low battery level we may want to limit recompression to use only a lower
priority algorithm, which uses less power.
Introduce algo= parameter support to recompression sysfs knob so that
user-sapce can request recompression with particular algorithm only:
echo "type=idle algo=zstd" > /sys/block/zramX/recompress
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Size class index comparison is powerful enough so we can remove object
size comparisons.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
It makes no sense for us to recompress the object if it will be in the
same size class. We anyway don't get any memory gain. But, at the same
time, we get a CPU time overhead when inserting this object into zspage
and decompressing it afterwards.
[senozhatsky: rebased and fixed conflicts]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Romanov <[email protected]>
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Avoid typecasts that are needed for IS_ERR() and use IS_ERR_VALUE()
instead.
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Andrew Morton <[email protected]>
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Re-phrase writeback BIO error comment.
Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Andrew Morton <[email protected]>
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add a new flag to zram block state that shows if the page was recompressed
(using alternative compression algorithm).
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Allow zram to recompress (using secondary compression streams)
pages.
Re-compression algorithms (we support up to 3 at this stage)
are selected via recomp_algorithm:
echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm
Please read documentation for more details.
We support several recompression modes:
1) IDLE pages recompression is activated by `idle` mode
echo "type=idle" > /sys/block/zram0/recompress
2) Since there may be many idle pages user-space may pass a size
threshold value (in bytes) and we will recompress pages only
of equal or greater size:
echo "threshold=888" > /sys/block/zram0/recompress
3) HUGE pages recompression is activated by `huge` mode
echo "type=huge" > /sys/block/zram0/recompress
4) HUGE_IDLE pages recompression is activated by `huge_idle` mode
echo "type=huge_idle" > /sys/block/zram0/recompress
[[email protected]: we should always zero out err variable in recompress loop[
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
We will use non-WB variant in ZRAM page recompression path.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Introduce recomp_algorithm sysfs knob that controls secondary algorithm
selection used for recompression.
We will support up to 3 secondary compression algorithms which are sorted
in order of their priority. To select an algorithm user has to provide
its name and priority:
echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm
echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm
During recompression zram iterates through the list of registered
secondary algorithms in order of their priorities.
We also have a short version for cases when there is only
one secondary compression algorithm:
echo "algo=zstd" > /sys/block/zramX/recomp_algorithm
This will register zstd as the secondary algorithm with priority 1.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Alexey Romanov <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "zram: Support multiple compression streams", v5.
This series adds support for multiple compression streams. The main idea
is that different compression algorithms have different characteristics
and zram may benefit when it uses a combination of algorithms: a default
algorithm that is faster but have lower compression rate and a secondary
algorithm that can use higher compression rate at a price of slower
compression/decompression.
There are several use-case for this functionality:
- huge pages re-compression: zstd or deflate can successfully compress
huge pages (~50% of huge pages on my synthetic ChromeOS tests), IOW
pages that lzo was not able to compress.
- idle pages re-compression: idle/cold pages sit in the memory and we
may reduce zsmalloc memory usage if we recompress those idle pages.
Userspace has a number of ways to control the behavior and impact of zram
recompression: what type of pages should be recompressed, size watermarks,
etc. Please refer to documentation patch.
This patch (of 13):
The patch turns compression streams and compressor algorithm name struct
zram members into arrays, so that we can have multiple compression streams
support (in the next patches).
The patch uses a rather explicit API for compressor selection:
- Get primary (default) compression stream
zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP])
- Get secondary compression stream
zcomp_stream_get(zram->comps[ZRAM_SECONDARY_COMP])
We use similar API for compression streams put().
At this point we always have just one compression stream,
since CONFIG_ZRAM_MULTI_COMP is not yet defined.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Alexey Romanov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Flag delayed_rmap of 'struct mmu_gather' is rather a private member, but
it is still accessed directly. Instead, let the TLB gather code access
the flag.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Gordeev <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
When we remove a page table entry, we are very careful to only free the
page after we have flushed the TLB, because other CPUs could still be
using the page through stale TLB entries until after the flush.
However, we have removed the rmap entry for that page early, which means
that functions like folio_mkclean() would end up not serializing with the
page table lock because the page had already been made invisible to rmap.
And that is a problem, because while the TLB entry exists, we could end up
with the following situation:
(a) one CPU could come in and clean it, never seeing our mapping of the
page
(b) another CPU could continue to use the stale and dirty TLB entry and
continue to write to said page
resulting in a page that has been dirtied, but then marked clean again,
all while another CPU might have dirtied it some more.
End result: possibly lost dirty data.
This extends our current TLB gather infrastructure to optionally track a
"should I do a delayed page_remove_rmap() for this page after flushing the
TLB". It uses the newly introduced 'encoded page pointer' to do that
without having to keep separate data around.
Note, this is complicated by a couple of issues:
- we want to delay the rmap removal, but not past the page table lock,
because that simplifies the memcg accounting
- only SMP configurations want to delay TLB flushing, since on UP
there are obviously no remote TLBs to worry about, and the page
table lock means there are no preemption issues either
- s390 has its own mmu_gather model that doesn't delay TLB flushing,
and as a result also does not want the delayed rmap. As such, we can
treat S390 like the UP case and use a common fallback for the "no
delays" case.
- we can track an enormous number of pages in our mmu_gather structure,
with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each,
all set up to be approximately 10k pending pages.
We do not want to have a huge number of batched pages that we then
need to check for delayed rmap handling inside the page table lock.
Particularly that last point results in a noteworthy detail, where the
normal page batch gathering is limited once we have delayed rmaps pending,
in such a way that only the last batch (the so-called "active batch") in
the mmu_gather structure can have any delayed entries.
NOTE! While the "possibly lost dirty data" sounds catastrophic, for this
all to happen you need to have a user thread doing either madvise() with
MADV_DONTNEED or a full re-mmap() of the area concurrently with another
thread continuing to use said mapping.
So arguably this is about user space doing crazy things, but from a VM
consistency standpoint it's better if we track the dirty bit properly even
when user space goes off the rails.
[[email protected]: fix UP build, per Linus]
Link: https://lore.kernel.org/all/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Reported-by: Nadav Amit <[email protected]>
Tested-by: Nadav Amit <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This is purely a preparatory patch that makes all the data structures
ready for encoding flags with the mmu_gather page pointers.
The code currently always sets the flag to zero and doesn't use it yet,
but now it's tracking the type state along. The next step will be to
actually start using it.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
release_pages() already could take either an array of page pointers, or an
array of folio pointers. Expand it to also accept an array of encoded
page pointers, which is what both the existing mlock() use and the
upcoming mmu_gather use of encoded page pointers wants.
Note that release_pages() won't actually use, or react to, any extra
encoded bits. Instead, this is very much a case of "I have walked the
array of encoded pages and done everything the extra bits tell me to do,
now release it all".
Also, while the "either page or folio pointers" dual use was handled with
a cast of the pointer in "release_folios()", this takes a slightly
different approach and uses the "transparent union" attribute to describe
the set of arguments to the function:
https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html
which has been supported by gcc forever, but the kernel hasn't used
before.
That allows us to avoid using various wrappers with casts, and just use
the same function regardless of use.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
We already have this notion in parts of the MM code (see the mlock code
with the LRU_PAGE and NEW_PAGE bits), but I'm going to introduce a new
case, and I refuse to do the same thing we've done before where we just
put bits in the raw pointer and say it's still a normal pointer.
So this introduces a 'struct encoded_page' pointer that cannot be used for
anything else than to encode a real page pointer and a couple of extra
bits in the low bits. That way the compiler can trivially track the state
of the pointer and you just explicitly encode and decode the extra bits.
Note that this makes the alignment of 'struct page' explicit even for the
case where CONFIG_HAVE_ALIGNED_STRUCT_PAGE is not set. That is entirely
redundant in almost all cases, since the page structure already contains
several word-sized entries.
However, on m68k, the alignment of even 32-bit data is just 16 bits, and
as such in theory the alignment of 'struct page' could be too. So let's
just make it very very explicit that the alignment needs to be at least 32
bits, giving us a guarantee of two unused low bits in the pointer.
Now, in practice, our page struct array is aligned much more than that
anyway, even on m68k, and our existing code in mm/mlock.c obviously
already depended on that. But since the whole point of this change is to
be careful about the type system when hiding extra bits in the pointer,
let's also be explicit about the assumptions we make.
NOTE! This is being very careful in another way too: it has a build-time
assertion that the 'flags' added to the page pointer actually fit in the
two bits. That means that this helper must be inlined, and can only be
used in contexts where the compiler can statically determine that the
value fits in the available bits.
[[email protected]: kerneldoc on a forward-declared struct confuses htmldocs]
Link: https://lore.kernel.org/all/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]> [s390]
Cc: Nadav Amit <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Let's extend the test to cover the possible mprotect() optimization when
removing write-protection. mprotect() must not allow write-access to a
COW-shared page by accident.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
NUMA hinting no longer uses savedwrite, let's rip it out.
... and while at it, drop __pte_write() and __pmd_write() on ppc64.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a
NUMA hinting fault") added remembering write permissions using ordinary
pte_write() for PROT_NONE mapped pages to avoid write faults when
remapping the page !PROT_NONE on NUMA hinting faults.
That commit noted:
The patch looks hacky but the alternatives looked worse. The tidest was
to rewalk the page tables after a hinting fault but it was more complex
than this approach and the performance was worse. It's not generally
safe to just mark the page writable during the fault if it's a write
fault as it may have been read-only for COW so that approach was
discarded.
Later, commit 288bc54949fc ("mm/autonuma: let architecture override how
the write bit should be stashed in a protnone pte.") introduced a family
of savedwrite PTE functions that didn't necessarily improve the whole
situation.
One confusing thing is that nowadays, if a page is pte_protnone()
and pte_savedwrite() then also pte_write() is true. Another source of
confusion is that there is only a single pte_mk_savedwrite() call in the
kernel. All other write-protection code seems to silently rely on
pte_wrprotect().
Ever since PageAnonExclusive was introduced and we started using it in
mprotect context via commit 64fe24a3e05e ("mm/mprotect: try avoiding write
faults for exclusive anonymous pages when changing protection"), we do
have machinery in place to avoid write faults when changing protection,
which is exactly what we want to do here.
Let's similarly do what ordinary mprotect() does nowadays when upgrading
write permissions and reuse can_change_pte_writable() and
can_change_pmd_writable() to detect if we can upgrade PTE permissions to be
writable.
For anonymous pages there should be absolutely no change: if an
anonymous page is not exclusive, it could not have been mapped writable --
because only exclusive anonymous pages can be mapped writable.
However, there *might* be a change for writable shared mappings that
require writenotify: if they are not dirty, we cannot map them writable.
While it might not matter in practice, we'd need a different way to
identify whether writenotify is actually required -- and ordinary mprotect
would benefit from that as well.
Note that we don't optimize for the actual migration case:
(1) When migration succeeds the new PTE will not be writable because the
source PTE was not writable (protnone); in the future we
might just optimize that case similarly by reusing
can_change_pte_writable()/can_change_pmd_writable() when removing
migration PTEs.
(2) When migration fails, we'd have to recalculate the "writable" flag
because we temporarily dropped the PT lock; for now keep it simple and
set "writable=false".
We'll remove all savedwrite leftovers next.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Let's factor the check out into vma_wants_manual_pte_write_upgrade(), to be
reused in NUMA hinting fault context soon.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Let's replicate what we have for PTEs in can_change_pte_writable() also
for PMDs.
While this might look like a pure performance improvement, we'll us this to
get rid of savedwrite handling in do_huge_pmd_numa_page() next. Place
do_huge_pmd_numa_page() strategically good for that purpose.
Note that MM_CP_TRY_CHANGE_WRITABLE is currently only set when we come
via mprotect_fixup().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
We want to replicate this code for handling PMDs soon.
(1) No need to crash the kernel, warning and rejecting is good enough. As
this will no longer get optimized out, drop the pte_write() check: no
harm would be done.
(2) Add a comment why PROT_NONE mapped pages are excluded.
(3) Add a comment regarding MAP_SHARED handling and why we rely on the
dirty bit in the PTE.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm/autonuma: replace savedwrite infrastructure", v2.
As discussed in my talk at LPC, we can reuse the same mechanism for
deciding whether to map a pte writable when upgrading permissions via
mprotect() -- e.g., PROT_READ -> PROT_READ|PROT_WRITE -- to replace the
savedwrite infrastructure used for NUMA hinting faults (e.g., PROT_NONE ->
PROT_READ|PROT_WRITE).
Instead of maintaining previous write permissions for a pte/pmd, we
re-determine if the pte/pmd can be writable. The big benefit is that we
have a common logic for deciding whether we can map a pte/pmd writable on
protection changes.
For private mappings, there should be no difference -- from what I
understand, that is what autonuma benchmarks care about.
I ran autonumabench for v1 on a system with 2 NUMA nodes, 96 GiB each via:
perf stat --null --repeat 10
The numa01 benchmark is quite noisy in my environment and I failed to
reduce the noise so far.
numa01:
mm-unstable: 146.88 +- 6.54 seconds time elapsed ( +- 4.45% )
mm-unstable++: 147.45 +- 13.39 seconds time elapsed ( +- 9.08% )
numa02:
mm-unstable: 16.0300 +- 0.0624 seconds time elapsed ( +- 0.39% )
mm-unstable++: 16.1281 +- 0.0945 seconds time elapsed ( +- 0.59% )
It is worth noting that for shared writable mappings that require
writenotify, we will only avoid write faults if the pte/pmd is dirty
(inherited from the older mprotect logic). If we ever care about
optimizing that further, we'd need a different mechanism to identify
whether the FS still needs to get notified on the next write access.
In any case, such an optimization will then not be autonuma-specific, but
mprotect() permission upgrades would similarly benefit from it.
This patch (of 7):
Anonymous pages might have the dirty bit clear, but this should not
prevent mprotect from making them writable if they are exclusive.
Therefore, skip the test whether the page is dirty in this case.
Note that there are already other ways to get a writable PTE mapping an
anonymous page that is clean: for example, via MADV_FREE. In an ideal
world, we'd have a different indication from the FS whether writenotify is
still required.
[[email protected]: return directly; update description]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Nadav Amit <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
page_owner_sort was introduced since commit 48c96a368579 ("mm/page_owner:
keep track of page owners"), and we should ignore it.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Rong Tao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
It's hard to add a page_add_anon_rmap() into __split_huge_pmd_locked()'s
HPAGE_PMD_NR set_pte_at() loop, without wincing at the "freeze" case's
HPAGE_PMD_NR page_remove_rmap() loop below it.
It's just a mistake to add rmaps in the "freeze" (insert migration entries
prior to splitting huge page) case: the pmd_migration case already avoids
doing that, so just follow its lead. page_add_ref() versus put_page()
likewise. But why is one more put_page() needed in the "freeze" case?
Because it's removing the pmd rmap, already removed when pmd_migration
(and freeze and pmd_migration are mutually exclusive cases).
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mina Almasry <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now?
Yes. Not by atomic64_t or cmpxchg games, those get difficult on 32-bit;
but if we slightly abuse subpages_mapcount by additionally demanding that
one bit be set there when the compound page is PMD-mapped, then a cascade
of two atomic ops is able to maintain the stats without bit_spin_lock.
This is harder to reason about than when bit_spin_locked, but I believe
safe; and no drift in stats detected when testing. When there are racing
removes and adds, of course the sequence of operations is less well-
defined; but each operation on subpages_mapcount is atomically good. What
might be disastrous, is if subpages_mapcount could ever fleetingly appear
negative: but the pte lock (or pmd lock) these rmap functions are called
under, ensures that a last remove cannot race ahead of a first add.
Continue to make an exception for hugetlb (PageHuge) pages, though that
exception can be easily removed by a further commit if necessary: leave
subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just
carry on checking compound_mapcount too in folio_mapped(), page_mapped().
Evidence is that this way goes slightly faster than the previous
implementation in all cases (pmds after ptes now taking around 103ms); and
relieves us of worrying about contention on the bit_spin_lock.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mina Almasry <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm,thp,rmap: rework the use of subpages_mapcount", v2.
This patch (of 3):
Following suggestion from Linus, instead of counting every PTE map of a
compound page in subpages_mapcount, just count how many of its subpages
are PTE-mapped: this yields the exact number needed for NR_ANON_MAPPED and
NR_FILE_MAPPED stats, without any need for a locked scan of subpages; and
requires updating the count less often.
This does then revert total_mapcount() and folio_mapcount() to needing a
scan of subpages; but they are inherently racy, and need no locking, so
Linus is right that the scans are much better done there. Plus (unlike in
6.1 and previous) subpages_mapcount lets us avoid the scan in the common
case of no PTE maps. And page_mapped() and folio_mapped() remain scanless
and just as efficient with the new meaning of subpages_mapcount: those are
the functions which I most wanted to remove the scan from.
The updated page_dup_compound_rmap() is no longer suitable for use by anon
THP's __split_huge_pmd_locked(); but page_add_anon_rmap() can be used for
that, so long as its VM_BUG_ON_PAGE(!PageLocked) is deleted.
Evidence is that this way goes slightly faster than the previous
implementation for most cases; but significantly faster in the (now
scanless) pmds after ptes case, which started out at 870ms and was brought
down to 495ms by the previous series, now takes around 105ms.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mina Almasry <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|