Age | Commit message (Collapse) | Author | Files | Lines |
|
We need to kfree() secondary algorithms names when reset zram device that
had multi-streams, otherwise we leak memory.
[[email protected]: kfree(NULL) is legal]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 001d92735701 ("zram: add recompression algorithm sysfs knob")
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Now that xol_mapping has its own ->fault() method we no longer need
xol_area->pages[1] == NULL, we need a single page.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sven Schnelle <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Currently each xol_area has its own instance of vm_special_mapping, this
is suboptimal and ugly. Kill xol_area->xol_mapping and add a single
global instance of vm_special_mapping, the ->fault() method can use
area->pages rather than xol_mapping->pages.
As a side effect this fixes the problem introduced by the recent commit
223febc6e557 ("mm: add optional close() to struct vm_special_mapping"), if
special_mapping_close() is called from the __mmput() paths, it will use
vma->vm_private_data = &area->xol_mapping freed by uprobe_clear_state().
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 223febc6e557 ("mm: add optional close() to struct vm_special_mapping")
Signed-off-by: Oleg Nesterov <[email protected]>
Reported-by: Sven Schnelle <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This reverts commit 08e28de1160a712724268fd33d77b32f1bc84d1c.
A malicious application can munmap() its "[uprobes]" vma and in this case
xol_mapping.close == uprobe_clear_state() will free the memory which can
be used by another thread, or the same thread when it hits the uprobe bp
afterwards.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sven Schnelle <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Currently, we have mTHP features, but unfortunately, without support for
large folio swap-ins, once these large folios are swapped out, they are
lost because mTHP swap is a one-way process. The lack of mTHP swap-in
functionality prevents mTHP from being used on devices like Android that
heavily rely on swap.
This patch introduces mTHP swap-in support. It starts from sync devices
such as zRAM. This is probably the simplest and most common use case,
benefiting billions of Android phones and similar devices with minimal
implementation cost. In this straightforward scenario, large folios are
always exclusive, eliminating the need to handle complex rmap and
swapcache issues.
It offers several benefits:
1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
swap-out and swap-in. Large folios in the buddy system are also
preserved as much as possible, rather than being fragmented due
to swap-in.
2. Eliminates fragmentation in swap slots and supports successful
THP_SWPOUT.
w/o this patch (Refer to the data from Chris's and Kairui's latest
swap allocator optimization while running ./thp_swap_allocator_test
w/o "-a" option [1]):
./thp_swap_allocator_test
Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%
w/ this patch (always 0%):
Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
...
3. With both mTHP swap-out and swap-in supported, we offer the option to enable
zsmalloc compression/decompression with larger granularity[2]. The upcoming
optimization in zsmalloc will significantly increase swap speed and improve
compression efficiency. Tested by running 100 iterations of swapping 100MiB
of anon memory, the swap speed improved dramatically:
time consumption of swapin(ms) time consumption of swapout(ms)
lz4 4k 45274 90540
lz4 64k 22942 55667
zstdn 4k 85035 186585
zstdn 64k 46558 118533
The compression ratio also improved, as evaluated with 1 GiB of data:
granularity orig_data_size compr_data_size
4KiB-zstd 1048576000 246876055
64KiB-zstd 1048576000 199763892
Without mTHP swap-in, the potential optimizations in zsmalloc cannot be
realized.
4. Even mTHP swap-in itself can reduce swap-in page faults by a factor
of nr_pages. Swapping in content filled with the same data 0x11, w/o
and w/ the patch for five rounds (Since the content is the same,
decompression will be very fast. This primarily assesses the impact of
reduced page faults):
swp in bandwidth(bytes/ms) w/o w/
round1 624152 1127501
round2 631672 1127501
round3 620459 1139756
round4 606113 1139756
round5 624152 1152281
avg 621310 1137359 +83%
5. With both mTHP swap-out and swap-in supported, we offer the option to enable
hardware accelerators(Intel IAA) to do parallel decompression with which
Kanchana reported 7X improvement on zRAM read latency[3].
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/all/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gao Xiang <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Usama Arif <[email protected]>
Cc: Kanchana P Sridhar <[email protected]>
Cc: Kairui Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
large folios
With large folios swap-in, we might need to uncharge multiple entries all
together, add nr argument in mem_cgroup_swapin_uncharge_swap().
For the existing two users, just pass nr=1.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Barry Song <[email protected]>
Acked-by: Chris Li <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gao Xiang <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Chuanhua Han <[email protected]>
Cc: Kanchana P Sridhar <[email protected]>
Cc: Usama Arif <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm: enable large folios swap-in support", v9.
Currently, we support mTHP swapout but not swapin. This means that once
mTHP is swapped out, it will come back as small folios when swapped in.
This is particularly detrimental for devices like Android, where more than
half of the memory is in swap.
The lack of mTHP swapin functionality makes mTHP a showstopper in
scenarios that heavily rely on swap. This patchset introduces mTHP
swap-in support. It starts with synchronous devices similar to zRAM,
aiming to benefit as many users as possible with minimal changes.
This patch (of 3):
There could be a corner case where the first entry is non-zeromap, but a
subsequent entry is zeromap. In this case, we should not let
swap_read_folio_zeromap() return false since we will still read corrupted
data.
Additionally, the iteration of test_bit() is unnecessary and can be
replaced with bitmap operations, which are more efficient.
We can adopt the style of swap_pte_batch() and folio_pte_batch() to
introduce swap_zeromap_batch() which seems to provide the greatest
flexibility for the caller. This approach allows the caller to either
check if the zeromap status of all entries is consistent or determine the
number of contiguous entries with the same status.
Since swap_read_folio() can't handle reading a large folio that's
partially zeromap and partially non-zeromap, we've moved the code to
mm/swap.h so that others, like those working on swap-in, can access it.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
Signed-off-by: Barry Song <[email protected]>
Reviewed-by: Yosry Ahmed <[email protected]>
Reviewed-by: Usama Arif <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Chuanhua Han <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gao Xiang <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Kanchana P Sridhar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This replaces all the existing READ_ONCE() based page table accesses with
respective pxdp_get() helpers. Although these helpers might also fallback
to READ_ONCE() as default, but they do provide an opportunity for various
platforms to override when required. This change is a step in direction to
replace all page table entry accesses with respective pxdp_get() helpers.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Following query shows that architectures that don't provide
asm/set_memory.h don't use set_memory_...() functions.
$ git grep set_memory_ alpha arc csky hexagon loongarch m68k microblaze mips nios2 openrisc parisc sh sparc um xtensa
Following query shows that all core users of set_memory_...()
functions always take returned value into account:
$ git grep -w -e set_memory_ro -e set_memory_rw -e set_memory_x -e set_memory_nx -e set_memory_rox `find . -maxdepth 1 -type d | grep -v arch | grep /`
set_memory_...() functions can fail, leaving the memory attributes
unchanged. Make sure all callers check the returned code.
Link: https://github.com/KSPP/linux/issues/7
Link: https://lkml.kernel.org/r/6a89ffc69666de84721216947c6b6c7dcca39d7d.1725723347.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
__split_vma() and mas_store_gfp() returns several types of errno on
failure so don't ignore them in vms_gather_munmap_vmas(). For example,
__split_vma() returns -EINVAL when an unaligned huge page is unmapped.
This issue is reproduced by ltp memfd_create03 test.
Don't initialise the error variable and assign it when a failure actually
occurs.
[[email protected]: fix whitespace, per Liam]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 6898c9039bc8 ("mm/vma: extract the gathering of vmas from do_vmi_align_munmap()")
Signed-off-by: Xiao Yang <[email protected]>
Suggested-by: Lorenzo Stoakes <[email protected]>
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Cc: "Liam R. Howlett" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Extern declarations have no definitions with !CONFIG_MEMCG_V1 and no
users, drop them altogether.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Koutný <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Chen Ridong <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Waiman Long <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
We already do this when reporting slab info - more consistent and more
readable.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Reviewed-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Similar to other poison recovery, use copy_mc_user_highpage() to avoid
potentially kernel panic during copy page in copy_present_page() from
fork, once copy failed due to hwpoison in source page, we need to break
out of copy in copy_pte_range() and release prealloc folio, so
copy_mc_user_highpage() is moved ahead before set *prealloc to NULL.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: Jane Chu <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Tony Luck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm: hwpoison: two more poison recovery".
One more CoW path to support poison recorvery in do_cow_fault(), and the
last copy_user_highpage() user is replaced to copy_mc_user_highpage() from
copy_present_page() during fork to support poison recorvery too.
This patch (of 2):
Like commit a873dfe1032a ("mm, hwpoison: try to recover from copy-on
write faults"), there is another path which could crash because it does
not have recovery code where poison is consumed by the kernel in
do_cow_fault(), a crash calltrace shown below on old kernel, but it
could be happened in the lastest mainline code,
CPU: 7 PID: 3248 Comm: mpi Kdump: loaded Tainted: G OE 5.10.0 #1
pc : copy_page+0xc/0xbc
lr : copy_user_highpage+0x50/0x9c
Call trace:
copy_page+0xc/0xbc
do_cow_fault+0x118/0x2bc
do_fault+0x40/0x1a4
handle_pte_fault+0x154/0x230
__handle_mm_fault+0x1a8/0x38c
handle_mm_fault+0xf0/0x250
do_page_fault+0x184/0x454
do_translation_fault+0xac/0xd4
do_mem_abort+0x44/0xbc
Fix it by using copy_mc_user_highpage() to handle this case and return
VM_FAULT_HWPOISON for cow fault.
[[email protected]: unlock/put vmf->page, per Miaohe]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: Jane Chu <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jiaqi Yan <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Tony Luck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "resource: Fix region_intersects() vs
add_memory_driver_managed()", v3.
The patchset fixes a bug of region_intersects() for systems with CXL
memory. The details of the bug can be found in [1/3]. To avoid similar
bugs in the future. A kunit test case for region_intersects() is added in
[3/3]. [2/3] is a preparation patch for [3/3].
This patch (of 3):
region_intersects() is important because it's used for /dev/mem permission
checking. To avoid possible bug of region_intersects() in the future, a
kunit test case for region_intersects() is added.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Alison Schofield <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Baoquan He <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
During developing a kunit test case for region_intersects(), some fake
resources need to be inserted into iomem_resource. To do that, a resource
hole needs to be found first in iomem_resource.
However, alloc_free_mem_region() cannot work for iomem_resource now.
Because the start address to check cannot be 0 to detect address wrapping
0 in gfr_continue(), while iomem_resource.start == 0. To make
alloc_free_mem_region() works for iomem_resource, gfr_start() is changed
to avoid to return 0 even if base->start == 0. We don't need to check 0
as start address.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Alison Schofield <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Baoquan He <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The z3fold compressed pages allocator is rarely used, most users use
zsmalloc. The only disadvantage of zsmalloc in comparison is the
dependency on MMU, and zbud is a more common option for !MMU as it was the
default zswap allocator for a long time.
Historically, zsmalloc had worse latency than zbud and z3fold but offered
better memory savings. This is no longer the case as shown by a simple
recent analysis [1]. That analysis showed that z3fold does not have any
advantage over zsmalloc or zbud considering both performance and memory
usage. In a kernel build test on tmpfs in a limited cgroup, z3fold took
3% more time and used 1.8% more memory. The latency of zswap_load() was
7% higher, and that of zswap_store() was 10% higher. Zsmalloc is better
in all metrics.
Moreover, z3fold apparently has latent bugs, which was made noticeable by
a recent soft lockup bug report with z3fold [2]. Switching to zsmalloc
not only fixed the problem, but also reduced the swap usage from 6~8G to
1~2G. Other users have also reported being bitten by mistakenly enabling
z3fold.
Other than hurting users, z3fold is repeatedly causing wasted engineering
effort. Apart from investigating the above bug, it came up in multiple
development discussions (e.g. [3]) as something we need to handle, when
there aren't any legit users (at least not intentionally).
The natural course of action is to deprecate z3fold, and remove in a few
cycles if no objections are raised from active users. Next on the list
should be zbud, as it offers marginal latency gains at the cost of huge
memory waste when compared to zsmalloc. That one will need to wait until
zsmalloc does not depend on MMU.
Rename the user-visible config option from CONFIG_Z3FOLD to
CONFIG_Z3FOLD_DEPRECATED so that users with CONFIG_Z3FOLD=y get a new
prompt with explanation during make oldconfig. Also, remove
CONFIG_Z3FOLD=y from defconfigs.
[1]https://lore.kernel.org/lkml/CAJD7tkbRF6od-2x_L8-A1QL3=2Ww13sCj4S3i4bNndqF+3+_Vg@mail.gmail.com/
[2]https://lore.kernel.org/lkml/[email protected]/
[3]https://lore.kernel.org/lkml/CAJD7tkbnmeVugfunffSovJf9FAgy9rhBVt_tx=nxUveLUfqVsA@mail.gmail.com/
[[email protected]: deprecate ZSWAP_ZPOOL_DEFAULT_Z3FOLD as well]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Acked-by: Chris Down <[email protected]>
Acked-by: Nhat Pham <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Vitaly Wool <[email protected]>
Acked-by: Christoph Hellwig <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Naveen N. Rao <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
With the addition of pfnmap support in vmf_insert_pfn_{pmd,pud}() we can
take advantage of PMD and PUD faults to PCI BAR mmaps and create more
efficient mappings. PCI BARs are always a power of two and will typically
get at least PMD alignment without userspace even trying. Userspace
alignment for PUD mappings is also not too difficult.
Consolidate faults through a single handler with a new wrapper for
standard single page faults. The pre-faulting behavior of commit
d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault") is removed
in this refactoring since huge_fault will cover the bulk of the faults and
results in more efficient page table usage. We also want to avoid that
pre-faulted single page mappings preempt huge page mappings.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alex Williamson <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Support huge pfnmaps by using bit 56 (PTE_SPECIAL) for "special" on
pmds/puds. Provide the pmd/pud helpers to set/get special bit.
There's one more thing missing for arm64 which is the pxx_pgprot() for
pmd/pud. Add them too, which is mostly the same as the pte version by
dropping the pfn field. These helpers are essential to be used in the new
follow_pfnmap*() API to report valid pgprot_t results.
Note that arm64 doesn't yet support huge PUD yet, but it's still
straightforward to provide the pud helpers that we need altogether. Only
PMD helpers will make an immediate benefit until arm64 will support huge
PUDs first in general (e.g. in THPs).
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Helpers to install and detect special pmd/pud entries. In short, bit 9 on
x86 is not used for pmd/pud, so we can directly define them the same as
the pte level. One note is that it's also used in _PAGE_BIT_CPA_TEST but
that is only used in the debug test, and shouldn't conflict in this case.
One note is that pxx_set|clear_flags() for pmd/pud will need to be moved
upper so that they can be referenced by the new special bit helpers.
There's no change in the code that was moved.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
follow_pte() users have been converted to follow_pfnmap*(). Remove the
API.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Use the new API that can understand huge pfn mappings.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Use the new API that can understand huge pfn mappings.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Use the new API that can understand huge pfn mappings.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Use the new API that can understand huge pfn mappings.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Use the new API that can understand huge pfn mappings.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Use the new pfnmap API to allow huge MMIO mappings for VMs. The rest work
is done perfectly on the other side (host_pfn_mapping_level()).
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Introduce a pair of APIs to follow pfn mappings to get entry information.
It's very similar to what follow_pte() does before, but different in that
it recognizes huge pfn mappings.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
There're:
- 8 archs (arc, arm64, include, mips, powerpc, s390, sh, x86) that
support pte_pgprot().
- 2 archs (x86, sparc) that support pmd_pgprot().
- 1 arch (x86) that support pud_pgprot().
Always define them to be used in generic code, and then we don't need to
fiddle with "#ifdef"s when doing so.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Teach the fork code to properly copy pfnmaps for pmd/pud levels. Pud is
much easier, the write bit needs to be persisted though for writable and
shared pud mappings like PFNMAP ones, otherwise a follow up write in
either parent or child process will trigger a write fault.
Do the same for pmd level.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Teach folio_walk_start() to recognize special pmd/pud mappings, and fail
them properly as it means there's no folio backing them.
[[email protected]: remove some stale comments, per David]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Since gup-fast doesn't have the vma reference, teach it to detect such huge
pfnmaps by checking the special bit for pmd/pud too, just like ptes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This enables PFNMAPs to be mapped at either pmd/pud layers. Generalize the
dax case into vma_is_special_huge() so as to cover both. Meanwhile, rename
the macro to THP_ORDERS_ALL_SPECIAL.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
We need these special bits to be around on pfnmaps. Mark properly for
!devmap case, reflecting that there's no page struct backing the entry.
Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
It constantly returns false since 2017. One assertion is added in 2019 but
it should never have triggered, IOW it means what is checked should be
asserted instead.
If it didn't exist for 7 years maybe it's good idea to remove it and only
add it when it comes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm: Support huge pfnmaps", v2.
Overview
========
This series implements huge pfnmaps support for mm in general. Huge
pfnmap allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels,
similar to what we do with dax / thp / hugetlb so far to benefit from TLB
hits. Now we extend that idea to PFN mappings, e.g. PCI MMIO bars where
it can grow as large as 8GB or even bigger.
Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. The last
patch (from Alex Williamson) will be the first user of huge pfnmap, so as
to enable vfio-pci driver to fault in huge pfn mappings.
Implementation
==============
In reality, it's relatively simple to add such support comparing to many
other types of mappings, because of PFNMAP's specialties when there's no
vmemmap backing it, so that most of the kernel routines on huge mappings
should simply already fail for them, like GUPs or old-school follow_page()
(which is recently rewritten to be folio_walk* APIs by David).
One trick here is that we're still unmature on PUDs in generic paths here
and there, as DAX is so far the only user. This patchset will add the 2nd
user of it. Hugetlb can be a 3rd user if the hugetlb unification work can
go on smoothly, but to be discussed later.
The other trick is how to allow gup-fast working for such huge mappings
even if there's no direct sign of knowing whether it's a normal page or
MMIO mapping. This series chose to keep the pte_special solution, so that
it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so
that gup-fast will be able to identify them and fail properly.
Along the way, we'll also notice that the major pgtable pfn walker, aka,
follow_pte(), will need to retire soon due to the fact that it only works
with ptes. A new set of simple API is introduced (follow_pfnmap* API) to
be able to do whatever follow_pte() can already do, plus that it can also
process huge pfnmaps now. Half of this series is about that and
converting all existing pfnmap walkers to use the new API properly.
Hopefully the new API also looks better to avoid exposing e.g. pgtable
lock details into the callers, so that it can be used in an even more
straightforward way.
Here, three more options will be introduced and involved in huge pfnmap:
- ARCH_SUPPORTS_HUGE_PFNMAP
Arch developers will need to select this option when huge pfnmap is
supported in arch's Kconfig. After this patchset applied, both x86_64
and arm64 will start to enable it by default.
- ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP
These options are for driver developers to identify whether current
arch / config supports huge pfnmaps, making decision on whether it can
use the huge pfnmap APIs to inject them. One can refer to the last
vfio-pci patch from Alex on the use of them properly in a device
driver.
So after the whole set applied, and if one would enable some dynamic debug
lines in vfio-pci core files, we should observe things like:
vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100
In this specific case, it says that vfio-pci faults in PMDs properly for a
few BAR0 offsets.
Patch Layout
============
Patch 1: Introduce the new options mentioned above for huge PFNMAPs
Patch 2: A tiny cleanup
Patch 3-8: Preparation patches for huge pfnmap (include introduce
special bit for pmd/pud)
Patch 9-16: Introduce follow_pfnmap*() API, use it everywhere, and
then drop follow_pte() API
Patch 17: Add huge pfnmap support for x86_64
Patch 18: Add huge pfnmap support for arm64
Patch 19: Add vfio-pci support for all kinds of huge pfnmaps (Alex)
TODO
====
More architectures / More page sizes
------------------------------------
Currently only x86_64 (2M+1G) and arm64 (2M) are supported. There seems
to have plan to support arm64 1G later on top of this series [2].
Any arch will need to first support THP / THP_1G, then provide a special
bit in pmds/puds to support huge pfnmaps.
remap_pfn_range() support
-------------------------
Currently, remap_pfn_range() still only maps PTEs. With the new option,
remap_pfn_range() can logically start to inject either PMDs or PUDs when
the alignment requirements match on the VAs.
When the support is there, it should be able to silently benefit all
drivers that is using remap_pfn_range() in its mmap() handler on better
TLB hit rate and overall faster MMIO accesses similar to processor on
hugepages.
More driver support
-------------------
VFIO is so far the only consumer for the huge pfnmaps after this series
applied. Besides above remap_pfn_range() generic optimization, device
driver can also try to optimize its mmap() on a better VA alignment for
either PMD/PUD sizes. This may, iiuc, normally require userspace changes,
as the driver doesn't normally decide the VA to map a bar. But I don't
think I know all the drivers to know the full picture.
Credits all go to Alex on help testing the GPU/NIC use cases above.
[0] https://lore.kernel.org/r/[email protected]
[1] https://lore.kernel.org/r/[email protected]
[2] https://lore.kernel.org/r/[email protected]
This patch (of 19):
This patch introduces the option to introduce special pte bit into
pmd/puds. Archs can start to define pmd_special / pud_special when
supported by selecting the new option. Per-arch support will be added
later.
Before that, create fallbacks for these helpers so that they are always
available.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Gavin Shan <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
In commit 96cfe2c0fd23 ("mm/madvise: replace ptrace attach requirement for
process_madvise") process_madvise() was updated to require the caller to
possess the CAP_SYS_NICE capability to perform the operation, in addition
to a check against PTRACE_MODE_READ performed by mm_access().
The mm_access() function explicitly checks to see if the address space of
the process being referenced is the current one, in which case no check is
performed.
We, however, do not do this when checking the CAP_SYS_NICE capability. This
means that we insist on the caller possessing this capability in order to
perform madvise() operations on its own address space, which seems
nonsensical.
Simply add a check to allow for an invocation of this function with pidfd
set to the current process without elevation.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 96cfe2c0fd23 ("mm/madvise: replace ptrace attach requirement for process_madvise")
Signed-off-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Ensure huge_zero_folio won't have large_rmappable flag set. So it can be
reported as thp,zero correctly through stable_page_flags().
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 5691753d73a2 ("mm: convert huge_zero_page to huge_zero_folio")
Signed-off-by: Miaohe Lin <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Syzbot reports a UAF in hugetlb_fault(). This happens because
vmf_anon_prepare() could drop the per-VMA lock and allow the current VMA
to be freed before hugetlb_vma_unlock_read() is called.
We can fix this by using a modified version of vmf_anon_prepare() that
doesn't release the VMA lock on failure, and then release it ourselves
after hugetlb_vma_unlock_read().
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 9acad7ba3e25 ("hugetlb: use vmf_anon_prepare() instead of anon_vma_prepare()")
Reported-by: [email protected]
Closes: https://lore.kernel.org/linux-mm/[email protected]/
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Some callers of vmf_anon_prepare() may not want us to release the per-VMA
lock ourselves. Rename vmf_anon_prepare() to __vmf_anon_prepare() and let
the callers drop the lock when desired.
Also, make vmf_anon_prepare() a wrapper that releases the per-VMA lock
itself for any callers that don't care.
This is in preparation to fix this bug reported by syzbot:
https://lore.kernel.org/linux-mm/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 9acad7ba3e25 ("hugetlb: use vmf_anon_prepare() instead of anon_vma_prepare()")
Reported-by: [email protected]
Closes: https://lore.kernel.org/linux-mm/[email protected]/
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
On a system with CXL memory, the resource tree (/proc/iomem) related to
CXL memory may look like something as follows.
490000000-50fffffff : CXL Window 0
490000000-50fffffff : region0
490000000-50fffffff : dax0.0
490000000-50fffffff : System RAM (kmem)
Because drivers/dax/kmem.c calls add_memory_driver_managed() during
onlining CXL memory, which makes "System RAM (kmem)" a descendant of "CXL
Window X". This confuses region_intersects(), which expects all "System
RAM" resources to be at the top level of iomem_resource. This can lead to
bugs.
For example, when the following command line is executed to write some
memory in CXL memory range via /dev/mem,
$ dd if=data of=/dev/mem bs=$((1 << 10)) seek=$((0x490000000 >> 10)) count=1
dd: error writing '/dev/mem': Bad address
1+0 records in
0+0 records out
0 bytes copied, 0.0283507 s, 0.0 kB/s
the command fails as expected. However, the error code is wrong. It
should be "Operation not permitted" instead of "Bad address". More
seriously, the /dev/mem permission checking in devmem_is_allowed() passes
incorrectly. Although the accessing is prevented later because ioremap()
isn't allowed to map system RAM, it is a potential security issue. During
command executing, the following warning is reported in the kernel log for
calling ioremap() on system RAM.
ioremap on RAM at 0x0000000490000000 - 0x0000000490000fff
WARNING: CPU: 2 PID: 416 at arch/x86/mm/ioremap.c:216 __ioremap_caller.constprop.0+0x131/0x35d
Call Trace:
memremap+0xcb/0x184
xlate_dev_mem_ptr+0x25/0x2f
write_mem+0x94/0xfb
vfs_write+0x128/0x26d
ksys_write+0xac/0xfe
do_syscall_64+0x9a/0xfd
entry_SYSCALL_64_after_hwframe+0x4b/0x53
The details of command execution process are as follows. In the above
resource tree, "System RAM" is a descendant of "CXL Window 0" instead of a
top level resource. So, region_intersects() will report no System RAM
resources in the CXL memory region incorrectly, because it only checks the
top level resources. Consequently, devmem_is_allowed() will return 1
(allow access via /dev/mem) for CXL memory region incorrectly.
Fortunately, ioremap() doesn't allow to map System RAM and reject the
access.
So, region_intersects() needs to be fixed to work correctly with the
resource tree with "System RAM" not at top level as above. To fix it, if
we found a unmatched resource in the top level, we will continue to search
matched resources in its descendant resources. So, we will not miss any
matched resources in resource tree anymore.
In the new implementation, an example resource tree
|------------- "CXL Window 0" ------------|
|-- "System RAM" --|
will behave similar as the following fake resource tree for
region_intersects(, IORESOURCE_SYSTEM_RAM, ),
|-- "System RAM" --||-- "CXL Window 0a" --|
Where "CXL Window 0a" is part of the original "CXL Window 0" that
isn't covered by "System RAM".
Link: https://lkml.kernel.org/r/[email protected]
Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM")
Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Alison Schofield <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Each zsmalloc pool maintains several named kmem-caches for zs_handle-s and
zspage-s. On a system with multiple zsmalloc pools and CONFIG_DEBUG_VM
this triggers kmem_cache_sanity_check():
kmem_cache of name 'zspage' already exists
WARNING: at mm/slab_common.c:108 do_kmem_cache_create_usercopy+0xb5/0x310
...
kmem_cache of name 'zs_handle' already exists
WARNING: at mm/slab_common.c:108 do_kmem_cache_create_usercopy+0xb5/0x310
...
We provide zram device name when init its zsmalloc pool, so we can use
that same name for zsmalloc caches and, hence, create unique names that
can easily be linked to zram device that has created them.
So instead of having this
cat /proc/slabinfo
slabinfo - version: 2.1
zspage 46 46 ...
zs_handle 128 128 ...
zspage 34270 34270 ...
zs_handle 34816 34816 ...
zspage 0 0 ...
zs_handle 0 0 ...
We now have this
cat /proc/slabinfo
slabinfo - version: 2.1
zspage-zram2 46 46 ...
zs_handle-zram2 128 128 ...
zspage-zram0 34270 34270 ...
zs_handle-zram0 34816 34816 ...
zspage-zram1 0 0 ...
zs_handle-zram1 0 0 ...
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 2e40e163a25a ("zsmalloc: decouple handle and object")
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The allocated size in xen_swiotlb_alloc_coherent() and
xen_swiotlb_free_coherent() is calculated wrong for the case of
XEN_PAGE_SIZE not matching PAGE_SIZE. Fix that.
Fixes: 7250f422da04 ("xen-swiotlb: use actually allocated size on check physical continuous")
Reported-by: Jan Beulich <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: Jan Beulich <[email protected]>
Reviewed-by: Stefano Stabellini <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
|
|
When checking a memory buffer to be consecutive in machine memory,
the alignment needs to be checked, too. Failing to do so might result
in DMA memory not being aligned according to its requested size,
leading to error messages like:
4xxx 0000:2b:00.0: enabling device (0140 -> 0142)
4xxx 0000:2b:00.0: Ring address not aligned
4xxx 0000:2b:00.0: Failed to initialise service qat_crypto
4xxx 0000:2b:00.0: Resetting device qat_dev0
4xxx: probe of 0000:2b:00.0 failed with error -14
Fixes: 9435cce87950 ("xen/swiotlb: Add support for 64KB page granularity")
Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: Stefano Stabellini <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
"This is the "last" part of the support for the new nbcon consoles.
Where "nbcon" stays for "No Big console lock CONsoles" aka not under
the console_lock.
New callbacks are added to struct console:
- write_thread() for flushing nbcon consoles in task context.
- write_atomic() for flushing nbcon consoles in atomic context,
including NMI.
- con->device_lock() and device_unlock() for taking the driver
specific lock, for example, port->lock.
New printk-specific kthreads are created:
- per-console kthreads which get responsible for flushing normal
priority messages on nbcon consoles.
- thread which gets responsible for flushing normal priority messages
on all consoles when CONFIG_RT enabled.
The new callbacks are called under a special per-console lock which
has already been added back in v6.7. It allows to distinguish three
severities: normal, emergency, and panic. A context with a higher
priority could take over the ownership when it is safe even in the
middle of handling a record. The panic context could do it even when
it is not safe. But it is allowed only for the final desperate flush
before entering the infinite loop.
The new lock helps to flush the messages directly in emergency and
panic contexts. But it is not enough in all situations:
- console_lock() is still need for synchronization against boot
consoles.
- con->device_lock() is need for synchronization against other
operations on the same HW, e.g. serial port speed setting,
non-printk related read/write.
The dependency on con->device_lock() is mutual. Any code taking the
driver specific lock has to acquire the related nbcon console context
as well. For example, see the new uart_port_lock() API. It provides
the necessary synchronization against emergency and panic contexts
where the messages are flushed only under the new per-console lock.
Maybe surprisingly, a quite tricky part is the decision how to flush
the consoles in various situations. It has to take into account:
- message priority: normal, emergency, panic
- scheduling context: task, atomic, deferred_legacy
- registered consoles: boot, legacy, nbcon
- threads are running: early boot, suspend, shutdown, panic
- caller: printk(), pr_flush(), printk_flush_in_panic(),
console_unlock(), console_start(), ...
The primary decision is made in printk_get_console_flush_type(). It
creates a hint what the caller should do:
- flush nbcon consoles directly or via the kthread
- call the legacy loop (console_unlock()) directly or via irq_work
The existing behavior is preserved for the legacy consoles. The only
exception is that they are not longer flushed directly from printk()
in panic() before CPUs are stopped. But this blocking happens only
when at least one nbcon console is registered. The motivation is to
increase a chance to produce the crash dump. They legacy consoles
might create a deadlock in compare with nbcon consoles. The nbcon
console should allow to see the messages even when the crash dump
fails.
There are three possible ways how nbcon consoles are flushed:
- The per-nbcon-console kthread is responsible for flushing messages
added with the normal priority. This is the default mode.
- The legacy loop, aka console_unlock(), is used when there is still
a boot console registered. There is no easy way how to match an
early console driver with a nbcon console driver. And the
console_lock() provides the only reliable serialization at the
moment.
The legacy loop uses either con->write_atomic() or
con->write_thread() callbacks depending on whether it is allowed to
schedule. The atomic variant has to be used from printk().
- In other situations, the messages are flushed directly using
write_atomic() which can be called in any context, including NMI.
It is primary needed during early boot or shutdown, in emergency
situations, and panic.
The emergency priority is used by a code called within
nbcon_cpu_emergency_enter()/exit(). At the moment, it is used in four
situations: WARN(), Oops, lockdep, and RCU stall reports.
Finally, there is no nbcon console at the moment. It means that the
changes should _not_ modify the existing behavior. The only exception
is CONFIG_RT which would force offloading the legacy loop, for normal
priority context, into the dedicated kthread"
* tag 'printk-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (54 commits)
printk: Avoid false positive lockdep report for legacy printing
printk: nbcon: Assign nice -20 for printing threads
printk: Implement legacy printer kthread for PREEMPT_RT
tty: sysfs: Add nbcon support for 'active'
proc: Add nbcon support for /proc/consoles
proc: consoles: Add notation to c_start/c_stop
printk: nbcon: Show replay message on takeover
printk: Provide helper for message prepending
printk: nbcon: Rely on kthreads for normal operation
printk: nbcon: Use thread callback if in task context for legacy
printk: nbcon: Relocate nbcon_atomic_emit_one()
printk: nbcon: Introduce printer kthreads
printk: nbcon: Init @nbcon_seq to highest possible
printk: nbcon: Add context to usable() and emit()
printk: Flush console on unregister_console()
printk: Fail pr_flush() if before SYSTEM_SCHEDULING
printk: nbcon: Add function for printers to reacquire ownership
printk: nbcon: Use raw_cpu_ptr() instead of open coding
printk: Use the BITS_PER_LONG macro
lockdep: Mark emergency sections in lockdep splats
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull debugobjects updates from Thomas Gleixner:
- Use the threshold to check for the pool refill condition and not the
run time recorded all time low fill value, which is lower than the
threshold and therefore causes refills to be delayed.
- KCSAN annotation updates and simplification of the fill_pool() code.
* tag 'core-debugobjects-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
debugobjects: Remove redundant checks in fill_pool()
debugobjects: Fix conditions in fill_pool()
debugobjects: Fix the compilation attributes of some global variables
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Core:
- Overhaul of posix-timers in preparation of removing the workaround
for periodic timers which have signal delivery ignored.
- Remove the historical extra jiffie in msleep()
msleep() adds an extra jiffie to the timeout value to ensure
minimal sleep time. The timer wheel ensures minimal sleep time
since the large rewrite to a non-cascading wheel, but the extra
jiffie in msleep() remained unnoticed. Remove it.
- Make the timer slack handling correct for realtime tasks.
The procfs interface is inconsistent and does neither reflect
reality nor conforms to the man page. Show the correct 0 slack for
real time tasks and enforce it at the core level instead of having
inconsistent individual checks in various timer setup functions.
- The usual set of updates and enhancements all over the place.
Drivers:
- Allow the ACPI PM timer to be turned off during suspend
- No new drivers
- The usual updates and enhancements in various drivers"
* tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
ntp: Make sure RTC is synchronized when time goes backwards
treewide: Fix wrong singular form of jiffies in comments
cpu: Use already existing usleep_range()
timers: Rename next_expiry_recalc() to be unique
platform/x86:intel/pmc: Fix comment for the pmc_core_acpi_pm_timer_suspend_resume function
clocksource/drivers/jcore: Use request_percpu_irq()
clocksource/drivers/cadence-ttc: Add missing clk_disable_unprepare in ttc_setup_clockevent
clocksource/drivers/asm9260: Add missing clk_disable_unprepare in asm9260_timer_init
clocksource/drivers/qcom: Add missing iounmap() on errors in msm_dt_timer_init()
clocksource/drivers/ingenic: Use devm_clk_get_enabled() helpers
platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended
clocksource: acpi_pm: Add external callback for suspend/resume
clocksource/drivers/arm_arch_timer: Using for_each_available_child_of_node_scoped()
dt-bindings: timer: rockchip: Add rk3576 compatible
timers: Annotate possible non critical data race of next_expiry
timers: Remove historical extra jiffie for timeout in msleep()
hrtimer: Use and report correct timerslack values for realtime tasks
hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse.
timers: Add sparse annotation for timer_sync_wait_running().
signal: Replace BUG_ON()s
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Core:
- Remove a global lock in the affinity setting code
The lock protects a cpumask for intermediate results and the lock
causes a bottleneck on simultaneous start of multiple virtual
machines. Replace the lock and the static cpumask with a per CPU
cpumask which is nicely serialized by raw spinlock held when
executing this code.
- Provide support for giving a suffix to interrupt domain names.
That's required to support devices with subfunctions so that the
domain names are distinct even if they originate from the same
device node.
- The usual set of cleanups and enhancements all over the place
Drivers:
- Support for longarch AVEC interrupt chip
- Refurbishment of the Armada driver so it can be extended for new
variants.
- The usual set of cleanups and enhancements all over the place"
* tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits)
genirq: Use cpumask_intersects()
genirq/cpuhotplug: Use cpumask_intersects()
irqchip/apple-aic: Only access system registers on SoCs which provide them
irqchip/apple-aic: Add a new "Global fast IPIs only" feature level
irqchip/apple-aic: Skip unnecessary enabling of use_fast_ipi
dt-bindings: apple,aic: Document A7-A11 compatibles
irqdomain: Use IS_ERR_OR_NULL() in irq_domain_trim_hierarchy()
genirq/msi: Use kmemdup_array() instead of kmemdup()
genirq/proc: Change the return value for set affinity permission error
genirq/proc: Use irq_move_pending() in show_irq_affinity()
genirq/proc: Correctly set file permissions for affinity control files
genirq: Get rid of global lock in irq_do_set_affinity()
genirq: Fix typo in struct comment
irqchip/loongarch-avec: Add AVEC irqchip support
irqchip/loongson-pch-msi: Prepare get_pch_msi_handle() for AVECINTC
irqchip/loongson-eiointc: Rename CPUHP_AP_IRQ_LOONGARCH_STARTING
LoongArch: Architectural preparation for AVEC irqchip
LoongArch: Move irqchip function prototypes to irq-loongson.h
irqchip/loongson-pch-msi: Switch to MSI parent domains
softirq: Remove unused 'action' parameter from action callback
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull clocksource watchdog updates from Thomas Gleixner:
- Make the uncertainty margin handling more robust to prevent false
positives
- Clarify comments
* tag 'timers-clocksource-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin
clocksource: Fix comments on WATCHDOG_THRESHOLD & WATCHDOG_MAX_SKEW
clocksource: Improve comments for watchdog skew bounds
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug updates from Thomas Gleixner:
- Prepare the core for supporting parallel hotplug on loongarch
- A small set of cleanups and enhancements
* tag 'smp-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp: Mark smp_prepare_boot_cpu() __init
cpu: Fix W=1 build kernel-doc warning
cpu/hotplug: Provide weak fallback for arch_cpuhp_init_parallel_bringup()
cpu/hotplug: Make HOTPLUG_PARALLEL independent of HOTPLUG_SMT
|