Age | Commit message (Collapse) | Author | Files | Lines |
|
PagePoisoned() accesses page->flags which can be updated concurrently:
| BUG: KCSAN: data-race in next_uptodate_page / unlock_page
|
| write (marked) to 0xffffea00050f37c0 of 8 bytes by task 1872 on cpu 1:
| instrument_atomic_write include/linux/instrumented.h:87 [inline]
| clear_bit_unlock_is_negative_byte include/asm-generic/bitops/instrumented-lock.h:74 [inline]
| unlock_page+0x102/0x1b0 mm/filemap.c:1465
| filemap_map_pages+0x6c6/0x890 mm/filemap.c:3057
| ...
| read to 0xffffea00050f37c0 of 8 bytes by task 1873 on cpu 0:
| PagePoisoned include/linux/page-flags.h:204 [inline]
| PageReadahead include/linux/page-flags.h:382 [inline]
| next_uptodate_page+0x456/0x830 mm/filemap.c:2975
| ...
| CPU: 0 PID: 1873 Comm: systemd-udevd Not tainted 5.11.0-rc4-00001-gf9ce0be71d1f #1
To avoid the compiler tearing or otherwise optimizing the access, use
READ_ONCE() to access flags.
Link: https://lore.kernel.org/all/20210826144157.GA26950@xsang-OptiPlex-9020/
Link: https://lkml.kernel.org/r/[email protected]
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Marco Elver <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Will Deacon <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit 7a5da02de8d6 ("locking/lockdep: check for freed initmem in
static_obj()") added arch_is_kernel_initmem_freed() which is supposed to
report whether an object is part of already freed init memory.
For the time being, the generic version of
arch_is_kernel_initmem_freed() always reports 'false', allthough
free_initmem() is generically called on all architectures.
Therefore, change the generic version of arch_is_kernel_initmem_freed()
to check whether free_initmem() has been called. If so, then check if a
given address falls into init memory.
To ease the use of system_state, move it out of line into its only
caller which is lockdep.c
Link: https://lkml.kernel.org/r/1d40783e676e07858be97d881f449ee7ea8adfb1.1633001016.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Paul Mackerras <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
core_kernel_text() considers that until system_state in at least
SYSTEM_RUNNING, init memory is valid.
But init memory is freed a few lines before setting SYSTEM_RUNNING, so
we have a small period of time when core_kernel_text() is wrong.
Create an intermediate system state called SYSTEM_FREEING_INIT that is
set before starting freeing init memory, and use it in
core_kernel_text() to report init memory invalid earlier.
Link: https://lkml.kernel.org/r/9ecfdee7dd4d741d172cb93ff1d87f1c58127c9a.1633001016.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Heiko Carstens <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There was a report that starting an Ubuntu in docker while using cpuset
to bind it to movable nodes (a node only has movable zone, like a node
for hotplug or a Persistent Memory node in normal usage) will fail due
to memory allocation failure, and then OOM is involved and many other
innocent processes got killed.
It can be reproduced with command:
$ docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"
(where node 4 is a movable node)
runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G W I E 5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
Call Trace:
dump_stack+0x6b/0x88
dump_header+0x4a/0x1e2
oom_kill_process.cold+0xb/0x10
out_of_memory.part.0+0xaf/0x230
out_of_memory+0x3d/0x80
__alloc_pages_slowpath.constprop.0+0x954/0xa20
__alloc_pages_nodemask+0x2d3/0x300
pipe_write+0x322/0x590
new_sync_write+0x196/0x1b0
vfs_write+0x1c3/0x1f0
ksys_write+0xa7/0xe0
do_syscall_64+0x52/0xd0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Mem-Info:
active_anon:392832 inactive_anon:182 isolated_anon:0
active_file:68130 inactive_file:151527 isolated_file:0
unevictable:2701 dirty:0 writeback:7
slab_reclaimable:51418 slab_unreclaimable:116300
mapped:45825 shmem:735 pagetables:2540 bounce:0
free:159849484 free_pcp:73 free_cma:0
Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0 0
Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB
oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
The reason is that in this case, the target cpuset nodes only have
movable zone, while the creation of an OS in docker sometimes needs to
allocate memory in non-movable zones (dma/dma32/normal) like
GFP_HIGHUSER, and the cpuset limit forbids the allocation, then
out-of-memory killing is involved even when normal nodes and movable
nodes both have many free memory.
The OOM killer cannot help to resolve the situation as there is no
usable memory for the request in the cpuset scope. The only reasonable
measure to take is to fail the allocation right away and have the caller
to deal with it.
So add a check for cases like this in the slowpath of allocation, and
bail out early returning NULL for the allocation.
As page allocation is one of the hottest path in kernel, this check will
hurt all users with sane cpuset configuration, add a static branch check
and detect the abnormal config in cpuset memory binding setup so that
the extra check cost in page allocation is not paid by everyone.
[thanks to Micho Hocko and David Rientjes for suggesting not handling
it inside OOM code, adding cpuset check, refining comments]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Feng Tang <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
allocation
Commit ffb29b1c255a ("mm/vmalloc: fix numa spreading for large hash
tables") can cause significant performance regressions in some
situations as Andrew mentioned in [1]. The main situation is vmalloc,
vmalloc will allocate pages with NUMA_NO_NODE by default, that will
result in alloc page one by one;
In order to solve this, __alloc_pages_bulk and mempolicy should be
considered at the same time.
1) If node is specified in memory allocation request, it will alloc all
pages by __alloc_pages_bulk.
2) If interleaving allocate memory, it will cauculate how many pages
should be allocated in each node, and use __alloc_pages_bulk to alloc
pages in each node.
[1]: https://lore.kernel.org/lkml/CALvZod4G3SzP3kWxQYn0fj+VgG-G3yWXz=gz17+3N57ru1iajw@mail.gmail.com/t/#m750c8e3231206134293b089feaa090590afa0f60
[[email protected]: coding style fixes]
[[email protected]: make two functions static]
[[email protected]: fix CONFIG_NUMA=n build]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chen Wandun <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Hanjun Guo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
With KASAN_VMALLOC and NEED_PER_CPU_PAGE_FIRST_CHUNK the kernel crashes:
Unable to handle kernel paging request at virtual address ffff7000028f2000
...
swapper pgtable: 64k pages, 48-bit VAs, pgdp=0000000042440000
[ffff7000028f2000] pgd=000000063e7c0003, p4d=000000063e7c0003, pud=000000063e7c0003, pmd=000000063e7b0003, pte=0000000000000000
Internal error: Oops: 96000007 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 5.13.0-rc4-00003-gc6e6e28f3f30-dirty #62
Hardware name: linux,dummy-virt (DT)
pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO BTYPE=--)
pc : kasan_check_range+0x90/0x1a0
lr : memcpy+0x88/0xf4
sp : ffff80001378fe20
...
Call trace:
kasan_check_range+0x90/0x1a0
pcpu_page_first_chunk+0x3f0/0x568
setup_per_cpu_areas+0xb8/0x184
start_kernel+0x8c/0x328
The vm area used in vm_area_register_early() has no kasan shadow memory,
Let's add a new kasan_populate_early_vm_area_shadow() function to
populate the vm area shadow memory to fix the issue.
[[email protected]: fix redefinition of 'kasan_populate_early_vm_area_shadow']
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Acked-by: Marco Elver <[email protected]> [KASAN]
Acked-by: Andrey Konovalov <[email protected]> [KASAN]
Acked-by: Catalin Marinas <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The vmalloc guard pages are added on top of each allocation, thereby
isolating any two allocations from one another. The top guard of the
lower allocation is the bottom guard guard of the higher allocation etc.
Therefore VM_NO_GUARD is dangerous; it breaks the basic premise of
isolating separate allocations.
There are only two in-tree users of this flag, neither of which use it
through the exported interface. Ensure it stays this way.
Link: https://lkml.kernel.org/r/YUMfdA36fuyZ+/[email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Will Deacon <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
By using DECLARE_EVENT_CLASS and TRACE_EVENT_FN, we can save a lot of
space from duplicate code.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Ftrace core will add newline automatically on printing, so using it in
TP_printkcreates a blank line.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gang Li <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The fallback was introduced in commit 80c33624e472 ("io-mapping: Fixup
for different names of writecombine") to fix the build on microblaze.
5 years later, it seems all archs now provide a pgprot_writecombine(),
so just remove the other possible fallbacks. For microblaze,
pgprot_writecombine() is available since commit 97ccedd793ac
("microblaze: Provide pgprot_device/writecombine macros for nommu").
This is build-tested on microblaze with a hack to always build
mm/io-mapping.o and without DIYing on an x86-only macro
(_PAGE_CACHE_MASK)
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lucas De Marchi <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit 3947be1969a9 ("[PATCH] memory hotplug: sysfs and add/remove
functions") defines CONFIG_MEM_BLOCK_SIZE, but this has never been
utilized anywhere.
It is a good practice to keep the CONFIG_* defines exclusively for the
Kbuild system. So, drop this unused definition.
This issue was noticed due to running ./scripts/checkkconfigsymbols.py.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lukas Bulwahn <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use the helper for the checks. Rename "check_mapping" into
"zap_mapping" because "check_mapping" looks like a bool but in fact it
stores the mapping itself. When it's set, we check the mapping (it must
be non-NULL). When it's cleared we skip the check, which works like the
old way.
Move the duplicated comments to the helper too.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The first_index/last_index parameters in zap_details are actually only
used in unmap_mapping_range_tree(). At the meantime, this function is
only called by unmap_mapping_pages() once.
Instead of passing these two variables through the whole stack of page
zapping code, remove them from zap_details and let them simply be
parameters of unmap_mapping_range_tree(), which is inlined.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Liam Howlett <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It is defined in the same file just a few lines above.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Rolf Eike Beer <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now the kmem states is only used to indicate whether the kmem is
offline. However, we can set ->kmemcg_id to -1 to indicate whether the
kmem is offline. Finally, we can remove the kmem states to simplify the
code.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Move grabbing and releasing the bdi refcount out of the common
wb_init/wb_exit helpers into code that is only used for the non-default
memcg driven bdi_writeback structures.
[[email protected]: add comment]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix typo]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Miquel Raynal <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Vignesh Raghavendra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a new SB_I_ flag to mark superblocks that have an ephemeral bdi
associated with them, and unregister it when the superblock is shut
down.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Miquel Raynal <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Vignesh Raghavendra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As already done in GrapheneOS, add the __alloc_size attribute for
appropriate percpu allocator interfaces, to provide additional hinting
for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
compiler optimizations.
Note that due to the implementation of the percpu API, this is unlikely
to ever actually provide compile-time checking beyond very simple
non-SMP builds. But, since they are technically allocators, mark them
as such.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Co-developed-by: Daniel Micay <[email protected]>
Signed-off-by: Daniel Micay <[email protected]>
Acked-by: Dennis Zhou <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dwaipayan Ray <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Lukas Bulwahn <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Alexandre Bounine <[email protected]>
Cc: Gustavo A. R. Silva <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jing Xiangfeng <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Souptick Joarder <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As already done in GrapheneOS, add the __alloc_size attribute for
appropriate page allocator interfaces, to provide additional hinting for
better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
compiler optimizations.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Co-developed-by: Daniel Micay <[email protected]>
Signed-off-by: Daniel Micay <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dwaipayan Ray <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Lukas Bulwahn <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Alexandre Bounine <[email protected]>
Cc: Gustavo A. R. Silva <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jing Xiangfeng <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Souptick Joarder <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As already done in GrapheneOS, add the __alloc_size attribute for
appropriate vmalloc allocator interfaces, to provide additional hinting
for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
compiler optimizations.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Co-developed-by: Daniel Micay <[email protected]>
Signed-off-by: Daniel Micay <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dwaipayan Ray <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Lukas Bulwahn <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Alexandre Bounine <[email protected]>
Cc: Gustavo A. R. Silva <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jing Xiangfeng <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Souptick Joarder <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As already done in GrapheneOS, add the __alloc_size attribute for
regular kvmalloc interfaces, to provide additional hinting for better
bounds checking, assisting CONFIG_FORTIFY_SOURCE and other compiler
optimizations.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Co-developed-by: Daniel Micay <[email protected]>
Signed-off-by: Daniel Micay <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dwaipayan Ray <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Lukas Bulwahn <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Alexandre Bounine <[email protected]>
Cc: Gustavo A. R. Silva <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jing Xiangfeng <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Souptick Joarder <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As already done in GrapheneOS, add the __alloc_size attribute for
regular kmalloc interfaces, to provide additional hinting for better
bounds checking, assisting CONFIG_FORTIFY_SOURCE and other compiler
optimizations.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Co-developed-by: Daniel Micay <[email protected]>
Signed-off-by: Daniel Micay <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dwaipayan Ray <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Lukas Bulwahn <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Alexandre Bounine <[email protected]>
Cc: Gustavo A. R. Silva <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jing Xiangfeng <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Souptick Joarder <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Based on feedback from Joe Perches and Linus Torvalds, regularize the
slab function prototypes before making attribute changes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Alexandre Bounine <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dwaipayan Ray <[email protected]>
Cc: Gustavo A. R. Silva <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jing Xiangfeng <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Lukas Bulwahn <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Souptick Joarder <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
GCC and Clang can use the "alloc_size" attribute to better inform the
results of __builtin_object_size() (for compile-time constant values).
Clang can additionally use alloc_size to inform the results of
__builtin_dynamic_object_size() (for run-time values).
Because GCC sees the frequent use of struct_size() as an allocator size
argument, and notices it can return SIZE_MAX (the overflow indication),
it complains about these call sites overflowing (since SIZE_MAX is
greater than the default -Walloc-size-larger-than=PTRDIFF_MAX). This
isn't helpful since we already know a SIZE_MAX will be caught at
run-time (this was an intentional design). To deal with this, we must
disable this check as it is both a false positive and redundant. (Clang
does not have this warning option.)
Unfortunately, just checking the -Wno-alloc-size-larger-than is not
sufficient to make the __alloc_size attribute behave correctly under
older GCC versions. The attribute itself must be disabled in those
situations too, as there appears to be no way to reliably silence the
SIZE_MAX constant expression cases for GCC versions less than 9.1:
In file included from ./include/linux/resource_ext.h:11,
from ./include/linux/pci.h:40,
from drivers/net/ethernet/intel/ixgbe/ixgbe.h:9,
from drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c:4:
In function 'kmalloc_node',
inlined from 'ixgbe_alloc_q_vector' at ./include/linux/slab.h:743:9:
./include/linux/slab.h:618:9: error: argument 1 value '18446744073709551615' exceeds maximum object size 9223372036854775807 [-Werror=alloc-size-larger-than=]
return __kmalloc_node(size, flags, node);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/linux/slab.h: In function 'ixgbe_alloc_q_vector':
./include/linux/slab.h:455:7: note: in a call to allocation function '__kmalloc_node' declared here
void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_slab_alignment __malloc;
^~~~~~~~~~~~~~
Specifically:
'-Wno-alloc-size-larger-than' is not correctly handled by GCC < 9.1
https://godbolt.org/z/hqsfG7q84 (doesn't disable)
https://godbolt.org/z/P9jdrPTYh (doesn't admit to not knowing about option)
https://godbolt.org/z/465TPMWKb (only warns when other warnings appear)
'-Walloc-size-larger-than=18446744073709551615' is not handled by GCC < 8.2
https://godbolt.org/z/73hh1EPxz (ignores numeric value)
Since anything marked with __alloc_size would also qualify for marking
with __malloc, just include __malloc along with it to avoid redundant
markings. (Suggested by Linus Torvalds.)
Finally, make sure checkpatch.pl doesn't get confused about finding the
__alloc_size attribute on functions. (Thanks to Joe Perches.)
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Tested-by: Randy Dunlap <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Dwaipayan Ray <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Lukas Bulwahn <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Alexandre Bounine <[email protected]>
Cc: Gustavo A. R. Silva <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jing Xiangfeng <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: kernel test robot <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Souptick Joarder <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Introduce a variant of kasan_record_aux_stack() that does not do any
memory allocation through stackdepot. This will permit using it in
contexts that cannot allocate any memory.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Tested-by: Shuah Khan <[email protected]>
Acked-by: Sebastian Andrzej Siewior <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Gustavo A. R. Silva" <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Taras Madan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vijayanand Jitta <[email protected]>
Cc: Vinayak Menon <[email protected]>
Cc: Walter Wu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add __stack_depot_save(), which provides more fine-grained control over
stackdepot's memory allocation behaviour, in case stackdepot runs out of
"stack slabs".
Normally stackdepot uses alloc_pages() in case it runs out of space;
passing can_alloc==false to __stack_depot_save() prohibits this, at the
cost of more likely failure to record a stack trace.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Tested-by: Shuah Khan <[email protected]>
Acked-by: Sebastian Andrzej Siewior <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Gustavo A. R. Silva" <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Taras Madan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vijayanand Jitta <[email protected]>
Cc: Vinayak Menon <[email protected]>
Cc: Walter Wu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "stackdepot, kasan, workqueue: Avoid expanding stackdepot
slabs when holding raw_spin_lock", v2.
Shuah Khan reported [1]:
| When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
| kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
| it tries to allocate memory attempting to acquire spinlock in page
| allocation code while holding workqueue pool raw_spinlock.
|
| There are several instances of this problem when block layer tries
| to __queue_work(). Call trace from one of these instances is below:
|
| kblockd_mod_delayed_work_on()
| mod_delayed_work_on()
| __queue_delayed_work()
| __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
| insert_work()
| kasan_record_aux_stack()
| kasan_save_stack()
| stack_depot_save()
| alloc_pages()
| __alloc_pages()
| get_page_from_freelist()
| rm_queue()
| rm_queue_pcplist()
| local_lock_irqsave(&pagesets.lock, flags);
| [ BUG: Invalid wait context triggered ]
PROVE_RAW_LOCK_NESTING is pointing out that (on RT kernels) the locking
rules are being violated. More generally, memory is being allocated
from a non-preemptive context (raw_spin_lock'd c-s) where it is not
allowed.
To properly fix this, we must prevent stackdepot from replenishing its
"stack slab" pool if memory allocations cannot be done in the current
context: it's a bug to use either GFP_ATOMIC nor GFP_NOWAIT in certain
non-preemptive contexts, including raw_spin_locks (see gfp.h and commit
ab00db216c9c7).
The only downside is that saving a stack trace may fail if: stackdepot
runs out of space AND the same stack trace has not been recorded before.
I expect this to be unlikely, and a simple experiment (boot the kernel)
didn't result in any failure to record stack trace from insert_work().
The series includes a few minor fixes to stackdepot that I noticed in
preparing the series. It then introduces __stack_depot_save(), which
exposes the option to force stackdepot to not allocate any memory.
Finally, KASAN is changed to use the new stackdepot interface and
provide kasan_record_aux_stack_noalloc(), which is then used by
workqueue code.
[1] https://lkml.kernel.org/r/[email protected]
This patch (of 6):
<linux/stackdepot.h> refers to gfp_t, but doesn't include gfp.h.
Fix it by including <linux/gfp.h>.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Tested-by: Shuah Khan <[email protected]>
Acked-by: Sebastian Andrzej Siewior <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Walter Wu <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Vijayanand Jitta <[email protected]>
Cc: Vinayak Menon <[email protected]>
Cc: "Gustavo A. R. Silva" <[email protected]>
Cc: Taras Madan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Not required at all, and having this causes a huge kernel rebuild as
soon as something in dax.h changes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
With CONFIG_SLUB_CPU_PARTIAL enabled, SLUB keeps a percpu list of
partial slabs that can be promoted to cpu slab when the previous one is
depleted, without accessing the shared partial list. A slab can be
added to this list by 1) refill of an empty list from get_partial_node()
- once we really have to access the shared partial list, we acquire
multiple slabs to amortize the cost of locking, and 2) first free to a
previously full slab - instead of putting the slab on a shared partial
list, we can more cheaply freeze it and put it on the per-cpu list.
To control how large a percpu partial list can grow for a kmem cache,
set_cpu_partial() calculates a target number of free objects on each
cpu's percpu partial list, and this can be also set by the sysfs file
cpu_partial.
However, the tracking of actual number of objects is imprecise, in order
to limit overhead from cpu X freeing an objects to a slab on percpu
partial list of cpu Y. Basically, the percpu partial slabs form a
single linked list, and when we add a new slab to the list with current
head "oldpage", we set in the struct page of the slab we're adding:
page->pages = oldpage->pages + 1; // this is precise
page->pobjects = oldpage->pobjects + (page->objects - page->inuse);
page->next = oldpage;
Thus the real number of free objects in the slab (objects - inuse) is
only determined at the moment of adding the slab to the percpu partial
list, and further freeing doesn't update the pobjects counter nor
propagate it to the current list head. As Jann reports [1], this can
easily lead to large inaccuracies, where the target number of objects
(up to 30 by default) can translate to the same number of (empty) slab
pages on the list. In case 2) above, we put a slab with 1 free object
on the list, thus only increase page->pobjects by 1, even if there are
subsequent frees on the same slab. Jann has noticed this in practice
and so did we [2] when investigating significant increase of kmemcg
usage after switching from SLAB to SLUB.
While this is no longer a problem in kmemcg context thanks to the
accounting rewrite in 5.9, the memory waste is still not ideal and it's
questionable whether it makes sense to perform free object count based
control when object counts can easily become so much inaccurate. So
this patch converts the accounting to be based on number of pages only
(which is precise) and removes the page->pobjects field completely.
This is also ultimately simpler.
To retain the existing set_cpu_partial() heuristic, first calculate the
target number of objects as previously, but then convert it to target
number of pages by assuming the pages will be half-filled on average.
This assumption might obviously also be inaccurate in practice, but
cannot degrade to actual number of pages being equal to the target
number of objects.
We could also skip the intermediate step with target number of objects
and rewrite the heuristic in terms of pages. However we still have the
sysfs file cpu_partial which uses number of objects and could break
existing users if it suddenly becomes number of pages, so this patch
doesn't do that.
In practice, after this patch the heuristics limit the size of percpu
partial list up to 2 pages. In case of a reported regression (which
would mean some workload has benefited from the previous imprecise
object based counting), we can tune the heuristics to get a better
compromise within the new scheme, while still avoid the unexpectedly
long percpu partial lists.
[1] https://lore.kernel.org/linux-mm/CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=-umR7BLsEgjEYzA@mail.gmail.com/
[2] https://lore.kernel.org/all/[email protected]/
==========
Evaluation
==========
Mel was kind enough to run v1 through mmtests machinery for netperf
(localhost) and hackbench and, for most significant results see below.
So there are some apparent regressions, especially with hackbench, which
I think ultimately boils down to having shorter percpu partial lists on
average and some benchmarks benefiting from longer ones. Monitoring
slab usage also indicated less memory usage by slab. Based on that, the
following patch will bump the defaults to allow longer percpu partial
lists than after this patch.
However the goal is certainly not such that we would limit the percpu
partial lists to 30 pages just because previously a specific alloc/free
pattern could lead to the limit of 30 objects translate to a limit to 30
pages - that would make little sense. This is a correctness patch, and
if a workload benefits from larger lists, the sysfs tuning knobs are
still there to allow that.
Netperf
2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads per socket), 384GB RAM
TCP-RR:
hmean before 127045.79 after 121092.94 (-4.69%, worse)
stddev before 2634.37 after 1254.08
UDP-RR:
hmean before 166985.45 after 160668.94 ( -3.78%, worse)
stddev before 4059.69 after 1943.63
2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads per socket), 512GB RAM
TCP-RR:
hmean before 84173.25 after 76914.72 ( -8.62%, worse)
UDP-RR:
hmean before 93571.12 after 96428.69 ( 3.05%, better)
stddev before 23118.54 after 16828.14
2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads per socket), 64GB RAM
TCP-RR:
hmean before 49984.92 after 48922.27 ( -2.13%, worse)
stddev before 6248.15 after 4740.51
UDP-RR:
hmean before 61854.31 after 68761.81 ( 11.17%, better)
stddev before 4093.54 after 5898.91
other machines - within 2%
Hackbench
(results before and after the patch, negative % means worse)
2-socket AMD EPYC 7713 (64 cores, 128 threads per core), 256GB RAM
hackbench-process-sockets
Amean 1 0.5380 0.5583 ( -3.78%)
Amean 4 0.7510 0.8150 ( -8.52%)
Amean 7 0.7930 0.9533 ( -20.22%)
Amean 12 0.7853 1.1313 ( -44.06%)
Amean 21 1.1520 1.4993 ( -30.15%)
Amean 30 1.6223 1.9237 ( -18.57%)
Amean 48 2.6767 2.9903 ( -11.72%)
Amean 79 4.0257 5.1150 ( -27.06%)
Amean 110 5.5193 7.4720 ( -35.38%)
Amean 141 7.2207 9.9840 ( -38.27%)
Amean 172 8.4770 12.1963 ( -43.88%)
Amean 203 9.6473 14.3137 ( -48.37%)
Amean 234 11.3960 18.7917 ( -64.90%)
Amean 265 13.9627 22.4607 ( -60.86%)
Amean 296 14.9163 26.0483 ( -74.63%)
hackbench-thread-sockets
Amean 1 0.5597 0.5877 ( -5.00%)
Amean 4 0.7913 0.8960 ( -13.23%)
Amean 7 0.8190 1.0017 ( -22.30%)
Amean 12 0.9560 1.1727 ( -22.66%)
Amean 21 1.7587 1.5660 ( 10.96%)
Amean 30 2.4477 1.9807 ( 19.08%)
Amean 48 3.4573 3.0630 ( 11.41%)
Amean 79 4.7903 5.1733 ( -8.00%)
Amean 110 6.1370 7.4220 ( -20.94%)
Amean 141 7.5777 9.2617 ( -22.22%)
Amean 172 9.2280 11.0907 ( -20.18%)
Amean 203 10.2793 13.3470 ( -29.84%)
Amean 234 11.2410 17.1070 ( -52.18%)
Amean 265 12.5970 23.3323 ( -85.22%)
Amean 296 17.1540 24.2857 ( -41.57%)
2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads
per socket), 384GB RAM
hackbench-process-sockets
Amean 1 0.5760 0.4793 ( 16.78%)
Amean 4 0.9430 0.9707 ( -2.93%)
Amean 7 1.5517 1.8843 ( -21.44%)
Amean 12 2.4903 2.7267 ( -9.49%)
Amean 21 3.9560 4.2877 ( -8.38%)
Amean 30 5.4613 5.8343 ( -6.83%)
Amean 48 8.5337 9.2937 ( -8.91%)
Amean 79 14.0670 15.2630 ( -8.50%)
Amean 110 19.2253 21.2467 ( -10.51%)
Amean 141 23.7557 25.8550 ( -8.84%)
Amean 172 28.4407 29.7603 ( -4.64%)
Amean 203 33.3407 33.9927 ( -1.96%)
Amean 234 38.3633 39.1150 ( -1.96%)
Amean 265 43.4420 43.8470 ( -0.93%)
Amean 296 48.3680 48.9300 ( -1.16%)
hackbench-thread-sockets
Amean 1 0.6080 0.6493 ( -6.80%)
Amean 4 1.0000 1.0513 ( -5.13%)
Amean 7 1.6607 2.0260 ( -22.00%)
Amean 12 2.7637 2.9273 ( -5.92%)
Amean 21 5.0613 4.5153 ( 10.79%)
Amean 30 6.3340 6.1140 ( 3.47%)
Amean 48 9.0567 9.5577 ( -5.53%)
Amean 79 14.5657 15.7983 ( -8.46%)
Amean 110 19.6213 21.6333 ( -10.25%)
Amean 141 24.1563 26.2697 ( -8.75%)
Amean 172 28.9687 30.2187 ( -4.32%)
Amean 203 33.9763 34.6970 ( -2.12%)
Amean 234 38.8647 39.3207 ( -1.17%)
Amean 265 44.0813 44.1507 ( -0.16%)
Amean 296 49.2040 49.4330 ( -0.47%)
2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads
per socket), 512GB RAM
hackbench-process-sockets
Amean 1 0.5027 0.5017 ( 0.20%)
Amean 4 1.1053 1.2033 ( -8.87%)
Amean 7 1.8760 2.1820 ( -16.31%)
Amean 12 2.9053 3.1810 ( -9.49%)
Amean 21 4.6777 4.9920 ( -6.72%)
Amean 30 6.5180 6.7827 ( -4.06%)
Amean 48 10.0710 10.5227 ( -4.48%)
Amean 79 16.4250 17.5053 ( -6.58%)
Amean 110 22.6203 24.4617 ( -8.14%)
Amean 141 28.0967 31.0363 ( -10.46%)
Amean 172 34.4030 36.9233 ( -7.33%)
Amean 203 40.5933 43.0850 ( -6.14%)
Amean 234 46.6477 48.7220 ( -4.45%)
Amean 265 53.0530 53.9597 ( -1.71%)
Amean 296 59.2760 59.9213 ( -1.09%)
hackbench-thread-sockets
Amean 1 0.5363 0.5330 ( 0.62%)
Amean 4 1.1647 1.2157 ( -4.38%)
Amean 7 1.9237 2.2833 ( -18.70%)
Amean 12 2.9943 3.3110 ( -10.58%)
Amean 21 4.9987 5.1880 ( -3.79%)
Amean 30 6.7583 7.0043 ( -3.64%)
Amean 48 10.4547 10.8353 ( -3.64%)
Amean 79 16.6707 17.6790 ( -6.05%)
Amean 110 22.8207 24.4403 ( -7.10%)
Amean 141 28.7090 31.0533 ( -8.17%)
Amean 172 34.9387 36.8260 ( -5.40%)
Amean 203 41.1567 43.0450 ( -4.59%)
Amean 234 47.3790 48.5307 ( -2.43%)
Amean 265 53.9543 54.6987 ( -1.38%)
Amean 296 60.0820 60.2163 ( -0.22%)
1-socket Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz (4 cores, 8 threads),
32 GB RAM
hackbench-process-sockets
Amean 1 1.4760 1.5773 ( -6.87%)
Amean 3 3.9370 4.0910 ( -3.91%)
Amean 5 6.6797 6.9357 ( -3.83%)
Amean 7 9.3367 9.7150 ( -4.05%)
Amean 12 15.7627 16.1400 ( -2.39%)
Amean 18 23.5360 23.6890 ( -0.65%)
Amean 24 31.0663 31.3137 ( -0.80%)
Amean 30 38.7283 39.0037 ( -0.71%)
Amean 32 41.3417 41.6097 ( -0.65%)
hackbench-thread-sockets
Amean 1 1.5250 1.6043 ( -5.20%)
Amean 3 4.0897 4.2603 ( -4.17%)
Amean 5 6.7760 7.0933 ( -4.68%)
Amean 7 9.4817 9.9157 ( -4.58%)
Amean 12 15.9610 16.3937 ( -2.71%)
Amean 18 23.9543 24.3417 ( -1.62%)
Amean 24 31.4400 31.7217 ( -0.90%)
Amean 30 39.2457 39.5467 ( -0.77%)
Amean 32 41.8267 42.1230 ( -0.71%)
2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads
per socket), 64GB RAM
hackbench-process-sockets
Amean 1 1.0347 1.0880 ( -5.15%)
Amean 4 1.7267 1.8527 ( -7.30%)
Amean 7 2.6707 2.8110 ( -5.25%)
Amean 12 4.1617 4.3383 ( -4.25%)
Amean 21 7.0070 7.2600 ( -3.61%)
Amean 30 9.9187 10.2397 ( -3.24%)
Amean 48 15.6710 16.3923 ( -4.60%)
Amean 79 24.7743 26.1247 ( -5.45%)
Amean 110 34.3000 35.9307 ( -4.75%)
Amean 141 44.2043 44.8010 ( -1.35%)
Amean 172 54.2430 54.7260 ( -0.89%)
Amean 192 60.6557 60.9777 ( -0.53%)
hackbench-thread-sockets
Amean 1 1.0610 1.1353 ( -7.01%)
Amean 4 1.7543 1.9140 ( -9.10%)
Amean 7 2.7840 2.9573 ( -6.23%)
Amean 12 4.3813 4.4937 ( -2.56%)
Amean 21 7.3460 7.5350 ( -2.57%)
Amean 30 10.2313 10.5190 ( -2.81%)
Amean 48 15.9700 16.5940 ( -3.91%)
Amean 79 25.3973 26.6637 ( -4.99%)
Amean 110 35.1087 36.4797 ( -3.91%)
Amean 141 45.8220 46.3053 ( -1.05%)
Amean 172 55.4917 55.7320 ( -0.43%)
Amean 192 62.7490 62.5410 ( 0.33%)
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Reported-by: Jann Horn <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Not all files in the kernel should include mm.h. Migrating callers from
kmalloc to kvmalloc is easier if the kvmalloc functions are in slab.h.
[[email protected]: move the new kvrealloc() also]
[[email protected]: drivers/hwmon/occ/p9_sbe.c needs slab.h]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Pekka Enberg <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When handling shmem page fault the THP with corrupted subpage could be
PMD mapped if certain conditions are satisfied. But kernel is supposed
to send SIGBUS when trying to map hwpoisoned page.
There are two paths which may do PMD map: fault around and regular
fault.
Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault()
codepaths") the thing was even worse in fault around path. The THP
could be PMD mapped as long as the VMA fits regardless what subpage is
accessed and corrupted. After this commit as long as head page is not
corrupted the THP could be PMD mapped.
In the regular fault path the THP could be PMD mapped as long as the
corrupted page is not accessed and the VMA fits.
This loophole could be fixed by iterating every subpage to check if any
of them is hwpoisoned or not, but it is somewhat costly in page fault
path.
So introduce a new page flag called HasHWPoisoned on the first tail
page. It indicates the THP has hwpoisoned subpage(s). It is set if any
subpage of THP is found hwpoisoned by memory failure and after the
refcount is bumped successfully, then cleared when the THP is freed or
split.
The soft offline path doesn't need this since soft offline handler just
marks a subpage hwpoisoned when the subpage is migrated successfully.
But shmem THP didn't get split then migrated at all.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Signed-off-by: Yang Shi <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Suggested-by: Kirill A. Shutemov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from WiFi (mac80211), and BPF.
Current release - regressions:
- skb_expand_head: adjust skb->truesize to fix socket memory
accounting
- mptcp: fix corrupt receiver key in MPC + data + checksum
Previous releases - regressions:
- multicast: calculate csum of looped-back and forwarded packets
- cgroup: fix memory leak caused by missing cgroup_bpf_offline
- cfg80211: fix management registrations locking, prevent list
corruption
- cfg80211: correct false positive in bridge/4addr mode check
- tcp_bpf: fix race in the tcp_bpf_send_verdict resulting in reusing
previous verdict
Previous releases - always broken:
- sctp: enhancements for the verification tag, prevent attackers from
killing SCTP sessions
- tipc: fix size validations for the MSG_CRYPTO type
- mac80211: mesh: fix HE operation element length check, prevent out
of bound access
- tls: fix sign of socket errors, prevent positive error codes being
reported from read()/write()
- cfg80211: scan: extend RCU protection in
cfg80211_add_nontrans_list()
- implement ->sock_is_readable() for UDP and AF_UNIX, fix poll() for
sockets in a BPF sockmap
- bpf: fix potential race in tail call compatibility check resulting
in two operations which would make the map incompatible succeeding
- bpf: prevent increasing bpf_jit_limit above max
- bpf: fix error usage of map_fd and fdget() in generic batch update
- phy: ethtool: lock the phy for consistency of results
- prevent infinite while loop in skb_tx_hash() when Tx races with
driver reconfiguring the queue <> traffic class mapping
- usbnet: fixes for bad HW conjured by syzbot
- xen: stop tx queues during live migration, prevent UAF
- net-sysfs: initialize uid and gid before calling
net_ns_get_ownership
- mlxsw: prevent Rx stalls under memory pressure"
* tag 'net-5.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (67 commits)
Revert "net: hns3: fix pause config problem after autoneg disabled"
mptcp: fix corrupt receiver key in MPC + data + checksum
riscv, bpf: Fix potential NULL dereference
octeontx2-af: Fix possible null pointer dereference.
octeontx2-af: Display all enabled PF VF rsrc_alloc entries.
octeontx2-af: Check whether ipolicers exists
net: ethernet: microchip: lan743x: Fix skb allocation failure
net/tls: Fix flipped sign in async_wait.err assignment
net/tls: Fix flipped sign in tls_err_abort() calls
net/smc: Correct spelling mistake to TCPF_SYN_RECV
net/smc: Fix smc_link->llc_testlink_time overflow
nfp: bpf: relax prog rejection for mtu check through max_pkt_offset
vmxnet3: do not stop tx queues after netif_device_detach()
r8169: Add device 10ec:8162 to driver r8169
ptp: Document the PTP_CLK_MAGIC ioctl number
usbnet: fix error return code in usbnet_probe()
net: hns3: adjust string spaces of some parameters of tx bd info in debugfs
net: hns3: expand buffer len for some debugfs command
net: hns3: add more string spaces for dumping packets number of queue info in debugfs
net: hns3: fix data endian problem of some functions of debugfs
...
|
|
using packetdrill it's possible to observe that the receiver key contains
random values when clients transmit MP_CAPABLE with data and checksum (as
specified in RFC8684 §3.1). Fix the layout of mptcp_out_options, to avoid
using the skb extension copy when writing the MP_CAPABLE sub-option.
Fixes: d7b269083786 ("mptcp: shrink mptcp_out_options struct")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/233
Reported-by: Poorva Sonparote <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
sk->sk_err appears to expect a positive value, a convention that ktls
doesn't always follow and that leads to memory corruption in other code.
For instance,
[kworker]
tls_encrypt_done(..., err=<negative error from crypto request>)
tls_err_abort(.., err)
sk->sk_err = err;
[task]
splice_from_pipe_feed
...
tls_sw_do_sendpage
if (sk->sk_err) {
ret = -sk->sk_err; // ret is positive
splice_from_pipe_feed (continued)
ret = actor(...) // ret is still positive and interpreted as bytes
// written, resulting in underflow of buf->len and
// sd->len, leading to huge buf->offset and bogus
// addresses computed in later calls to actor()
Fix all tls_err_abort() callers to pass a negative error code
consistently and centralize the error-prone sign flip there, throwing in
a warning to catch future misuse and uninlining the function so it
really does only warn once.
Cc: [email protected]
Fixes: c46234ebb4d1e ("tls: RX path for ktls")
Reported-by: [email protected]
Signed-off-by: Daniel Jordan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Two fixes:
* bridge vs. 4-addr mode check was wrong
* management frame registrations locking was
wrong, causing list corruption/crashes
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Daniel Borkmann says:
====================
pull-request: bpf 2021-10-26
We've added 12 non-merge commits during the last 7 day(s) which contain
a total of 23 files changed, 118 insertions(+), 98 deletions(-).
The main changes are:
1) Fix potential race window in BPF tail call compatibility check, from Toke Høiland-Jørgensen.
2) Fix memory leak in cgroup fs due to missing cgroup_bpf_offline(), from Quanyang Wang.
3) Fix file descriptor reference counting in generic_map_update_batch(), from Xu Kuohai.
4) Fix bpf_jit_limit knob to the max supported limit by the arch's JIT, from Lorenz Bauer.
5) Fix BPF sockmap ->poll callbacks for UDP and AF_UNIX sockets, from Cong Wang and Yucong Sun.
6) Fix BPF sockmap concurrency issue in TCP on non-blocking sendmsg calls, from Liu Jian.
7) Fix build failure of INODE_STORAGE and TASK_STORAGE maps on !CONFIG_NET, from Tejun Heo.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Fix potential race in tail call compatibility check
bpf: Move BPF_MAP_TYPE for INODE_STORAGE and TASK_STORAGE outside of CONFIG_NET
selftests/bpf: Use recv_timeout() instead of retries
net: Implement ->sock_is_readable() for UDP and AF_UNIX
skmsg: Extract and reuse sk_msg_is_readable()
net: Rename ->stream_memory_read to ->sock_is_readable
tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function
cgroup: Fix memory leak caused by missing cgroup_bpf_offline
bpf: Fix error usage of map_fd and fdget() in generic_map_update_batch()
bpf: Prevent increasing bpf_jit_limit above max
bpf: Define bpf_jit_alloc_exec_limit for arm64 JIT
bpf: Define bpf_jit_alloc_exec_limit for riscv JIT
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Lorenzo noticed that the code testing for program type compatibility of
tail call maps is potentially racy in that two threads could encounter a
map with an unset type simultaneously and both return true even though they
are inserting incompatible programs.
The race window is quite small, but artificially enlarging it by adding a
usleep_range() inside the check in bpf_prog_array_compatible() makes it
trivial to trigger from userspace with a program that does, essentially:
map_fd = bpf_create_map(BPF_MAP_TYPE_PROG_ARRAY, 4, 4, 2, 0);
pid = fork();
if (pid) {
key = 0;
value = xdp_fd;
} else {
key = 1;
value = tc_fd;
}
err = bpf_map_update_elem(map_fd, &key, &value, 0);
While the race window is small, it has potentially serious ramifications in
that triggering it would allow a BPF program to tail call to a program of a
different type. So let's get rid of it by protecting the update with a
spinlock. The commit in the Fixes tag is the last commit that touches the
code in question.
v2:
- Use a spinlock instead of an atomic variable and cmpxchg() (Alexei)
v3:
- Put lock and the members it protects into an embedded 'owner' struct (Daniel)
Fixes: 3324b584b6f6 ("ebpf: misc core cleanup")
Reported-by: Lorenzo Bianconi <[email protected]>
Signed-off-by: Toke Høiland-Jørgensen <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
bpf_types.h has BPF_MAP_TYPE_INODE_STORAGE and BPF_MAP_TYPE_TASK_STORAGE
declared inside #ifdef CONFIG_NET although they are built regardless of
CONFIG_NET. So, when CONFIG_BPF_SYSCALL && !CONFIG_NET, they are built
without the declarations leading to spurious build failures and not
registered to bpf_map_types making them unavailable.
Fix it by moving the BPF_MAP_TYPE for the two map types outside of
CONFIG_NET.
Reported-by: kernel test robot <[email protected]>
Fixes: a10787e6d58c ("bpf: Enable task local storage for tracing programs")
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Martin KaFai Lau <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
tcp_bpf_sock_is_readable() is pretty much generic,
we can extract it and reuse it for non-TCP sockets.
Signed-off-by: Cong Wang <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
The proto ops ->stream_memory_read() is currently only used
by TCP to check whether psock queue is empty or not. We need
to rename it before reusing it for non-TCP protocols, and
adjust the exsiting users accordingly.
Signed-off-by: Cong Wang <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
During a testing of an user-space application which transmits UDP
multicast datagrams and utilizes multicast routing to send the UDP
datagrams out of defined network interfaces, I've found a multicast
router does not fill-in UDP checksum into locally produced, looped-back
and forwarded UDP datagrams, if an original output NIC the datagrams
are sent to has UDP TX checksum offload enabled.
The datagrams are sent malformed out of the NIC the datagrams have been
forwarded to.
It is because:
1. If TX checksum offload is enabled on the output NIC, UDP checksum
is not calculated by kernel and is not filled into skb data.
2. dev_loopback_xmit(), which is called solely by
ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY
unconditionally.
3. Since 35fc92a9 ("[NET]: Allow forwarding of ip_summed except
CHECKSUM_COMPLETE"), the ip_summed value is preserved during
forwarding.
4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during
a packet egress.
The minimum fix in dev_loopback_xmit():
1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the
case when the original output NIC has TX checksum offload enabled.
The effects are:
a) If the forwarding destination interface supports TX checksum
offloading, the NIC driver is responsible to fill-in the
checksum.
b) If the forwarding destination interface does NOT support TX
checksum offloading, checksums are filled-in by kernel before
skb is submitted to the NIC driver.
c) For local delivery, checksum validation is skipped as in the
case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary().
2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It
means, for CHECKSUM_NONE, the behavior is unmodified and is there
to skip a looped-back packet local delivery checksum validation.
Signed-off-by: Cyril Strejc <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The management registrations locking was broken, the list was
locked for each wdev, but cfg80211_mgmt_registrations_update()
iterated it without holding all the correct spinlocks, causing
list corruption.
Rather than trying to fix it with fine-grained locking, just
move the lock to the wiphy/rdev (still need the list on each
wdev), we already need to hold the wdev lock to change it, so
there's no contention on the lock in any case. This trivially
fixes the bug since we hold one wdev's lock already, and now
will hold the lock that protects all lists.
Cc: [email protected]
Reported-by: Jouni Malinen <[email protected]>
Fixes: 6cd536fe62ef ("cfg80211: change internal management frame registration API")
Link: https://lore.kernel.org/r/20211025133111.5cf733eab0f4.I7b0abb0494ab712f74e2efcd24bb31ac33f7eee9@changeid
Signed-off-by: Johannes Berg <[email protected]>
|
|
Restrict bpf_jit_limit to the maximum supported by the arch's JIT.
Signed-off-by: Lorenz Bauer <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix two regressions, one related to ACPI power resources
management and one that broke ACPI tools compilation.
Specifics:
- Stop turning off unused ACPI power resources in an unknown state to
address a regression introduced during the 5.14 cycle (Rafael
Wysocki).
- Fix an ACPI tools build issue introduced recently when the minimal
stdarg.h was added (Miguel Bernal Marin)"
* tag 'acpi-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: PM: Do not turn off power resources in unknown state
ACPI: tools: fix compilation error
|
|
Merge a fix for a recent ACPI tools bild regresson.
* acpi-tools:
ACPI: tools: fix compilation error
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ucounts fixes from Eric Biederman:
"There has been one very hard to track down bug in the ucount code that
we have been tracking since roughly v5.14 was released. Alex managed
to find a reliable reproducer a few days ago and then I was able to
instrument the code and figure out what the issue was.
It turns out the sigqueue_alloc single atomic operation optimization
did not play nicely with ucounts multiple level rlimits. It turned out
that either sigqueue_alloc or sigqueue_free could be operating on
multiple levels and trigger the conditions for the optimization on
more than one level at the same time.
To deal with that situation I have introduced inc_rlimit_get_ucounts
and dec_rlimit_put_ucounts that just focuses on the optimization and
the rlimit and ucount changes.
While looking into the big bug I found I couple of other little issues
so I am including those fixes here as well.
When I have time I would very much like to dig into process ownership
of the shared signal queue and see if we could pick a single owner for
the entire queue so that all of the rlimits can count to that owner.
That should entirely remove the need to call get_ucounts and
put_ucounts in sigqueue_alloc and sigqueue_free. It is difficult
because Linux unlike POSIX supports setuid that works on a single
thread"
* 'ucount-fixes-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucounts: Move get_ucounts from cred_alloc_blank to key_change_session_keyring
ucounts: Proper error handling in set_cred_ucounts
ucounts: Pair inc_rlimit_ucounts with dec_rlimit_ucoutns in commit_creds
ucounts: Fix signal ucount refcounting
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, and can.
We'll have one more fix for a socket accounting regression, it's still
getting polished. Otherwise things look fine.
Current release - regressions:
- revert "vrf: reset skb conntrack connection on VRF rcv", there are
valid uses for previous behavior
- can: m_can: fix iomap_read_fifo() and iomap_write_fifo()
Current release - new code bugs:
- mlx5: e-switch, return correct error code on group creation failure
Previous releases - regressions:
- sctp: fix transport encap_port update in sctp_vtag_verify
- stmmac: fix E2E delay mechanism (in PTP timestamping)
Previous releases - always broken:
- netfilter: ip6t_rt: fix out-of-bounds read of ipv6_rt_hdr
- netfilter: xt_IDLETIMER: fix out-of-bound read caused by lack of
init
- netfilter: ipvs: make global sysctl read-only in non-init netns
- tcp: md5: fix selection between vrf and non-vrf keys
- ipv6: count rx stats on the orig netdev when forwarding
- bridge: mcast: use multicast_membership_interval for IGMPv3
- can:
- j1939: fix UAF for rx_kref of j1939_priv abort sessions on
receiving bad messages
- isotp: fix TX buffer concurrent access in isotp_sendmsg() fix
return error on FC timeout on TX path
- ice: fix re-init of RDMA Tx queues and crash if RDMA was not inited
- hns3: schedule the polling again when allocation fails, prevent
stalls
- drivers: add missing of_node_put() when aborting
for_each_available_child_of_node()
- ptp: fix possible memory leak and UAF in ptp_clock_register()
- e1000e: fix packet loss in burst mode on Tiger Lake and later
- mlx5e: ipsec: fix more checksum offload issues"
* tag 'net-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (75 commits)
usbnet: sanity check for maxpacket
net: enetc: make sure all traffic classes can send large frames
net: enetc: fix ethtool counter name for PM0_TERR
ptp: free 'vclock_index' in ptp_clock_release()
sfc: Don't use netif_info before net_device setup
sfc: Export fibre-specific supported link modes
net/mlx5e: IPsec: Fix work queue entry ethernet segment checksum flags
net/mlx5e: IPsec: Fix a misuse of the software parser's fields
net/mlx5e: Fix vlan data lost during suspend flow
net/mlx5: E-switch, Return correct error code on group creation failure
net/mlx5: Lag, change multipath and bonding to be mutually exclusive
ice: Add missing E810 device ids
igc: Update I226_K device ID
e1000e: Fix packet loss on Tiger Lake and later
e1000e: Separate TGP board type from SPT
ptp: Fix possible memory leak in ptp_clock_register()
net: stmmac: Fix E2E delay mechanism
nfc: st95hf: Make spi remove() callback return zero
net: hns3: disable sriov before unload hclge layer
net: hns3: fix vf reset workqueue cannot exit
...
|
|
Both multipath and bonding events are changing the HW LAG state
independently.
Handling one of the features events while the other is already
enabled can cause unwanted behavior, for example handling
bonding event while multipath enabled will disable the lag and
cause multipath to stop working.
Fix it by ignoring bonding event while in multipath and ignoring FIB
events while in bonding mode.
Fixes: 544fe7c2e654 ("net/mlx5e: Activate HW multipath and handle port affinity based on FIB events")
Signed-off-by: Maor Dickman <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Reviewed-by: Mark Bloch <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Recursion fix for tracing.
While cleaning up some of the tracing recursion protection logic, I
discovered a scenario that the current design would miss, and would
allow an infinite recursion. Removing an optimization trick that
opened the hole fixes the issue and cleans up the code as well"
* tag 'trace-v5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have all levels of checks prevent recursion
|
|
Check for a NULL page->mapping before dereferencing the mapping in
page_is_secretmem(), as the page's mapping can be nullified while gup()
is running, e.g. by reclaim or truncation.
BUG: kernel NULL pointer dereference, address: 0000000000000068
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 6 PID: 4173897 Comm: CPU 3/KVM Tainted: G W
RIP: 0010:internal_get_user_pages_fast+0x621/0x9d0
Code: <48> 81 7a 68 80 08 04 bc 0f 85 21 ff ff 8 89 c7 be
RSP: 0018:ffffaa90087679b0 EFLAGS: 00010046
RAX: ffffe3f37905b900 RBX: 00007f2dd561e000 RCX: ffffe3f37905b934
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffe3f37905b900
...
CR2: 0000000000000068 CR3: 00000004c5898003 CR4: 00000000001726e0
Call Trace:
get_user_pages_fast_only+0x13/0x20
hva_to_pfn+0xa9/0x3e0
try_async_pf+0xa1/0x270
direct_page_fault+0x113/0xad0
kvm_mmu_page_fault+0x69/0x680
vmx_handle_exit+0xe1/0x5d0
kvm_arch_vcpu_ioctl_run+0xd81/0x1c70
kvm_vcpu_ioctl+0x267/0x670
__x64_sys_ioctl+0x83/0xa0
do_syscall_64+0x56/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 1507f51255c9 ("mm: introduce memfd_secret system call to create "secret" memory areas")
Signed-off-by: Sean Christopherson <[email protected]>
Reported-by: Darrick J. Wong <[email protected]>
Reported-by: Stephen <[email protected]>
Tested-by: Darrick J. Wong <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|