Age | Commit message (Collapse) | Author | Files | Lines |
|
Among other data, CPU entry area holds exception stacks, so addresses from
this area can be passed to kmsan_get_metadata().
This previously led to kmsan_get_metadata() returning NULL, which in turn
resulted in a warning that triggered further attempts to call
kmsan_get_metadata() in the exception context, which quickly exhausted the
exception stack.
This patch allocates shadow and origin for the CPU entry area on x86 and
introduces arch_kmsan_get_meta_or_null(), which performs arch-specific
metadata mapping.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Fixes: 21d723a7c1409 ("kmsan: add KMSAN runtime core")
Cc: Alexander Viro <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vegard Nossum <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Make KMSAN usable by adding the necessary Kconfig bits.
Also declare x86-specific functions checking address validity in
arch/x86/include/asm/kmsan.h.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vegard Nossum <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This is needed to allow memory tools like KASAN and KMSAN see the memory
accesses from the checksum code. Without CONFIG_GENERIC_CSUM the tools
can't see memory accesses originating from handwritten assembly code.
For KASAN it's a question of detecting more bugs, for KMSAN using the C
implementation also helps avoid false positives originating from seemingly
uninitialized checksum values.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vegard Nossum <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Unless stated otherwise (by explicitly calling __memcpy(), __memset() or
__memmove()) we want all string functions to call their __msan_ versions
(e.g. __msan_memcpy() instead of memcpy()), so that shadow and origin
values are updated accordingly.
Bootloader must still use the default string functions to avoid crashes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vegard Nossum <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Insert KMSAN hooks that make the necessary bookkeeping changes:
- poison page shadow and origins in alloc_pages()/free_page();
- clear page shadow and origins in clear_page(), copy_user_highpage();
- copy page metadata in copy_highpage(), wp_page_copy();
- handle vmap()/vunmap()/iounmap();
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vegard Nossum <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
KMSAN is going to use 3/4 of existing vmalloc space to hold the metadata,
therefore we lower VMALLOC_END to make sure vmalloc() doesn't allocate
past the first 1/4.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vegard Nossum <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Use hooks from instrumented.h to notify bug detection tools about usercopy
events in variations of get_user() and put_user().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vegard Nossum <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "Add KernelMemorySanitizer infrastructure", v7.
KernelMemorySanitizer (KMSAN) is a detector of errors related to uses of
uninitialized memory. It relies on compile-time Clang instrumentation
(similar to MSan in the userspace [1]) and tracks the state of every bit
of kernel memory, being able to report an error if uninitialized value is
used in a condition, dereferenced, or escapes to userspace, USB or DMA.
KMSAN has reported more than 300 bugs in the past few years (recently
fixed bugs: [2]), most of them with the help of syzkaller. Such bugs keep
getting introduced into the kernel despite new compiler warnings and other
analyses (the 6.0 cycle already resulted in several KMSAN-reported bugs,
e.g. [3]). Mitigations like total stack and heap initialization are
unfortunately very far from being deployable.
The proposed patchset contains KMSAN runtime implementation together with
small changes to other subsystems needed to make KMSAN work.
The latter changes fall into several categories:
1. Changes and refactorings of existing code required to add KMSAN:
- [01/43] x86: add missing include to sparsemem.h
- [02/43] stackdepot: reserve 5 extra bits in depot_stack_handle_t
- [03/43] instrumented.h: allow instrumenting both sides of copy_from_user()
- [04/43] x86: asm: instrument usercopy in get_user() and __put_user_size()
- [05/43] asm-generic: instrument usercopy in cacheflush.h
- [10/43] libnvdimm/pfn_dev: increase MAX_STRUCT_PAGE_SIZE
2. KMSAN-related declarations in generic code, KMSAN runtime library,
docs and configs:
- [06/43] kmsan: add ReST documentation
- [07/43] kmsan: introduce __no_sanitize_memory and __no_kmsan_checks
- [09/43] x86: kmsan: pgtable: reduce vmalloc space
- [11/43] kmsan: add KMSAN runtime core
- [13/43] MAINTAINERS: add entry for KMSAN
- [24/43] kmsan: add tests for KMSAN
- [31/43] objtool: kmsan: list KMSAN API functions as uaccess-safe
- [35/43] x86: kmsan: use __msan_ string functions where possible
- [43/43] x86: kmsan: enable KMSAN builds for x86
3. Adding hooks from different subsystems to notify KMSAN about memory
state changes:
- [14/43] mm: kmsan: maintain KMSAN metadata for page
- [15/43] mm: kmsan: call KMSAN hooks from SLUB code
- [16/43] kmsan: handle task creation and exiting
- [17/43] init: kmsan: call KMSAN initialization routines
- [18/43] instrumented.h: add KMSAN support
- [19/43] kmsan: add iomap support
- [20/43] Input: libps2: mark data received in __ps2_command() as initialized
- [21/43] dma: kmsan: unpoison DMA mappings
- [34/43] x86: kmsan: handle open-coded assembly in lib/iomem.c
- [36/43] x86: kmsan: sync metadata pages on page fault
4. Changes that prevent false reports by explicitly initializing memory,
disabling optimized code that may trick KMSAN, selectively skipping
instrumentation:
- [08/43] kmsan: mark noinstr as __no_sanitize_memory
- [12/43] kmsan: disable instrumentation of unsupported common kernel code
- [22/43] virtio: kmsan: check/unpoison scatterlist in vring_map_one_sg()
- [23/43] kmsan: handle memory sent to/from USB
- [25/43] kmsan: disable strscpy() optimization under KMSAN
- [26/43] crypto: kmsan: disable accelerated configs under KMSAN
- [27/43] kmsan: disable physical page merging in biovec
- [28/43] block: kmsan: skip bio block merging logic for KMSAN
- [29/43] kcov: kmsan: unpoison area->list in kcov_remote_area_put()
- [30/43] security: kmsan: fix interoperability with auto-initialization
- [32/43] x86: kmsan: disable instrumentation of unsupported code
- [33/43] x86: kmsan: skip shadow checks in __switch_to()
- [37/43] x86: kasan: kmsan: support CONFIG_GENERIC_CSUM on x86, enable it for KASAN/KMSAN
- [38/43] x86: fs: kmsan: disable CONFIG_DCACHE_WORD_ACCESS
- [39/43] x86: kmsan: don't instrument stack walking functions
- [40/43] entry: kmsan: introduce kmsan_unpoison_entry_regs()
5. Fixes for bugs detected with CONFIG_KMSAN_CHECK_PARAM_RETVAL:
- [41/43] bpf: kmsan: initialize BPF registers with zeroes
- [42/43] mm: fs: initialize fsdata passed to write_begin/write_end interface
This patchset allows one to boot and run a defconfig+KMSAN kernel on a
QEMU without known false positives. It however doesn't guarantee there
are no false positives in drivers of certain devices or less tested
subsystems, although KMSAN is actively tested on syzbot with a large
config.
By default, KMSAN enforces conservative checks of most kernel function
parameters passed by value (via CONFIG_KMSAN_CHECK_PARAM_RETVAL, which
maps to the -fsanitize-memory-param-retval compiler flag). As discussed
in [4] and [5], passing uninitialized values as function parameters is
considered undefined behavior, therefore KMSAN now reports such cases as
errors. Several newly added patches fix known manifestations of these
errors.
This patch (of 43):
Including sparsemem.h from other files (e.g. transitively via
asm/pgtable_64_types.h) results in compilation errors due to unknown
types:
sparsemem.h:34:32: error: unknown type name 'phys_addr_t'
extern int phys_to_target_node(phys_addr_t start);
^
sparsemem.h:36:39: error: unknown type name 'u64'
extern int memory_add_physaddr_to_nid(u64 start);
^
Fix these errors by including linux/types.h from sparsemem.h This is
required for the upcoming KMSAN patches.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vegard Nossum <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Eric Biggers <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Add the respective UP last level cache mask accessors in order not to
cause segfaults when lscpu accesses their representation in sysfs
- Fix for a race in the alternatives batch patching machinery when
kprobes are set
* tag 'x86_urgent_for_v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cacheinfo: Add a cpu_llc_shared_mask() UP variant
x86/alternative: Fix race in try_get_desc()
|
|
Keep all constants using lowercase "x".
Signed-off-by: Peng Hao <[email protected]>
Message-Id: <CAPm50aKnctFL_7fZ-eqrz-QGnjW2+DTyDDrhxi7UZVO3HjD8UA@mail.gmail.com>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
No 'kvmp_mmu_pages', it should be 'kvm_mmu_page'. And
struct kvm_mmu_pages and struct kvm_mmu_page are different structures,
here should be kvm_mmu_page.
kvm_mmu_pages is defined in arch/x86/kvm/mmu/mmu.c.
Suggested-by: David Matlack <[email protected]>
Signed-off-by: Peng Hao <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Message-Id: <CAPm50aL=0smbohhjAcK=ciUwcQJ=uAQP1xNQi52YsE7U8NFpEw@mail.gmail.com>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
KVM x86 updates for 6.1, batch #2:
- Misc PMU fixes and cleanups.
- Fixes for Hyper-V hypercall selftest
|
|
IBS_OP_DATA2 DataSrc provides detail about location of the data
being accessed from by load ops. Define macros for legacy and
extended DataSrc values.
Signed-off-by: Ravi Bangoria <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Merge upstream to get RAPTORLAKE_S
Signed-off-by: Peter Zijlstra <[email protected]>
|
|
On a CONFIG_SMP=n kernel, the LLC shared mask is 0, which prevents
__cache_amd_cpumap_setup() from doing the L3 masks setup, and more
specifically from setting up the shared_cpu_map and shared_cpu_list
files in sysfs, leading to lscpu from util-linux getting confused and
segfaulting.
Add a cpu_llc_shared_mask() UP variant which returns a mask with a
single bit set, i.e., for CPU0.
Fixes: 2b83809a5e6d ("x86/cpu/amd: Derive L3 shared_cpu_map from cpu_llc_shared_mask")
Reported-by: Saurabh Sengar <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Some architectures support the accessed bit in non-leaf PMD entries, e.g.,
x86 sets the accessed bit in a non-leaf PMD entry when using it as part of
linear address translation [1]. Page table walkers that clear the
accessed bit may use this capability to reduce their search space.
Note that:
1. Although an inline function is preferable, this capability is added
as a configuration option for consistency with the existing macros.
2. Due to the little interest in other varieties, this capability was
only tested on Intel and AMD CPUs.
Thanks to the following developers for their efforts [2][3].
Randy Dunlap <[email protected]>
Stephen Rothwell <[email protected]>
[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 3 (June 2021), section 4.8
[2] https://lore.kernel.org/r/[email protected]/
[3] https://lore.kernel.org/r/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yu Zhao <[email protected]>
Reviewed-by: Barry Song <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
Tested-by: Vaibhav Jain <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michael Larabel <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "Multi-Gen LRU Framework", v14.
What's new
==========
1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
machines. The old direct reclaim backoff, which tries to enforce a
minimum fairness among all eligible memcgs, over-swapped by about
(total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
pulls the plug on swapping once the target is met, trades some
fairness for curtailed latency:
https://lore.kernel.org/r/[email protected]/
3. Fixed minior build warnings and conflicts. More comments and nits.
TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and
straightforward.
Patchset overview
=================
The design and implementation overview is in patch 14:
https://lore.kernel.org/r/[email protected]/
01. mm: x86, arm64: add arch_has_hw_pte_young()
02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Take advantage of hardware features when trying to clear the accessed
bit in many PTEs.
03. mm/vmscan.c: refactor shrink_node()
04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
its sole caller"
Minor refactors to improve readability for the following patches.
05. mm: multi-gen LRU: groundwork
Adds the basic data structure and the functions that insert pages to
and remove pages from the multi-gen LRU (MGLRU) lists.
06. mm: multi-gen LRU: minimal implementation
A minimal implementation without optimizations.
07. mm: multi-gen LRU: exploit locality in rmap
Exploits spatial locality to improve efficiency when using the rmap.
08. mm: multi-gen LRU: support page table walks
Further exploits spatial locality by optionally scanning page tables.
09. mm: multi-gen LRU: optimize multiple memcgs
Optimizes the overall performance for multiple memcgs running mixed
types of workloads.
10. mm: multi-gen LRU: kill switch
Adds a kill switch to enable or disable MGLRU at runtime.
11. mm: multi-gen LRU: thrashing prevention
12. mm: multi-gen LRU: debugfs interface
Provide userspace with features like thrashing prevention, working set
estimation and proactive reclaim.
13. mm: multi-gen LRU: admin guide
14. mm: multi-gen LRU: design doc
Add an admin guide and a design doc.
Benchmark results
=================
Independent lab results
-----------------------
Based on the popularity of searches [01] and the memory usage in
Google's public cloud, the most popular open-source memory-hungry
applications, in alphabetical order, are:
Apache Cassandra Memcached
Apache Hadoop MongoDB
Apache Spark PostgreSQL
MariaDB (MySQL) Redis
An independent lab evaluated MGLRU with the most widely used benchmark
suites for the above applications. They posted 960 data points along
with kernel metrics and perf profiles collected over more than 500
hours of total benchmark time. Their final reports show that, with 95%
confidence intervals (CIs), the above applications all performed
significantly better for at least part of their benchmark matrices.
On 5.14:
1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
less wall time to sort three billion random integers, respectively,
under the medium- and the high-concurrency conditions, when
overcommitting memory. There were no statistically significant
changes in wall time for the rest of the benchmark matrix.
2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
more transactions per minute (TPM), respectively, under the medium-
and the high-concurrency conditions, when overcommitting memory.
There were no statistically significant changes in TPM for the rest
of the benchmark matrix.
3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
and [21.59, 30.02]% more operations per second (OPS), respectively,
for sequential access, random access and Gaussian (distribution)
access, when THP=always; 95% CIs [13.85, 15.97]% and
[23.94, 29.92]% more OPS, respectively, for random access and
Gaussian access, when THP=never. There were no statistically
significant changes in OPS for the rest of the benchmark matrix.
4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
[2.16, 3.55]% more operations per second (OPS), respectively, for
exponential (distribution) access, random access and Zipfian
(distribution) access, when underutilizing memory; 95% CIs
[8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
respectively, for exponential access, random access and Zipfian
access, when overcommitting memory.
On 5.15:
5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
and [4.11, 7.50]% more operations per second (OPS), respectively,
for exponential (distribution) access, random access and Zipfian
(distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
[6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
exponential access, random access and Zipfian access, when swap was
on.
6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
less average wall time to finish twelve parallel TeraSort jobs,
respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically
significant changes in average wall time for the rest of the
benchmark matrix.
7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
minute (TPM) under the high-concurrency condition, when swap was
off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically
significant changes in TPM for the rest of the benchmark matrix.
8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
[11.47, 19.36]% more total operations per second (OPS),
respectively, for sequential access, random access and Gaussian
(distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
[10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
for sequential access, random access and Gaussian access, when
THP=never.
Our lab results
---------------
To supplement the above results, we ran the following benchmark suites
on 5.16-rc7 and found no regressions [10].
fs_fio_bench_hdd_mq pft
fs_lmbench pgsql-hammerdb
fs_parallelio redis
fs_postmark stream
hackbench sysbenchthread
kernbench tpcc_spark
memcached unixbench
multichase vm-scalability
mutilate will-it-scale
nginx
[01] https://trends.google.com
[02] https://lore.kernel.org/r/[email protected]/
[03] https://lore.kernel.org/r/[email protected]/
[04] https://lore.kernel.org/r/[email protected]/
[05] https://lore.kernel.org/r/[email protected]/
[06] https://lore.kernel.org/r/[email protected]/
[07] https://lore.kernel.org/r/[email protected]/
[08] https://lore.kernel.org/r/[email protected]/
[09] https://lore.kernel.org/r/[email protected]/
[10] https://lore.kernel.org/r/[email protected]/
Read-world applications
=======================
Third-party testimonials
------------------------
Konstantin reported [11]:
I have Archlinux with 8G RAM + zswap + swap. While developing, I
have lots of apps opened such as multiple LSP-servers for different
langs, chats, two browsers, etc... Usually, my system gets quickly
to a point of SWAP-storms, where I have to kill LSP-servers,
restart browsers to free memory, etc, otherwise the system lags
heavily and is barely usable.
1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
patchset, and I started up by opening lots of apps to create memory
pressure, and worked for a day like this. Till now I had not a
single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
getting to the point of 3G in SWAP before without a single
SWAP-storm.
Vaibhav from IBM reported [12]:
In a synthetic MongoDB Benchmark, seeing an average of ~19%
throughput improvement on POWER10(Radix MMU + 64K Page Size) with
MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
three different request distributions, namely, Exponential, Uniform
and Zipfan.
Shuang from U of Rochester reported [13]:
With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
and [9.26, 10.36]% higher throughput, respectively, for random
access, Zipfian (distribution) access and Gaussian (distribution)
access, when the average number of jobs per CPU is 1; 95% CIs
[42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
throughput, respectively, for random access, Zipfian access and
Gaussian access, when the average number of jobs per CPU is 2.
Daniel from Michigan Tech reported [14]:
With Memcached allocating ~100GB of byte-addressable Optante,
performance improvement in terms of throughput (measured as queries
per second) was about 10% for a series of workloads.
Large-scale deployments
-----------------------
We've rolled out MGLRU to tens of millions of ChromeOS users and
about a million Android users. Google's fleetwide profiling [15] shows
an overall 40% decrease in kswapd CPU usage, in addition to
improvements in other UX metrics, e.g., an 85% decrease in the number
of low-memory kills at the 75th percentile and an 18% decrease in
app launch time at the 50th percentile.
The downstream kernels that have been using MGLRU include:
1. Android [16]
2. Arch Linux Zen [17]
3. Armbian [18]
4. ChromeOS [19]
5. Liquorix [20]
6. OpenWrt [21]
7. post-factum [22]
8. XanMod [23]
[11] https://lore.kernel.org/r/[email protected]/
[12] https://lore.kernel.org/r/[email protected]/
[13] https://lore.kernel.org/r/[email protected]/
[14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
[15] https://dl.acm.org/doi/10.1145/2749469.2750392
[16] https://android.com
[17] https://archlinux.org
[18] https://armbian.com
[19] https://chromium.org
[20] https://liquorix.net
[21] https://openwrt.org
[22] https://codeberg.org/pf-kernel
[23] https://xanmod.org
Summary
=======
The facts are:
1. The independent lab results and the real-world applications
indicate substantial improvements; there are no known regressions.
2. Thrashing prevention, working set estimation and proactive reclaim
work out of the box; there are no equivalent solutions.
3. There is a lot of new code; no smaller changes have been
demonstrated similar effects.
Our options, accordingly, are:
1. Given the amount of evidence, the reported improvements will likely
materialize for a wide range of workloads.
2. Gauging the interest from the past discussions, the new features
will likely be put to use for both personal computers and data
centers.
3. Based on Google's track record, the new code will likely be well
maintained in the long term. It'd be more difficult if not
impossible to achieve similar effects with other approaches.
This patch (of 14):
Some architectures automatically set the accessed bit in PTEs, e.g., x86
and arm64 v8.2. On architectures that do not have this capability,
clearing the accessed bit in a PTE usually triggers a page fault following
the TLB miss of this PTE (to emulate the accessed bit).
Being aware of this capability can help make better decisions, e.g.,
whether to spread the work out over a period of time to reduce bursty page
faults when trying to clear the accessed bit in many PTEs.
Note that theoretically this capability can be unreliable, e.g.,
hotplugged CPUs might be different from builtin ones. Therefore it should
not be used in architecture-independent code that involves correctness,
e.g., to determine whether TLB flushes are required (in combination with
the accessed bit).
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yu Zhao <[email protected]>
Reviewed-by: Barry Song <[email protected]>
Acked-by: Brian Geffon <[email protected]>
Acked-by: Jan Alexander Steffens (heftig) <[email protected]>
Acked-by: Oleksandr Natalenko <[email protected]>
Acked-by: Steven Barrett <[email protected]>
Acked-by: Suleiman Souhlal <[email protected]>
Acked-by: Will Deacon <[email protected]>
Tested-by: Daniel Byrne <[email protected]>
Tested-by: Donald Carr <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Tested-by: Konstantin Kharlamov <[email protected]>
Tested-by: Shuang Zhai <[email protected]>
Tested-by: Sofia Trinh <[email protected]>
Tested-by: Vaibhav Jain <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Larabel <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Qi Zheng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm: Remember a/d bits for migration entries", v4.
Problem
=======
When migrating a page, right now we always mark the migrated page as old &
clean.
However that could lead to at least two problems:
(1) We lost the real hot/cold information while we could have persisted.
That information shouldn't change even if the backing page is changed
after the migration,
(2) There can be always extra overhead on the immediate next access to
any migrated page, because hardware MMU needs cycles to set the young
bit again for reads, and dirty bits for write, as long as the
hardware MMU supports these bits.
Many of the recent upstream works showed that (2) is not something trivial
and actually very measurable. In my test case, reading 1G chunk of memory
- jumping in page size intervals - could take 99ms just because of the
extra setting on the young bit on a generic x86_64 system, comparing to
4ms if young set.
This issue is originally reported by Andrea Arcangeli.
Solution
========
To solve this problem, this patchset tries to remember the young/dirty
bits in the migration entries and carry them over when recovering the
ptes.
We have the chance to do so because in many systems the swap offset is not
really fully used. Migration entries use swp offset to store PFN only,
while the PFN is normally not as large as swp offset and normally smaller.
It means we do have some free bits in swp offset that we can use to store
things like A/D bits, and that's how this series tried to approach this
problem.
max_swapfile_size() is used here to detect per-arch offset length in swp
entries. We'll automatically remember the A/D bits when we find that we
have enough swp offset field to keep both the PFN and the extra bits.
Since max_swapfile_size() can be slow, the last two patches cache the
results for it and also swap_migration_ad_supported as a whole.
Known Issues / TODOs
====================
We still haven't taught madvise() to recognize the new A/D bits in
migration entries, namely MADV_COLD/MADV_FREE. E.g. when MADV_COLD upon
a migration entry. It's not clear yet on whether we should clear the A
bit, or we should just drop the entry directly.
We didn't teach idle page tracking on the new migration entries, because
it'll need larger rework on the tree on rmap pgtable walk. However it
should make it already better because before this patchset page will be
old page after migration, so the series will fix potential false negative
of idle page tracking when pages were migrated before observing.
The other thing is migration A/D bits will not start to working for
private device swap entries. The code is there for completeness but since
private device swap entries do not yet have fields to store A/D bits, even
if we'll persistent A/D across present pte switching to migration entry,
we'll lose it again when the migration entry converted to private device
swap entry.
Tests
=====
After the patchset applied, the immediate read access test [1] of above 1G
chunk after migration can shrink from 99ms to 4ms. The test is done by
moving 1G pages from node 0->1->0 then read it in page size jumps. The
test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
Similar effect can also be measured when writting the memory the 1st time
after migration.
After applying the patchset, both initial immediate read/write after page
migrated will perform similarly like before migration happened.
Patch Layout
============
Patch 1-2: Cleanups from either previous versions or on swapops.h macros.
Patch 3-4: Prepare for the introduction of migration A/D bits
Patch 5: The core patch to remember young/dirty bit in swap offsets.
Patch 6-7: Cache relevant fields to make migration_entry_supports_ad() fast.
[1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
This patch (of 7):
Replace all the magic "5" with the macro.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Huang Ying <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Dave Hansen:
- A performance fix for recent large AMD systems that avoids an ancient
cpu idle hardware workaround
- A new Intel model number. Folks like these upstream as soon as
possible so that each developer doing feature development doesn't
need to carry their own #define
- SGX fixes for a userspace crash and a rare kernel warning
* tag 'x86_urgent_for_v6.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ACPI: processor idle: Practically limit "Dummy wait" workaround to old Intel systems
x86/sgx: Handle VA page allocation failure for EAUG on PF.
x86/sgx: Do not fail on incomplete sanitization on premature stop of ksgxd
x86/cpu: Add CPU model numbers for Meteor Lake
|
|
The ZERO_CALL_USED_REGS feature may zero out caller-saved registers
before returning.
In spurious_kernel_fault(), the "pte_offset_kernel()" call results in
this assembly code:
.Ltmp151:
#APP
# ALT: oldnstr
.Ltmp152:
.Ltmp153:
.Ltmp154:
.section .discard.retpoline_safe,"",@progbits
.quad .Ltmp154
.text
callq *pv_ops+536(%rip)
.Ltmp155:
.section .parainstructions,"a",@progbits
.p2align 3, 0x0
.quad .Ltmp153
.byte 67
.byte .Ltmp155-.Ltmp153
.short 1
.text
.Ltmp156:
# ALT: padding
.zero (-(((.Ltmp157-.Ltmp158)-(.Ltmp156-.Ltmp152))>0))*((.Ltmp157-.Ltmp158)-(.Ltmp156-.Ltmp152)),144
.Ltmp159:
.section .altinstructions,"a",@progbits
.Ltmp160:
.long .Ltmp152-.Ltmp160
.Ltmp161:
.long .Ltmp158-.Ltmp161
.short 33040
.byte .Ltmp159-.Ltmp152
.byte .Ltmp157-.Ltmp158
.text
.section .altinstr_replacement,"ax",@progbits
# ALT: replacement 1
.Ltmp158:
movq %rdi, %rax
.Ltmp157:
.text
#NO_APP
.Ltmp162:
testb $-128, %dil
The "testb" here is using %dil, but the %rdi register was cleared before
returning from "callq *pv_ops+536(%rip)". Adding the proper constraints
results in the use of a different register:
movq %r11, %rdi
# Similar to above.
testb $-128, %r11b
Link: https://github.com/KSPP/linux/issues/192
Signed-off-by: Bill Wendling <[email protected]>
Reported-and-tested-by: Nathan Chancellor <[email protected]>
Fixes: 035f7f87b729 ("randstruct: Enable Clang support")
Reviewed-by: Juergen Gross <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Drive-by clean up of the comment.
[ Impact: cleanup]
Signed-off-by: Bill Wendling <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
All uses of arch_kprobe_override_function() have been removed by
commit 540adea3809f ("error-injection: Separate error-injection
from kprobe"), so remove the declaration, too.
Link: https://lkml.kernel.org/r/[email protected]
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Signed-off-by: Gaosheng Cui <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
All uses of modifying_ftrace_code have been removed by
commit 768ae4406a5c ("x86/ftrace: Use text_poke()"),
so remove the declaration, too.
Link: https://lkml.kernel.org/r/[email protected]
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Signed-off-by: Gaosheng Cui <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
With CONFIG_CFI_CLANG, the compiler injects a type preamble immediately
before each function and a check to validate the target function type
before indirect calls:
; type preamble
__cfi_function:
mov <id>, %eax
function:
...
; indirect call check
mov -<id>,%r10d
add -0x4(%r11),%r10d
je .Ltmp1
ud2
.Ltmp1:
call __x86_indirect_thunk_r11
Add error handling code for the ud2 traps emitted for the checks, and
allow CONFIG_CFI_CLANG to be selected on x86_64.
This produces the following oops on CFI failure (generated using lkdtm):
[ 21.441706] CFI failure at lkdtm_indirect_call+0x16/0x20 [lkdtm]
(target: lkdtm_increment_int+0x0/0x10 [lkdtm]; expected type: 0x7e0c52a)
[ 21.444579] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 21.445296] CPU: 0 PID: 132 Comm: sh Not tainted
5.19.0-rc8-00020-g9f27360e674c #1
[ 21.445296] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 21.445296] RIP: 0010:lkdtm_indirect_call+0x16/0x20 [lkdtm]
[ 21.445296] Code: 52 1c c0 48 c7 c1 c5 50 1c c0 e9 25 48 2a cc 0f 1f
44 00 00 49 89 fb 48 c7 c7 50 b4 1c c0 41 ba 5b ad f3 81 45 03 53 f8
[ 21.445296] RSP: 0018:ffffa9f9c02ffdc0 EFLAGS: 00000292
[ 21.445296] RAX: 0000000000000027 RBX: ffffffffc01cb300 RCX: 385cbbd2e070a700
[ 21.445296] RDX: 0000000000000000 RSI: c0000000ffffdfff RDI: ffffffffc01cb450
[ 21.445296] RBP: 0000000000000006 R08: 0000000000000000 R09: ffffffff8d081610
[ 21.445296] R10: 00000000bcc90825 R11: ffffffffc01c2fc0 R12: 0000000000000000
[ 21.445296] R13: ffffa31b827a6000 R14: 0000000000000000 R15: 0000000000000002
[ 21.445296] FS: 00007f08b42216a0(0000) GS:ffffa31b9f400000(0000)
knlGS:0000000000000000
[ 21.445296] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 21.445296] CR2: 0000000000c76678 CR3: 0000000001940000 CR4: 00000000000006f0
[ 21.445296] Call Trace:
[ 21.445296] <TASK>
[ 21.445296] lkdtm_CFI_FORWARD_PROTO+0x30/0x50 [lkdtm]
[ 21.445296] direct_entry+0x12d/0x140 [lkdtm]
[ 21.445296] full_proxy_write+0x5d/0xb0
[ 21.445296] vfs_write+0x144/0x460
[ 21.445296] ? __x64_sys_wait4+0x5a/0xc0
[ 21.445296] ksys_write+0x69/0xd0
[ 21.445296] do_syscall_64+0x51/0xa0
[ 21.445296] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 21.445296] RIP: 0033:0x7f08b41a6fe1
[ 21.445296] Code: be 07 00 00 00 41 89 c0 e8 7e ff ff ff 44 89 c7 89
04 24 e8 91 c6 02 00 8b 04 24 48 83 c4 68 c3 48 63 ff b8 01 00 00 03
[ 21.445296] RSP: 002b:00007ffcdf65c2e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 21.445296] RAX: ffffffffffffffda RBX: 00007f08b4221690 RCX: 00007f08b41a6fe1
[ 21.445296] RDX: 0000000000000012 RSI: 0000000000c738f0 RDI: 0000000000000001
[ 21.445296] RBP: 0000000000000001 R08: fefefefefefefeff R09: fefefefeffc5ff4e
[ 21.445296] R10: 00007f08b42222b0 R11: 0000000000000246 R12: 0000000000c738f0
[ 21.445296] R13: 0000000000000012 R14: 00007ffcdf65c401 R15: 0000000000c70450
[ 21.445296] </TASK>
[ 21.445296] Modules linked in: lkdtm
[ 21.445296] Dumping ftrace buffer:
[ 21.445296] (ftrace buffer empty)
[ 21.471442] ---[ end trace 0000000000000000 ]---
[ 21.471811] RIP: 0010:lkdtm_indirect_call+0x16/0x20 [lkdtm]
[ 21.472467] Code: 52 1c c0 48 c7 c1 c5 50 1c c0 e9 25 48 2a cc 0f 1f
44 00 00 49 89 fb 48 c7 c7 50 b4 1c c0 41 ba 5b ad f3 81 45 03 53 f8
[ 21.474400] RSP: 0018:ffffa9f9c02ffdc0 EFLAGS: 00000292
[ 21.474735] RAX: 0000000000000027 RBX: ffffffffc01cb300 RCX: 385cbbd2e070a700
[ 21.475664] RDX: 0000000000000000 RSI: c0000000ffffdfff RDI: ffffffffc01cb450
[ 21.476471] RBP: 0000000000000006 R08: 0000000000000000 R09: ffffffff8d081610
[ 21.477127] R10: 00000000bcc90825 R11: ffffffffc01c2fc0 R12: 0000000000000000
[ 21.477959] R13: ffffa31b827a6000 R14: 0000000000000000 R15: 0000000000000002
[ 21.478657] FS: 00007f08b42216a0(0000) GS:ffffa31b9f400000(0000)
knlGS:0000000000000000
[ 21.479577] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 21.480307] CR2: 0000000000c76678 CR3: 0000000001940000 CR4: 00000000000006f0
[ 21.481460] Kernel panic - not syncing: Fatal exception
Signed-off-by: Sami Tolvanen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Tested-by: Kees Cook <[email protected]>
Tested-by: Nathan Chancellor <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
With CONFIG_CFI_CLANG, assembly functions indirectly called
from C code must be annotated with type identifiers to pass CFI
checking. Define the __CFI_TYPE helper macro to match the compiler
generated function preamble, and ensure SYM_TYPED_FUNC_START also
emits ENDBR with IBT.
Signed-off-by: Sami Tolvanen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Tested-by: Kees Cook <[email protected]>
Tested-by: Nathan Chancellor <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Interrupts, NMIs etc. sent while in guest mode are already handled
properly by the *_interrupt_allowed callbacks, but other events can
cause a vCPU to be runnable that are specific to guest mode.
In the case of VMX there are two, the preemption timer and the
monitor trap. The VMX preemption timer is already special cased via
the hv_timer_pending callback, but the purpose of the callback can be
easily extended to MTF or in fact any other event that can occur only
in guest mode.
Rename the callback and add an MTF check; kvm_arch_vcpu_runnable()
now can return true if an MTF is pending, without relying on
kvm_vcpu_running()'s call to kvm_check_nested_events(). Until that call
is removed, however, the patch introduces no functional change.
Reviewed-by: Maxim Levitsky <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Morph pending exceptions to pending VM-Exits (due to interception) when
the exception is queued instead of waiting until nested events are
checked at VM-Entry. This fixes a longstanding bug where KVM fails to
handle an exception that occurs during delivery of a previous exception,
KVM (L0) and L1 both want to intercept the exception (e.g. #PF for shadow
paging), and KVM determines that the exception is in the guest's domain,
i.e. queues the new exception for L2. Deferring the interception check
causes KVM to esclate various combinations of injected+pending exceptions
to double fault (#DF) without consulting L1's interception desires, and
ends up injecting a spurious #DF into L2.
KVM has fudged around the issue for #PF by special casing emulated #PF
injection for shadow paging, but the underlying issue is not unique to
shadow paging in L0, e.g. if KVM is intercepting #PF because the guest
has a smaller maxphyaddr and L1 (but not L0) is using shadow paging.
Other exceptions are affected as well, e.g. if KVM is intercepting #GP
for one of SVM's workaround or for the VMware backdoor emulation stuff.
The other cases have gone unnoticed because the #DF is spurious if and
only if L1 resolves the exception, e.g. KVM's goofs go unnoticed if L1
would have injected #DF anyways.
The hack-a-fix has also led to ugly code, e.g. bailing from the emulator
if #PF injection forced a nested VM-Exit and the emulator finds itself
back in L1. Allowing for direct-to-VM-Exit queueing also neatly solves
the async #PF in L2 mess; no need to set a magic flag and token, simply
queue a #PF nested VM-Exit.
Deal with event migration by flagging that a pending exception was queued
by userspace and check for interception at the next KVM_RUN, e.g. so that
KVM does the right thing regardless of the order in which userspace
restores nested state vs. event state.
When "getting" events from userspace, simply drop any pending excpetion
that is destined to be intercepted if there is also an injected exception
to be migrated. Ideally, KVM would migrate both events, but that would
require new ABI, and practically speaking losing the event is unlikely to
be noticed, let alone fatal. The injected exception is captured, RIP
still points at the original faulting instruction, etc... So either the
injection on the target will trigger the same intercepted exception, or
the source of the intercepted exception was transient and/or
non-deterministic, thus dropping it is ok-ish.
Fixes: a04aead144fd ("KVM: nSVM: fix running nested guests when npt=0")
Fixes: feaf0c7dc473 ("KVM: nVMX: Do not generate #DF if #PF happens during exception delivery into L2")
Cc: Jim Mattson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Move the definition of "struct kvm_queued_exception" out of kvm_vcpu_arch
in anticipation of adding a second instance in kvm_vcpu_arch to handle
exceptions that occur when vectoring an injected exception and are
morphed to VM-Exit instead of leading to #DF.
Opportunistically take advantage of the churn to rename "nr" to "vector".
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Rename the kvm_x86_ops hook for exception injection to better reflect
reality, and to align with pretty much every other related function name
in KVM.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
KVM has to check guest visible HYPERV_CPUID_NESTED_FEATURES.EBX CPUID
leaf to know which Enlightened VMCS definition to use (original or 2022
update). Cache the leaf along with other Hyper-V CPUID feature leaves
to make the check quick.
Reviewed-by: Maxim Levitsky <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Updated Hyper-V Enlightened VMCS specification lists several new
fields for the following features:
- PerfGlobalCtrl
- EnclsExitingBitmap
- Tsc Scaling
- GuestLbrCtl
- CET
- SSP
Update the definition.
Note, the updated spec also provides an additional CPUID feature flag,
CPUIDD.0x4000000A.EBX BIT(0), for PerfGlobalCtrl to workaround a Windows
11 quirk. Despite what the TLFS says:
Indicates support for the GuestPerfGlobalCtrl and HostPerfGlobalCtrl
fields in the enlightened VMCS.
guests can safely use the fields if they are enumerated in the
architectural VMX MSRs. I.e. KVM-on-HyperV doesn't need to check the
CPUID bit, but KVM-as-HyperV must ensure the bit is set if PerfGlobalCtrl
fields are exposed to L1.
https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/tlfs
Signed-off-by: Vitaly Kuznetsov <[email protected]>
[sean: tweak CPUID name to make it PerfGlobalCtrl only]
Signed-off-by: Sean Christopherson <[email protected]>
Acked-by: Wei Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Section 1.9 of TLFS v6.0b says:
"All structures are padded in such a way that fields are aligned
naturally (that is, an 8-byte field is aligned to an offset of 8 bytes
and so on)".
'struct enlightened_vmcs' has a glitch:
...
struct {
u32 nested_flush_hypercall:1; /* 836: 0 4 */
u32 msr_bitmap:1; /* 836: 1 4 */
u32 reserved:30; /* 836: 2 4 */
} hv_enlightenments_control; /* 836 4 */
u32 hv_vp_id; /* 840 4 */
u64 hv_vm_id; /* 844 8 */
u64 partition_assist_page; /* 852 8 */
...
And the observed values in 'partition_assist_page' make no sense at
all. Fix the layout by padding the structure properly.
Fixes: 68d1eb72ee99 ("x86/hyper-v: define struct hv_enlightened_vmcs and clean field bits")
Reviewed-by: Maxim Levitsky <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
More pci fixes
Fix for a code analyser warning
|
|
resctrl_cqm_threshold is stored in a hardware specific chunk size,
but exposed to user-space as bytes.
This means the filesystem parts of resctrl need to know how the hardware
counts, to convert the user provided byte value to chunks. The interface
between the architecture's resctrl code and the filesystem ought to
treat everything as bytes.
Change the unit of resctrl_cqm_threshold to bytes. resctrl_arch_rmid_read()
still returns its value in chunks, so this needs converting to bytes.
As all the users have been touched, rename the variable to
resctrl_rmid_realloc_threshold, which describes what the value is for.
Neither r->num_rmid nor hw_res->mon_scale are guaranteed to be a power
of 2, so the existing code introduces a rounding error from resctrl's
theoretical fraction of the cache usage. This behaviour is kept as it
ensures the user visible value matches the value read from hardware
when the rmid will be reallocated.
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Jamie Iles <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Xin Hao <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Tested-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In preparation for reducing the use of ksize(), record the actual
allocation size for later memcpy(). This avoids copying extra
(uninitialized!) bytes into the patch buffer when the requested
allocation size isn't exactly the size of a kmalloc bucket.
Additionally, fix potential future issues where runtime bounds checking
will notice that the buffer was allocated to a smaller value than
returned by ksize().
Fixes: 757885e94a22 ("x86, microcode, amd: Early microcode patch loading support for AMD")
Suggested-by: Daniel Micay <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/lkml/CA+DvKQ+bp7Y7gmaVhacjv9uF6Ar-o4tet872h4Q8RPYPJjcJQA@mail.gmail.com/
|
|
Reinstate the per-vCPU guest_supported_xcr0 by partially reverting
commit 988896bb6182; the implicit assessment that guest_supported_xcr0 is
always the same as guest_fpu.fpstate->user_xfeatures was incorrect.
kvm_vcpu_after_set_cpuid() isn't the only place that sets user_xfeatures,
as user_xfeatures is set to fpu_user_cfg.default_features when guest_fpu
is allocated via fpu_alloc_guest_fpstate() => __fpstate_reset().
guest_supported_xcr0 on the other hand is zero-allocated. If userspace
never invokes KVM_SET_CPUID2, supported XCR0 will be '0', whereas the
allowed user XFEATURES will be non-zero.
Practically speaking, the edge case likely doesn't matter as no sane
userspace will live migrate a VM without ever doing KVM_SET_CPUID2. The
primary motivation is to prepare for KVM intentionally and explicitly
setting bits in user_xfeatures that are not set in guest_supported_xcr0.
Because KVM_{G,S}ET_XSAVE can be used to svae/restore FP+SSE state even
if the host doesn't support XSAVE, KVM needs to set the FP+SSE bits in
user_xfeatures even if they're not allowed in XCR0, e.g. because XCR0
isn't exposed to the guest. At that point, the simplest fix is to track
the two things separately (allowed save/restore vs. allowed XCR0).
Fixes: 988896bb6182 ("x86/kvm/fpu: Remove kvm_vcpu_arch.guest_supported_xcr0")
Cc: [email protected]
Cc: Leonardo Bras <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The entries in the .parainstructions sections are 8 byte aligned and the
corresponding C struct paravirt_patch_site makes the array offset 16
bytes.
Though the pushed entries are only using 12 bytes, __parainstructions_end
is therefore 4 bytes short.
That works by chance because it's only used in a loop:
for (p = start; p < end; p++)
But this falls flat when calculating the number of elements:
n = end - start
That's obviously off by one.
Ensure that the gap is filled and the last entry is occupying 16 bytes.
[ bp: Add the proper struct and section names. ]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Juergen Gross <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
If x is not 0, __ffs(x) is equivalent to:
(unsigned long)__builtin_ctzl(x)
And if x is not ~0UL, ffz(x) is equivalent to:
(unsigned long)__builtin_ctzl(~x)
Because __builting_ctzl() returns an int, a cast to (unsigned long) is
necessary to avoid potential warnings on implicit casts.
Concerning the edge cases, __builtin_ctzl(0) is always undefined,
whereas __ffs(0) and ffz(~0UL) may or may not be defined, depending on
the processor. Regardless, for both functions, developers are asked to
check against 0 or ~0UL so replacing __ffs() or ffz() by
__builting_ctzl() is safe.
For x86_64, the current __ffs() and ffz() implementations do not
produce optimized code when called with a constant expression. On the
contrary, the __builtin_ctzl() folds into a single instruction.
However, for non constant expressions, the __ffs() and ffz() asm
versions of the kernel remains slightly better than the code produced
by GCC (it produces a useless instruction to clear eax).
Use __builtin_constant_p() to select between the kernel's
__ffs()/ffz() and the __builtin_ctzl() depending on whether the
argument is constant or not.
** Statistics **
On a allyesconfig, before...:
$ objdump -d vmlinux.o | grep tzcnt | wc -l
3607
...and after:
$ objdump -d vmlinux.o | grep tzcnt | wc -l
2600
So, roughly 27.9% of the calls to either __ffs() or ffz() were using
constant expressions and could be optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
Note: on x86_64, the BSF instruction produces TZCNT when used with the
REP prefix (which explain the use of `grep tzcnt' instead of `grep bsf'
in above benchmark). c.f. [1]
[1] e26a44a2d618 ("x86: Use REP BSF unconditionally")
[ bp: Massage commit message. ]
Signed-off-by: Vincent Mailhol <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Reviewed-by: Yury Norov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
For x86_64, the current ffs() implementation does not produce optimized
code when called with a constant expression. On the contrary, the
__builtin_ffs() functions of both GCC and clang are able to fold the
expression into a single instruction.
** Example **
Consider two dummy functions foo() and bar() as below:
#include <linux/bitops.h>
#define CONST 0x01000000
unsigned int foo(void)
{
return ffs(CONST);
}
unsigned int bar(void)
{
return __builtin_ffs(CONST);
}
GCC would produce below assembly code:
0000000000000000 <foo>:
0: ba 00 00 00 01 mov $0x1000000,%edx
5: b8 ff ff ff ff mov $0xffffffff,%eax
a: 0f bc c2 bsf %edx,%eax
d: 83 c0 01 add $0x1,%eax
10: c3 ret
<Instructions after ret and before next function were redacted>
0000000000000020 <bar>:
20: b8 19 00 00 00 mov $0x19,%eax
25: c3 ret
And clang would produce:
0000000000000000 <foo>:
0: b8 ff ff ff ff mov $0xffffffff,%eax
5: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # c <foo+0xc>
c: 83 c0 01 add $0x1,%eax
f: c3 ret
0000000000000010 <bar>:
10: b8 19 00 00 00 mov $0x19,%eax
15: c3 ret
Both examples clearly demonstrate the benefit of using __builtin_ffs()
instead of the kernel's asm implementation for constant expressions.
However, for non constant expressions, the kernel's ffs() asm version
remains better for x86_64 because, contrary to GCC, it doesn't emit the
CMOV assembly instruction, c.f. [1] (noticeably, clang is able optimize
out the CMOV call).
Use __builtin_constant_p() to select between the kernel's ffs() and
the __builtin_ffs() depending on whether the argument is constant or
not.
As a side benefit, replacing the ffs() function declaration by a macro
also removes below -Wshadow warning:
./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
283 | static __always_inline int ffs(int x)
** Statistics **
On a allyesconfig, before...:
$ objdump -d vmlinux.o | grep bsf | wc -l
1081
...and after:
$ objdump -d vmlinux.o | grep bsf | wc -l
792
So, roughly 26.7% of the calls to ffs() were using constant
expressions and could be optimized out.
(tests done on linux v5.18-rc5 x86_64 using GCC 11.2.1)
[1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
[ bp: Massage commit message. ]
Signed-off-by: Vincent Mailhol <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Reviewed-by: Yury Norov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Only x86 has own release_thread(), introduce a new weak release_thread()
function to clean empty definitions in other ARCHs.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Acked-by: Guo Ren <[email protected]> [csky]
Acked-by: Russell King (Oracle) <[email protected]>
Acked-by: Geert Uytterhoeven <[email protected]>
Acked-by: Brian Cain <[email protected]>
Acked-by: Michael Ellerman <[email protected]> [powerpc]
Acked-by: Stafford Horne <[email protected]> [openrisc]
Acked-by: Catalin Marinas <[email protected]> [arm64]
Acked-by: Huacai Chen <[email protected]> [LoongArch]
Cc: Alexander Gordeev <[email protected]>
Cc: Anton Ivanov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Guo Ren <[email protected]> [csky]
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Stefan Kristiansson <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Xuerui Wang <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Current i10nm_edac only supports firmware decoder (ACPI DSM methods).
MCA bank registers of Ice Lake or Tremont CPUs contain the information
to decode DDR memory errors. To get better decoding performance, add
the driver decoder (decoding DDR memory errors via extracting error
information from MCA bank registers) for Ice Lake and Tremont CPUs.
Co-developed-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Qiuxu Zhuo <[email protected]>
Signed-off-by: Youquan Song <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/
|
|
Remove the CONFIG_PREEMPT_RT symbol from the ifdef around
do_softirq_own_stack() and move it to Kconfig instead.
Enable softirq stacks based on SOFTIRQ_ON_OWN_STACK which depends on
HAVE_SOFTIRQ_ON_OWN_STACK and its default value is set to !PREEMPT_RT.
This ensures that softirq stacks are not used on PREEMPT_RT and avoids
a 'select' statement on an option which has a 'depends' statement.
Link: https://lore.kernel.org/YvN5E%[email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
|
|
The APIC supports two modes, legacy APIC (or xAPIC), and Extended APIC
(or x2APIC). X2APIC mode is mostly compatible with legacy APIC, but
it disables the memory-mapped APIC interface in favor of one that uses
MSRs. The APIC mode is controlled by the EXT bit in the APIC MSR.
The MMIO/xAPIC interface has some problems, most notably the APIC LEAK
[1]. This bug allows an attacker to use the APIC MMIO interface to
extract data from the SGX enclave.
Introduce support for a new feature that will allow the BIOS to lock
the APIC in x2APIC mode. If the APIC is locked in x2APIC mode and the
kernel tries to disable the APIC or revert to legacy APIC mode a GP
fault will occur.
Introduce support for a new MSR (IA32_XAPIC_DISABLE_STATUS) and handle
the new locked mode when the LEGACY_XAPIC_DISABLED bit is set by
preventing the kernel from trying to disable the x2APIC.
On platforms with the IA32_XAPIC_DISABLE_STATUS MSR, if SGX or TDX are
enabled the LEGACY_XAPIC_DISABLED will be set by the BIOS. If
legacy APIC is required, then it SGX and TDX need to be disabled in the
BIOS.
[1]: https://aepicleak.com/aepicleak.pdf
Signed-off-by: Daniel Sneddon <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Tested-by: Neelima Krishnan <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Add model numbers for client and mobile parts.
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Optimize internal hw_breakpoint state if the architecture's number of
breakpoint slots is constant. This avoids several kmalloc() calls and
potentially unnecessary failures if the allocations fail, as well as
subtly improves code generation and cache locality.
The protocol is that if an architecture defines hw_breakpoint_slots via
the preprocessor, it must be constant and the same for all types.
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Dmitry Vyukov <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix PAT on Xen, which caused i915 driver failures
- Fix compat INT 80 entry crash on Xen PV guests
- Fix 'MMIO Stale Data' mitigation status reporting on older Intel CPUs
- Fix RSB stuffing regressions
- Fix ORC unwinding on ftrace trampolines
- Add Intel Raptor Lake CPU model number
- Fix (work around) a SEV-SNP bootloader bug providing bogus values in
boot_params->cc_blob_address, by ignoring the value on !SEV-SNP
bootups.
- Fix SEV-SNP early boot failure
- Fix the objtool list of noreturn functions and annotate snp_abort(),
which bug confused objtool on gcc-12.
- Fix the documentation for retbleed
* tag 'x86-urgent-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/ABI: Mention retbleed vulnerability info file for sysfs
x86/sev: Mark snp_abort() noreturn
x86/sev: Don't use cc_platform_has() for early SEV-SNP calls
x86/boot: Don't propagate uninitialized boot_params->cc_blob_address
x86/cpu: Add new Raptor Lake CPU model number
x86/unwind/orc: Unwind ftrace trampolines with correct ORC entry
x86/nospec: Fix i386 RSB stuffing
x86/nospec: Unwreck the RSB stuffing
x86/bugs: Add "unknown" reporting for MMIO Stale Data
x86/entry: Fix entry_INT80_compat for Xen PV guests
x86/PAT: Have pat_enabled() properly reflect state when running on Xen
|
|
If AMD Last Branch Record Extension Version 2 (LbrExtV2) is detected,
enable it alongside LBR Freeze on PMI when an event requests branch stack
i.e. PERF_SAMPLE_BRANCH_STACK.
Each branch record is represented by a pair of registers, LBR From and LBR
To. The freeze feature prevents any updates to these registers once a PMC
overflows. The contents remain unchanged until the freeze bit is cleared by
the PMI handler.
The branch records are read and copied to sample data before unfreezing.
However, only valid entries are copied. There is no additional register to
denote which of the register pairs represent the top of the stack (TOS)
since internal register renaming always ensures that the first pair (i.e.
index 0) is the one representing the most recent branch and so on.
The LBR registers are per-thread resources and are cleared explicitly
whenever a new task is scheduled in. There are no special implications on
the contents of these registers when transitioning to deep C-states.
Signed-off-by: Sandipan Das <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/d3b8500a3627a0d4d0259b005891ee248f248d91.1660211399.git.sandipan.das@amd.com
|
|
AMD Last Branch Record Extension Version 2 (LbrExtV2) is driven by Core PMC
overflows. It records recently taken branches up to the moment when the PMC
overflow occurs.
Detect the feature during PMU initialization and set the branch stack depth
using CPUID leaf 0x80000022 EBX.
Signed-off-by: Sandipan Das <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/fc6e45378ada258f1bab79b0de6e05c393a8f1dd.1660211399.git.sandipan.das@amd.com
|
|
CPUID leaf 0x80000022 i.e. ExtPerfMonAndDbg advertises some new performance
monitoring features for AMD processors.
Bit 1 of EAX indicates support for Last Branch Record Extension Version 2
(LbrExtV2) features. If found to be set during PMU initialization, the EBX
bits of the same leaf can be used to determine the number of available LBR
entries.
For better utilization of feature words, LbrExtV2 is added as a scattered
feature bit.
[peterz: Rename to AMD_LBR_V2]
Signed-off-by: Sandipan Das <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/172d2b0df39306ed77221c45ee1aa62e8ae0548d.1660211399.git.sandipan.das@amd.com
|