Age | Commit message (Collapse) | Author | Files | Lines |
|
Mel Gorman says:
"32-bit NUMA systems should be non-existent in practice. The last NUMA
system I'm aware of that was both NUMA and 32-bit only died somewhere
between 2004 and 2007. If someone is running a 64-bit capable system in
32-bit mode with NUMA, they really are just punishing themselves for fun."
Mark DISCONTIGMEM broken for now as suggested by Christoph Hellwig,
and (hopefully) remove it in a couple of releases.
Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Sparsemem has been a default memory model for x86-64 for over a decade,
since:
b263295dbffd ("x86: 64-bit, make sparsemem vmemmap the only memory model").
Make it the default for 32-bit NUMA systems (if there any left) as well.
Signed-off-by: Mike Rapoport <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
PC8/PC9/PC10 counters
Kaby Lake (and Coffee Lake) has PC8/PC9/PC10 residency counters.
This patch updates the list of Kaby/Coffee Lake PMU event counters
from the snb_cstates[] list of events to the hswult_cstates[]
list of events, which keeps all previously supported events and
also adds the PKG_C8, PKG_C9 and PKG_C10 residency counters.
This allows user space tools to profile them through the perf interface.
Signed-off-by: Harry Pan <[email protected]>
Cc: <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
We have supported per-device dma_map_ops in generic code for a long
time, and this symbol just guards the inclusion of the dma_map_ops
registry used for vmd. Stop enabling it for anything but vmd.
No change in functionality intended.
Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
When building x86 with Clang LTO and CFI, CFI jump regions are
automatically added to the end of the .text section late in linking. As a
result, the _etext position was being labelled before the appended jump
regions, causing confusion about where the boundaries of the executable
region actually are in the running kernel, and broke at least the fault
injection code. This moves the _etext mark to outside (and immediately
after) the .text area, as it already the case on other architectures
(e.g. arm64, arm).
Reported-and-tested-by: Sami Tolvanen <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/20190423183827.GA4012@beast
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Towards the goal of removing cc-ldoption, it seems that --hash-style=
was added to binutils 2.17.50.0.2 in 2006. The minimal required version
of binutils for the kernel according to
Documentation/process/changes.rst is 2.20.
Suggested-by: Masahiro Yamada <[email protected]>
Signed-off-by: Nick Desaulniers <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Link: https://gcc.gnu.org/ml/gcc/2007-01/msg01141.html
Signed-off-by: Ingo Molnar <[email protected]>
|
|
In-NMI warnings have been added to vmalloc_fault() via:
ebc8827f75 ("x86: Barf when vmalloc and kmemcheck faults happen in NMI")
back in the time when our NMI entry code could not cope with nested NMIs.
These days, it's perfectly fine to take a fault in NMI context and we
don't have to care about the fact that IRET from the fault handler might
cause NMI nesting.
This warning has already been removed from 32-bit implementation of
vmalloc_fault() in:
6863ea0cda8 ("x86/mm: Remove in_nmi() warning from vmalloc_fault()")
but the 64-bit version was omitted.
Remove the bogus warning also from 64-bit implementation of vmalloc_fault().
Reported-by: Nicolai Stange <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: 6863ea0cda8 ("x86/mm: Remove in_nmi() warning from vmalloc_fault()")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The __put_user() macro evaluates it's @ptr argument inside the
__uaccess_begin() / __uaccess_end() region. While this would normally
not be expected to be an issue, an UBSAN bug (it ignored -fwrapv,
fixed in GCC 8+) would transform the @ptr evaluation for:
drivers/gpu/drm/i915/i915_gem_execbuffer.c: if (unlikely(__put_user(offset, &urelocs[r-stack].presumed_offset))) {
into a signed-overflow-UB check and trigger the objtool AC validation.
Finish this commit:
2a418cf3f5f1 ("x86/uaccess: Don't leak the AC flag into __put_user() value evaluation")
and explicitly evaluate all 3 arguments early.
Reported-by: Randy Dunlap <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Randy Dunlap <[email protected]> # build-tested
Acked-by: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Fixes: 2a418cf3f5f1 ("x86/uaccess: Don't leak the AC flag into __put_user() value evaluation")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The first kmemleak_scan() call after boot would trigger the crash below
because this callpath:
kernel_init
free_initmem
mem_encrypt_free_decrypted_mem
free_init_pages
unmaps memory inside the .bss when DEBUG_PAGEALLOC=y.
kmemleak_init() will register the .data/.bss sections and then
kmemleak_scan() will scan those addresses and dereference them looking
for pointer references. If free_init_pages() frees and unmaps pages in
those sections, kmemleak_scan() will crash if referencing one of those
addresses:
BUG: unable to handle kernel paging request at ffffffffbd402000
CPU: 12 PID: 325 Comm: kmemleak Not tainted 5.1.0-rc4+ #4
RIP: 0010:scan_block
Call Trace:
scan_gray_list
kmemleak_scan
kmemleak_scan_thread
kthread
ret_from_fork
Since kmemleak_free_part() is tolerant to unknown objects (not tracked
by kmemleak), it is fine to call it from free_init_pages() even if not
all address ranges passed to this function are known to kmemleak.
[ bp: Massage. ]
Fixes: b3f0907c71e0 ("x86/mm: Add .bss..decrypted section to hold shared variables")
Signed-off-by: Qian Cai <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brijesh Singh <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: x86-ml <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
git://anongit.freedesktop.org/drm/drm-intel into drm-next
UAPI Changes:
- uAPI "Fixes:" patch for the upcoming kernel 5.1, included here too
We have an Ack from the media folks (only current user) for this
late tweak
Cross-subsystem Changes:
- ALSA: hda: Fix racy display power access (Takashi, Chris)
Driver Changes:
- DDI and MIPI-DSI clocks fixes for Icelake (Vandita)
- Fix Icelake frequency change/locking (RPS) (Mika)
- Temporarily disable ppGTT read-only bit on Icelake (Mika)
- Add missing Icelake W/As (Mika)
- Enable 12 deep CSB status FIFO on Icelake (Mika)
- Inherit more Icelake code for Elkhartlake (Bob, Jani)
- Handle catastrophic error on engine reset (Mika)
- Shortcut readiness to reset check (Mika)
- Regression fix for GEM_BUSY causing us to report a mixed uabi-class request as not busy (Chris)
- Revert back to max link rate and lane count on eDP (Jani)
- Fix pipe BPP readout for BXT/GLK DSI (Ville)
- Set DP min_bpp to 8*3 for non-RGB output formats (Ville)
- Enable coarse preemption boundaries for Gen8 (Chris)
- Do not enable FEC without DSC (Ville)
- Restore correct BXT DDI latency optim setting calculation (Ville)
- Always reset context's RING registers to avoid running workload twice during reset (Chris)
- Set GPU wedged on driver unload (Janusz)
- Consolidate two similar barries from timeline into one (Chris)
- Only reset the pinned kernel contexts on resume (Chris)
- Wakeref tracking improvements (Chris, Imre)
- Lockdep fixes for shrinker interactions (Chris)
- Bump ready tasks ahead of busywaits in prep of semaphore use (Chris)
- Huge step in splitting display code into fine grained files (Jani)
- Refactor the IRQ init/reset macros for code saving (Paulo)
- Convert IRQ initialization code to uncore MMIO access (Paulo)
- Convert workarounds code to use uncore MMIO access (Chris)
- Nuke drm_crtc_state and use intel_atomic_state instead (Manasi)
- Update SKL clock-gating WA (Radhakrishna, Ville)
- Isolate GuC reset code flow (Chris)
- Expose force_dsc_enable through debugfs (Manasi)
- Header standalone compile testing framework (Jani)
- Code cleanups to reduce driver footprint (Chris)
- PSR code fixes and cleanups (Jose)
- Sparse and kerneldoc updates (Chris)
- Suppress spurious combo PHY B warning (Vile)
Signed-off-by: Dave Airlie <[email protected]>
From: Joonas Lahtinen <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
AMD family 17h Models 10h-2Fh may report a high number of L1 BTB MCA
errors under certain conditions. The errors are benign and can safely be
ignored. However, the high error rate may cause the MCA threshold
counter to overflow causing a high rate of thresholding interrupts.
In addition, users may see the errors reported through the AMD MCE
decoder module, even with the interrupt disabled, due to MCA polling.
Clear the "Counter Present" bit in the Instruction Fetch bank's
MCA_MISC0 register. This will prevent enabling MCA thresholding on this
bank which will prevent the high interrupt rate due to this error.
Define an AMD-specific function to filter these errors from the MCE
event pool so that they don't get reported during early boot.
Rename filter function in EDAC/mce_amd to avoid a naming conflict, while
at it.
[ bp: Move function prototype to the internal header and
massage/cleanup, fix typos. ]
Reported-by: Rafał Miłecki <[email protected]>
Signed-off-by: Yazen Ghannam <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "[email protected]" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morse <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Pu Wen <[email protected]>
Cc: Qiuxu Zhuo <[email protected]>
Cc: Shirish S <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: linux-edac <[email protected]>
Cc: x86-ml <[email protected]>
Cc: <[email protected]> # 5.0.x: c95b323dcd35: x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
Cc: <[email protected]> # 5.0.x: 30aa3d26edb0: x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
Cc: <[email protected]> # 5.0.x: 9308fd407455: x86/MCE: Group AMD function prototypes in <asm/mce.h>
Cc: <[email protected]> # 5.0.x
Link: https://lkml.kernel.org/r/[email protected]
|
|
Some systems may report spurious MCA errors. In general, spurious MCA
errors may be disabled by clearing a particular bit in MCA_CTL. However,
clearing a bit in MCA_CTL may not be recommended for some errors, so the
only option is to ignore them.
An MCA error is printed and handled after it has been added to the MCE
event pool. So an MCA error can be ignored by not adding it to that pool
in the first place.
Add such a filtering function.
[ bp: Move function prototype to the internal header and massage. ]
Signed-off-by: Yazen Ghannam <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "[email protected]" <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Pu Wen <[email protected]>
Cc: Qiuxu Zhuo <[email protected]>
Cc: "[email protected]" <[email protected]>
Cc: Shirish S <[email protected]>
Cc: <[email protected]> # 5.0.x
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: x86-ml <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Add a new command-line option "xen_timer_slop=<INT>" that sets the
minimum delta of virtual Xen timers. This commit does not change the
default timer slop value for virtual Xen timers.
Lowering the timer slop value should improve the accuracy of virtual
timers (e.g., better process dispatch latency), but it will likely
increase the number of virtual timer interrupts (relative to the
original slop setting).
The original timer slop value has not changed since the introduction
of the Xen-aware Linux kernel code. This commit provides users an
opportunity to tune timer performance given the refinements to
hardware and the Xen event channel processing. It also mirrors
a feature in the Xen hypervisor - the "timer_slop" Xen command line
option.
[boris: updated comment describing TIMER_SLOP]
Signed-off-by: Ryan Thibodeaux <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Signed-off-by: Boris Ostrovsky <[email protected]>
|
|
INVALIDATE_TLB_VECTOR_START has been removed by:
52aec3308db8("x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR")
while VSYSCALL_EMU_VECTO(204) has also been removed, by:
3ae36655b97a("x86-64: Rework vsyscall emulation and add vsyscall= parameter")
so update the comments in <asm/irq_vectors.h> accordingly.
Signed-off-by: Jiang Biao <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The original intention to move RDSP parsing very early, before KASLR
does its ranges selection, was to accommodate movable memory regions
machines (CONFIG_MEMORY_HOTREMOVE) to still be able to do memory
hotplug.
However, that broke kexec'ing a kernel on EFI machines because depending
on where the EFI systab was mapped, on at least one machine it isn't
present in the kexec mapping of the second kernel, leading to a triple
fault in the early code.
Fixing this properly requires significantly involved surgery and we
cannot allow ourselves to do that, that close to the merge window.
So disable the RSDP parsing code temporarily until it is fixed properly
in the next release cycle.
Signed-off-by: Borislav Petkov <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Chao Fan <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Ingo Molnar <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: [email protected]
Cc: Kees Cook <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: x86-ml <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
crashkernel=xM tries to reserve memory for the crash kernel under 4G,
which is enough, usually. But this could fail sometimes, for example
when one tries to reserve a big chunk like 2G, for example.
So let the crashkernel=xM just fall back to use high memory in case it
fails to find a suitable low range. Do not set the ,high as default
because it allocates extra low memory for DMA buffers and swiotlb, and
this is not always necessary for all machines.
Typically, crashkernel=128M usually works with low reservation under 4G,
so keep <4G as default.
[ bp: Massage. ]
Signed-off-by: Dave Young <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Ingo Molnar <[email protected]>
Acked-by: Baoquan He <[email protected]>
Cc: Dave Young <[email protected]>
Cc: David Howells <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: [email protected]
Cc: "Paul E. McKenney" <[email protected]>
Cc: Petr Tesarik <[email protected]>
Cc: [email protected]
Cc: Ram Pai <[email protected]>
Cc: Sinan Kaya <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Thymo van Beers <[email protected]>
Cc: [email protected]
Cc: x86-ml <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Zhimin Gu <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The kdump crashkernel low reservation is limited to under 896M even for
X86_64. This obscure and miserable limitation exists for compatibility
with old kexec-tools but the reason is not documented anywhere.
Some more tests/investigations about the background:
a) Previously, old kexec-tools could only load purgatory to memory under
2G. Eric removed that limitation in 2012 in kexec-tools:
b4f9f8599679 ("kexec x86_64: Make purgatory relocatable anywhere
in the 64bit address space.")
b) Back in 2013 Yinghai removed all the limitations in new kexec-tools,
bzImage64 can be loaded anywhere:
82c3dd2280d2 ("kexec, x86_64: Load bzImage64 above 4G")
c) Test results with old kexec-tools with old and latest kernels:
1. Old kexec-tools can not build with modern toolchain anymore,
I built it in a RHEL6 vm.
2. 2.0.0 kexec-tools does not work with the latest kernel even with
memory under 896M and gives an error:
"ELF core (kcore) parse failed"
For that it needs below kexec-tools fix:
ed15ba1b9977 ("build_mem_phdrs(): check if p_paddr is invalid")
3. Even with patched kexec-tools which fixes 2), it still needs some
other fixes to work correctly for KASLR-enabled kernels.
So the situation is:
* Old kexec-tools is already broken with latest kernels.
* We can not keep these limitations forever just for compatibility with very
old kexec-tools.
* If one must use old tools then he/she can choose crashkernel=X@Y.
* People have reported bugs where crashkernel=384M failed because KASLR
makes the 0-896M space sparse.
* Crashkernel can reserve in low or high area, it is natural to understand
low as memory under 4G.
Hence drop the 896M limitation and change crashkernel low reservation to
reserve under 4G by default.
Signed-off-by: Dave Young <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Ingo Molnar <[email protected]>
Acked-by: Baoquan He <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Howells <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Petr Tesarik <[email protected]>
Cc: [email protected]
Cc: Ram Pai <[email protected]>
Cc: Sinan Kaya <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: x86-ml <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Zhimin Gu <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
So we are going to be staring at those in the next years, let's make
them more succinct. In particular:
- change "address = " to "address: "
- "-privileged" reads funny. It should be simply "kernel" or "user"
- "from kernel code" reads funny too. "kernel mode" or "user mode" is
more natural.
An actual example says more than 1000 words, of course:
[ 0.248370] BUG: kernel NULL pointer dereference, address: 00000000000005b8
[ 0.249120] #PF: supervisor write access in kernel mode
[ 0.249717] #PF: error_code(0x0002) - not-present page
Signed-off-by: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Misc fixes:
- various tooling fixes
- kretprobe fixes
- kprobes annotation fixes
- kprobes error checking fix
- fix the default events for AMD Family 17h CPUs
- PEBS fix
- AUX record fix
- address filtering fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kprobes: Avoid kretprobe recursion bug
kprobes: Mark ftrace mcount handler functions nokprobe
x86/kprobes: Verify stack frame on kretprobe
perf/x86/amd: Add event map for AMD Family 17h
perf bpf: Return NULL when RB tree lookup fails in perf_env__find_btf()
perf tools: Fix map reference counting
perf evlist: Fix side band thread draining
perf tools: Check maps for bpf programs
perf bpf: Return NULL when RB tree lookup fails in perf_env__find_bpf_prog_info()
tools include uapi: Sync sound/asound.h copy
perf top: Always sample time to satisfy needs of use of ordered queuing
perf evsel: Use hweight64() instead of hweight_long(attr.sample_regs_user)
tools lib traceevent: Fix missing equality check for strcmp
perf stat: Disable DIR_FORMAT feature for 'perf stat record'
perf scripts python: export-to-sqlite.py: Fix use of parent_id in calls_view
perf header: Fix lock/unlock imbalances when processing BPF/BTF info
perf/x86: Fix incorrect PEBS_REGS
perf/ring_buffer: Fix AUX record suppression
perf/core: Fix the address filtering fix
kprobes: Fix error check when reusing optimized probes
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Misc fixes all over the place: a console spam fix, section attributes
fixes, a KASLR fix, a TLB stack-variable alignment fix, a reboot
quirk, boot options related warnings fix, an LTO fix, a deadlock fix
and an RDT fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/intel: Lower the "ENERGY_PERF_BIAS: Set to normal" message's log priority
x86/cpu/bugs: Use __initconst for 'const' init data
x86/mm/KASLR: Fix the size of the direct mapping section
x86/Kconfig: Fix spelling mistake "effectivness" -> "effectiveness"
x86/mm/tlb: Revert "x86/mm: Align TLB invalidation info"
x86/reboot, efi: Use EFI reboot for Acer TravelMate X514-51T
x86/mm: Prevent bogus warnings with "noexec=off"
x86/build/lto: Fix truncated .bss with -fdata-sections
x86/speculation: Prevent deadlock on ssb_state::lock
x86/resctrl: Do not repeat rdtgroup mode initialization
|
|
ia64, parisc and sparc just use a copy of the generic version
of asm/sockios.h, and x86 is a redirect to the same file, so we
can just let the header file be generated.
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The go32() and go64() functions has an argument and a local variable called ‘name’.
Rename both to clarify the code and to fix a warning with -Wshadow.
Signed-off-by: Leonardo Brás <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Michal Marek <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/20181023011022.GA6574@WindFlash
Signed-off-by: Ingo Molnar <[email protected]>
|
|
In case when the number of entries in the section header table is larger
then or equal to SHN_LORESERVE the size of the table is held in the sh_size
member of the initial entry in section header table instead of e_shnum.
Same with the string table index which is located in sh_link instead of
e_shstrndx.
This case is easily reproducible with KCFLAGS="-ffunction-sections",
bzImage build fails with "String table index out of bounds" error.
Signed-off-by: Artem Savkov <[email protected]>
Reviewed-by: Josh Poimboeuf <[email protected]>
Acked-by: Joe Lawrence <[email protected]>
Cc: Eric W . Biederman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Simplify the die() lines. ]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
DEBUG_HOTPLUG_CPU0 debug feature offlines a CPU as early as possible
allowing userspace to boot up without that CPU (so that it is possible
to check for unwanted dependencies towards the offlined CPU). After
doing so it emits a "CPU %u is now offline" pr_info, which is not enough
descriptive of why the CPU was offlined (e.g., one might be running with
a config that triggered some problem, not being aware that CONFIG_DEBUG_
HOTPLUG_CPU0 is set).
Add a bit more of informative text to the pr_info, so that it is
immediately obvious why a CPU has been offlined in early boot stages.
Background:
Got to scratch my head a bit while debugging a WARNING splat related to
the offlining of CPU0. Without being aware yet of this debug option it
wasn't immediately obvious why CPU0 was being offlined by the kernel.
Signed-off-by: Juri Lelli <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
[ Merge line-broken line. ]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Linus pointed out that deciphering the raw #PF error code and printing
a more human readable message are two different things, and also that
printing the negative cases is mostly just noise[1]. For example, the
USER bit doesn't mean the fault originated in user code and stating
that an oops wasn't due to a protection keys violation isn't interesting
since an oops on a keys violation is a one-in-a-million scenario.
Remove the per-bit decoding of the error code and instead print:
- the raw error code
- why the fault occurred
- the effective privilege level of the access
- the type of access
- whether the fault originated in user code or kernel code
This provides the user with the information needed to triage 99.9% of
oopses without polluting the log with useless information or conflating
the error_code with the CPL.
Sample output:
BUG: kernel NULL pointer dereference, address = 0000000000000008
#PF: supervisor-privileged instruction fetch from kernel code
#PF: error_code(0x0010) - not-present page
BUG: unable to handle page fault for address = ffffbeef00000000
#PF: supervisor-privileged instruction fetch from kernel code
#PF: error_code(0x0010) - not-present page
BUG: unable to handle page fault for address = ffffc90000230000
#PF: supervisor-privileged write access from kernel code
#PF: error_code(0x000b) - reserved bit violation
[1] https://lkml.kernel.org/r/CAHk-=whk_fsnxVMvF1T2fFCaP2WrvSybABrLQCWLJyCvHw6NKA@mail.gmail.com
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Reword the NULL pointer dereference case to simply state that a NULL
pointer was dereferenced, i.e. drop "unable to handle" as that implies
that there are instances where the kernel actual does handle NULL
pointer dereferences, which is not true barring funky exception fixup.
For the non-NULL case, replace "kernel paging request" with "page fault"
as the kernel can technically oops on faults that originated in user
code. Dropping "kernel" also allows future patches to provide detailed
information on where the fault occurred, e.g. user vs. kernel, without
conflicting with the initial BUG message.
In both cases, replace "at address=" with wording more appropriate to
the oops, as "at" may be interpreted as stating that the address is the
RIP of the instruction that faulted.
Last, and probably least, further qualify the NULL-pointer path by
checking that the fault actually originated in kernel code. It's
technically possible for userspace to map address 0, and not printing
a super specific message is the least of our worries if the kernel does
manage to oops on an actual NULL pointer dereference from userspace.
Before:
BUG: unable to handle kernel NULL pointer dereference at ffffbeef00000000
BUG: unable to handle kernel paging request at ffffbeef00000000
After:
BUG: kernel NULL pointer dereference, address = 0000000000000008
BUG: unable to handle page fault for address = ffffbeef00000000
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yu-cheng Yu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
For new Centaur CPUs the ucode will take care of the preservation of cache coherence
between CPU cores in C-states regardless of how deep the C-states are. So, it is not
necessary to flush the caches in software befor entering C3. This useless operation
will cause performance drop for the cores which share some caches with the idling core.
Signed-off-by: David Wang <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Acked-by: Pavel Machek <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
[ Tidy up the comment. ]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Use DEFINE_DEBUGFS_ATTRIBUTE() rather than DEFINE_SIMPLE_ATTRIBUTE()
for debugfs files.
Semantic patch information:
Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file()
imposes some significant overhead as compared to
DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe().
Generated by: scripts/coccinelle/api/debugfs/debugfs_simple_attr.cocci
Signed-off-by: YueHaibing <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Yazen Ghannam <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
priority
The "ENERGY_PERF_BIAS: Set to 'normal', was 'performance'" message triggers
on pretty much every Intel machine. The purpose of log messages with
a warning level is to notify the user of something which potentially is
a problem, or at least somewhat unexpected.
This message clearly does not match those criteria, so lower its log
priority from warning to info.
Signed-off-by: Hans de Goede <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
This per cpu variable is accessed from assembler code, so it needs
to be visible for LTO.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
|
|
This function is referrenced from assembler, so it needs to be marked
visible for LTO.
Fixes: 3a025de64bf8 ("x86/hyperv: Enable PV qspinlock for Hyper-V")
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Yi Sun <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
|
|
LTO will happily inline __const_udelay() everywhere it is used. Forcing it
noinline saves ~44k text in a LTO build.
13999560 1740864 1499136 17239560 1070e08 vmlinux-with-udelay-inline
13954764 1736768 1499136 17190668 1064f0c vmlinux-wo-udelay-inline
Even without LTO this function should never be inlined.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The "vide" inline assembler is only needed on 32bit kernels for old
32bit only CPUs.
Guard it with an #ifdef so it's not included in 64bit builds.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
With gcc toplevel assembler statements that do not mark themselves as .text
may end up in other sections. This causes LTO boot crashes because various
assembler statements ended up in the middle of the initcall section. It's
also a latent problem without LTO, although it's currently not known to
cause any real problems.
According to the gcc team it's expected behavior.
Always mark all the top level assembler statements as text so that they
switch to the right section.
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Some of the recently added const tables use __initdata which causes section
attribute conflicts.
Use __initconst instead.
Fixes: fa1202ef2243 ("x86/speculation: Add command line control")
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
|
|
Avoid kretprobe recursion loop bg by setting a dummy
kprobes to current_kprobe per-CPU variable.
This bug has been introduced with the asm-coded trampoline
code, since previously it used another kprobe for hooking
the function return placeholder (which only has a nop) and
trampoline handler was called from that kprobe.
This revives the old lost kprobe again.
With this fix, we don't see deadlock anymore.
And you can see that all inner-called kretprobe are skipped.
event_1 235 0
event_2 19375 19612
The 1st column is recorded count and the 2nd is missed count.
Above shows (event_1 rec) + (event_2 rec) ~= (event_2 missed)
(some difference are here because the counter is racy)
Reported-by: Andrea Righi <[email protected]>
Tested-by: Andrea Righi <[email protected]>
Signed-off-by: Masami Hiramatsu <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Fixes: c9becf58d935 ("[PATCH] kretprobe: kretprobe-booster")
Link: http://lkml.kernel.org/r/155094064889.6137.972160690963039.stgit@devbox
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Verify the stack frame pointer on kretprobe trampoline handler,
If the stack frame pointer does not match, it skips the wrong
entry and tries to find correct one.
This can happen if user puts the kretprobe on the function
which can be used in the path of ftrace user-function call.
Such functions should not be probed, so this adds a warning
message that reports which function should be blacklisted.
Tested-by: Andrea Righi <[email protected]>
Signed-off-by: Masami Hiramatsu <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/155094059185.6137.15527904013362842072.stgit@devbox
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Make the anon_inodes facility unconditional so that it can be used by core
VFS code and pidfd code.
Signed-off-by: David Howells <[email protected]>
Signed-off-by: Al Viro <[email protected]>
[[email protected]: adapt commit message to mention pidfds]
Signed-off-by: Christian Brauner <[email protected]>
|
|
The "event counter" was removed from rseq before it was merged upstream.
However, a few comments in the source code still refer to it. Adapt the
comments to match reality.
Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ben Maurer <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Chris Lameter <[email protected]>
Cc: Dave Watson <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Russell King <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
On x86 systems, only MSDOS and GPT partition tables are typically
encountered. Remove all the rest.
Note, CONFIG_EFI_PARTITION is also removed since it defaults to `y'.
Signed-off-by: Ahmed S. Darwish <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/20190306004425.GA30537@darwi-home-pc
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Syntax only, no functional or semantic change.
This routine matches packages, not die, so name it thus.
Signed-off-by: Len Brown <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Link: http://lkml.kernel.org/r/7ca18c4ae7816a1f9eda37414725df676e63589d.1551160674.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The XO-1 and XO-1.5 batteries apparently differ in an ability to report
ambient temperature. We need to use a different compatible string for the
XO-1.5 battery.
Previously olpc_dt_fixup() used the presence of the battery node's
compatible property to decide whether the DT is up to date. Now we need
to look for a particular value in the compatible string, to decide
Signed-off-by: Lubomir Rintel <[email protected]>
Acked-by: Pavel Machek <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Reichel <[email protected]>
|
|
This makes the following patch more concise.
Signed-off-by: Lubomir Rintel <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Reichel <[email protected]>
|
|
It was pointed out in a review, and checkpatch.pl complains about this.
Breaking it down into multiple ofw evaluations works just as well and
reads better.
Signed-off-by: Lubomir Rintel <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Reichel <[email protected]>
|
|
To minimize the latency of timer interrupts as observed by the guest,
KVM adjusts the values it programs into the host timers to account for
the host's overhead of programming and handling the timer event. In
the event that the adjustments are too aggressive, i.e. the timer fires
earlier than the guest expects, KVM busy waits immediately prior to
entering the guest.
Currently, KVM manually converts the delay from nanoseconds to clock
cycles. But, the conversion is done in the guest's time domain, while
the delay occurs in the host's time domain. This is perfectly ok when
the guest and host are using the same TSC ratio, but if the guest is
using a different ratio then the delay may not be accurate and could
wait too little or too long.
When the guest is not using the host's ratio, convert the delay from
guest clock cycles to host nanoseconds and use ndelay() instead of
__delay() to provide more accurate timing. Because converting to
nanoseconds is relatively expensive, e.g. requires division and more
multiplication ops, continue using __delay() directly when guest and
host TSCs are running at the same ratio.
Cc: Liran Alon <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: [email protected]
Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The introduction of adaptive tuning of lapic timer advancement did not
allow for the scenario where userspace would want to disable adaptive
tuning but still employ timer advancement, e.g. for testing purposes or
to handle a use case where adaptive tuning is unable to settle on a
suitable time. This is epecially pertinent now that KVM places a hard
threshold on the maximum advancment time.
Rework the timer semantics to accept signed values, with a value of '-1'
being interpreted as "use adaptive tuning with KVM's internal default",
and any other value being used as an explicit advancement time, e.g. a
time of '0' effectively disables advancement.
Note, this does not completely restore the original behavior of
lapic_timer_advance_ns. Prior to tracking the advancement per vCPU,
which is necessary to support autotuning, userspace could adjust
lapic_timer_advance_ns for *running* vCPU. With per-vCPU tracking, the
module params are snapshotted at vCPU creation, i.e. applying a new
advancement effectively requires restarting a VM.
Dynamically updating a running vCPU is possible, e.g. a helper could be
added to retrieve the desired delay, choosing between the global module
param and the per-VCPU value depending on whether or not auto-tuning is
(globally) enabled, but introduces a great deal of complexity. The
wrapper itself is not complex, but understanding and documenting the
effects of dynamically toggling auto-tuning and/or adjusting the timer
advancement is nigh impossible since the behavior would be dependent on
KVM's implementation as well as compiler optimizations. In other words,
providing stable behavior would require extremely careful consideration
now and in the future.
Given that the expected use of a manually-tuned timer advancement is to
"tune once, run many", use the vastly simpler approach of recognizing
changes to the module params only when creating a new vCPU.
Cc: Liran Alon <[email protected]>
Cc: Wanpeng Li <[email protected]>
Reviewed-by: Liran Alon <[email protected]>
Cc: [email protected]
Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Automatically adjusting the globally-shared timer advancement could
corrupt the timer, e.g. if multiple vCPUs are concurrently adjusting
the advancement value. That could be partially fixed by using a local
variable for the arithmetic, but it would still be susceptible to a
race when setting timer_advance_adjust_done.
And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the
correct calibration for a given vCPU may not apply to all vCPUs.
Furthermore, lapic_timer_advance_ns is marked __read_mostly, which is
effectively violated when finding a stable advancement takes an extended
amount of timer.
Opportunistically change the definition of lapic_timer_advance_ns to
a u32 so that it matches the style of struct kvm_timer. Explicitly
pass the param to kvm_create_lapic() so that it doesn't have to be
exposed to lapic.c, thus reducing the probability of unintentionally
using the global value instead of the per-vCPU value.
Cc: Liran Alon <[email protected]>
Cc: Wanpeng Li <[email protected]>
Reviewed-by: Liran Alon <[email protected]>
Cc: [email protected]
Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
To minimize the latency of timer interrupts as observed by the guest,
KVM adjusts the values it programs into the host timers to account for
the host's overhead of programming and handling the timer event. Now
that the timer advancement is automatically tuned during runtime, it's
effectively unbounded by default, e.g. if KVM is running as L1 the
advancement can measure in hundreds of milliseconds.
Disable timer advancement if adaptive tuning yields an advancement of
more than 5000ns, as large advancements can break reasonable assumptions
of the guest, e.g. that a timer configured to fire after 1ms won't
arrive on the next instruction. Although KVM busy waits to mitigate the
case of a timer event arriving too early, complications can arise when
shifting the interrupt too far, e.g. kvm-unit-test's vmx.interrupt test
will fail when its "host" exits on interrupts as KVM may inject the INTR
before the guest executes STI+HLT. Arguably the unit test is "broken"
in the sense that delaying a timer interrupt by 1ms doesn't technically
guarantee the interrupt will arrive after STI+HLT, but it's a reasonable
assumption that KVM should support.
Furthermore, an unbounded advancement also effectively unbounds the time
spent busy waiting, e.g. if the guest programs a timer with a very large
delay.
5000ns is a somewhat arbitrary threshold. When running on bare metal,
which is the intended use case, timer advancement is expected to be in
the general vicinity of 1000ns. 5000ns is high enough that false
positives are unlikely, while not being so high as to negatively affect
the host's performance/stability.
Note, a future patch will enable userspace to disable KVM's adaptive
tuning, which will allow priveleged userspace will to specifying an
advancement value in excess of this arbitrary threshold in order to
satisfy an abnormal use case.
Cc: Liran Alon <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: [email protected]
Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
It was reported that with some special Multi Processor Group configuration,
e.g:
bcdedit.exe /set groupsize 1
bcdedit.exe /set maxgroup on
bcdedit.exe /set groupaware on
for a 16-vCPU guest WS2012 shows BSOD on boot when PV TLB flush mechanism
is in use.
Tracing kvm_hv_flush_tlb immediately reveals the issue:
kvm_hv_flush_tlb: processor_mask 0x0 address_space 0x0 flags 0x2
The only flag set in this request is HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES,
however, processor_mask is 0x0 and no HV_FLUSH_ALL_PROCESSORS is specified.
We don't flush anything and apparently it's not what Windows expects.
TLFS doesn't say anything about such requests and newer Windows versions
seem to be unaffected. This all feels like a WS2012 bug, which is, however,
easy to workaround in KVM: let's flush everything when we see an empty
flush request, over-flushing doesn't hurt.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
If guest sets MSR_IA32_TSCDEADLINE to value such that in host
time-domain it's shorter than lapic_timer_advance_ns, we can
reach a case that we call hrtimer_start() with expiration time set at
the past.
Because lapic_timer.timer is init with HRTIMER_MODE_ABS_PINNED, it
is not allowed to run in softirq and therefore will never expire.
To avoid such a scenario, verify that deadline expiration time is set on
host time-domain further than (now + lapic_timer_advance_ns).
A future patch can also consider adding a min_timer_deadline_ns module parameter,
similar to min_timer_period_us to avoid races that amount of ns it takes
to run logic could still call hrtimer_start() with expiration timer set
at the past.
Reviewed-by: Joao Martins <[email protected]>
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|