Age | Commit message (Collapse) | Author | Files | Lines |
|
* intel_pstate-fix:
cpufreq: intel_pstate: report correct CPU frequencies during trace
* cpufreq-x86-fix:
cpufreq: x86: Disable interrupts during MSRs reading
|
|
Currently KASLR will parse all e820 entries of RAM type and add all
candidate positions into the slots array. After that we choose one slot
randomly as the new position which the kernel will be decompressed into
and run at.
On systems with EFI enabled, e820 memory regions are coming from EFI
memory regions by combining adjacent regions.
These EFI memory regions have various attributes, and the "mirrored"
attribute is one of them. The physical memory region whose descriptors
in EFI memory map has EFI_MEMORY_MORE_RELIABLE attribute (bit: 16) are
mirrored. The address range mirroring feature of the kernel arranges such
mirrored regions into normal zones and other regions into movable zones.
With the mirroring feature enabled, the code and data of the kernel can only
be located in the more reliable mirrored regions. However, the current KASLR
code doesn't check EFI memory entries, and could choose a new kernel position
in non-mirrored regions. This will break the intended functionality of the
address range mirroring feature.
To fix this, if EFI is detected, iterate EFI memory map and pick the mirrored
region to process for adding candidate of randomization slot. If EFI is disabled
or no mirrored region found, still process the e820 memory map.
Signed-off-by: Baoquan He <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
[ Rewrote most of the text. ]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The existing map iteration helper for_each_efi_memory_desc_in_map can
only be used after the kernel initializes the EFI subsystem to set up
struct efi_memory_map.
Before that we also need iterate map descriptors which are stored in several
intermediate structures, like struct efi_boot_memmap for arch independent
usage and struct efi_info for x86 arch only.
Introduce efi_early_memdesc_ptr() to get pointer to a map descriptor, and
replace several places where that primitive is open coded.
Signed-off-by: Baoquan He <[email protected]>
[ Various improvements to the text. ]
Acked-by: Matt Fleming <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/20170816134651.GF21273@x1
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
This implements refcount_t overflow protection on x86 without a noticeable
performance impact, though without the fuller checking of REFCOUNT_FULL.
This is done by duplicating the existing atomic_t refcount implementation
but with normally a single instruction added to detect if the refcount
has gone negative (e.g. wrapped past INT_MAX or below zero). When detected,
the handler saturates the refcount_t to INT_MIN / 2. With this overflow
protection, the erroneous reference release that would follow a wrap back
to zero is blocked from happening, avoiding the class of refcount-overflow
use-after-free vulnerabilities entirely.
Only the overflow case of refcounting can be perfectly protected, since
it can be detected and stopped before the reference is freed and left to
be abused by an attacker. There isn't a way to block early decrements,
and while REFCOUNT_FULL stops increment-from-zero cases (which would
be the state _after_ an early decrement and stops potential double-free
conditions), this fast implementation does not, since it would require
the more expensive cmpxchg loops. Since the overflow case is much more
common (e.g. missing a "put" during an error path), this protection
provides real-world protection. For example, the two public refcount
overflow use-after-free exploits published in 2016 would have been
rendered unexploitable:
http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/
http://cyseclabs.com/page?n=02012016
This implementation does, however, notice an unchecked decrement to zero
(i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it
resulted in a zero). Decrements under zero are noticed (since they will
have resulted in a negative value), though this only indicates that a
use-after-free may have already happened. Such notifications are likely
avoidable by an attacker that has already exploited a use-after-free
vulnerability, but it's better to have them reported than allow such
conditions to remain universally silent.
On first overflow detection, the refcount value is reset to INT_MIN / 2
(which serves as a saturation value) and a report and stack trace are
produced. When operations detect only negative value results (such as
changing an already saturated value), saturation still happens but no
notification is performed (since the value was already saturated).
On the matter of races, since the entire range beyond INT_MAX but before
0 is negative, every operation at INT_MIN / 2 will trap, leaving no
overflow-only race condition.
As for performance, this implementation adds a single "js" instruction
to the regular execution flow of a copy of the standard atomic_t refcount
operations. (The non-"and_test" refcount_dec() function, which is uncommon
in regular refcount design patterns, has an additional "jz" instruction
to detect reaching exactly zero.) Since this is a forward jump, it is by
default the non-predicted path, which will be reinforced by dynamic branch
prediction. The result is this protection having virtually no measurable
change in performance over standard atomic_t operations. The error path,
located in .text.unlikely, saves the refcount location and then uses UD0
to fire a refcount exception handler, which resets the refcount, handles
reporting, and returns to regular execution. This keeps the changes to
.text size minimal, avoiding return jumps and open-coded calls to the
error reporting routine.
Example assembly comparison:
refcount_inc() before:
.text:
ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp)
refcount_inc() after:
.text:
ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp)
ffffffff8154614d: 0f 88 80 d5 17 00 js ffffffff816c36d3
...
.text.unlikely:
ffffffff816c36d3: 48 8d 4d f4 lea -0xc(%rbp),%rcx
ffffffff816c36d7: 0f ff (bad)
These are the cycle counts comparing a loop of refcount_inc() from 1
to INT_MAX and back down to 0 (via refcount_dec_and_test()), between
unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL
(refcount_t-full), and this overflow-protected refcount (refcount_t-fast):
2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s:
cycles protections
atomic_t 82249267387 none
refcount_t-fast 82211446892 overflow, untested dec-to-zero
refcount_t-full 144814735193 overflow, untested dec-to-zero, inc-from-zero
This code is a modified version of the x86 PAX_REFCOUNT atomic_t
overflow defense from the last public patch of PaX/grsecurity, based
on my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code. Thanks
to PaX Team for various suggestions for improvement for repurposing this
code to be a refcount-only protection.
Signed-off-by: Kees Cook <[email protected]>
Reviewed-by: Josh Poimboeuf <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Elena Reshetova <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Hans Liljestrand <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Manfred Spraul <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: linux-arch <[email protected]>
Link: http://lkml.kernel.org/r/20170815161924.GA133115@beast
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Speculative processor accesses may reference any memory that has a
valid page table entry. While a speculative access won't generate
a machine check, it will log the error in a machine check bank. That
could cause escalation of a subsequent error since the overflow bit
will be then set in the machine check bank status register.
Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
address of the page we want to map out otherwise we may trigger the
very problem we are trying to avoid. We use a non-canonical address
that passes through the usual Linux table walking code to get to the
same "pte".
Thanks to Dave Hansen for reviewing several iterations of this.
Also see:
http://marc.info/?l=linux-mm&m=149860136413338&w=2
Signed-off-by: Tony Luck <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Elliott, Robert (Persistent Memory) <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Commit:
d77698df39a5 ("x86/build: Specify stack alignment for clang")
intended to use the same stack alignment for clang as with gcc.
The two compilers use different options to configure the stack alignment
(gcc: -mpreferred-stack-boundary=n, clang: -mstack-alignment=n).
The above commit assumes that the clang option uses the same parameter
type as gcc, i.e. that the alignment is specified as 2^n. However clang
interprets the value of this option literally to use an alignment of n,
in consequence the stack remains misaligned.
Change the values used with -mstack-alignment to be the actual alignment
instead of a power of two.
cc-option isn't used here with the typical pattern of KBUILD_CFLAGS +=
$(call cc-option ...). The reason is that older gcc versions don't
support the -mpreferred-stack-boundary option, since cc-option doesn't
verify whether the alternative option is valid it would incorrectly
select the clang option -mstack-alignment..
Signed-off-by: Matthias Kaehlcke <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: [email protected]
Cc: Greg Hackmann <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Michael Davidson <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephen Hines <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
__startup_64() is normally using fixup_pointer() to access globals in a
position-independent fashion. However 'next_early_pgt' was accessed
directly, which wasn't guaranteed to work.
Luckily GCC was generating a R_X86_64_PC32 PC-relative relocation for
'next_early_pgt', but Clang emitted a R_X86_64_32S, which led to
accessing invalid memory and rebooting the kernel.
Signed-off-by: Alexander Potapenko <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Davidson <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: c88d71508e36 ("x86/boot/64: Rewrite startup_64() in C")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
register_nmi_handler() can be called from PREEMPT_RT atomic context
(e.g. wakeup_cpu_via_init_nmi() or native_stop_other_cpus()), and thus
ordinary spinlocks cannot be used.
Signed-off-by: Scott Wood <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Don Zickus <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
The ADDR_NO_RANDOMIZE checks in stack_maxrandom_size() and
randomize_stack_top() are not required.
PF_RANDOMIZE is set by load_elf_binary() only if ADDR_NO_RANDOMIZE is not
set, no need to re-check after that.
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Dmitry Safonov <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
Documentation/admin-guide/kernel-parameters.txt says:
norandmaps Don't use address space randomization. Equivalent
to echo 0 > /proc/sys/kernel/randomize_va_space
but it doesn't work because arch_rnd() which is used to randomize
mm->mmap_base returns a random value unconditionally. And as Kirill
pointed out, ADDR_NO_RANDOMIZE is broken by the same reason.
Just shift the PF_RANDOMIZE check from arch_mmap_rnd() to arch_rnd().
Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Cyrill Gorcunov <[email protected]>
Reviewed-by: Dmitry Safonov <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
The use of the ternary operator is redundant as ret can never be
non-zero at that point. Instead, just return nbytes.
Detected by CoverityScan, CID#1452658 ("Logically dead code")
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Vikas Shivappa <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
|
|
During a mkdir, the entire limbo list is synchronously checked on each
package for free RMIDs by sending IPIs. With a large number of RMIDs (SKL
has 192) this creates a intolerable amount of work in IPIs.
Replace the IPI based checking of the limbo list with asynchronous worker
threads on each package which periodically scan the limbo list and move the
RMIDs that have:
llc_occupancy < threshold_occupancy
on all packages to the free list.
mkdir now returns -ENOSPC if the free list and the limbo list ere empty or
returns -EBUSY if there are RMIDs on the limbo list and the free list is
empty.
Getting rid of the IPIs also simplifies the data structures and the
serialization required for handling the lists.
[ tglx: Rewrote changelog ... ]
Signed-off-by: Vikas Shivappa <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
|
|
When a CPU is dying, the overflow worker is canceled and rescheduled on a
different CPU in the same domain. But if the timer is already about to
expire this essentially doubles the interval which might result in a non
detected overflow.
Cancel the overflow worker and reschedule it immediately on a different CPU
in same domain. The work could be flushed as well, but that would
reschedule it on the same CPU.
[ tglx: Rewrote changelog once again ]
Reported-by: Thomas Gleixner <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
|
|
|
|
Larry reported a CPU hotplug lock recursion in the MTRR code.
============================================
WARNING: possible recursive locking detected
systemd-udevd/153 is trying to acquire lock:
(cpu_hotplug_lock.rw_sem){.+.+.+}, at: [<c030fc26>] stop_machine+0x16/0x30
but task is already holding lock:
(cpu_hotplug_lock.rw_sem){.+.+.+}, at: [<c0234353>] mtrr_add_page+0x83/0x470
....
cpus_read_lock+0x48/0x90
stop_machine+0x16/0x30
mtrr_add_page+0x18b/0x470
mtrr_add+0x3e/0x70
mtrr_add_page() holds the hotplug rwsem already and calls stop_machine()
which acquires it again.
Call stop_machine_cpuslocked() instead.
Reported-and-tested-by: Larry Finger <[email protected]>
Reported-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708140920250.1865@nanos
Cc: "Paul E. McKenney" <[email protected]>
Cc: Borislav Petkov <[email protected]>
|
|
When I cleaned up the Xen SYSCALL entries, I inadvertently changed
the reported segment registers. Before my patch, regs->ss was
__USER(32)_DS and regs->cs was __USER(32)_CS. After the patch, they
are FLAT_USER_CS/DS(32).
This had a couple unfortunate effects. It confused the
opportunistic fast return logic. It also significantly increased
the risk of triggering a nasty glibc bug:
https://sourceware.org/bugzilla/show_bug.cgi?id=21269
Update the Xen entry code to change it back.
Reported-by: Brian Gerst <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Fixes: 8a9949bc71a7 ("x86/xen/64: Rearrange the SYSCALL entries")
Link: http://lkml.kernel.org/r/daba8351ea2764bb30272296ab9ce08a81bd8264.1502775273.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Linux 4.13-rc5
There's a really nasty nouveau collision, hopefully someone can take a look
once I pushed this out.
|
|
We want the firmware, and other changes, in here as well.
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
"Fix an error path bug in ixp4xx as well as a read overrun in
sha1-avx2"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: x86/sha1 - Fix reads beyond the number of blocks passed
crypto: ixp4xx - Fix error handling path in 'aead_perform()'
|
|
Currently we have pqr_state and rdt_default_state which store the cached
CLOSID/RMIDs and the user configured cpu default values respectively. We
touch both of these during context switch. Put all of them in one
structure so that we can spare a cache line.
Reported-by: Thomas Gleixner <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
|
|
The user configured per cpu default RMID is not cleared during cpu
hotplug. This may lead to incorrect RMID values after a cpu goes offline
and again comes back online. Clear the per cpu default RMID during cpu
offline and online handling.
Reported-by: Prakyha Sai Praneeth <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
|
|
include/linux/i2c is not for client devices. Move the header file to a
more appropriate location.
Signed-off-by: Wolfram Sang <[email protected]>
Acked-by: Patrik Jakobsson <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Some fixes for Xen:
- a fix for a regression introduced in 4.13 for a Xen HVM-guest
configured with KASLR
- a fix for a possible deadlock in the xenbus driver when booting the
system
- a fix for lost interrupts in Xen guests"
* tag 'for-linus-4.13b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/events: Fix interrupt lost during irq_disable and irq_enable
xen: avoid deadlock in xenbus
xen: fix hvm guest with kaslr enabled
xen: split up xen_hvm_init_shared_info()
x86: provide an init_mem_mapping hypervisor hook
|
|
Host-initiated writes to the IA32_APIC_BASE MSR do not have to follow
local APIC state transition constraints, but the value written must be
valid.
Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Bailing out immediately if there is no available mmu page to alloc.
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [warn_test:3089]
irq event stamp: 20532
hardirqs last enabled at (20531): [<ffffffff8e9b6908>] restore_regs_and_iret+0x0/0x1d
hardirqs last disabled at (20532): [<ffffffff8e9b7ae8>] apic_timer_interrupt+0x98/0xb0
softirqs last enabled at (8266): [<ffffffff8e9badc6>] __do_softirq+0x206/0x4c1
softirqs last disabled at (8253): [<ffffffff8e083918>] irq_exit+0xf8/0x100
CPU: 5 PID: 3089 Comm: warn_test Tainted: G OE 4.13.0-rc3+ #8
RIP: 0010:kvm_mmu_prepare_zap_page+0x72/0x4b0 [kvm]
Call Trace:
make_mmu_pages_available.isra.120+0x71/0xc0 [kvm]
kvm_mmu_load+0x1cf/0x410 [kvm]
kvm_arch_vcpu_ioctl_run+0x1316/0x1bf0 [kvm]
kvm_vcpu_ioctl+0x340/0x700 [kvm]
? kvm_vcpu_ioctl+0x340/0x700 [kvm]
? __fget+0xfc/0x210
do_vfs_ioctl+0xa4/0x6a0
? __fget+0x11d/0x210
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x23/0xc2
? __this_cpu_preempt_check+0x13/0x20
This can be reproduced readily by ept=N and running syzkaller tests since
many syzkaller testcases don't setup any memory regions. However, if ept=Y
rmode identity map will be created, then kvm_mmu_calculate_mmu_pages() will
extend the number of VM's mmu pages to at least KVM_MIN_ALLOC_MMU_PAGES
which just hide the issue.
I saw the scenario kvm->arch.n_max_mmu_pages == 0 && kvm->arch.n_used_mmu_pages == 1,
so there is one active mmu page on the list, kvm_mmu_prepare_zap_page() fails
to zap any pages, however prepare_zap_oldest_mmu_page() always returns true.
It incurs infinite loop in make_mmu_pages_available() which causes mmu->lock
softlockup.
This patch fixes it by setting the return value of prepare_zap_oldest_mmu_page()
according to whether or not there is mmu page zapped.
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Let's reuse the function introduced with eptp switching.
We don't explicitly have to check against enable_ept_ad_bits, as this
is implicitly done when checking against nested_vmx_ept_caps in
valid_ept_address().
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
A Xen HVM guest running with KASLR enabled will die rather soon today
because the shared info page mapping is using va() too early. This was
introduced by commit a5d5f328b0e2baa5ee7c119fd66324eb79eeeb66 ("xen:
allocate page for shared info page from low memory").
In order to fix this use early_memremap() to get a temporary virtual
address for shared info until va() can be used safely.
Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
|
|
Instead of calling xen_hvm_init_shared_info() on boot and resume split
it up into a boot time function searching for the pfn to use and a
mapping function doing the hypervisor mapping call.
Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
|
|
Provide a hook in hypervisor_x86 called after setting up initial
memory mapping.
This is needed e.g. by Xen HVM guests to map the hypervisor shared
info page.
Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
|
|
The newly introduced function is only used when CONFIG_SMP is set:
arch/x86/kernel/cpu/amd.c:305:13: warning: 'legacy_fixup_core_id' defined but not used
This moves the existing #ifdef around the caller so it covers
legacy_fixup_core_id() as well.
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Emanuel Czirai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Suravee Suthikulpanit <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Yazen Ghannam <[email protected]>
Fixes: b89b41d0b841 ("x86/cpu/amd: Limit cpu_core_id fixup to families older than F17h")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Mark a couple of structures and functions as 'static', pointed out by Sparse:
warning: symbol 'bts_pmu' was not declared. Should it be static?
warning: symbol 'p4_event_aliases' was not declared. Should it be static?
warning: symbol 'rapl_attr_groups' was not declared. Should it be static?
symbol 'process_uv2_message' was not declared. Should it be static?
Signed-off-by: Colin Ian King <[email protected]>
Acked-by: Andrew Banman <[email protected]> # for the UV change
Cc: Alexander Shishkin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
VMSAVE/VMLOAD" CPUID flag
"virtual_vmload_vmsave" is what is going to land in /proc/cpuinfo now
as per v4.13-rc4, for a single feature bit which is clearly too long.
So rename it to what it is called in the processor manual.
"v_vmsave_vmload" is a bit shorter, after all.
We could go more aggressively here but having it the same as in the
processor manual is advantageous.
Signed-off-by: Borislav Petkov <[email protected]>
Acked-by: Radim Krčmář <[email protected]>
Cc: Janakarajan Natarajan <[email protected]>
Cc: Jörg Rödel <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: kvm-ML <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
According to Intel 64 and IA-32 Architectures SDM, Volume 3,
Chapter 14.2, "Software needs to exercise care to avoid delays
between the two RDMSRs (for example interrupts)".
So, disable interrupts during reading MSRs IA32_APERF and IA32_MPERF.
See also: commit 4ab60c3f32c7 (cpufreq: intel_pstate: Disable
interrupts during MSRs reading).
Signed-off-by: Doug Smythies <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
Hyper-V host can suggest us to use hypercall for doing remote TLB flush,
this is supposed to work faster than IPIs.
Implementation details: to do HvFlushVirtualAddress{Space,List} hypercalls
we need to put the input somewhere in memory and we don't really want to
have memory allocation on each call so we pre-allocate per cpu memory areas
on boot.
pv_ops patching is happening very early so we need to separate
hyperv_setup_mmu_ops() and hyper_alloc_mmu().
It is possible and easy to implement local TLB flushing too and there is
even a hint for that. However, I don't see a room for optimization on the
host side as both hypercall and native tlb flush will result in vmexit. The
hint is also not set on modern Hyper-V versions.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
For systems with X86_FEATURE_TOPOEXT, current logic uses the APIC ID
to calculate shared_cpu_map. However, APIC IDs are not guaranteed to
be contiguous for cores across different L3s (e.g. family17h system
w/ downcore configuration). This breaks the logic, and results in an
incorrect L3 shared_cpu_map.
Instead, always use the previously calculated cpu_llc_shared_mask of
each CPU to derive the L3 shared_cpu_map.
Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Current cpu_core_id fixup causes downcored F17h configurations to be
incorrect:
NODE: 0
processor 0 core id : 0
processor 1 core id : 1
processor 2 core id : 2
processor 3 core id : 4
processor 4 core id : 5
processor 5 core id : 0
NODE: 1
processor 6 core id : 2
processor 7 core id : 3
processor 8 core id : 4
processor 9 core id : 0
processor 10 core id : 1
processor 11 core id : 2
Code that relies on the cpu_core_id, like match_smt(), for example,
which builds the thread siblings masks used by the scheduler, is
mislead.
So, limit the fixup to pre-F17h machines. The new value for cpu_core_id
for F17h and later will represent the CPUID_Fn8000001E_EBX[CoreId],
which is guaranteed to be unique for each core within a socket.
This way we have:
NODE: 0
processor 0 core id : 0
processor 1 core id : 1
processor 2 core id : 2
processor 3 core id : 4
processor 4 core id : 5
processor 5 core id : 6
NODE: 1
processor 6 core id : 8
processor 7 core id : 9
processor 8 core id : 10
processor 9 core id : 12
processor 10 core id : 13
processor 11 core id : 14
Signed-off-by: Suravee Suthikulpanit <[email protected]>
[ Heavily massaged. ]
Signed-off-by: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yazen Ghannam <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
So I was looking at text_poke_bp() today and I couldn't make sense of
the barriers there.
How's for something like so?
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Acked-by: Jiri Kosina <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Switching FS and GS is a mess, and the current code is still subtly
wrong: it assumes that "Loading a nonzero value into FS sets the
index and base", which is false on AMD CPUs if the value being
loaded is 1, 2, or 3.
(The current code came from commit 3e2b68d752c9 ("x86/asm,
sched/x86: Rewrite the FS and GS context switch code"), which made
it better but didn't fully fix it.)
Rewrite it to be much simpler and more obviously correct. This
should fix it fully on AMD CPUs and shouldn't adversely affect
performance.
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Chang Seok <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
In ELF_COPY_CORE_REGS, we're copying from the current task, so
accessing thread.fsbase and thread.gsbase makes no sense. Just read
the values from the CPU registers.
In practice, the old code would have been correct most of the time
simply because thread.fsbase and thread.gsbase usually matched the
CPU registers.
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Chang Seok <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
execve used to leak FSBASE and GSBASE on AMD CPUs. Fix it.
The security impact of this bug is small but not quite zero -- it
could weaken ASLR when a privileged task execs a less privileged
program, but only if program changed bitness across the exec, or the
child binary was highly unusual or actively malicious. A child
program that was compromised after the exec would not have access to
the leaked base.
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Chang Seok <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
A hang on CPU0 onlining after a preceding offlining is observed. Trace
shows that CPU0 is stuck in check_tsc_sync_target() waiting for source
CPU to run check_tsc_sync_source() but this never happens. Source CPU,
in its turn, is stuck on synchronize_sched() which is called from
native_cpu_up() -> do_boot_cpu() -> unregister_nmi_handler().
So it's a classic ABBA deadlock, due to the use of synchronize_sched() in
unregister_nmi_handler().
Fix the bug by moving unregister_nmi_handler() from do_boot_cpu() to
native_cpu_up() after cpu onlining is done.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
To support implementing remote TLB flushing on Hyper-V with a hypercall
we need to make vp_index available outside of vmbus module. Rename and
globalize.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Rep hypercalls are normal hypercalls which perform multiple actions at
once. Hyper-V guarantees to return exectution to the caller in not more
than 50us and the caller needs to use hypercall continuation. Touch NMI
watchdog between hypercall invocations.
This is going to be used for HvFlushVirtualAddressList hypercall for
remote TLB flushing.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Hyper-V supports 'fast' hypercalls when all parameters are passed through
registers. Implement an inline version of a simpliest of these calls:
hypercall with one 8-byte input and no output.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
We have only three call sites for hv_do_hypercall() and we're going to
change HVCALL_SIGNAL_EVENT to doing fast hypercall so we can inline this
function for optimization.
Hyper-V top level functional specification states that r9-r11 registers
and flags may be clobbered by the hypervisor during hypercall and with
inlining this is somewhat important, add the clobbers.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Code is arch/x86/hyperv/ is only needed when CONFIG_HYPERV is set, the
'basic' support and detection lives in arch/x86/kernel/cpu/mshyperv.c
which is included when CONFIG_HYPERVISOR_GUEST is set.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|