aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2017-08-17Merge branches 'intel_pstate-fix' and 'cpufreq-x86-fix'Rafael J. Wysocki1-0/+3
* intel_pstate-fix: cpufreq: intel_pstate: report correct CPU frequencies during trace * cpufreq-x86-fix: cpufreq: x86: Disable interrupts during MSRs reading
2017-08-17x86/boot/KASLR: Prefer mirrored memory regions for the kernel physical addressBaoquan He1-2/+66
Currently KASLR will parse all e820 entries of RAM type and add all candidate positions into the slots array. After that we choose one slot randomly as the new position which the kernel will be decompressed into and run at. On systems with EFI enabled, e820 memory regions are coming from EFI memory regions by combining adjacent regions. These EFI memory regions have various attributes, and the "mirrored" attribute is one of them. The physical memory region whose descriptors in EFI memory map has EFI_MEMORY_MORE_RELIABLE attribute (bit: 16) are mirrored. The address range mirroring feature of the kernel arranges such mirrored regions into normal zones and other regions into movable zones. With the mirroring feature enabled, the code and data of the kernel can only be located in the more reliable mirrored regions. However, the current KASLR code doesn't check EFI memory entries, and could choose a new kernel position in non-mirrored regions. This will break the intended functionality of the address range mirroring feature. To fix this, if EFI is detected, iterate EFI memory map and pick the mirrored region to process for adding candidate of randomization slot. If EFI is disabled or no mirrored region found, still process the e820 memory map. Signed-off-by: Baoquan He <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] [ Rewrote most of the text. ] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-17efi: Introduce efi_early_memdesc_ptr to get pointer to memmap descriptorBaoquan He1-1/+1
The existing map iteration helper for_each_efi_memory_desc_in_map can only be used after the kernel initializes the EFI subsystem to set up struct efi_memory_map. Before that we also need iterate map descriptors which are stored in several intermediate structures, like struct efi_boot_memmap for arch independent usage and struct efi_info for x86 arch only. Introduce efi_early_memdesc_ptr() to get pointer to a map descriptor, and replace several places where that primitive is open coded. Signed-off-by: Baoquan He <[email protected]> [ Various improvements to the text. ] Acked-by: Matt Fleming <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/20170816134651.GF21273@x1 Signed-off-by: Ingo Molnar <[email protected]>
2017-08-17Merge branch 'linus' into x86/boot, to pick up fixesIngo Molnar53-290/+744
Signed-off-by: Ingo Molnar <[email protected]>
2017-08-17locking/refcounts, x86/asm: Implement fast refcount overflow protectionKees Cook4-0/+158
This implements refcount_t overflow protection on x86 without a noticeable performance impact, though without the fuller checking of REFCOUNT_FULL. This is done by duplicating the existing atomic_t refcount implementation but with normally a single instruction added to detect if the refcount has gone negative (e.g. wrapped past INT_MAX or below zero). When detected, the handler saturates the refcount_t to INT_MIN / 2. With this overflow protection, the erroneous reference release that would follow a wrap back to zero is blocked from happening, avoiding the class of refcount-overflow use-after-free vulnerabilities entirely. Only the overflow case of refcounting can be perfectly protected, since it can be detected and stopped before the reference is freed and left to be abused by an attacker. There isn't a way to block early decrements, and while REFCOUNT_FULL stops increment-from-zero cases (which would be the state _after_ an early decrement and stops potential double-free conditions), this fast implementation does not, since it would require the more expensive cmpxchg loops. Since the overflow case is much more common (e.g. missing a "put" during an error path), this protection provides real-world protection. For example, the two public refcount overflow use-after-free exploits published in 2016 would have been rendered unexploitable: http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/ http://cyseclabs.com/page?n=02012016 This implementation does, however, notice an unchecked decrement to zero (i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it resulted in a zero). Decrements under zero are noticed (since they will have resulted in a negative value), though this only indicates that a use-after-free may have already happened. Such notifications are likely avoidable by an attacker that has already exploited a use-after-free vulnerability, but it's better to have them reported than allow such conditions to remain universally silent. On first overflow detection, the refcount value is reset to INT_MIN / 2 (which serves as a saturation value) and a report and stack trace are produced. When operations detect only negative value results (such as changing an already saturated value), saturation still happens but no notification is performed (since the value was already saturated). On the matter of races, since the entire range beyond INT_MAX but before 0 is negative, every operation at INT_MIN / 2 will trap, leaving no overflow-only race condition. As for performance, this implementation adds a single "js" instruction to the regular execution flow of a copy of the standard atomic_t refcount operations. (The non-"and_test" refcount_dec() function, which is uncommon in regular refcount design patterns, has an additional "jz" instruction to detect reaching exactly zero.) Since this is a forward jump, it is by default the non-predicted path, which will be reinforced by dynamic branch prediction. The result is this protection having virtually no measurable change in performance over standard atomic_t operations. The error path, located in .text.unlikely, saves the refcount location and then uses UD0 to fire a refcount exception handler, which resets the refcount, handles reporting, and returns to regular execution. This keeps the changes to .text size minimal, avoiding return jumps and open-coded calls to the error reporting routine. Example assembly comparison: refcount_inc() before: .text: ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp) refcount_inc() after: .text: ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp) ffffffff8154614d: 0f 88 80 d5 17 00 js ffffffff816c36d3 ... .text.unlikely: ffffffff816c36d3: 48 8d 4d f4 lea -0xc(%rbp),%rcx ffffffff816c36d7: 0f ff (bad) These are the cycle counts comparing a loop of refcount_inc() from 1 to INT_MAX and back down to 0 (via refcount_dec_and_test()), between unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL (refcount_t-full), and this overflow-protected refcount (refcount_t-fast): 2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s: cycles protections atomic_t 82249267387 none refcount_t-fast 82211446892 overflow, untested dec-to-zero refcount_t-full 144814735193 overflow, untested dec-to-zero, inc-from-zero This code is a modified version of the x86 PAX_REFCOUNT atomic_t overflow defense from the last public patch of PaX/grsecurity, based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Thanks to PaX Team for various suggestions for improvement for repurposing this code to be a refcount-only protection. Signed-off-by: Kees Cook <[email protected]> Reviewed-by: Josh Poimboeuf <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David S. Miller <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Elena Reshetova <[email protected]> Cc: Eric Biggers <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Greg KH <[email protected]> Cc: Hans Liljestrand <[email protected]> Cc: James Bottomley <[email protected]> Cc: Jann Horn <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Manfred Spraul <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: linux-arch <[email protected]> Link: http://lkml.kernel.org/r/20170815161924.GA133115@beast Signed-off-by: Ingo Molnar <[email protected]>
2017-08-17x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pagesTony Luck2-0/+47
Speculative processor accesses may reference any memory that has a valid page table entry. While a speculative access won't generate a machine check, it will log the error in a machine check bank. That could cause escalation of a subsequent error since the overflow bit will be then set in the machine check bank status register. Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual address of the page we want to map out otherwise we may trigger the very problem we are trying to avoid. We use a non-canonical address that passes through the usual Linux table walking code to get to the same "pte". Thanks to Dave Hansen for reviewing several iterations of this. Also see: http://marc.info/?l=linux-mm&m=149860136413338&w=2 Signed-off-by: Tony Luck <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: Elliott, Robert (Persistent Memory) <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-17x86/build: Fix stack alignment for CLangMatthias Kaehlcke1-6/+8
Commit: d77698df39a5 ("x86/build: Specify stack alignment for clang") intended to use the same stack alignment for clang as with gcc. The two compilers use different options to configure the stack alignment (gcc: -mpreferred-stack-boundary=n, clang: -mstack-alignment=n). The above commit assumes that the clang option uses the same parameter type as gcc, i.e. that the alignment is specified as 2^n. However clang interprets the value of this option literally to use an alignment of n, in consequence the stack remains misaligned. Change the values used with -mstack-alignment to be the actual alignment instead of a power of two. cc-option isn't used here with the typical pattern of KBUILD_CFLAGS += $(call cc-option ...). The reason is that older gcc versions don't support the -mpreferred-stack-boundary option, since cc-option doesn't verify whether the alternative option is valid it would incorrectly select the clang option -mstack-alignment.. Signed-off-by: Matthias Kaehlcke <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: [email protected] Cc: Greg Hackmann <[email protected]> Cc: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Michael Davidson <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephen Hines <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-17x86/boot/64/clang: Use fixup_pointer() to access 'next_early_pgt'Alexander Potapenko1-3/+4
__startup_64() is normally using fixup_pointer() to access globals in a position-independent fashion. However 'next_early_pgt' was accessed directly, which wasn't guaranteed to work. Luckily GCC was generating a R_X86_64_PC32 PC-relative relocation for 'next_early_pgt', but Clang emitted a R_X86_64_32S, which led to accessing invalid memory and rebooting the kernel. Signed-off-by: Alexander Potapenko <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael Davidson <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: c88d71508e36 ("x86/boot/64: Rewrite startup_64() in C") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-17Merge branch 'linus' into perf/core, to pick up fixesIngo Molnar5-54/+87
Signed-off-by: Ingo Molnar <[email protected]>
2017-08-16x86/nmi: Use raw lockScott Wood1-9/+9
register_nmi_handler() can be called from PREEMPT_RT atomic context (e.g. wakeup_cpu_via_init_nmi() or native_stop_other_cpus()), and thus ordinary spinlocks cannot be used. Signed-off-by: Scott Wood <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Don Zickus <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-08-16x86/elf: Remove the unnecessary ADDR_NO_RANDOMIZE checksOleg Nesterov1-2/+1
The ADDR_NO_RANDOMIZE checks in stack_maxrandom_size() and randomize_stack_top() are not required. PF_RANDOMIZE is set by load_elf_binary() only if ADDR_NO_RANDOMIZE is not set, no need to re-check after that. Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Dmitry Safonov <[email protected]> Cc: [email protected] Cc: Andy Lutomirski <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-08-16x86: Fix norandmaps/ADDR_NO_RANDOMIZEOleg Nesterov1-2/+2
Documentation/admin-guide/kernel-parameters.txt says: norandmaps Don't use address space randomization. Equivalent to echo 0 > /proc/sys/kernel/randomize_va_space but it doesn't work because arch_rnd() which is used to randomize mm->mmap_base returns a random value unconditionally. And as Kirill pointed out, ADDR_NO_RANDOMIZE is broken by the same reason. Just shift the PF_RANDOMIZE check from arch_mmap_rnd() to arch_rnd(). Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()") Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Cyrill Gorcunov <[email protected]> Reviewed-by: Dmitry Safonov <[email protected]> Cc: [email protected] Cc: Andy Lutomirski <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-08-16x86/intel_rdt: Remove redundant ternary operator on returnColin Ian King1-1/+1
The use of the ternary operator is redundant as ret can never be non-zero at that point. Instead, just return nbytes. Detected by CoverityScan, CID#1452658 ("Logically dead code") Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Vikas Shivappa <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-16x86/intel_rdt/cqm: Improve limbo list processingVikas Shivappa3-122/+133
During a mkdir, the entire limbo list is synchronously checked on each package for free RMIDs by sending IPIs. With a large number of RMIDs (SKL has 192) this creates a intolerable amount of work in IPIs. Replace the IPI based checking of the limbo list with asynchronous worker threads on each package which periodically scan the limbo list and move the RMIDs that have: llc_occupancy < threshold_occupancy on all packages to the free list. mkdir now returns -ENOSPC if the free list and the limbo list ere empty or returns -EBUSY if there are RMIDs on the limbo list and the free list is empty. Getting rid of the IPIs also simplifies the data structures and the serialization required for handling the lists. [ tglx: Rewrote changelog ... ] Signed-off-by: Vikas Shivappa <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-16x86/intel_rdt/mbm: Fix MBM overflow handler during CPU hotplugVikas Shivappa4-6/+6
When a CPU is dying, the overflow worker is canceled and rescheduled on a different CPU in the same domain. But if the timer is already about to expire this essentially doubles the interval which might result in a non detected overflow. Cancel the overflow worker and reschedule it immediately on a different CPU in same domain. The work could be flushed as well, but that would reschedule it on the same CPU. [ tglx: Rewrote changelog once again ] Reported-by: Thomas Gleixner <[email protected]> Signed-off-by: Vikas Shivappa <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller5-54/+87
2017-08-15x86/mtrr: Prevent CPU hotplug lock recursionThomas Gleixner1-3/+15
Larry reported a CPU hotplug lock recursion in the MTRR code. ============================================ WARNING: possible recursive locking detected systemd-udevd/153 is trying to acquire lock: (cpu_hotplug_lock.rw_sem){.+.+.+}, at: [<c030fc26>] stop_machine+0x16/0x30 but task is already holding lock: (cpu_hotplug_lock.rw_sem){.+.+.+}, at: [<c0234353>] mtrr_add_page+0x83/0x470 .... cpus_read_lock+0x48/0x90 stop_machine+0x16/0x30 mtrr_add_page+0x18b/0x470 mtrr_add+0x3e/0x70 mtrr_add_page() holds the hotplug rwsem already and calls stop_machine() which acquires it again. Call stop_machine_cpuslocked() instead. Reported-and-tested-by: Larry Finger <[email protected]> Reported-by: Dmitry Vyukov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708140920250.1865@nanos Cc: "Paul E. McKenney" <[email protected]> Cc: Borislav Petkov <[email protected]>
2017-08-15x86/xen/64: Fix the reported SS and CS in SYSCALLAndy Lutomirski1-0/+18
When I cleaned up the Xen SYSCALL entries, I inadvertently changed the reported segment registers. Before my patch, regs->ss was __USER(32)_DS and regs->cs was __USER(32)_CS. After the patch, they are FLAT_USER_CS/DS(32). This had a couple unfortunate effects. It confused the opportunistic fast return logic. It also significantly increased the risk of triggering a nasty glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=21269 Update the Xen entry code to change it back. Reported-by: Brian Gerst <[email protected]> Signed-off-by: Andy Lutomirski <[email protected]> Cc: Andrew Cooper <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Fixes: 8a9949bc71a7 ("x86/xen/64: Rearrange the SYSCALL entries") Link: http://lkml.kernel.org/r/daba8351ea2764bb30272296ab9ce08a81bd8264.1502775273.git.luto@kernel.org Signed-off-by: Ingo Molnar <[email protected]>
2017-08-15Backmerge tag 'v4.13-rc5' into drm-nextDave Airlie22-168/+397
Linux 4.13-rc5 There's a really nasty nouveau collision, hopefully someone can take a look once I pushed this out.
2017-08-14Merge 4.13-rc5 into char-misc-nextGreg Kroah-Hartman22-168/+397
We want the firmware, and other changes, in here as well. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2017-08-14Merge branch 'linus' of ↵Linus Torvalds2-32/+37
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "Fix an error path bug in ixp4xx as well as a read overrun in sha1-avx2" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: x86/sha1 - Fix reads beyond the number of blocks passed crypto: ixp4xx - Fix error handling path in 'aead_perform()'
2017-08-14x86/intel_rdt: Modify the intel_pqr_state for better performanceVikas Shivappa3-24/+26
Currently we have pqr_state and rdt_default_state which store the cached CLOSID/RMIDs and the user configured cpu default values respectively. We touch both of these during context switch. Put all of them in one structure so that we can spare a cache line. Reported-by: Thomas Gleixner <[email protected]> Signed-off-by: Vikas Shivappa <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-14x86/intel_rdt/cqm: Clear the default RMID during hotcpuVikas Shivappa1-0/+1
The user configured per cpu default RMID is not cleared during cpu hotplug. This may lead to incorrect RMID values after a cpu goes offline and again comes back online. Clear the per cpu default RMID during cpu offline and online handling. Reported-by: Prakyha Sai Praneeth <[email protected]> Signed-off-by: Vikas Shivappa <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-13gpu: drm: tc35876x: move header file out of I2C realmWolfram Sang1-1/+1
include/linux/i2c is not for client devices. Move the header file to a more appropriate location. Signed-off-by: Wolfram Sang <[email protected]> Acked-by: Patrik Jakobsson <[email protected]>
2017-08-12Merge tag 'for-linus-4.13b-rc5-tag' of ↵Linus Torvalds3-22/+50
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "Some fixes for Xen: - a fix for a regression introduced in 4.13 for a Xen HVM-guest configured with KASLR - a fix for a possible deadlock in the xenbus driver when booting the system - a fix for lost interrupts in Xen guests" * tag 'for-linus-4.13b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/events: Fix interrupt lost during irq_disable and irq_enable xen: avoid deadlock in xenbus xen: fix hvm guest with kaslr enabled xen: split up xen_hvm_init_shared_info() x86: provide an init_mem_mapping hypervisor hook
2017-08-11kvm: x86: Disallow illegal IA32_APIC_BASE MSR valuesJim Mattson1-6/+8
Host-initiated writes to the IA32_APIC_BASE MSR do not have to follow local APIC state transition constraints, but the value written must be valid. Signed-off-by: Jim Mattson <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2017-08-11KVM: MMU: Bail out immediately if there is no available mmu pageWanpeng Li2-10/+29
Bailing out immediately if there is no available mmu page to alloc. Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2017-08-11KVM: MMU: Fix softlockup due to mmu_lock is held too longWanpeng Li1-3/+1
watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [warn_test:3089] irq event stamp: 20532 hardirqs last enabled at (20531): [<ffffffff8e9b6908>] restore_regs_and_iret+0x0/0x1d hardirqs last disabled at (20532): [<ffffffff8e9b7ae8>] apic_timer_interrupt+0x98/0xb0 softirqs last enabled at (8266): [<ffffffff8e9badc6>] __do_softirq+0x206/0x4c1 softirqs last disabled at (8253): [<ffffffff8e083918>] irq_exit+0xf8/0x100 CPU: 5 PID: 3089 Comm: warn_test Tainted: G OE 4.13.0-rc3+ #8 RIP: 0010:kvm_mmu_prepare_zap_page+0x72/0x4b0 [kvm] Call Trace: make_mmu_pages_available.isra.120+0x71/0xc0 [kvm] kvm_mmu_load+0x1cf/0x410 [kvm] kvm_arch_vcpu_ioctl_run+0x1316/0x1bf0 [kvm] kvm_vcpu_ioctl+0x340/0x700 [kvm] ? kvm_vcpu_ioctl+0x340/0x700 [kvm] ? __fget+0xfc/0x210 do_vfs_ioctl+0xa4/0x6a0 ? __fget+0x11d/0x210 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc2 ? __this_cpu_preempt_check+0x13/0x20 This can be reproduced readily by ept=N and running syzkaller tests since many syzkaller testcases don't setup any memory regions. However, if ept=Y rmode identity map will be created, then kvm_mmu_calculate_mmu_pages() will extend the number of VM's mmu pages to at least KVM_MIN_ALLOC_MMU_PAGES which just hide the issue. I saw the scenario kvm->arch.n_max_mmu_pages == 0 && kvm->arch.n_used_mmu_pages == 1, so there is one active mmu page on the list, kvm_mmu_prepare_zap_page() fails to zap any pages, however prepare_zap_oldest_mmu_page() always returns true. It incurs infinite loop in make_mmu_pages_available() which causes mmu->lock softlockup. This patch fixes it by setting the return value of prepare_zap_oldest_mmu_page() according to whether or not there is mmu page zapped. Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2017-08-11KVM: nVMX: validate eptp pointerDavid Hildenbrand1-5/+2
Let's reuse the function introduced with eptp switching. We don't explicitly have to check against enable_ept_ad_bits, as this is implicitly done when checking against nested_vmx_ept_caps in valid_ept_address(). Signed-off-by: David Hildenbrand <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2017-08-11xen: fix hvm guest with kaslr enabledJuergen Gross1-2/+14
A Xen HVM guest running with KASLR enabled will die rather soon today because the shared info page mapping is using va() too early. This was introduced by commit a5d5f328b0e2baa5ee7c119fd66324eb79eeeb66 ("xen: allocate page for shared info page from low memory"). In order to fix this use early_memremap() to get a temporary virtual address for shared info until va() can be used safely. Signed-off-by: Juergen Gross <[email protected]> Reviewed-by: Boris Ostrovsky <[email protected]> Acked-by: Ingo Molnar <[email protected]> Signed-off-by: Juergen Gross <[email protected]>
2017-08-11xen: split up xen_hvm_init_shared_info()Juergen Gross1-21/+24
Instead of calling xen_hvm_init_shared_info() on boot and resume split it up into a boot time function searching for the pfn to use and a mapping function doing the hypervisor mapping call. Signed-off-by: Juergen Gross <[email protected]> Reviewed-by: Boris Ostrovsky <[email protected]> Acked-by: Ingo Molnar <[email protected]> Signed-off-by: Juergen Gross <[email protected]>
2017-08-11x86: provide an init_mem_mapping hypervisor hookJuergen Gross2-0/+13
Provide a hook in hypervisor_x86 called after setting up initial memory mapping. This is needed e.g. by Xen HVM guests to map the hypervisor shared info page. Signed-off-by: Juergen Gross <[email protected]> Reviewed-by: Boris Ostrovsky <[email protected]> Acked-by: Ingo Molnar <[email protected]> Signed-off-by: Juergen Gross <[email protected]>
2017-08-11x86/cpu/amd: Hide unused legacy_fixup_core_id() functionArnd Bergmann1-1/+1
The newly introduced function is only used when CONFIG_SMP is set: arch/x86/kernel/cpu/amd.c:305:13: warning: 'legacy_fixup_core_id' defined but not used This moves the existing #ifdef around the caller so it covers legacy_fixup_core_id() as well. Signed-off-by: Arnd Bergmann <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Emanuel Czirai <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suravee Suthikulpanit <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: Yazen Ghannam <[email protected]> Fixes: b89b41d0b841 ("x86/cpu/amd: Limit cpu_core_id fixup to families older than F17h") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-11x86: Mark various structures and functions as 'static'Colin Ian King4-5/+5
Mark a couple of structures and functions as 'static', pointed out by Sparse: warning: symbol 'bts_pmu' was not declared. Should it be static? warning: symbol 'p4_event_aliases' was not declared. Should it be static? warning: symbol 'rapl_attr_groups' was not declared. Should it be static? symbol 'process_uv2_message' was not declared. Should it be static? Signed-off-by: Colin Ian King <[email protected]> Acked-by: Andrew Banman <[email protected]> # for the UV change Cc: Alexander Shishkin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-11x86/cpufeature, kvm/svm: Rename (shorten) the new "virtualized ↵Borislav Petkov2-2/+2
VMSAVE/VMLOAD" CPUID flag "virtual_vmload_vmsave" is what is going to land in /proc/cpuinfo now as per v4.13-rc4, for a single feature bit which is clearly too long. So rename it to what it is called in the processor manual. "v_vmsave_vmload" is a bit shorter, after all. We could go more aggressively here but having it the same as in the processor manual is advantageous. Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Radim Krčmář <[email protected]> Cc: Janakarajan Natarajan <[email protected]> Cc: Jörg Rödel <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: kvm-ML <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-11cpufreq: x86: Disable interrupts during MSRs readingDoug Smythies1-0/+3
According to Intel 64 and IA-32 Architectures SDM, Volume 3, Chapter 14.2, "Software needs to exercise care to avoid delays between the two RDMSRs (for example interrupts)". So, disable interrupts during reading MSRs IA32_APERF and IA32_MPERF. See also: commit 4ab60c3f32c7 (cpufreq: intel_pstate: Disable interrupts during MSRs reading). Signed-off-by: Doug Smythies <[email protected]> Reviewed-by: Len Brown <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2017-08-10x86/hyper-v: Use hypercall for remote TLB flushVitaly Kuznetsov6-1/+152
Hyper-V host can suggest us to use hypercall for doing remote TLB flush, this is supposed to work faster than IPIs. Implementation details: to do HvFlushVirtualAddress{Space,List} hypercalls we need to put the input somewhere in memory and we don't really want to have memory allocation on each call so we pre-allocate per cpu memory areas on boot. pv_ops patching is happening very early so we need to separate hyperv_setup_mmu_ops() and hyper_alloc_mmu(). It is possible and easy to implement local TLB flushing too and there is even a hint for that. However, I don't see a room for optimization on the host side as both hypercall and native tlb flush will result in vmexit. The hint is also not set on modern Hyper-V versions. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Reviewed-by: Stephen Hemminger <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Jork Loeser <[email protected]> Cc: K. Y. Srinivasan <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Simon Xiao <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-10x86/cpu/amd: Derive L3 shared_cpu_map from cpu_llc_shared_maskSuravee Suthikulpanit1-14/+18
For systems with X86_FEATURE_TOPOEXT, current logic uses the APIC ID to calculate shared_cpu_map. However, APIC IDs are not guaranteed to be contiguous for cores across different L3s (e.g. family17h system w/ downcore configuration). This breaks the logic, and results in an incorrect L3 shared_cpu_map. Instead, always use the previously calculated cpu_llc_shared_mask of each CPU to derive the L3 shared_cpu_map. Signed-off-by: Suravee Suthikulpanit <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-10x86/cpu/amd: Limit cpu_core_id fixup to families older than F17hSuravee Suthikulpanit1-7/+17
Current cpu_core_id fixup causes downcored F17h configurations to be incorrect: NODE: 0 processor 0 core id : 0 processor 1 core id : 1 processor 2 core id : 2 processor 3 core id : 4 processor 4 core id : 5 processor 5 core id : 0 NODE: 1 processor 6 core id : 2 processor 7 core id : 3 processor 8 core id : 4 processor 9 core id : 0 processor 10 core id : 1 processor 11 core id : 2 Code that relies on the cpu_core_id, like match_smt(), for example, which builds the thread siblings masks used by the scheduler, is mislead. So, limit the fixup to pre-F17h machines. The new value for cpu_core_id for F17h and later will represent the CPUID_Fn8000001E_EBX[CoreId], which is guaranteed to be unique for each core within a socket. This way we have: NODE: 0 processor 0 core id : 0 processor 1 core id : 1 processor 2 core id : 2 processor 3 core id : 4 processor 4 core id : 5 processor 5 core id : 6 NODE: 1 processor 6 core id : 8 processor 7 core id : 9 processor 8 core id : 10 processor 9 core id : 12 processor 10 core id : 13 processor 11 core id : 14 Signed-off-by: Suravee Suthikulpanit <[email protected]> [ Heavily massaged. ] Signed-off-by: Borislav Petkov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Yazen Ghannam <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-10x86: Clarify/fix no-op barriers for text_poke_bp()Peter Zijlstra1-6/+16
So I was looking at text_poke_bp() today and I couldn't make sense of the barriers there. How's for something like so? Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Acked-by: Jiri Kosina <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-10x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUsAndy Lutomirski1-105/+122
Switching FS and GS is a mess, and the current code is still subtly wrong: it assumes that "Loading a nonzero value into FS sets the index and base", which is false on AMD CPUs if the value being loaded is 1, 2, or 3. (The current code came from commit 3e2b68d752c9 ("x86/asm, sched/x86: Rewrite the FS and GS context switch code"), which made it better but didn't fully fix it.) Rewrite it to be much simpler and more obviously correct. This should fix it fully on AMD CPUs and shouldn't adversely affect performance. Signed-off-by: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Chang Seok <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-10x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumpsAndy Lutomirski1-2/+3
In ELF_COPY_CORE_REGS, we're copying from the current task, so accessing thread.fsbase and thread.gsbase makes no sense. Just read the values from the CPU registers. In practice, the old code would have been correct most of the time simply because thread.fsbase and thread.gsbase usually matched the CPU registers. Signed-off-by: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Chang Seok <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-10x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_commonAndy Lutomirski1-0/+9
execve used to leak FSBASE and GSBASE on AMD CPUs. Fix it. The security impact of this bug is small but not quite zero -- it could weaken ASLR when a privileged task execs a less privileged program, but only if program changed bitness across the exec, or the child binary was highly unusual or actively malicious. A child program that was compromised after the exec would not have access to the leaked base. Signed-off-by: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Chang Seok <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-10x86/smpboot: Unbreak CPU0 hotplugVitaly Kuznetsov1-13/+17
A hang on CPU0 onlining after a preceding offlining is observed. Trace shows that CPU0 is stuck in check_tsc_sync_target() waiting for source CPU to run check_tsc_sync_source() but this never happens. Source CPU, in its turn, is stuck on synchronize_sched() which is called from native_cpu_up() -> do_boot_cpu() -> unregister_nmi_handler(). So it's a classic ABBA deadlock, due to the use of synchronize_sched() in unregister_nmi_handler(). Fix the bug by moving unregister_nmi_handler() from do_boot_cpu() to native_cpu_up() after cpu onlining is done. Signed-off-by: Vitaly Kuznetsov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-10hyper-v: Globalize vp_indexVitaly Kuznetsov2-1/+57
To support implementing remote TLB flushing on Hyper-V with a hypercall we need to make vp_index available outside of vmbus module. Rename and globalize. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Reviewed-by: Stephen Hemminger <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Jork Loeser <[email protected]> Cc: K. Y. Srinivasan <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Simon Xiao <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-10x86/hyper-v: Implement rep hypercallsVitaly Kuznetsov1-0/+39
Rep hypercalls are normal hypercalls which perform multiple actions at once. Hyper-V guarantees to return exectution to the caller in not more than 50us and the caller needs to use hypercall continuation. Touch NMI watchdog between hypercall invocations. This is going to be used for HvFlushVirtualAddressList hypercall for remote TLB flushing. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Reviewed-by: Stephen Hemminger <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Jork Loeser <[email protected]> Cc: K. Y. Srinivasan <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Simon Xiao <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-10x86/hyper-v: Introduce fast hypercall implementationVitaly Kuznetsov1-0/+34
Hyper-V supports 'fast' hypercalls when all parameters are passed through registers. Implement an inline version of a simpliest of these calls: hypercall with one 8-byte input and no output. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Reviewed-by: Stephen Hemminger <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Jork Loeser <[email protected]> Cc: K. Y. Srinivasan <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Simon Xiao <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-10x86/hyper-v: Make hv_do_hypercall() inlineVitaly Kuznetsov2-49/+45
We have only three call sites for hv_do_hypercall() and we're going to change HVCALL_SIGNAL_EVENT to doing fast hypercall so we can inline this function for optimization. Hyper-V top level functional specification states that r9-r11 registers and flags may be clobbered by the hypervisor during hypercall and with inlining this is somewhat important, add the clobbers. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Reviewed-by: Stephen Hemminger <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Jork Loeser <[email protected]> Cc: K. Y. Srinivasan <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Simon Xiao <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-10x86/hyper-v: Include hyperv/ only when CONFIG_HYPERV is setVitaly Kuznetsov2-2/+7
Code is arch/x86/hyperv/ is only needed when CONFIG_HYPERV is set, the 'basic' support and detection lives in arch/x86/kernel/cpu/mshyperv.c which is included when CONFIG_HYPERVISOR_GUEST is set. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Reviewed-by: Stephen Hemminger <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Jork Loeser <[email protected]> Cc: K. Y. Srinivasan <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Simon Xiao <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-10Merge branch 'linus' into x86/platform, to pick up fixesIngo Molnar19-146/+347
Signed-off-by: Ingo Molnar <[email protected]>