aboutsummaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)AuthorFilesLines
2017-02-03KVM: MIPS: Add vcpu_run() & vcpu_reenter() callbacksJames Hogan3-41/+52
Add implementation callbacks for entering the guest (vcpu_run()) and reentering the guest (vcpu_reenter()), allowing implementation specific operations to be performed before entering the guest or after returning to the host without cluttering kvm_arch_vcpu_ioctl_run(). This allows the T&E specific lazy user GVA flush to be moved into trap_emul.c, along with disabling of the HTW. We also move kvm_mips_deliver_interrupts() as VZ will need to restore the guest timer state prior to delivering interrupts. Signed-off-by: James Hogan <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: "Radim Krčmář" <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: [email protected] Cc: [email protected]
2017-02-03KVM: MIPS: Remove duplicated ASIDs from vcpuJames Hogan7-48/+44
The kvm_vcpu_arch structure contains both mm_structs for allocating MMU contexts (primarily the ASID) but it also copies the resulting ASIDs into guest_{user,kernel}_asid[] arrays which are referenced from uasm generated code. This duplication doesn't seem to serve any purpose, and it gets in the way of generalising the ASID handling across guest kernel/user modes, so lets just extract the ASID straight out of the mm_struct on demand, and in fact there are convenient cpu_context() and cpu_asid() macros for doing so. To reduce the verbosity of this code we do also add kern_mm and user_mm local variables where the kernel and user mm_structs are used. Signed-off-by: James Hogan <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: "Radim Krčmář" <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: [email protected] Cc: [email protected]
2017-02-03KVM: MIPS/MMU: Move preempt/ASID handling to implementationJames Hogan2-53/+54
The MIPS KVM host and guest GVA ASIDs may need regenerating when scheduling a process in guest context, which is done from the kvm_arch_vcpu_load() / kvm_arch_vcpu_put() functions in mmu.c. However this is a fairly implementation specific detail. VZ for example may use GuestIDs instead of normal ASIDs to distinguish mappings belonging to different guests, and even on VZ without GuestID the root TLB will be used differently to trap & emulate. Trap & emulate GVA ASIDs only relate to the user part of the full address space, so can be left active during guest exit handling (guest context) to allow guest instructions to be easily read and translated. VZ root ASIDs however are for GPA mappings so can't be left active during normal kernel code. They also aren't useful for accessing guest virtual memory, and we should have CP0_BadInstr[P] registers available to provide encodings of trapping guest instructions anyway. Therefore move the ASID preemption handling into the implementation callback. Signed-off-by: James Hogan <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: "Radim Krčmář" <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: [email protected] Cc: [email protected]
2017-02-03KVM: MIPS: Convert get/set_regs -> vcpu_load/putJames Hogan3-10/+10
Convert the get_regs() and set_regs() callbacks to vcpu_load() and vcpu_put(), which provide a cpu argument and more closely match the kvm_arch_vcpu_load() / kvm_arch_vcpu_put() that they are called by. This is in preparation for moving ASID management into the implementations. Signed-off-by: James Hogan <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: "Radim Krčmář" <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: [email protected] Cc: [email protected]
2017-02-03KVM: MIPS/MMU: Simplify ASID restorationJames Hogan2-37/+12
KVM T&E uses an ASID for guest kernel mode and an ASID for guest user mode. The current ASID is saved when the guest is scheduled out, and restored when scheduling back in, with checks for whether the ASID needs to be regenerated. This isn't really necessary as the ASID can be easily determined by the current guest mode, so lets simplify it to just read the required ASID from guest_kernel_asid or guest_user_asid even if the ASID hasn't been regenerated. Signed-off-by: James Hogan <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: "Radim Krčmář" <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: [email protected] Cc: [email protected]
2017-02-03KVM: MIPS: Drop partial KVM_NMI implementationJames Hogan1-16/+0
MIPS incompletely implements the KVM_NMI ioctl to supposedly perform a CPU reset, but all it actually does is invalidate the ASIDs. It doesn't expose the KVM_CAP_USER_NMI capability which is supposed to indicate the presence of the KVM_NMI ioctl, and no user software actually uses it on MIPS. Since this is dead code that would technically need updating for GVA page table handling in upcoming patches, remove it now. If we wanted to implement NMI injection later it can always be done properly along with the KVM_CAP_USER_NMI capability, and if we wanted to implement a proper CPU reset it would be better done with a separate ioctl. Signed-off-by: James Hogan <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: "Radim Krčmář" <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: [email protected] Cc: [email protected]
2017-02-03Merge MIPS prerequisitesJames Hogan9-54/+116
Merge in MIPS prerequisites from GVA page tables and GPA page tables series. The same branch can also merge into the MIPS tree. Signed-off-by: James Hogan <[email protected]>
2017-02-03MIPS: Add return errors to protected cache opsJames Hogan1-20/+35
The protected cache ops contain no out of line fixup code to return an error code in the event of a fault, with the cache op being skipped in that case. For KVM however we'd like to detect this case as page faulting will be disabled so it could happen during normal operation if the GVA page tables were flushed, and need to be handled by the caller. Add the out-of-line fixup code to load the error value -EFAULT into the return variable, and adapt the protected cache line functions to pass the error back to the caller. Signed-off-by: James Hogan <[email protected]> Acked-by: Ralf Baechle <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: "Radim Krčmář" <[email protected]> Cc: [email protected] Cc: [email protected]
2017-02-03MIPS: Export some tlbex internals for KVM to useJames Hogan2-17/+42
Export to TLB exception code generating functions so that KVM can construct a fast TLB refill handler for guest context without reinventing the wheel quite so much. Signed-off-by: James Hogan <[email protected]> Acked-by: Ralf Baechle <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: "Radim Krčmář" <[email protected]> Cc: [email protected] Cc: [email protected]
2017-02-03MIPS: uasm: Add include guards in asm/uasm.hJames Hogan1-0/+5
Add include guards in asm/uasm.h to allow it to be safely used by a new header asm/tlbex.h in the next patch to expose TLB exception building functions for KVM to use. Signed-off-by: James Hogan <[email protected]> Acked-by: Ralf Baechle <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: "Radim Krčmář" <[email protected]> Cc: [email protected] Cc: [email protected]
2017-02-03MIPS: Export pgd/pmd symbols for KVMJames Hogan3-1/+7
Export pmd_init(), invalid_pmd_table and tlbmiss_handler_setup_pgd to GPL kernel modules so that MIPS KVM can use the inline page table management functions and switch between page tables: - pmd_init() will be used directly by KVM to initialise newly allocated pmd tables with invalid lower level table pointers. - invalid_pmd_table is used by pud_present(), pud_none(), and pud_clear(), which KVM will use to test and clear pud entries. - tlbmiss_handler_setup_pgd() will be called by KVM entry code to switch to the appropriate GVA page tables. Signed-off-by: James Hogan <[email protected]> Acked-by: Ralf Baechle <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: "Radim Krčmář" <[email protected]> Cc: [email protected] Cc: [email protected]
2017-02-02MIPS: Move pgd_alloc() out of headerJames Hogan3-16/+27
pgd_alloc() references init_mm which is not exported to modules. In order for KVM to be able to use pgd_alloc() to allocate GVA page tables, move pgd_alloc() into a new pgtable.c file and export it to modules. Signed-off-by: James Hogan <[email protected]> Acked-by: Ralf Baechle <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: "Radim Krčmář" <[email protected]> Cc: [email protected] Cc: [email protected]
2017-02-02MIPS: KVM: Return directly after a failed copy_from_user() in ↵Markus Elfring1-7/+2
kvm_arch_vcpu_ioctl() * Return directly after a call of the function "copy_from_user" failed in a case block. * Delete the jump label "out" which became unnecessary with this refactoring. Signed-off-by: Markus Elfring <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]> Signed-off-by: James Hogan <[email protected]>
2017-01-20Revert "KVM: nested VMX: disable perf cpuid reporting"Jim Mattson2-8/+0
This reverts commit bc6134942dbbf31c25e9bd7c876be5da81c9e1ce. A CPUID instruction executed in VMX non-root mode always causes a VM-exit, regardless of the leaf being queried. Fixes: bc6134942dbb ("KVM: nested VMX: disable perf cpuid reporting") Signed-off-by: Jim Mattson <[email protected]> [The issue solved by bc6134942dbb has been resolved with ff651cb613b4 ("KVM: nVMX: Add nested msr load/restore algorithm").] Signed-off-by: Radim Krčmář <[email protected]>
2017-01-17kvm: x86: Expose Intel VPOPCNTDQ feature to guestPiotr Luc1-1/+1
Vector population count instructions for dwords and qwords are to be used in future Intel Xeon & Xeon Phi processors. The bit 14 of CPUID[level:0x07, ECX] indicates that the new instructions are supported by a processor. The spec can be found in the Intel Software Developer Manual (SDM) or in the Instruction Set Extensions Programming Reference (ISE). Signed-off-by: Piotr Luc <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Radim Krčmář <[email protected]>
2017-01-17Merge branch 'x86/cpufeature' of ↵Radim Krčmář37-131/+312
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next For AVX512_VPOPCNTDQ.
2017-01-16x86/cpufeature: Add AVX512_VPOPCNTDQ featurePiotr Luc2-1/+2
Vector population count instructions for dwords and qwords are going to be available in future Intel Xeon & Xeon Phi processors. Bit 14 of CPUID[level:0x07, ECX] indicates that the instructions are supported by a processor. The specification can be found in the Intel Software Developer Manual (SDM) and in the Instruction Set Extensions Programming Reference (ISE). Populate the feature bit and clear it when xsave is disabled. Signed-off-by: Piotr Luc <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: [email protected] Cc: Radim Krčmář <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-01-15Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds18-100/+129
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes: - unwinder fixes - AMD CPU topology enumeration fixes - microcode loader fixes - x86 embedded platform fixes - fix for a bootup crash that may trigger when clearcpuid= is used with invalid values" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mpx: Use compatible types in comparison to fix sparse error x86/tsc: Add the Intel Denverton Processor to native_calibrate_tsc() x86/entry: Fix the end of the stack for newly forked tasks x86/unwind: Include __schedule() in stack traces x86/unwind: Disable KASAN checks for non-current tasks x86/unwind: Silence warnings for non-current tasks x86/microcode/intel: Use correct buffer size for saving microcode data x86/microcode/intel: Fix allocation size of struct ucode_patch x86/microcode/intel: Add a helper which gives the microcode revision x86/microcode: Use native CPUID to tickle out microcode revision x86/CPU: Add native CPUID variants returning a single datum x86/boot: Add missing declaration of string functions x86/CPU/AMD: Fix Bulldozer topology x86/platform/intel-mid: Rename 'spidev' to 'mrfld_spidev' x86/cpu: Fix typo in the comment for Anniedale x86/cpu: Fix bootup crashes by sanitizing the argument of the 'clearcpuid=' command-line option
2017-01-15Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds7-3/+15
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc race fixes uncovered by fuzzing efforts, a Sparse fix, two PMU driver fixes, plus miscellanous tooling fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Reject non sampling events with precise_ip perf/x86/intel: Account interrupts for PEBS errors perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race perf/core: Fix sys_perf_event_open() vs. hotplug perf/x86/intel: Use ULL constant to prevent undefined shift behaviour perf/x86/intel/uncore: Fix hardcoded socket 0 assumption in the Haswell init code perf/x86: Set pmu->module in Intel PMU modules perf probe: Fix to probe on gcc generated symbols for offline kernel perf probe: Fix --funcs to show correct symbols for offline module perf symbols: Robustify reading of build-id from sysfs perf tools: Install tools/lib/traceevent plugins with install-bin tools lib traceevent: Fix prev/next_prio for deadline tasks perf record: Fix --switch-output documentation and comment perf record: Make __record_options static tools lib subcmd: Add OPT_STRING_OPTARG_SET option perf probe: Fix to get correct modname from elf header samples/bpf trace_output_user: Remove duplicate sys/ioctl.h include samples/bpf sock_example: Avoid getting ethhdr from two includes perf sched timehist: Show total scheduling time
2017-01-15Merge branch 'efi-urgent-for-linus' of ↵Linus Torvalds2-2/+68
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI fixes from Ingo Molnar: "A number of regression fixes: - Fix a boot hang on machines that have somewhat unusual memory map entries of phys_addr=0x0 num_pages=0, which broke due to a recent commit. This commit got cherry-picked from the v4.11 queue because the bug is affecting real machines. - Fix a boot hang also reported by KASAN, caused by incorrect init ordering introduced by a recent optimization. - Fix a recent robustification fix to allocate_new_fdt_and_exit_boot() that introduced an invalid assumption. Neither bugs were seen in the wild AFAIK" * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi/x86: Prune invalid memory map entries and fix boot regression x86/efi: Don't allocate memmap through memblock after mm_init() efi/libstub/arm*: Pass latest memory map to the kernel
2017-01-14efi/x86: Prune invalid memory map entries and fix boot regressionPeter Jones1-0/+66
Some machines, such as the Lenovo ThinkPad W541 with firmware GNET80WW (2.28), include memory map entries with phys_addr=0x0 and num_pages=0. These machines fail to boot after the following commit, commit 8e80632fb23f ("efi/esrt: Use efi_mem_reserve() and avoid a kmalloc()") Fix this by removing such bogus entries from the memory map. Furthermore, currently the log output for this case (with efi=debug) looks like: [ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[0x0000000000000000-0xffffffffffffffff] (0MB) This is clearly wrong, and also not as informative as it could be. This patch changes it so that if we find obviously invalid memory map entries, we print an error and skip those entries. It also detects the display of the address range calculation overflow, so the new output is: [ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries: [ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[0x0000000000000000-0x0000000000000000] (invalid) It also detects memory map sizes that would overflow the physical address, for example phys_addr=0xfffffffffffff000 and num_pages=0x0200000000000001, and prints: [ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries: [ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[phys_addr=0xfffffffffffff000-0x20ffffffffffffffff] (invalid) It then removes these entries from the memory map. Signed-off-by: Peter Jones <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]> [ardb: refactor for clarity with no functional changes, avoid PAGE_SHIFT] Signed-off-by: Matt Fleming <[email protected]> [Matt: Include bugzilla info in commit log] Cc: <[email protected]> # v4.9+ Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://bugzilla.kernel.org/show_bug.cgi?id=191121 Signed-off-by: Ingo Molnar <[email protected]>
2017-01-14perf/x86: Reject non sampling events with precise_ipJiri Olsa1-0/+4
As Peter suggested [1] rejecting non sampling PEBS events, because they dont make any sense and could cause bugs in the NMI handler [2]. [1] http://lkml.kernel.org/r/20170103094059.GC3093@worktop [2] http://lkml.kernel.org/r/[email protected] Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Cc: Vince Weaver <[email protected]> Link: http://lkml.kernel.org/r/20170103142454.GA26251@krava Signed-off-by: Ingo Molnar <[email protected]>
2017-01-14perf/x86/intel: Account interrupts for PEBS errorsJiri Olsa1-1/+5
It's possible to set up PEBS events to get only errors and not any data, like on SNB-X (model 45) and IVB-EP (model 62) via 2 perf commands running simultaneously: taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10 This leads to a soft lock up, because the error path of the intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt for error PEBS interrupts, so in case you're getting ONLY errors you don't have a way to stop the event when it's over the max_samples_per_tick limit: NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816] ... RIP: 0010:[<ffffffff81159232>] [<ffffffff81159232>] smp_call_function_single+0xe2/0x140 ... Call Trace: ? trace_hardirqs_on_caller+0xf5/0x1b0 ? perf_cgroup_attach+0x70/0x70 perf_install_in_context+0x199/0x1b0 ? ctx_resched+0x90/0x90 SYSC_perf_event_open+0x641/0xf90 SyS_perf_event_open+0x9/0x10 do_syscall_64+0x6c/0x1f0 entry_SYSCALL64_slow_path+0x25/0x25 Add perf_event_account_interrupt() which does the interrupt and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s error path. We keep the pending_kill and pending_wakeup logic only in the __perf_event_overflow() path, because they make sense only if there's any data to deliver. Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Cc: Vince Weaver <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-01-14x86/mpx: Use compatible types in comparison to fix sparse errorTobias Klauser1-1/+1
info->si_addr is of type void __user *, so it should be compared against something from the same address space. This fixes the following sparse error: arch/x86/mm/mpx.c:296:27: error: incompatible types in comparison expression (different address spaces) Signed-off-by: Tobias Klauser <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-01-14x86/tsc: Add the Intel Denverton Processor to native_calibrate_tsc()Len Brown1-0/+1
The Intel Denverton microserver uses a 25 MHz TSC crystal, so we can derive its exact [*] TSC frequency using CPUID and some arithmetic, eg.: TSC: 1800 MHz (25000000 Hz * 216 / 3 / 1000000) [*] 'exact' is only as good as the crystal, which should be +/- 20ppm Signed-off-by: Len Brown <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/306899f94804aece6d8fa8b4223ede3b48dbb59c.1484287748.git.len.brown@intel.com Signed-off-by: Ingo Molnar <[email protected]>
2017-01-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds4-14/+66
Pull KVM fixes from Paolo Bonzini: - fix for module unload vs deferred jump labels (note: there might be other buggy modules!) - two NULL pointer dereferences from syzkaller - also syzkaller: fix emulation of fxsave/fxrstor/sgdt/sidt, problem made worse during this merge window, "just" kernel memory leak on releases - fix emulation of "mov ss" - somewhat serious on AMD, less so on Intel * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: fix emulation of "MOV SS, null selector" KVM: x86: fix NULL deref in vcpu_scan_ioapic KVM: eventfd: fix NULL deref irqbypass consumer KVM: x86: Introduce segmented_write_std KVM: x86: flush pending lapic jump label updates on module unload jump_labels: API for flushing deferred jump label updates
2017-01-13Merge tag 'arm64-fixes' of ↵Linus Torvalds2-10/+28
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Catalin Marinas: - Fix huge_ptep_set_access_flags() to return "changed" when any of the ptes in the contiguous range is changed, not just the last one - Fix the adr_l assembly macro to work in modules under KASLR * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: assembler: make adr_l work in modules under KASLR arm64: hugetlb: fix the wrong return value for huge_ptep_set_access_flags
2017-01-12arm64: assembler: make adr_l work in modules under KASLRArd Biesheuvel1-9/+27
When CONFIG_RANDOMIZE_MODULE_REGION_FULL=y, the offset between loaded modules and the core kernel may exceed 4 GB, putting symbols exported by the core kernel out of the reach of the ordinary adrp/add instruction pairs used to generate relative symbol references. So make the adr_l macro emit a movz/movk sequence instead when executing in module context. While at it, remove the pointless special case for the stack pointer. Acked-by: Mark Rutland <[email protected]> Acked-by: Will Deacon <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]> Signed-off-by: Catalin Marinas <[email protected]>
2017-01-12KVM: x86: fix emulation of "MOV SS, null selector"Paolo Bonzini1-10/+38
This is CVE-2017-2583. On Intel this causes a failed vmentry because SS's type is neither 3 nor 7 (even though the manual says this check is only done for usable SS, and the dmesg splat says that SS is unusable!). On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb. The fix fabricates a data segment descriptor when SS is set to a null selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb. Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3; this in turn ensures CPL < 3 because RPL must be equal to CPL. Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing the bug and deciphering the manuals. Reported-by: Xiaohan Zhang <[email protected]> Fixes: 79d5b4c3cd809c770d4bf9812635647016c56011 Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2017-01-12KVM: x86: fix NULL deref in vcpu_scan_ioapicWanpeng Li1-0/+2
Reported by syzkaller: BUG: unable to handle kernel NULL pointer dereference at 00000000000001b0 IP: _raw_spin_lock+0xc/0x30 PGD 3e28eb067 PUD 3f0ac6067 PMD 0 Oops: 0002 [#1] SMP CPU: 0 PID: 2431 Comm: test Tainted: G OE 4.10.0-rc1+ #3 Call Trace: ? kvm_ioapic_scan_entry+0x3e/0x110 [kvm] kvm_arch_vcpu_ioctl_run+0x10a8/0x15f0 [kvm] ? pick_next_task_fair+0xe1/0x4e0 ? kvm_arch_vcpu_load+0xea/0x260 [kvm] kvm_vcpu_ioctl+0x33a/0x600 [kvm] ? hrtimer_try_to_cancel+0x29/0x130 ? do_nanosleep+0x97/0xf0 do_vfs_ioctl+0xa1/0x5d0 ? __hrtimer_init+0x90/0x90 ? do_nanosleep+0x5b/0xf0 SyS_ioctl+0x79/0x90 do_syscall_64+0x6e/0x180 entry_SYSCALL64_slow_path+0x25/0x25 RIP: _raw_spin_lock+0xc/0x30 RSP: ffffa43688973cc0 The syzkaller folks reported a NULL pointer dereference due to ENABLE_CAP succeeding even without an irqchip. The Hyper-V synthetic interrupt controller is activated, resulting in a wrong request to rescan the ioapic and a NULL pointer dereference. #include <sys/ioctl.h> #include <sys/mman.h> #include <sys/types.h> #include <linux/kvm.h> #include <pthread.h> #include <stddef.h> #include <stdint.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #ifndef KVM_CAP_HYPERV_SYNIC #define KVM_CAP_HYPERV_SYNIC 123 #endif void* thr(void* arg) { struct kvm_enable_cap cap; cap.flags = 0; cap.cap = KVM_CAP_HYPERV_SYNIC; ioctl((long)arg, KVM_ENABLE_CAP, &cap); return 0; } int main() { void *host_mem = mmap(0, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); int kvmfd = open("/dev/kvm", 0); int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0); struct kvm_userspace_memory_region memreg; memreg.slot = 0; memreg.flags = 0; memreg.guest_phys_addr = 0; memreg.memory_size = 0x1000; memreg.userspace_addr = (unsigned long)host_mem; host_mem[0] = 0xf4; ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg); int cpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0); struct kvm_sregs sregs; ioctl(cpufd, KVM_GET_SREGS, &sregs); sregs.cr0 = 0; sregs.cr4 = 0; sregs.efer = 0; sregs.cs.selector = 0; sregs.cs.base = 0; ioctl(cpufd, KVM_SET_SREGS, &sregs); struct kvm_regs regs = { .rflags = 2 }; ioctl(cpufd, KVM_SET_REGS, &regs); ioctl(vmfd, KVM_CREATE_IRQCHIP, 0); pthread_t th; pthread_create(&th, 0, thr, (void*)(long)cpufd); usleep(rand() % 10000); ioctl(cpufd, KVM_RUN, 0); pthread_join(th, 0); return 0; } This patch fixes it by failing ENABLE_CAP if without an irqchip. Reported-by: Dmitry Vyukov <[email protected]> Fixes: 5c919412fe61 (kvm/x86: Hyper-V synthetic interrupt controller) Cc: [email protected] # 4.5+ Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Dmitry Vyukov <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2017-01-12KVM: x86: Introduce segmented_write_stdSteve Rutherford1-4/+18
Introduces segemented_write_std. Switches from emulated reads/writes to standard read/writes in fxsave, fxrstor, sgdt, and sidt. This fixes CVE-2017-2584, a longstanding kernel memory leak. Since commit 283c95d0e389 ("KVM: x86: emulate FXSAVE and FXRSTOR", 2016-11-09), which is luckily not yet in any final release, this would also be an exploitable kernel memory *write*! Reported-by: Dmitry Vyukov <[email protected]> Cc: [email protected] Fixes: 96051572c819194c37a8367624b285be10297eca Fixes: 283c95d0e3891b64087706b344a4b545d04a6e62 Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Steve Rutherford <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2017-01-12KVM: x86: flush pending lapic jump label updates on module unloadDavid Matlack3-0/+8
KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled). These are implemented with delayed_work structs which can still be pending when the KVM module is unloaded. We've seen this cause kernel panics when the kvm_intel module is quickly reloaded. Use the new static_key_deferred_flush() API to flush pending updates on module unload. Signed-off-by: David Matlack <[email protected]> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2017-01-12x86/entry: Fix the end of the stack for newly forked tasksJosh Poimboeuf2-23/+18
When unwinding a task, the end of the stack is always at the same offset right below the saved pt_regs, regardless of which syscall was used to enter the kernel. That convention allows the unwinder to verify that a stack is sane. However, newly forked tasks don't always follow that convention, as reported by the following unwinder warning seen by Dave Jones: WARNING: kernel stack frame pointer at ffffc90001443f30 in kworker/u8:8:30468 has bad value (null) The warning was due to the following call chain: (ftrace handler) call_usermodehelper_exec_async+0x5/0x140 ret_from_fork+0x22/0x30 The problem is that ret_from_fork() doesn't create a stack frame before calling other functions. Fix that by carefully using the frame pointer macros. In addition to conforming to the end of stack convention, this also makes related stack traces more sensible by making it clear to the user that ret_from_fork() was involved. Reported-by: Dave Jones <[email protected]> Signed-off-by: Josh Poimboeuf <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Miroslav Benes <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/8854cdaab980e9700a81e9ebf0d4238e4bbb68ef.1483978430.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <[email protected]>
2017-01-12x86/unwind: Include __schedule() in stack tracesJosh Poimboeuf2-5/+10
In the following commit: 0100301bfdf5 ("sched/x86: Rewrite the switch_to() code") ... the layout of the 'inactive_task_frame' struct was designed to have a frame pointer header embedded in it, so that the unwinder could use the 'bp' and 'ret_addr' fields to report __schedule() on the stack (or ret_from_fork() for newly forked tasks which haven't actually run yet). Finish the job by changing get_frame_pointer() to return a pointer to inactive_task_frame's 'bp' field rather than 'bp' itself. This allows the unwinder to start one frame higher on the stack, so that it properly reports __schedule(). Reported-by: Miroslav Benes <[email protected]> Signed-off-by: Josh Poimboeuf <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Dave Jones <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/598e9f7505ed0aba86e8b9590aa528c6c7ae8dcd.1483978430.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <[email protected]>
2017-01-12x86/unwind: Disable KASAN checks for non-current tasksJosh Poimboeuf2-3/+22
There are a handful of callers to save_stack_trace_tsk() and show_stack() which try to unwind the stack of a task other than current. In such cases, it's remotely possible that the task is running on one CPU while the unwinder is reading its stack from another CPU, causing the unwinder to see stack corruption. These cases seem to be mostly harmless. The unwinder has checks which prevent it from following bad pointers beyond the bounds of the stack. So it's not really a bug as long as the caller understands that unwinding another task will not always succeed. In such cases, it's possible that the unwinder may read a KASAN-poisoned region of the stack. Account for that by using READ_ONCE_NOCHECK() when reading the stack of another task. Use READ_ONCE() when reading the stack of the current task, since KASAN warnings can still be useful for finding bugs in that case. Reported-by: Dmitry Vyukov <[email protected]> Signed-off-by: Josh Poimboeuf <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Dave Jones <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Miroslav Benes <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/4c575eb288ba9f73d498dfe0acde2f58674598f1.1483978430.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <[email protected]>
2017-01-12x86/unwind: Silence warnings for non-current tasksJosh Poimboeuf1-0/+10
There are a handful of callers to save_stack_trace_tsk() and show_stack() which try to unwind the stack of a task other than current. In such cases, it's remotely possible that the task is running on one CPU while the unwinder is reading its stack from another CPU, causing the unwinder to see stack corruption. These cases seem to be mostly harmless. The unwinder has checks which prevent it from following bad pointers beyond the bounds of the stack. So it's not really a bug as long as the caller understands that unwinding another task will not always succeed. Since stack "corruption" on another task's stack isn't necessarily a bug, silence the warnings when unwinding tasks other than current. Reported-by: Dave Jones <[email protected]> Signed-off-by: Josh Poimboeuf <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Miroslav Benes <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/00d8c50eea3446c1524a2a755397a3966629354c.1483978430.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <[email protected]>
2017-01-11Merge branch 'linus' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: "This fixes a regression in aesni that renders it useless if it's built-in with a modular pcbc configuration" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: aesni - Fix failure when built-in with modular pcbc
2017-01-11perf/x86/intel: Use ULL constant to prevent undefined shift behaviourColin King1-1/+1
When x86_pmu.num_counters is 32 the shift of the integer constant 1 is exceeding 32bit and therefor undefined behaviour. Fix this by shifting 1ULL instead of 1. Reported-by: CoverityScan CID#1192105 ("Bad bit shift operation") Signed-off-by: Colin Ian King <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Kan Liang <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Alexander Shishkin <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-01-11perf/x86/intel/uncore: Fix hardcoded socket 0 assumption in the Haswell init ↵Prarit Bhargava1-1/+1
code hswep_uncore_cpu_init() uses a hardcoded physical package id 0 for the boot cpu. This works as long as the boot CPU is actually on the physical package 0, which is normaly the case after power on / reboot. But it fails with a NULL pointer dereference when a kdump kernel is started on a secondary socket which has a different physical package id because the locigal package translation for physical package 0 does not exist. Use the logical package id of the boot cpu instead of hard coded 0. [ tglx: Rewrote changelog once more ] Fixes: cf6d445f6897 ("perf/x86/uncore: Track packages, not per CPU data") Signed-off-by: Prarit Bhargava <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Harish Chegondi <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kan Liang <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2017-01-11arm64: hugetlb: fix the wrong return value for huge_ptep_set_access_flagsHuang Shijie1-1/+1
In current code, the @changed always returns the last one's status for the huge page with the contiguous bit set. This is really not what we want. Even one of the PTEs is changed, we should tell it to the caller. This patch fixes this issue. Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit") Cc: <[email protected]> # 4.5.x- Signed-off-by: Huang Shijie <[email protected]> Signed-off-by: Catalin Marinas <[email protected]>
2017-01-09x86/microcode/intel: Use correct buffer size for saving microcode dataJunichi Nomura1-2/+3
In generic_load_microcode(), curr_mc_size is the size of the last allocated buffer and since we have this performance "optimization" there to vmalloc a new buffer only when the current one is bigger, curr_mc_size ends up becoming the size of the biggest buffer we've seen so far. However, we end up saving the microcode patch which matches our CPU and its size is not curr_mc_size but the respective mc_size during the iteration while we're staring at it. So save that mc_size into a separate variable and use it to store the previously found microcode buffer. Without this fix, we could get oops like this: BUG: unable to handle kernel paging request at ffffc9000e30f000 IP: __memcpy+0x12/0x20 ... Call Trace: ? kmemdup+0x43/0x60 __alloc_microcode_buf+0x44/0x70 save_microcode_patch+0xd4/0x150 generic_load_microcode+0x1b8/0x260 request_microcode_user+0x15/0x20 microcode_write+0x91/0x100 __vfs_write+0x34/0x120 vfs_write+0xc1/0x130 SyS_write+0x56/0xc0 do_syscall_64+0x6c/0x160 entry_SYSCALL64_slow_path+0x25/0x25 Fixes: 06b8534cb728 ("x86/microcode: Rework microcode loading") Signed-off-by: Jun'ichi Nomura <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-01-09x86/microcode/intel: Fix allocation size of struct ucode_patchJunichi Nomura1-1/+1
We allocate struct ucode_patch here. @size is the size of microcode data and used for kmemdup() later in this function. Fixes: 06b8534cb728 ("x86/microcode: Rework microcode loading") Signed-off-by: Jun'ichi Nomura <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-01-09x86/microcode/intel: Add a helper which gives the microcode revisionBorislav Petkov3-38/+31
Since on Intel we're required to do CPUID(1) first, before reading the microcode revision MSR, let's add a special helper which does the required steps so that we don't forget to do them next time, when we want to read the microcode revision. Signed-off-by: Borislav Petkov <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-01-09x86/microcode: Use native CPUID to tickle out microcode revisionBorislav Petkov2-24/+4
Intel supplies the microcode revision value in MSR 0x8b (IA32_BIOS_SIGN_ID) after CPUID(1) has been executed. Execute it each time before reading that MSR. It used to do sync_core() which did do CPUID but c198b121b1a1 ("x86/asm: Rewrite sync_core() to use IRET-to-self") changed the sync_core() implementation so we better make the microcode loading case explicit, as the SDM documents it. Reported-and-tested-by: Jun'ichi Nomura <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-01-09x86/CPU: Add native CPUID variants returning a single datumBorislav Petkov1-0/+18
... similarly to the cpuid_<reg>() variants. Signed-off-by: Borislav Petkov <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-01-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-0/+2
Pull networking fixes from David Miller: 1) Fix dumping of nft_quota entries, from Pablo Neira Ayuso. 2) Fix out of bounds access in nf_tables discovered by KASAN, from Florian Westphal. 3) Fix IRQ enabling in dp83867 driver, from Grygorii Strashko. 4) Fix unicast filtering in be2net driver, from Ivan Vecera. 5) tg3_get_stats64() can race with driver close and ethtool reconfigurations, fix from Michael Chan. 6) Fix error handling when pass limit is reached in bpf code gen on x86. From Daniel Borkmann. 7) Don't clobber switch ops and use proper MDIO nested reads and writes in bcm_sf2 driver, from Florian Fainelli. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (21 commits) net: dsa: bcm_sf2: Utilize nested MDIO read/write net: dsa: bcm_sf2: Do not clobber b53_switch_ops net: stmmac: fix maxmtu assignment to be within valid range bpf: change back to orig prog on too many passes tg3: Fix race condition in tg3_get_stats64(). be2net: fix unicast list filling be2net: fix accesses to unicast list netlabel: add CALIPSO to the list of built-in protocols vti6: fix device register to report IFLA_INFO_KIND net: phy: dp83867: fix irq generation amd-xgbe: Fix IRQ processing when running in single IRQ mode sh_eth: R8A7740 supports packet shecksumming sh_eth: fix EESIPR values for SH77{34|63} r8169: fix the typo in the comment nl80211: fix sched scan netlink socket owner destruction bridge: netfilter: Fix dropping packets that moving through bridge interface netfilter: ipt_CLUSTERIP: check duplicate config when initializing netfilter: nft_payload: mangle ckecksum if NFT_PAYLOAD_L4CSUM_PSEUDOHDR is set netfilter: nf_tables: fix oob access netfilter: nft_queue: use raw_smp_processor_id() ...
2017-01-09kvm: nVMX: Reorder error checks for emulated VMXONJim Mattson1-3/+3
Checks on the operand to VMXON are performed after the check for legacy mode operation and the #GP checks, according to the pseudo-code in Intel's SDM. Signed-off-by: Jim Mattson <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2017-01-09KVM: lapic: do not scan IRR when delivering an interruptPaolo Bonzini1-4/+15
On interrupt delivery the PPR can only grow (except for auto-EOI), so it is impossible that non-auto-EOI interrupt delivery results in KVM_REQ_EVENT. We can therefore use __apic_update_ppr. Signed-off-by: Paolo Bonzini <[email protected]>
2017-01-09KVM: lapic: do not set KVM_REQ_EVENT unnecessarily on PPR updatePaolo Bonzini1-1/+2
On PPR update, we set KVM_REQ_EVENT unconditionally anytime PPR is lowered. But we can take into account IRR here already. Reviewed-by: Roman Kagan <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2017-01-09KVM: lapic: remove unnecessary KVM_REQ_EVENT on PPR updatePaolo Bonzini1-12/+24
PPR needs to be updated whenever on every IRR read because we may have missed TPR writes that _increased_ PPR. However, these writes need not generate KVM_REQ_EVENT, because either KVM_REQ_EVENT has been set already in __apic_accept_irq, or we are going to process the interrupt right away. Reviewed-by: Roman Kagan <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>