aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2017-07-18Merge branch 'x86/boot' into x86/mm, to pick up interacting changesIngo Molnar99-1182/+2138
The SME patches we are about to apply add some E820 logic, so merge in pending E820 code changes first, to have a single code base. Signed-off-by: Ingo Molnar <[email protected]>
2017-07-18x86/numa_emulation: Recalculate numa_nodes_parsed from emulated nodesWei Yang1-0/+7
When emulating NUMA, the kernel's emulated NUMA configuration may contain more or less nodes than there are physical nodes. In numa_emulation(), we recalculate numa_meminfo/numa_distance/__apicid_to_node according to the number of emulated nodes, except numa_nodes_parsed, which is arguably an omission. Recalculate numa_nodes_parsed as well. Signed-off-by: Wei Yang <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] [ Changelog fixes. ] Signed-off-by: Ingo Molnar <[email protected]>
2017-07-18x86/numa_emulation: Assign physnode_mask directly from numa_nodes_parsedWei Yang1-10/+8
numa_init() has already called init_func(), which is responsible for setting numa_nodes_parsed, so use this nodemask instead of re-finding it when calling numa_emulation(). This patch gets the physnode_mask directly from numa_nodes_parsed. At the same time, it corrects the comment of these two functions. Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-07-18x86/numa_emulation: Refine the calculation of max_emu_nid and dfl_phys_nidWei Yang1-13/+17
max_emu_nid and dfl_phys_nid is calculated from emu_nid_to_phys[], which is calculated in split_nodes_xxx_interleave(). From the logic in these functions, it is assured the emu_nid_to_phys[] has meaningful value if it return successfully and ensures dfl_phys_nid will get a valid value. This patch removes the error branch to check invalid dfl_phys_nid and abstracts out this part to a function for readability. Signed-off-by: Wei Yang <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-07-18x86/boot/KASLR: Rename process_e820_entry() into process_mem_region()Baoquan He1-3/+3
Now process_e820_entry() is not limited to e820 entry processing, rename it to process_mem_region(). And adjust the code comment accordingly. Signed-off-by: Baoquan He <[email protected]> Acked-by: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-07-18x86/boot/KASLR: Switch to pass struct mem_vector to process_e820_entry()Baoquan He1-11/+14
This makes process_e820_entry() be able to process any kind of memory region. Signed-off-by: Baoquan He <[email protected]> Acked-by: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-07-18x86/boot/KASLR: Wrap e820 entries walking code into new function ↵Baoquan He1-17/+21
process_e820_entries() The original function process_e820_entry() only takes care of each e820 entry passed. And move the E820_TYPE_RAM checking logic into process_e820_entries(). And remove the redundent local variable 'addr' definition in find_random_phys_addr(). Signed-off-by: Baoquan He <[email protected]> Acked-by: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-07-18x86/asm: Add unwind hint annotations to sync_core()Josh Poimboeuf1-0/+3
This enables objtool to grok the iret in the middle of a C function. Signed-off-by: Josh Poimboeuf <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/b057be26193c11d2ed3337b2107bc7adcba42c99.1499786555.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <[email protected]>
2017-07-18x86/entry/64: Add unwind hint annotationsJosh Poimboeuf3-11/+66
Add unwind hint annotations to entry_64.S. This will enable the ORC unwinder to unwind through any location in the entry code including syscalls, interrupts, and exceptions. Signed-off-by: Josh Poimboeuf <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/b9f6d478aadf68ba57c739dcfac34ec0dc021c4c.1499786555.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <[email protected]>
2017-07-18objtool, x86: Add facility for asm code to provide unwind hintsJosh Poimboeuf2-0/+210
Some asm (and inline asm) code does special things to the stack which objtool can't understand. (Nor can GCC or GNU assembler, for that matter.) In such cases we need a facility for the code to provide annotations, so the unwinder can unwind through it. This provides such a facility, in the form of unwind hints. They're similar to the GNU assembler .cfi* directives, but they give more information, and are needed in far fewer places, because objtool can fill in the blanks by following branches and adjusting the stack pointer for pushes and pops. Signed-off-by: Josh Poimboeuf <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/0f5f3c9104fca559ff4088bece1d14ae3bca52d5.1499786555.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <[email protected]>
2017-07-18x86/dumpstack: Fix interrupt and exception stack boundary checksJosh Poimboeuf2-4/+4
On x86_64, the double fault exception stack is located immediately after the interrupt stack in memory. This causes confusion in the unwinder when it tries to unwind through an empty interrupt stack, where the stack pointer points to the address bordering the two stacks. The unwinder incorrectly thinks it's running on the double fault stack. Fix this kind of stack border confusion by never considering the beginning address of an exception or interrupt stack to be part of the stack. Signed-off-by: Josh Poimboeuf <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Fixes: 5fe599e02e41 ("x86/dumpstack: Add support for unwinding empty IRQ stacks") Link: http://lkml.kernel.org/r/bcc142160a5104de5c354c21c394c93a0173943f.1499786555.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <[email protected]>
2017-07-18x86/dumpstack: Fix occasionally missing registersJosh Poimboeuf1-5/+7
If two consecutive stack frames have pt_regs, the oops dump code fails to print the second frame's registers. Fix that. Signed-off-by: Josh Poimboeuf <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Fixes: 3b3fa11bc700 ("x86/dumpstack: Print any pt_regs found on the stack") Link: http://lkml.kernel.org/r/269c5c00c7d45c699f3dcea42a3a594c6cf7a9a3.1499786555.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <[email protected]>
2017-07-18x86/entry/64: Initialize the top of the IRQ stack before switching stacksAndy Lutomirski1-1/+23
The OOPS unwinder wants the word at the top of the IRQ stack to point back to the previous stack at all times when the IRQ stack is in use. There's currently a one-instruction window in ENTER_IRQ_STACK during which this isn't the case. Fix it by writing the old RSP to the top of the IRQ stack before jumping. This currently writes the pointer to the stack twice, which is a bit ugly. We could get rid of this by replacing irq_stack_ptr with irq_stack_ptr_minus_eight (better name welcome). OTOH, there may be all kinds of odd microarchitectural considerations in play that affect performance by a few cycles here. Reported-by: Mike Galbraith <[email protected]> Reported-by: Josh Poimboeuf <[email protected]> Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Josh Poimboeuf <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/aae7e79e49914808440ad5310ace138ced2179ca.1499786555.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <[email protected]>
2017-07-18x86/entry/64: Refactor IRQ stacks and make them NMI-safeAndy Lutomirski3-26/+64
This will allow IRQ stacks to nest inside NMIs or similar entries that can happen during IRQ stack setup or teardown. The new macros won't work correctly if they're invoked with IRQs on. Add a check under CONFIG_DEBUG_ENTRY to detect that. Signed-off-by: Andy Lutomirski <[email protected]> [ Use %r10 instead of %r11 in xen_do_hypervisor_callback to make objtool and ORC unwinder's lives a little easier. ] Signed-off-by: Josh Poimboeuf <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/b0b2ff5fb97d2da2e1d7e1f380190c92545c8bb5.1499786555.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <[email protected]>
2017-07-18x86/mm, KVM: Fix warning when !CONFIG_PREEMPT_COUNTRoman Kagan1-1/+1
A recent commit: d6e41f1151fe ("x86/mm, KVM: Teach KVM's VMX code that CR3 isn't a constant") introduced a VM_WARN_ON(!in_atomic()) which generates false positives on every VM entry on !CONFIG_PREEMPT_COUNT kernels. Replace it with a test for preemptible(), which appears to match the original intent and works across different CONFIG_PREEMPT* variations. Signed-off-by: Roman Kagan <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Fixes: d6e41f1151fe ("x86/mm, KVM: Teach KVM's VMX code that CR3 isn't a constant") Signed-off-by: Ingo Molnar <[email protected]>
2017-07-17x86/hyper-v: stash the max number of virtual/logical processorVitaly Kuznetsov2-3/+11
Max virtual processor will be needed for 'extended' hypercalls supporting more than 64 vCPUs. While on it, unify on 'Hyper-V' in mshyperv.c as we currently have a mix, report acquired misc features as well. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Signed-off-by: K. Y. Srinivasan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2017-07-17x86/hyper-v: include hyperv/ only when CONFIG_HYPERV is setVitaly Kuznetsov2-2/+7
Code is arch/x86/hyperv/ is only needed when CONFIG_HYPERV is set, the 'basic' support and detection lives in arch/x86/kernel/cpu/mshyperv.c which is included when CONFIG_HYPERVISOR_GUEST is set. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Signed-off-by: K. Y. Srinivasan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2017-07-16x86/platform/uv/BAU: Fix congested_response_us not taking effectJustin Ernst1-4/+2
Bug fix for the BAU tunable congested_cycles not being set to the user defined value. Instead of referencing a global variable when deciding on BAU shutdown, a node will reference its own tunable set value ( cong_response_us). This results in the user set tunable value congested_response_us taking effect correctly. Signed-off-by: Justin Ernst <[email protected]> Acked-by: Andrew Banman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-07-16x86/cpu: Use indirect call to measure performance in init_amd_k6()Mikulas Patocka1-0/+1
This old piece of code is supposed to measure the performance of indirect calls to determine if the processor is buggy or not, however the compiler optimizer turns it into a direct call. Use the OPTIMIZER_HIDE_VAR() macro to thwart the optimization, so that a real indirect call is generated. Signed-off-by: Mikulas Patocka <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1707110737530.8746@file01.intranet.prod.int.rdu2.redhat.com Signed-off-by: Ingo Molnar <[email protected]>
2017-07-15Merge branch 'work.uaccess-unaligned' of ↵Linus Torvalds1-3/+0
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull uacess-unaligned removal from Al Viro: "That stuff had just one user, and an exotic one, at that - binfmt_flat on arm and m68k" * 'work.uaccess-unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kill {__,}{get,put}_user_unaligned() binfmt_flat: flat_{get,put}_addr_from_rp() should be able to fail
2017-07-15Merge branch 'for-linus-4.13-rc1' of ↵Linus Torvalds4-21/+29
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: "Mostly fixes for UML: - First round of fixes for PTRACE_GETRESET/SETREGSET - A printf vs printk cleanup - Minor improvements" * 'for-linus-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: um: Correctly check for PTRACE_GETRESET/SETREGSET um: v2: Use generic NOTES macro um: Add kerneldoc for userspace_tramp() and start_userspace() um: Add kerneldoc for segv_handler um: stub-data.h: remove superfluous include um: userspace - be more verbose in ptrace set regs error um: add dummy ioremap and iounmap functions um: Allow building and running on older hosts um: Avoid longjmp/setjmp symbol clashes with libpthread.a um: console: Ignore console= option um: Use os_warn to print out pre-boot warning/error messages um: Add os_warn() for pre-boot warning/error messages um: Use os_info for the messages on normal path um: Add os_info() for pre-boot information messages um: Use printk instead of printf in make_uml_dir
2017-07-15Merge tag 'kvm-4.13-2' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds14-172/+320
Pull more KVM updates from Radim Krčmář: "Second batch of KVM updates for v4.13 Common: - add uevents for VM creation/destruction - annotate and properly access RCU-protected objects s390: - rename IOCTL added in the first v4.13 merge x86: - emulate VMLOAD VMSAVE feature in SVM - support paravirtual asynchronous page fault while nested - add Hyper-V userspace interfaces for better migration - improve master clock corner cases - extend internal error reporting after EPT misconfig - correct single-stepping of emulated instructions in SVM - handle MCE during VM entry - fix nVMX VM entry checks and nVMX VMCS shadowing" * tag 'kvm-4.13-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (28 commits) kvm: x86: hyperv: make VP_INDEX managed by userspace KVM: async_pf: Let guest support delivery of async_pf from guest mode KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf KVM: async_pf: Add L1 guest async_pf #PF vmexit handler KVM: x86: Simplify kvm_x86_ops->queue_exception parameter list kvm: x86: hyperv: add KVM_CAP_HYPERV_SYNIC2 KVM: x86: make backwards_tsc_observed a per-VM variable KVM: trigger uevents when creating or destroying a VM KVM: SVM: Enable Virtual VMLOAD VMSAVE feature KVM: SVM: Add Virtual VMLOAD VMSAVE feature definition KVM: SVM: Rename lbr_ctl field in the vmcb control area KVM: SVM: Prepare for new bit definition in lbr_ctl KVM: SVM: handle singlestep exception when skipping emulated instructions KVM: x86: take slots_lock in kvm_free_pit KVM: s390: Fix KVM_S390_GET_CMMA_BITS ioctl definition kvm: vmx: Properly handle machine check during VM-entry KVM: x86: update master clock before computing kvmclock_offset kvm: nVMX: Shadow "high" parts of shadowed 64-bit VMCS fields kvm: nVMX: Fix nested_vmx_check_msr_bitmap_controls kvm: nVMX: Validate the I/O bitmaps on nested VM-entry ...
2017-07-14Merge branch 'linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: - fix new compiler warnings in cavium - set post-op IV properly in caam (this fixes chaining) - fix potential use-after-free in atmel in case of EBUSY - fix sleeping in softirq path in chcr - disable buggy sha1-avx2 driver (may overread and page fault) - fix use-after-free on signals in caam * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: cavium - make several functions static crypto: chcr - Avoid algo allocation in softirq. crypto: caam - properly set IV after {en,de}crypt crypto: atmel - only treat EBUSY as transient if backlog crypto: af_alg - Avoid sock_graft call warning crypto: caam - fix signals handling crypto: sha1-ssse3 - Disable avx2
2017-07-14kvm: x86: hyperv: make VP_INDEX managed by userspaceRoman Kagan4-19/+40
Hyper-V identifies vCPUs by Virtual Processor Index, which can be queried via HV_X64_MSR_VP_INDEX msr. It is defined by the spec as a sequential number which can't exceed the maximum number of vCPUs per VM. APIC ids can be sparse and thus aren't a valid replacement for VP indices. Current KVM uses its internal vcpu index as VP_INDEX. However, to make it predictable and persistent across VM migrations, the userspace has to control the value of VP_INDEX. This patch achieves that, by storing vp_index explicitly on vcpu, and allowing HV_X64_MSR_VP_INDEX to be set from the host side. For compatibility it's initialized to KVM vcpu index. Also a few variables are renamed to make clear distinction betweed this Hyper-V vp_index and KVM vcpu_id (== APIC id). Besides, a new capability, KVM_CAP_HYPERV_VP_INDEX, is added to allow the userspace to skip attempting msr writes where unsupported, to avoid spamming error logs. Signed-off-by: Roman Kagan <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2017-07-14KVM: async_pf: Let guest support delivery of async_pf from guest modeWanpeng Li6-5/+13
Adds another flag bit (bit 2) to MSR_KVM_ASYNC_PF_EN. If bit 2 is 1, async page faults are delivered to L1 as #PF vmexits; if bit 2 is 0, kvm_can_do_async_pf returns 0 if in guest mode. This is similar to what svm.c wanted to do all along, but it is only enabled for Linux as L1 hypervisor. Foreign hypervisors must never receive async page faults as vmexits, because they'd probably be very confused about that. Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2017-07-14KVM: async_pf: Force a nested vmexit if the injected #PF is async_pfWanpeng Li5-10/+35
Add an nested_apf field to vcpu->arch.exception to identify an async page fault, and constructs the expected vm-exit information fields. Force a nested VM exit from nested_vmx_check_exception() if the injected #PF is async page fault. Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2017-07-14KVM: async_pf: Add L1 guest async_pf #PF vmexit handlerWanpeng Li5-37/+51
This patch adds the L1 guest async page fault #PF vmexit handler, such by L1 similar to ordinary async page fault. Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> [Passed insn parameters to kvm_mmu_page_fault().] Signed-off-by: Radim Krčmář <[email protected]>
2017-07-14KVM: x86: Simplify kvm_x86_ops->queue_exception parameter listWanpeng Li4-13/+12
This patch removes all arguments except the first in kvm_x86_ops->queue_exception since they can extract the arguments from vcpu->arch.exception themselves. Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2017-07-13kvm: x86: hyperv: add KVM_CAP_HYPERV_SYNIC2Roman Kagan4-6/+17
There is a flaw in the Hyper-V SynIC implementation in KVM: when message page or event flags page is enabled by setting the corresponding msr, KVM zeroes it out. This is problematic because on migration the corresponding MSRs are loaded on the destination, so the content of those pages is lost. This went unnoticed so far because the only user of those pages was in-KVM hyperv synic timers, which could continue working despite that zeroing. Newer QEMU uses those pages for Hyper-V VMBus implementation, and zeroing them breaks the migration. Besides, in newer QEMU the content of those pages is fully managed by QEMU, so zeroing them is undesirable even when writing the MSRs from the guest side. To support this new scheme, introduce a new capability, KVM_CAP_HYPERV_SYNIC2, which, when enabled, makes sure that the synic pages aren't zeroed out in KVM. Signed-off-by: Roman Kagan <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2017-07-13KVM: x86: make backwards_tsc_observed a per-VM variableLadi Prosek2-4/+3
The backwards_tsc_observed global introduced in commit 16a9602 is never reset to false. If a VM happens to be running while the host is suspended (a common source of the TSC jumping backwards), master clock will never be enabled again for any VM. In contrast, if no VM is running while the host is suspended, master clock is unaffected. This is inconsistent and unnecessarily strict. Let's track the backwards_tsc_observed variable separately and let each VM start with a clean slate. Real world impact: My Windows VMs get slower after my laptop undergoes a suspend/resume cycle. The only way to get the perf back is unloading and reloading the kvm module. Signed-off-by: Ladi Prosek <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2017-07-12x86/efi: move asmlinkage before return typeJoe Perches1-2/+2
Make the code like the rest of the kernel. Link: http://lkml.kernel.org/r/1cd3d401626e51ea0e2333a860e76e80bc560a4c.1499284835.git.joe@perches.com Signed-off-by: Joe Perches <[email protected]> Cc: Matt Fleming <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-12x86/mmap: properly account for stack randomization in mmap_baseRik van Riel1-1/+6
When RLIMIT_STACK is, for example, 256MB, the current code results in a gap between the top of the task and mmap_base of 256MB, failing to take into account the amount by which the stack address was randomized. In other words, the stack gets less than RLIMIT_STACK space. Ensure that the gap between the stack and mmap_base always takes stack randomization and the stack guard gap into account. Obtained from Daniel Micay's linux-hardened tree. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Daniel Micay <[email protected]> Signed-off-by: Rik van Riel <[email protected]> Reported-by: Florian Weimer <[email protected]> Acked-by: Ingo Molnar <[email protected]> Cc: Will Deacon <[email protected]> Cc: Daniel Micay <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Catalin Marinas <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-12x86: ascii armor the x86_64 boot init stack canaryRik van Riel1-0/+1
Use the ascii-armor canary to prevent unterminated C string overflows from being able to successfully overwrite the canary, even if they somehow obtain the canary value. Inspired by execshield ascii-armor and Daniel Micay's linux-hardened tree. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Rik van Riel <[email protected]> Acked-by: Kees Cook <[email protected]> Cc: Daniel Micay <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-12include/linux/string.h: add the option of fortified string.h functionsDaniel Micay5-1/+23
This adds support for compiling with a rough equivalent to the glibc _FORTIFY_SOURCE=1 feature, providing compile-time and runtime buffer overflow checks for string.h functions when the compiler determines the size of the source or destination buffer at compile-time. Unlike glibc, it covers buffer reads in addition to writes. GNU C __builtin_*_chk intrinsics are avoided because they would force a much more complex implementation. They aren't designed to detect read overflows and offer no real benefit when using an implementation based on inline checks. Inline checks don't add up to much code size and allow full use of the regular string intrinsics while avoiding the need for a bunch of _chk functions and per-arch assembly to avoid wrapper overhead. This detects various overflows at compile-time in various drivers and some non-x86 core kernel code. There will likely be issues caught in regular use at runtime too. Future improvements left out of initial implementation for simplicity, as it's all quite optional and can be done incrementally: * Some of the fortified string functions (strncpy, strcat), don't yet place a limit on reads from the source based on __builtin_object_size of the source buffer. * Extending coverage to more string functions like strlcat. * It should be possible to optionally use __builtin_object_size(x, 1) for some functions (C strings) to detect intra-object overflows (like glibc's _FORTIFY_SOURCE=2), but for now this takes the conservative approach to avoid likely compatibility issues. * The compile-time checks should be made available via a separate config option which can be enabled by default (or always enabled) once enough time has passed to get the issues it catches fixed. Kees said: "This is great to have. While it was out-of-tree code, it would have blocked at least CVE-2016-3858 from being exploitable (improper size argument to strlcpy()). I've sent a number of fixes for out-of-bounds-reads that this detected upstream already" [[email protected]: x86: fix fortified memcpy] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: avoid panic() in favor of BUG()] Link: http://lkml.kernel.org/r/20170626235122.GA25261@beast [[email protected]: move from -mm, add ARCH_HAS_FORTIFY_SOURCE, tweak Kconfig help] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Daniel Micay <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Kees Cook <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Daniel Axtens <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-12kernel/watchdog: split up config optionsNicholas Piggin2-1/+2
Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR. LOCKUP_DETECTOR implies the general boot, sysctl, and programming interfaces for the lockup detectors. An architecture that wants to use a hard lockup detector must define HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH. Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and does not implement the LOCKUP_DETECTOR interfaces. sparc is unusual in that it has started to implement some of the interfaces, but not fully yet. It should probably be converted to a full HAVE_HARDLOCKUP_DETECTOR_ARCH. [[email protected]: fix] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Don Zickus <[email protected]> Reviewed-by: Babu Moger <[email protected]> Tested-by: Babu Moger <[email protected]> [sparc] Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Michael Ellerman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-12kexec: move vmcoreinfo out of the kernel's .bss sectionXunlei Pang2-3/+3
As Eric said, "what we need to do is move the variable vmcoreinfo_note out of the kernel's .bss section. And modify the code to regenerate and keep this information in something like the control page. Definitely something like this needs a page all to itself, and ideally far away from any other kernel data structures. I clearly was not watching closely the data someone decided to keep this silly thing in the kernel's .bss section." This patch allocates extra pages for these vmcoreinfo_XXX variables, one advantage is that it enhances some safety of vmcoreinfo, because vmcoreinfo now is kept far away from other kernel data structures. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Xunlei Pang <[email protected]> Tested-by: Michael Holzheu <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Suggested-by: Eric Biederman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dave Young <[email protected]> Cc: Hari Bathini <[email protected]> Cc: Mahesh Salgaonkar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-12KVM: SVM: Enable Virtual VMLOAD VMSAVE featureJanakarajan Natarajan2-0/+25
Enable the Virtual VMLOAD VMSAVE feature. This is done by setting bit 1 at position B8h in the vmcb. The processor must have nested paging enabled, be in 64-bit mode and have support for the Virtual VMLOAD VMSAVE feature for the bit to be set in the vmcb. Signed-off-by: Janakarajan Natarajan <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2017-07-12KVM: SVM: Add Virtual VMLOAD VMSAVE feature definitionJanakarajan Natarajan1-0/+1
Define a new cpufeature definition for Virtual VMLOAD VMSAVE. Signed-off-by: Janakarajan Natarajan <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2017-07-12KVM: SVM: Rename lbr_ctl field in the vmcb control areaJanakarajan Natarajan2-6/+6
Rename the lbr_ctl variable to better reflect the purpose of the field - provide support for virtualization extensions. Signed-off-by: Janakarajan Natarajan <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2017-07-12KVM: SVM: Prepare for new bit definition in lbr_ctlJanakarajan Natarajan2-2/+4
The lbr_ctl variable in the vmcb control area is used to enable or disable Last Branch Record (LBR) virtualization. However, this is to be done using only bit 0 of the variable. To correct this and to prepare for a new feature, change the current usage to work only on a particular bit. Signed-off-by: Janakarajan Natarajan <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2017-07-12KVM: SVM: handle singlestep exception when skipping emulated instructionsLadi Prosek1-26/+33
kvm_skip_emulated_instruction handles the singlestep debug exception which is something we almost always want. This commit (specifically the change in rdmsr_interception) makes the debug.flat KVM unit test pass on AMD. Two call sites still call skip_emulated_instruction directly: * In svm_queue_exception where it's used only for moving the rip forward * In task_switch_interception which is analogous to handle_task_switch in VMX Signed-off-by: Ladi Prosek <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2017-07-12KVM: x86: take slots_lock in kvm_free_pitRadim Krčmář1-0/+2
kvm_vm_release() did not have slots_lock when calling kvm_io_bus_unregister_dev() and this went unnoticed until 4a12f9517728 ("KVM: mark kvm->busses as rcu protected") added dynamic checks. Luckily, there should be no race at that point: ============================= WARNING: suspicious RCU usage 4.12.0.kvm+ #0 Not tainted ----------------------------- ./include/linux/kvm_host.h:479 suspicious rcu_dereference_check() usage! lockdep_rcu_suspicious+0xc5/0x100 kvm_io_bus_unregister_dev+0x173/0x190 [kvm] kvm_free_pit+0x28/0x80 [kvm] kvm_arch_sync_events+0x2d/0x30 [kvm] kvm_put_kvm+0xa7/0x2a0 [kvm] kvm_vm_release+0x21/0x30 [kvm] Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2017-07-12kvm: vmx: Properly handle machine check during VM-entryJim Mattson1-6/+9
vmx_complete_atomic_exit should call kvm_machine_check for any VM-entry failure due to a machine-check event. Such an exit should be recognized solely by its basic exit reason (i.e. the low 16 bits of the VMCS exit reason field). None of the other VMCS exit information fields contain valid information when the VM-exit is due to "VM-entry failure due to machine-check event". Signed-off-by: Jim Mattson <[email protected]> Reviewed-by: Xiao Guangrong <[email protected]> [Changed VM_EXIT_INTR_INFO condition to better describe its reason.] Signed-off-by: Radim Krčmář <[email protected]>
2017-07-12KVM: x86: update master clock before computing kvmclock_offsetRadim Krčmář1-1/+7
kvm master clock usually has a different frequency than the kernel boot clock. This is not a problem until the master clock is updated; update uses the current kernel boot clock to compute new kvm clock, which erases any kvm clock cycles that might have built up due to frequency difference over a long period. KVM_SET_CLOCK is one of places where we can safely update master clock as the guest-visible clock is going to be shifted anyway. The problem with current code is that it updates the kvm master clock after updating the offset. If the master clock was enabled before calling KVM_SET_CLOCK, then it might have built up a significant delta from kernel boot clock. In the worst case, the time set by userspace would be shifted by so much that it couldn't have been set at any point during KVM_SET_CLOCK. To fix this, move kvm_gen_update_masterclock() before computing kvmclock_offset, which means that the master clock and kernel boot clock will be sufficiently close together. Another solution would be to replace get_kvmclock_ns() with "ktime_get_boot_ns() + ka->kvmclock_offset", which is marginally more accurate, but would break symmetry with KVM_GET_CLOCK. Signed-off-by: Radim Krčmář <[email protected]>
2017-07-12kvm: nVMX: Shadow "high" parts of shadowed 64-bit VMCS fieldsJim Mattson1-26/+34
Inconsistencies result from shadowing only accesses to the full 64-bits of a 64-bit VMCS field, but not shadowing accesses to the high 32-bits of the field. The "high" part of a 64-bit field should be shadowed whenever the full 64-bit field is shadowed. Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2017-07-12kvm: nVMX: Fix nested_vmx_check_msr_bitmap_controlsJim Mattson1-11/+6
Allow the L1 guest to specify the last page of addressable guest physical memory for an L2 MSR permission bitmap. Also remove the vmcs12_read_any() check that should never fail. Fixes: 3af18d9c5fe95 ("KVM: nVMX: Prepare for using hardware MSR bitmap") Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2017-07-12kvm: nVMX: Validate the I/O bitmaps on nested VM-entryJim Mattson1-0/+16
According to the SDM, if the "use I/O bitmaps" VM-execution control is 1, bits 11:0 of each I/O-bitmap address must be 0. Neither address should set any bits beyond the processor's physical-address width. Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2017-07-12kvm: nVMX: Don't set vmcs12 to "launched" when VMLAUNCH failsJim Mattson1-2/+2
The VMCS launch state is not set to "launched" unless the VMLAUNCH actually succeeds. VMLAUNCH failure includes VM-exits with bit 31 set. Note that this change does not address the general problem that a failure to launch/resume vmcs02 (i.e. vmx->fail) is not handled correctly. Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2017-07-10binfmt_elf: use ELF_ET_DYN_BASE only for PIEKees Cook1-6/+7
The ELF_ET_DYN_BASE position was originally intended to keep loaders away from ET_EXEC binaries. (For example, running "/lib/ld-linux.so.2 /bin/cat" might cause the subsequent load of /bin/cat into where the loader had been loaded.) With the advent of PIE (ET_DYN binaries with an INTERP Program Header), ELF_ET_DYN_BASE continued to be used since the kernel was only looking at ET_DYN. However, since ELF_ET_DYN_BASE is traditionally set at the top 1/3rd of the TASK_SIZE, a substantial portion of the address space is unused. For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are loaded above the mmap region. This means they can be made to collide (CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with pathological stack regions. Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap region in all cases, and will now additionally avoid programs falling back to the mmap region by enforcing MAP_FIXED for program loads (i.e. if it would have collided with the stack, now it will fail to load instead of falling back to the mmap region). To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP) are loaded into the mmap region, leaving space available for either an ET_EXEC binary with a fixed location or PIE being loaded into mmap by the loader. Only PIE programs are loaded offset from ELF_ET_DYN_BASE, which means architectures can now safely lower their values without risk of loaders colliding with their subsequently loaded programs. For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use the entire 32-bit address space for 32-bit pointers. Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and suggestions on how to implement this solution. Fixes: d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR") Link: http://lkml.kernel.org/r/20170621173201.GA114489@beast Signed-off-by: Kees Cook <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Daniel Micay <[email protected]> Cc: Qualys Security Advisory <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Grzegorz Andrejczuk <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: James Hogan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Pratyush Anand <[email protected]> Cc: Russell King <[email protected]> Cc: Will Deacon <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10x86/kasan: don't allocate extra shadow memoryAndrey Ryabinin1-6/+1
We used to read several bytes of the shadow memory in advance. Therefore additional shadow memory mapped to prevent crash if speculative load would happen near the end of the mapped shadow memory. Now we don't have such speculative loads, so we no longer need to map additional shadow memory. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Ryabinin <[email protected]> Cc: Mark Rutland <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>