aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2018-08-06kvm: x86: Add ability to skip TLB flush when switching CR3Junaid Shahid4-6/+2
Remove the implicit flush from the set_cr3 handlers, so that the callers are able to decide whether to flush the TLB or not. Signed-off-by: Junaid Shahid <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06kvm: x86: Use fast CR3 switch for nested VMXJunaid Shahid3-9/+15
Use the fast CR3 switch mechanism to locklessly change the MMU root page when switching between L1 and L2. The switch from L2 to L1 should always go through the fast path, while the switch from L1 to L2 should go through the fast path if L1's CR3/EPTP for L2 hasn't changed since the last time. Signed-off-by: Junaid Shahid <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06kvm: x86: Support resetting the MMU context without resetting rootsJunaid Shahid2-17/+10
This adds support for re-initializing the MMU context in a different mode while preserving the active root_hpa and the prev_root. Signed-off-by: Junaid Shahid <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06kvm: x86: Add support for fast CR3 switch across different MMU modesJunaid Shahid1-6/+15
This generalizes the lockless CR3 switch path to be able to work across different MMU modes (e.g. nested vs non-nested) by checking that the expected page role of the new root page matches the page role of the previously stored root page in addition to checking that the new CR3 matches the previous CR3. Furthermore, instead of loading the hardware CR3 in fast_cr3_switch(), it is now done in vcpu_enter_guest(), as by that time the MMU context would be up-to-date with the VCPU mode. Signed-off-by: Junaid Shahid <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06kvm: x86: Introduce KVM_REQ_LOAD_CR3Junaid Shahid4-2/+11
The KVM_REQ_LOAD_CR3 request loads the hardware CR3 using the current root_hpa. Signed-off-by: Junaid Shahid <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06kvm: x86: Introduce kvm_mmu_calc_root_page_role()Junaid Shahid1-27/+85
These functions factor out the base role calculation from the corresponding kvm_init_*_mmu() functions. The new functions return what would be the role assigned to a root page in the current VCPU state. This can be masked with mmu_base_role_mask to derive the base role. Signed-off-by: Junaid Shahid <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06kvm: x86: Add fast CR3 switch code pathJunaid Shahid3-8/+71
When using shadow paging, a CR3 switch in the guest results in a VM Exit. In the common case, that VM exit doesn't require much processing by KVM. However, it does acquire the MMU lock, which can start showing signs of contention under some workloads even on a 2 VCPU VM when the guest is using KPTI. Therefore, we add a fast path that avoids acquiring the MMU lock in the most common cases e.g. when switching back and forth between the kernel and user mode CR3s used by KPTI with no guest page table changes in between. For now, this fast path is implemented only for 64-bit guests and hosts to avoid the handling of PDPTEs, but it can be extended later to 32-bit guests and/or hosts as well. Signed-off-by: Junaid Shahid <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06kvm: x86: Avoid taking MMU lock in kvm_mmu_sync_roots if no sync is neededJunaid Shahid1-8/+67
kvm_mmu_sync_roots() can locklessly check whether a sync is needed and just bail out if it isn't. Signed-off-by: Junaid Shahid <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06kvm: x86: Make sync_page() flush remote TLBs once onlyJunaid Shahid2-8/+20
sync_page() calls set_spte() from a loop across a page table. It would work better if set_spte() left the TLB flushing to its callers, so that sync_page() can aggregate into a single call. Signed-off-by: Junaid Shahid <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: MMU: drop vcpu param in gpte_accessPeter Xu1-5/+5
It's never used. Drop it. Signed-off-by: Peter Xu <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: nVMX: Separate logic allocating shadow vmcs to a functionLiran Alon1-9/+28
No functionality change. This is done as a preparation for VMCS shadowing virtualization. Signed-off-by: Liran Alon <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: VMX: Mark vmcs header as shadow in case alloc_vmcs_cpu() allocate ↵Liran Alon1-8/+8
shadow vmcs No functionality change. Signed-off-by: Liran Alon <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: nVMX: Expose VMCS shadowing to L1 guestLiran Alon1-0/+9
Expose VMCS shadowing to L1 as a VMX capability of the virtual CPU, whether or not VMCS shadowing is supported by the physical CPU. (VMCS shadowing emulation) Shadowed VMREADs and VMWRITEs from L2 are handled by L0, without a VM-exit to L1. Signed-off-by: Liran Alon <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: nVMX: Do not forward VMREAD/VMWRITE VMExits to L1 if required so by ↵Liran Alon1-2/+31
vmcs12 vmread/vmwrite bitmaps This is done as a preparation for VMCS shadowing emulation. Signed-off-by: Liran Alon <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: nVMX: vmread/vmwrite: Use shadow vmcs12 if running L2Liran Alon1-12/+49
This is done as a preparation to VMCS shadowing emulation. Signed-off-by: Liran Alon <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: nVMX: include shadow vmcs12 in nested statePaolo Bonzini1-1/+30
The shadow vmcs12 cannot be flushed on KVM_GET_NESTED_STATE, because at that point guest memory is assumed by userspace to be immutable. Capture the cache in vmx_get_nested_state, adding another page at the end if there is an active shadow vmcs12. Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: nVMX: Cache shadow vmcs12 on VMEntry and flush to memory on VMExitLiran Alon1-0/+74
This is done is done as a preparation to VMCS shadowing emulation. Signed-off-by: Liran Alon <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: nVMX: Verify VMCS shadowing VMCS link pointerLiran Alon1-2/+28
Intel SDM considers these checks to be part of "Checks on Guest Non-Register State". Note that it is legal for vmcs->vmcs_link_pointer to be -1ull when VMCS shadowing is enabled. In this case, any VMREAD/VMWRITE to shadowed-field sets the ALU flags for VMfailInvalid (i.e. CF=1). Signed-off-by: Liran Alon <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: nVMX: Verify VMCS shadowing controlsLiran Alon1-0/+16
Signed-off-by: Liran Alon <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: nVMX: Introduce nested_cpu_has_shadow_vmcs()Liran Alon1-1/+6
Signed-off-by: Liran Alon <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: nVMX: Fail VMLAUNCH and VMRESUME on shadow VMCSLiran Alon1-0/+11
Signed-off-by: Liran Alon <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: nVMX: Allow VMPTRLD for shadow VMCS if vCPU supports VMCS shadowingLiran Alon1-1/+8
Signed-off-by: Liran Alon <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: VMX: Change vmcs12_{read,write}_any() to receive vmcs12 as parameterLiran Alon1-10/+11
No functionality change. This is done as a preparation for VMCS shadowing emulation. Signed-off-by: Liran Alon <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: VMX: Create struct for VMCS headerLiran Alon1-9/+15
No functionality change. Signed-off-by: Liran Alon <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06kvm: nVMX: Introduce KVM_CAP_NESTED_STATEJim Mattson4-2/+270
For nested virtualization L0 KVM is managing a bit of state for L2 guests, this state can not be captured through the currently available IOCTLs. In fact the state captured through all of these IOCTLs is usually a mix of L1 and L2 state. It is also dependent on whether the L2 guest was running at the moment when the process was interrupted to save its state. With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM that is in VMX operation. Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Jim Mattson <[email protected]> [karahmed@ - rename structs and functions and make them ready for AMD and address previous comments. - handle nested.smm state. - rebase & a bit of refactoring. - Merge 7/8 and 8/8 into one patch. ] Signed-off-by: KarimAllah Ahmed <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: x86: do not load vmcs12 pages while still in SMMPaolo Bonzini3-18/+41
If the vCPU enters system management mode while running a nested guest, RSM starts processing the vmentry while still in SMM. In that case, however, the pages pointed to by the vmcs12 might be incorrectly loaded from SMRAM. To avoid this, delay the handling of the pages until just before the next vmentry. This is done with a new request and a new entry in kvm_x86_ops, which we will be able to reuse for nested VMX state migration. Extracted from a patch by Jim Mattson and KarimAllah Ahmed. Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: x86: ensure all MSRs can always be KVM_GET/SET_MSR'dPaolo Bonzini3-14/+30
Some of the MSRs returned by GET_MSR_INDEX_LIST currently cannot be sent back to KVM_GET_MSR and/or KVM_SET_MSR; either they can never be sent back, or you they are only accepted under special conditions. This makes the API a pain to use. To avoid this pain, this patch makes it so that the result of the get-list ioctl can always be used for host-initiated get and set. Since we don't have a separate way to check for read-only MSRs, this means some Hyper-V MSRs are ignored when written. Arguably they should not even be in the result of GET_MSR_INDEX_LIST, but I am leaving there in case userspace is using the outcome of GET_MSR_INDEX_LIST to derive the support for the corresponding Hyper-V feature. Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: vmx: remove save/restore of host BNDCGFS MSRSean Christopherson1-5/+6
Linux does not support Memory Protection Extensions (MPX) in the kernel itself, thus the BNDCFGS (Bound Config Supervisor) MSR will always be zero in the KVM host, i.e. RDMSR in vmx_save_host_state() is superfluous. KVM unconditionally sets VM_EXIT_CLEAR_BNDCFGS, i.e. BNDCFGS will always be zero after VMEXIT, thus manually loading BNDCFGS is also superfluous. And in the event the MPX kernel support is added (unlikely given that MPX for userspace is in its death throes[1]), BNDCFGS will likely be common across all CPUs[2], and at the least shouldn't change on a regular basis, i.e. saving the MSR on every VMENTRY is completely unnecessary. WARN_ONCE in hardware_setup() if the host's BNDCFGS is non-zero to document that KVM does not preserve BNDCFGS and to serve as a hint as to how BNDCFGS likely should be handled if MPX is used in the kernel, e.g. BNDCFGS should be saved once during KVM setup. [1] https://lkml.org/lkml/2018/4/27/1046 [2] http://www.openwall.com/lists/kernel-hardening/2017/07/24/28 Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-05x86: vdso: Use $LD instead of $CC to linkAlistair Strachan1-13/+9
The vdso{32,64}.so can fail to link with CC=clang when clang tries to find a suitable GCC toolchain to link these libraries with. /usr/bin/ld: arch/x86/entry/vdso/vclock_gettime.o: access beyond end of merged section (782) This happens because the host environment leaked into the cross compiler environment due to the way clang searches for suitable GCC toolchains. Clang is a retargetable compiler, and each invocation of it must provide --target=<something> --gcc-toolchain=<something> to allow it to find the correct binutils for cross compilation. These flags had been added to KBUILD_CFLAGS, but the vdso code uses CC and not KBUILD_CFLAGS (for various reasons) which breaks clang's ability to find the correct linker when cross compiling. Most of the time this goes unnoticed because the host linker is new enough to work anyway, or is incompatible and skipped, but this cannot be reliably assumed. This change alters the vdso makefile to just use LD directly, which bypasses clang and thus the searching problem. The makefile will just use ${CROSS_COMPILE}ld instead, which is always what we want. This matches the method used to link vmlinux. This drops references to DISABLE_LTO; this option doesn't seem to be set anywhere, and not knowing what its possible values are, it's not clear how to convert it from CC to LD flag. Signed-off-by: Alistair Strachan <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Andi Kleen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-08-05x86/irqflags: Provide a declaration for native_save_flNick Desaulniers1-0/+2
It was reported that the commit d0a8d9378d16 is causing users of gcc < 4.9 to observe -Werror=missing-prototypes errors. Indeed, it seems that: extern inline unsigned long native_save_fl(void) { return 0; } compiled with -Werror=missing-prototypes produces this warning in gcc < 4.9, but not gcc >= 4.9. Fixes: d0a8d9378d16 ("x86/paravirt: Make native_save_fl() extern inline"). Reported-by: David Laight <[email protected]> Reported-by: Jean Delvare <[email protected]> Signed-off-by: Nick Desaulniers <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2018-08-05x86/mm/init: Add helper for freeing kernel image pagesDave Hansen3-5/+15
When chunks of the kernel image are freed, free_init_pages() is used directly. Consolidate the three sites that do this. Also update the string to give an incrementally better description of that memory versus what was there before. Signed-off-by: Dave Hansen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Kees Cook <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Andi Kleen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-08-05x86/mm/init: Pass unconverted symbol addresses to free_init_pages()Dave Hansen1-6/+2
The x86 code has several places where it frees parts of kernel image: 1. Unused SMP alternative 2. __init code 3. The hole between text and rodata 4. The hole between rodata and data We call free_init_pages() to do this. Strangely, we convert the symbol addresses to kernel direct map addresses in some cases (#3, #4) but not others (#1, #2). The virt_to_page() and the other code in free_reserved_area() now works fine for for symbol addresses on x86, so don't bother converting the addresses to direct map addresses before freeing them. Signed-off-by: Dave Hansen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Kees Cook <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Andi Kleen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-08-05x86/mm/pti: Clear Global bit more aggressivelyDave Hansen2-10/+30
The kernel image starts out with the Global bit set across the entire kernel image. The bit is cleared with set_memory_nonglobal() in the configurations with PCIDs where the performance benefits of the Global bit are not needed. However, this is fragile. It means that we are stuck opting *out* of the less-secure (Global bit set) configuration, which seems backwards. Let's start more secure (Global bit clear) and then let things opt back in if they want performance, or are truly mapping common data between kernel and userspace. This fixes a bug. Before this patch, there are areas that are unmapped from the user page tables (like like everything above 0xffffffff82600000 in the example below). These have the hallmark of being a wrong Global area: they are not identical in the 'current_kernel' and 'current_user' page table dumps. They are also read-write, which means they're much more likely to contain secrets. Before this patch: current_kernel:---[ High Kernel Mapping ]--- current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K RW GLB NX pte current_kernel-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE GLB NX pmd current_kernel-0xffffffff82c00000-0xffffffff82e00000 2M RW GLB NX pte current_kernel-0xffffffff82e00000-0xffffffff83200000 4M RW PSE GLB NX pmd current_kernel-0xffffffff83200000-0xffffffffa0000000 462M pmd current_user:---[ High Kernel Mapping ]--- current_user-0xffffffff80000000-0xffffffff81000000 16M pmd current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte current_user-0xffffffff81e11000-0xffffffff82000000 1980K RW GLB NX pte current_user-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd After this patch: current_kernel:---[ High Kernel Mapping ]--- current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte current_kernel-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd current_kernel-0xffffffff82c00000-0xffffffff82e00000 2M RW NX pte current_kernel-0xffffffff82e00000-0xffffffff83200000 4M RW PSE NX pmd current_kernel-0xffffffff83200000-0xffffffffa0000000 462M pmd current_user:---[ High Kernel Mapping ]--- current_user-0xffffffff80000000-0xffffffff81000000 16M pmd current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte current_user-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte current_user-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd Fixes: 0f561fce4d69 ("x86/pti: Enable global pages for shared areas") Reported-by: Hugh Dickins <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Kees Cook <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Andi Kleen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-08-05Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-34/+73
Lots of overlapping changes, mostly trivial in nature. The mlxsw conflict was resolving using the example resolution at: https://github.com/jpirko/linux_mlxsw/blob/combined_queue/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c Signed-off-by: David S. Miller <[email protected]>
2018-08-05Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds1-18/+55
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Thomas Gleixner: "A single fix, which addresses boot failures on machines which do not report EBDA correctly, which can place the trampoline into reserved memory regions. Validating against E820 prevents that" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot/compressed/64: Validate trampoline placement against E820
2018-08-05Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2-4/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A set of fixes for perf: Kernel side: - Fix the hardcoded index of extra PCI devices on Broadwell which caused a resource conflict and triggered warnings on CPU hotplug. Tooling: - Update the tools copy of several files, including perf_event.h, powerpc's asm/unistd.h (new io_pgetevents syscall), bpf.h and x86's memcpy_64.s (used in 'perf bench mem'), silencing the respective warnings during the perf tools build. - Fix the build on the alpine:edge distro" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Fix hardcoded index of Broadwell extra PCI devices perf tools: Fix the build on the alpine:edge distro tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench mem memcpy' tools headers uapi: Refresh linux/bpf.h copy tools headers powerpc: Update asm/unistd.h copy to pick new tools headers uapi: Update tools's copy of linux/perf_event.h
2018-08-05KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentryPaolo Bonzini3-3/+27
When nested virtualization is in use, VMENTER operations from the nested hypervisor into the nested guest will always be processed by the bare metal hypervisor, and KVM's "conditional cache flushes" mode in particular does a flush on nested vmentry. Therefore, include the "skip L1D flush on vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting. Add the relevant Documentation. Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2018-08-05x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentryPaolo Bonzini4-0/+13
Bit 3 of ARCH_CAPABILITIES tells a hypervisor that L1D flush on vmentry is not needed. Add a new value to enum vmx_l1d_flush_state, which is used either if there is no L1TF bug at all, or if bit 3 is set in ARCH_CAPABILITIES. Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2018-08-05x86/speculation: Simplify sysfs report of VMX L1TF vulnerabilityPaolo Bonzini1-3/+9
Three changes to the content of the sysfs file: - If EPT is disabled, L1TF cannot be exploited even across threads on the same core, and SMT is irrelevant. - If mitigation is completely disabled, and SMT is enabled, print "vulnerable" instead of "vulnerable, SMT vulnerable" - Reorder the two parts so that the main vulnerability state comes first and the detail on SMT is second. Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2018-08-05Merge 4.18-rc7 into master to pick up the KVM dependcyThomas Gleixner60-152/+478
Signed-off-by: Thomas Gleixner <[email protected]>
2018-08-05x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()Nicolai Stange1-1/+0
For VMEXITs caused by external interrupts, vmx_handle_external_intr() indirectly calls into the interrupt handlers through the host's IDT. It follows that these interrupts get accounted for in the kvm_cpu_l1tf_flush_l1d per-cpu flag. The subsequently executed vmx_l1d_flush() will thus be aware that some interrupts have happened and conduct a L1d flush anyway. Setting l1tf_flush_l1d from vmx_handle_external_intr() isn't needed anymore. Drop it. Signed-off-by: Nicolai Stange <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2018-08-05x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1dNicolai Stange3-0/+5
The last missing piece to having vmx_l1d_flush() take interrupts after VMEXIT into account is to set the kvm_cpu_l1tf_flush_l1d per-cpu flag on irq entry. Issue calls to kvm_set_cpu_l1tf_flush_l1d() from entering_irq(), ipi_entering_ack_irq(), smp_reschedule_interrupt() and uv_bau_message_interrupt(). Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Nicolai Stange <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2018-08-05x86: Don't include linux/irq.h from asm/hardirq.hNicolai Stange20-2/+19
The next patch in this series will have to make the definition of irq_cpustat_t available to entering_irq(). Inclusion of asm/hardirq.h into asm/apic.h would cause circular header dependencies like asm/smp.h asm/apic.h asm/hardirq.h linux/irq.h linux/topology.h linux/smp.h asm/smp.h or linux/gfp.h linux/mmzone.h asm/mmzone.h asm/mmzone_64.h asm/smp.h asm/apic.h asm/hardirq.h linux/irq.h linux/irqdesc.h linux/kobject.h linux/sysfs.h linux/kernfs.h linux/idr.h linux/gfp.h and others. This causes compilation errors because of the header guards becoming effective in the second inclusion: symbols/macros that had been defined before wouldn't be available to intermediate headers in the #include chain anymore. A possible workaround would be to move the definition of irq_cpustat_t into its own header and include that from both, asm/hardirq.h and asm/apic.h. However, this wouldn't solve the real problem, namely asm/harirq.h unnecessarily pulling in all the linux/irq.h cruft: nothing in asm/hardirq.h itself requires it. Also, note that there are some other archs, like e.g. arm64, which don't have that #include in their asm/hardirq.h. Remove the linux/irq.h #include from x86' asm/hardirq.h. Fix resulting compilation errors by adding appropriate #includes to *.c files as needed. Note that some of these *.c files could be cleaned up a bit wrt. to their set of #includes, but that should better be done from separate patches, if at all. Signed-off-by: Nicolai Stange <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2018-08-05x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1dNicolai Stange2-4/+36
Part of the L1TF mitigation for vmx includes flushing the L1D cache upon VMENTRY. L1D flushes are costly and two modes of operations are provided to users: "always" and the more selective "conditional" mode. If operating in the latter, the cache would get flushed only if a host side code path considered unconfined had been traversed. "Unconfined" in this context means that it might have pulled in sensitive data like user data or kernel crypto keys. The need for L1D flushes is tracked by means of the per-vcpu flag l1tf_flush_l1d. KVM exit handlers considered unconfined set it. A vmx_l1d_flush() subsequently invoked before the next VMENTER will conduct a L1d flush based on its value and reset that flag again. Currently, interrupts delivered "normally" while in root operation between VMEXIT and VMENTER are not taken into account. Part of the reason is that these don't leave any traces and thus, the vmx code is unable to tell if any such has happened. As proposed by Paolo Bonzini, prepare for tracking all interrupts by introducing a new per-cpu flag, "kvm_cpu_l1tf_flush_l1d". It will be in strong analogy to the per-vcpu ->l1tf_flush_l1d. A later patch will make interrupt handlers set it. For the sake of cache locality, group kvm_cpu_l1tf_flush_l1d into x86' per-cpu irq_cpustat_t as suggested by Peter Zijlstra. Provide the helpers kvm_set_cpu_l1tf_flush_l1d(), kvm_clear_cpu_l1tf_flush_l1d() and kvm_get_cpu_l1tf_flush_l1d(). Make them trivial resp. non-existent for !CONFIG_KVM_INTEL as appropriate. Let vmx_l1d_flush() handle kvm_cpu_l1tf_flush_l1d in the same way as l1tf_flush_l1d. Suggested-by: Paolo Bonzini <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Nicolai Stange <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]>
2018-08-05x86/irq: Demote irq_cpustat_t::__softirq_pending to u16Nicolai Stange1-1/+1
An upcoming patch will extend KVM's L1TF mitigation in conditional mode to also cover interrupts after VMEXITs. For tracking those, stores to a new per-cpu flag from interrupt handlers will become necessary. In order to improve cache locality, this new flag will be added to x86's irq_cpustat_t. Make some space available there by shrinking the ->softirq_pending bitfield from 32 to 16 bits: the number of bits actually used is only NR_SOFTIRQS, i.e. 10. Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Nicolai Stange <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]>
2018-08-05x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()Nicolai Stange1-4/+6
Currently, vmx_vcpu_run() checks if l1tf_flush_l1d is set and invokes vmx_l1d_flush() if so. This test is unncessary for the "always flush L1D" mode. Move the check to vmx_l1d_flush()'s conditional mode code path. Notes: - vmx_l1d_flush() is likely to get inlined anyway and thus, there's no extra function call. - This inverts the (static) branch prediction, but there hadn't been any explicit likely()/unlikely() annotations before and so it stays as is. Signed-off-by: Nicolai Stange <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2018-08-05x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'Nicolai Stange1-5/+5
The vmx_l1d_flush_always static key is only ever evaluated if vmx_l1d_should_flush is enabled. In that case however, there are only two L1d flushing modes possible: "always" and "conditional". The "conditional" mode's implementation tends to require more sophisticated logic than the "always" mode. Avoid inverted logic by replacing the 'vmx_l1d_flush_always' static key with a 'vmx_l1d_flush_cond' one. There is no change in functionality. Signed-off-by: Nicolai Stange <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2018-08-05x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()Nicolai Stange1-7/+7
vmx_l1d_flush() gets invoked only if l1tf_flush_l1d is true. There's no point in setting l1tf_flush_l1d to true from there again. Signed-off-by: Nicolai Stange <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2018-08-03Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-12/+10
Pull KVM fixes from Paolo Bonzini: "Two vmx bugfixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvm: x86: vmx: fix vpid leak KVM: vmx: use local variable for current_vmptr when emulating VMPTRST
2018-08-03x86/intel_rdt: Disable PMU accessThomas Gleixner1-1/+1
Peter is objecting to the direct PMU access in RDT. Right now the PMU usage is broken anyway as it is not coordinated with perf. Until this discussion settled, disable the PMU mechanics by simply rejecting the type '2' measurement in the resctrl file. Reported-by: Peter Zijlstra <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Reinette Chatre <[email protected]> Cc: Dave Hansen <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] CC: [email protected] Cc: [email protected] Cc: [email protected]