aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2020-11-15kvm: x86: Sink cpuid update into vendor-specific set_cr4 functionsJim Mattson4-8/+8
On emulated VM-entry and VM-exit, update the CPUID bits that reflect CR4.OSXSAVE and CR4.PKE. This fixes a bug where the CPUID bits could continue to reflect L2 CR4 values after emulated VM-exit to L1. It also fixes a related bug where the CPUID bits could continue to reflect L1 CR4 values after emulated VM-entry to L2. The latter bug is mainly relevant to SVM, wherein CPUID is not a required intercept. However, it could also be relevant to VMX, because the code to conditionally update these CPUID bits assumes that the guest CPUID and the guest CR4 are always in sync. Fixes: 8eb3f87d903168 ("KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exit") Fixes: 2acf923e38fb6a ("KVM: VMX: Enable XSAVE/XRSTOR for guest") Fixes: b9baba86148904 ("KVM, pkeys: expose CPUID/CR4 to guest") Reported-by: Abhiroop Dabral <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Reviewed-by: Ricardo Koller <[email protected]> Reviewed-by: Peter Shier <[email protected]> Cc: Haozhong Zhang <[email protected]> Cc: Dexuan Cui <[email protected]> Cc: Huaitong Han <[email protected]> Message-Id: <[email protected]>
2020-11-15KVM: Don't allocate dirty bitmap if dirty ring is enabledPeter Xu1-1/+1
Because kvm dirty rings and kvm dirty log is used in an exclusive way, Let's avoid creating the dirty_bitmap when kvm dirty ring is enabled. At the meantime, since the dirty_bitmap will be conditionally created now, we can't use it as a sign of "whether this memory slot enabled dirty tracking". Change users like that to check against the kvm memory slot flags. Note that there still can be chances where the kvm memory slot got its dirty_bitmap allocated, _if_ the memory slots are created before enabling of the dirty rings and at the same time with the dirty tracking capability enabled, they'll still with the dirty_bitmap. However it should not hurt much (e.g., the bitmaps will always be freed if they are there), and the real users normally won't trigger this because dirty bit tracking flag should in most cases only be applied to kvm slots only before migration starts, that should be far latter than kvm initializes (VM starts). Signed-off-by: Peter Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-15KVM: X86: Implement ring-based dirty memory trackingPeter Xu7-2/+31
This patch is heavily based on previous work from Lei Cao <[email protected]> and Paolo Bonzini <[email protected]>. [1] KVM currently uses large bitmaps to track dirty memory. These bitmaps are copied to userspace when userspace queries KVM for its dirty page information. The use of bitmaps is mostly sufficient for live migration, as large parts of memory are be dirtied from one log-dirty pass to another. However, in a checkpointing system, the number of dirty pages is small and in fact it is often bounded---the VM is paused when it has dirtied a pre-defined number of pages. Traversing a large, sparsely populated bitmap to find set bits is time-consuming, as is copying the bitmap to user-space. A similar issue will be there for live migration when the guest memory is huge while the page dirty procedure is trivial. In that case for each dirty sync we need to pull the whole dirty bitmap to userspace and analyse every bit even if it's mostly zeros. The preferred data structure for above scenarios is a dense list of guest frame numbers (GFN). This patch series stores the dirty list in kernel memory that can be memory mapped into userspace to allow speedy harvesting. This patch enables dirty ring for X86 only. However it should be easily extended to other archs as well. [1] https://patchwork.kernel.org/patch/10471409/ Signed-off-by: Lei Cao <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Peter Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-15KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]Peter Xu4-56/+81
Originally, we have three code paths that can dirty a page without vcpu context for X86: - init_rmode_identity_map - init_rmode_tss - kvmgt_rw_gpa init_rmode_identity_map and init_rmode_tss will be setup on destination VM no matter what (and the guest cannot even see them), so it does not make sense to track them at all. To do this, allow __x86_set_memory_region() to return the userspace address that just allocated to the caller. Then in both of the functions we directly write to the userspace address instead of calling kvm_write_*() APIs. Another trivial change is that we don't need to explicitly clear the identity page table root in init_rmode_identity_map() because no matter what we'll write to the whole page with 4M huge page entries. Suggested-by: Paolo Bonzini <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Peter Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-15KVM: x86: hyper-v: allow KVM_GET_SUPPORTED_HV_CPUID as a system ioctlVitaly Kuznetsov4-25/+33
KVM_GET_SUPPORTED_HV_CPUID is a vCPU ioctl but its output is now independent from vCPU and in some cases VMMs may want to use it as a system ioctl instead. In particular, QEMU doesn CPU feature expansion before any vCPU gets created so KVM_GET_SUPPORTED_HV_CPUID can't be used. Convert KVM_GET_SUPPORTED_HV_CPUID to 'dual' system/vCPU ioctl with the same meaning. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-15KVM: x86: emulate wait-for-SIPI and SIPI-VMExitYadong Qi3-14/+44
Background: We have a lightweight HV, it needs INIT-VMExit and SIPI-VMExit to wake-up APs for guests since it do not monitor the Local APIC. But currently virtual wait-for-SIPI(WFS) state is not supported in nVMX, so when running on top of KVM, the L1 HV cannot receive the INIT-VMExit and SIPI-VMExit which cause the L2 guest cannot wake up the APs. According to Intel SDM Chapter 25.2 Other Causes of VM Exits, SIPIs cause VM exits when a logical processor is in wait-for-SIPI state. In this patch: 1. introduce SIPI exit reason, 2. introduce wait-for-SIPI state for nVMX, 3. advertise wait-for-SIPI support to guest. When L1 hypervisor is not monitoring Local APIC, L0 need to emulate INIT-VMExit and SIPI-VMExit to L1 to emulate INIT-SIPI-SIPI for L2. L2 LAPIC write would be traped by L0 Hypervisor(KVM), L0 should emulate the INIT/SIPI vmexit to L1 hypervisor to set proper state for L2's vcpu state. Handle procdure: Source vCPU: L2 write LAPIC.ICR(INIT). L0 trap LAPIC.ICR write(INIT): inject a latched INIT event to target vCPU. Target vCPU: L0 emulate an INIT VMExit to L1 if is guest mode. L1 set guest VMCS, guest_activity_state=WAIT_SIPI, vmresume. L0 set vcpu.mp_state to INIT_RECEIVED if (vmcs12.guest_activity_state == WAIT_SIPI). Source vCPU: L2 write LAPIC.ICR(SIPI). L0 trap LAPIC.ICR write(INIT): inject a latched SIPI event to traget vCPU. Target vCPU: L0 emulate an SIPI VMExit to L1 if (vcpu.mp_state == INIT_RECEIVED). L1 set CS:IP, guest_activity_state=ACTIVE, vmresume. L0 resume to L2. L2 start-up. Signed-off-by: Yadong Qi <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-15KVM: x86: fix apic_accept_events vs check_nested_eventsPaolo Bonzini2-5/+27
vmx_apic_init_signal_blocked is buggy in that it returns true even in VMX non-root mode. In non-root mode, however, INITs are not latched, they just cause a vmexit. Previously, KVM was waiting for them to be processed when kvm_apic_accept_events and in the meanwhile it ate the SIPIs that the processor received. However, in order to implement the wait-for-SIPI activity state, KVM will have to process KVM_APIC_SIPI in vmx_check_nested_events, and it will not be possible anymore to disregard SIPIs in non-root mode as the code is currently doing. By calling kvm_x86_ops.nested_ops->check_events, we can force a vmexit (with the side-effect of latching INITs) before incorrectly injecting an INIT or SIPI in a guest, and therefore vmx_apic_init_signal_blocked can do the right thing. Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-15KVM: x86: Return bool instead of int for CR4 and SREGS validity checksSean Christopherson4-19/+15
Rework the common CR4 and SREGS checks to return a bool instead of an int, i.e. true/false instead of 0/-EINVAL, and add "is" to the name to clarify the polarity of the return value (which is effectively inverted by this change). No functional changed intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-15KVM: x86: Move vendor CR4 validity check to dedicated kvm_x86_ops hookSean Christopherson7-21/+34
Split out VMX's checks on CR4.VMXE to a dedicated hook, .is_valid_cr4(), and invoke the new hook from kvm_valid_cr4(). This fixes an issue where KVM_SET_SREGS would return success while failing to actually set CR4. Fixing the issue by explicitly checking kvm_x86_ops.set_cr4()'s return in __set_sregs() is not a viable option as KVM has already stuffed a variety of vCPU state. Note, kvm_valid_cr4() and is_valid_cr4() have different return types and inverted semantics. This will be remedied in a future patch. Fixes: 5e1746d6205d ("KVM: nVMX: Allow setting the VMXE bit in CR4") Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-15KVM: SVM: Drop VMXE check from svm_set_cr4()Sean Christopherson1-3/+0
Drop svm_set_cr4()'s explicit check CR4.VMXE now that common x86 handles the check by incorporating VMXE into the CR4 reserved bits, via kvm_cpu_caps. SVM obviously does not set X86_FEATURE_VMX. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-15KVM: VMX: Drop explicit 'nested' check from vmx_set_cr4()Sean Christopherson1-12/+7
Drop vmx_set_cr4()'s explicit check on the 'nested' module param now that common x86 handles the check by incorporating VMXE into the CR4 reserved bits, via kvm_cpu_caps. X86_FEATURE_VMX is set in kvm_cpu_caps (by vmx_set_cpu_caps()), if and only if 'nested' is true. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-15KVM: VMX: Drop guest CPUID check for VMXE in vmx_set_cr4()Sean Christopherson1-2/+3
Drop vmx_set_cr4()'s somewhat hidden guest_cpuid_has() check on VMXE now that common x86 handles the check by incorporating VMXE into the CR4 reserved bits, i.e. in cr4_guest_rsvd_bits. This fixes a bug where KVM incorrectly rejects KVM_SET_SREGS with CR4.VMXE=1 if it's executed before KVM_SET_CPUID{,2}. Fixes: 5e1746d6205d ("KVM: nVMX: Allow setting the VMXE bit in CR4") Reported-by: Stas Sergeev <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-15kvm: mmu: fix is_tdp_mmu_check when the TDP MMU is not in usePaolo Bonzini1-0/+7
In some cases where shadow paging is in use, the root page will be either mmu->pae_root or vcpu->arch.mmu->lm_root. Then it will not have an associated struct kvm_mmu_page, because it is allocated with alloc_page instead of kvm_mmu_alloc_page. Just return false quickly from is_tdp_mmu_root if the TDP MMU is not in use, which also includes the case where shadow paging is enabled. Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-13livepatch: Use the default ftrace_ops instead of REGS when ARGS is availableSteven Rostedt (VMware)3-2/+9
When CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS is available, the ftrace call will be able to set the ip of the calling function. This will improve the performance of live kernel patching where it does not need all the regs to be stored just to change the instruction pointer. If all archs that support live kernel patching also support HAVE_DYNAMIC_FTRACE_WITH_ARGS, then the architecture specific function klp_arch_set_pc() could be made generic. It is possible that an arch can support HAVE_DYNAMIC_FTRACE_WITH_ARGS but not HAVE_DYNAMIC_FTRACE_WITH_REGS and then have access to live patching. Cc: Josh Poimboeuf <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: [email protected] Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Miroslav Benes <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-11-13ftrace/x86: Allow for arguments to be passed in to ftrace_regs by defaultSteven Rostedt (VMware)3-2/+25
Currently, the only way to get access to the registers of a function via a ftrace callback is to set the "FL_SAVE_REGS" bit in the ftrace_ops. But as this saves all regs as if a breakpoint were to trigger (for use with kprobes), it is expensive. The regs are already saved on the stack for the default ftrace callbacks, as that is required otherwise a function being traced will get the wrong arguments and possibly crash. And on x86, the arguments are already stored where they would be on a pt_regs structure to use that code for both the regs version of a callback, it makes sense to pass that information always to all functions. If an architecture does this (as x86_64 now does), it is to set HAVE_DYNAMIC_FTRACE_WITH_ARGS, and this will let the generic code that it could have access to arguments without having to set the flags. This also includes having the stack pointer being saved, which could be used for accessing arguments on the stack, as well as having the function graph tracer not require its own trampoline! Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-11-13ftrace: Have the callbacks receive a struct ftrace_regs instead of pt_regsSteven Rostedt (VMware)1-1/+2
In preparation to have arguments of a function passed to callbacks attached to functions as default, change the default callback prototype to receive a struct ftrace_regs as the forth parameter instead of a pt_regs. For callbacks that set the FL_SAVE_REGS flag in their ftrace_ops flags, they will now need to get the pt_regs via a ftrace_get_regs() helper call. If this is called by a callback that their ftrace_ops did not have a FL_SAVE_REGS flag set, it that helper function will return NULL. This will allow the ftrace_regs to hold enough just to get the parameters and stack pointer, but without the worry that callbacks may have a pt_regs that is not completely filled. Acked-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-11-13KVM: SVM: Update cr3_lm_rsvd_bits for AMD SEV guestsBabu Moger1-0/+8
For AMD SEV guests, update the cr3_lm_rsvd_bits to mask the memory encryption bit in reserved bits. Signed-off-by: Babu Moger <[email protected]> Message-Id: <160521948301.32054.5783800787423231162.stgit@bmoger-ubuntu> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-13KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_archBabu Moger3-1/+4
SEV guests fail to boot on a system that supports the PCID feature. While emulating the RSM instruction, KVM reads the guest CR3 and calls kvm_set_cr3(). If the vCPU is in the long mode, kvm_set_cr3() does a sanity check for the CR3 value. In this case, it validates whether the value has any reserved bits set. The reserved bit range is 63:cpuid_maxphysaddr(). When AMD memory encryption is enabled, the memory encryption bit is set in the CR3 value. The memory encryption bit may fall within the KVM reserved bit range, causing the KVM emulation failure. Introduce a new field cr3_lm_rsvd_bits in kvm_vcpu_arch which will cache the reserved bits in the CR3 value. This will be initialized to rsvd_bits(cpuid_maxphyaddr(vcpu), 63). If the architecture has any special bits(like AMD SEV encryption bit) that needs to be masked from the reserved bits, should be cleared in vendor specific kvm_x86_ops.vcpu_after_set_cpuid handler. Fixes: a780a3ea628268b2 ("KVM: X86: Fix reserved bits check for MOV to CR3") Signed-off-by: Babu Moger <[email protected]> Message-Id: <160521947657.32054.3264016688005356563.stgit@bmoger-ubuntu> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-13KVM: x86: clflushopt should be treated as a no-op by emulationDavid Edmondson1-1/+7
The instruction emulator ignores clflush instructions, yet fails to support clflushopt. Treat both similarly. Fixes: 13e457e0eebf ("KVM: x86: Emulator does not decode clflush well") Signed-off-by: David Edmondson <[email protected]> Message-Id: <[email protected]> Reviewed-by: Joao Martins <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-13x86/platform/uv: Fix copied UV5 output archtypeMike Travis1-3/+3
A test shows that the output contains a space: # cat /proc/sgi_uv/archtype NSGI4 U/UVX Remove that embedded space by copying the "trimmed" buffer instead of the untrimmed input character list. Use sizeof to remove size dependency on copy out length. Increase output buffer size by one character just in case BIOS sends an 8 character string for archtype. Fixes: 1e61f5a95f19 ("Add and decode Arch Type in UVsystab") Signed-off-by: Mike Travis <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Steve Wahl <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-11-11x86/fpu: Make kernel FPU protection RT friendlyThomas Gleixner1-2/+16
Non RT kernels need to protect FPU against preemption and bottom half processing. This is achieved by disabling bottom halfs via local_bh_disable() which implictly disables preemption. On RT kernels this protection mechanism is not sufficient because local_bh_disable() does not disable preemption. It serializes bottom half related processing via a CPU local lock. As bottom halfs are running always in thread context on RT kernels disabling preemption is the proper choice as it implicitly prevents bottom half processing. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-11-11x86/fpu: Simplify fpregs_[un]lock()Thomas Gleixner1-2/+3
There is no point in disabling preemption and then disabling bottom halfs. Just disabling bottom halfs is sufficient as it implicitly disables preemption on !RT kernels. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-11-11x86/platform/uv: Drop last traces of uv_flush_tlb_othersJiri Slaby1-10/+0
Commit 39297dde7390 ("x86/platform/uv: Remove UV BAU TLB Shootdown Handler") removed uv_flush_tlb_others. Its declaration was removed also from asm/uv/uv.h. But only for the CONFIG_X86_UV=y case. The inline definition (!X86_UV case) is still in place. So remove this implementation with everything what was added to support uv_flush_tlb_others: * include of asm/tlbflush.h * forward declarations of struct cpumask, mm_struct, and flush_tlb_info Signed-off-by: Jiri Slaby <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Mike Travis <[email protected]> Acked-by: Steve Wahl <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-11-10powercap: Add AMD Fam17h RAPL supportVictor Ding1-0/+1
Enable AMD Fam17h RAPL support for the power capping framework. The support is as per AMD Fam17h Model31h (Zen2) and model 00-ffh (Zen1) PPR. Tested by comparing the results of following two sysfs entries and the values directly read from corresponding MSRs via /dev/cpu/[x]/msr: /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj /sys/class/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/energy_uj Signed-off-by: Victor Ding <[email protected]> Acked-by: Kim Phillips <[email protected]> [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2020-11-10x86/msr-index: sort AMD RAPL MSRs by addressVictor Ding1-1/+1
MSRs in the rest of this file are sorted by their addresses; fixing the two outliers. No functional changes. Signed-off-by: Victor Ding <[email protected]> Acked-by: Kim Phillips <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2020-11-10efi/x86: Free efi_pgd with free_pages()Arvind Sankar1-11/+13
Commit d9e9a6418065 ("x86/mm/pti: Allocate a separate user PGD") changed the PGD allocation to allocate PGD_ALLOCATION_ORDER pages, so in the error path it should be freed using free_pages() rather than free_page(). Commit 06ace26f4e6f ("x86/efi: Free efi_pgd with free_pages()") fixed one instance of this, but missed another. Move the freeing out-of-line to avoid code duplication and fix this bug. Fixes: d9e9a6418065 ("x86/mm/pti: Allocate a separate user PGD") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arvind Sankar <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]>
2020-11-10x86/ioapic: Correct the PCI/ISA trigger type selectionThomas Gleixner1-2/+2
PCI's default trigger type is level and ISA's is edge. The recent refactoring made it the other way round, which went unnoticed as it seems only to cause havoc on some AMD systems. Make the comment and code do the right thing again. Fixes: a27dca645d2c ("x86/io_apic: Cleanup trigger/polarity helpers") Reported-by: Tom Lendacky <[email protected]> Reported-by: Borislav Petkov <[email protected]> Reported-by: Qian Cai <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Tom Lendacky <[email protected]> Cc: David Woodhouse <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-11-10perf/x86/intel/uncore: Fix Add BW copypastaArnd Bergmann1-1/+1
gcc -Wextra points out a duplicate initialization of one array member: arch/x86/events/intel/uncore_snb.c:478:37: warning: initialized field overwritten [-Woverride-init] 478 | [SNB_PCI_UNCORE_IMC_DATA_READS] = { SNB_UNCORE_PCI_IMC_DATA_WRITES_BASE, The only sensible explanation is that a duplicate 'READS' was used instead of the correct 'WRITES', so change it back. Fixes: 24633d901ea4 ("perf/x86/intel/uncore: Add BW counters for GT, IA and IO breakdown") Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-11-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds5-30/+86
Pull kvm fixes from Paolo Bonzini: "ARM: - fix compilation error when PMD and PUD are folded - fix regression in reads-as-zero behaviour of ID_AA64ZFR0_EL1 - add aarch64 get-reg-list test x86: - fix semantic conflict between two series merged for 5.10 - fix (and test) enforcement of paravirtual cpuid features selftests: - various cleanups to memory management selftests - new selftests testcase for performance of dirty logging" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (30 commits) KVM: selftests: allow two iterations of dirty_log_perf_test KVM: selftests: Introduce the dirty log perf test KVM: selftests: Make the number of vcpus global KVM: selftests: Make the per vcpu memory size global KVM: selftests: Drop pointless vm_create wrapper KVM: selftests: Add wrfract to common guest code KVM: selftests: Simplify demand_paging_test with timespec_diff_now KVM: selftests: Remove address rounding in guest code KVM: selftests: Factor code out of demand_paging_test KVM: selftests: Use a single binary for dirty/clear log test KVM: selftests: Always clear dirty bitmap after iteration KVM: selftests: Add blessed SVE registers to get-reg-list KVM: selftests: Add aarch64 get-reg-list test selftests: kvm: test enforcement of paravirtual cpuid features selftests: kvm: Add exception handling to selftests selftests: kvm: Clear uc so UCALL_NONE is being properly reported selftests: kvm: Fix the segment descriptor layout to match the actual layout KVM: x86: handle MSR_IA32_DEBUGCTLMSR with report_ignored_msrs kvm: x86: request masterclock update any time guest uses different msr kvm: x86: ensure pv_cpuid.features is initialized when enabling cap ...
2020-11-09perf/x86/intel: Make anythread filter support conditionalStephane Eranian4-2/+17
Starting with Arch Perfmon v5, the anythread filter on generic counters may be deprecated. The current kernel was exporting the any filter without checking. On Icelake, it means you could do cpu/event=0x3c,any/ even though the filter does not exist. This patch corrects the problem by relying on the CPUID 0xa leaf function to determine if anythread is supported or not as described in the Intel SDM Vol3b 18.2.5.1 AnyThread Deprecation section. Signed-off-by: Stephane Eranian <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-11-09perf/x86: Make dummy_iregs staticPeter Zijlstra1-1/+1
Having pt_regs on-stack is unfortunate, it's 168 bytes. Since it isn't actually used, make it a static variable. This both gets if off the stack and ensures it gets 0 initialized, just in case someone does look at it. Reported-by: Steven Rostedt <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-11-09perf/arch: Remove perf_sample_data::regs_user_copyPeter Zijlstra1-4/+11
struct perf_sample_data lives on-stack, we should be careful about it's size. Furthermore, the pt_regs copy in there is only because x86_64 is a trainwreck, solve it differently. Reported-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Steven Rostedt <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-11-09perf/x86: Reduce stack usage for x86_pmu::drain_pebs()Peter Zijlstra3-24/+27
intel_pmu_drain_pebs_*() is typically called from handle_pmi_common(), both have an on-stack struct perf_sample_data, which is *big*. Rewire things so that drain_pebs() can use the one handle_pmi_common() has. Reported-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-11-09perf: Reduce stack usage of perf_output_begin()Peter Zijlstra1-2/+2
__perf_output_begin() has an on-stack struct perf_sample_data in the unlikely case it needs to generate a LOST record. However, every call to perf_output_begin() must already have a perf_sample_data on-stack. Reported-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-11-09x86/xen: don't unbind uninitialized lock_kicker_irqBrian Masney1-1/+11
When booting a hyperthreaded system with the kernel parameter 'mitigations=auto,nosmt', the following warning occurs: WARNING: CPU: 0 PID: 1 at drivers/xen/events/events_base.c:1112 unbind_from_irqhandler+0x4e/0x60 ... Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006 ... Call Trace: xen_uninit_lock_cpu+0x28/0x62 xen_hvm_cpu_die+0x21/0x30 takedown_cpu+0x9c/0xe0 ? trace_suspend_resume+0x60/0x60 cpuhp_invoke_callback+0x9a/0x530 _cpu_up+0x11a/0x130 cpu_up+0x7e/0xc0 bringup_nonboot_cpus+0x48/0x50 smp_init+0x26/0x79 kernel_init_freeable+0xea/0x229 ? rest_init+0xaa/0xaa kernel_init+0xa/0x106 ret_from_fork+0x35/0x40 The secondary CPUs are not activated with the nosmt mitigations and only the primary thread on each CPU core is used. In this situation, xen_hvm_smp_prepare_cpus(), and more importantly xen_init_lock_cpu(), is not called, so the lock_kicker_irq is not initialized for the secondary CPUs. Let's fix this by exiting early in xen_uninit_lock_cpu() if the irq is not set to avoid the warning from above for each secondary CPU. Signed-off-by: Brian Masney <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Juergen Gross <[email protected]> Signed-off-by: Boris Ostrovsky <[email protected]>
2020-11-08Merge tag 'x86-urgent-2020-11-08' of ↵Linus Torvalds5-32/+54
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of x86 fixes: - Use SYM_FUNC_START_WEAK in the mem* ASM functions instead of a combination of .weak and SYM_FUNC_START_LOCAL which makes LLVMs integrated assembler upset - Correct the mitigation selection logic which prevented the related prctl to work correctly - Make the UV5 hubless system work correctly by fixing up the malformed table entries and adding the missing ones" * tag 'x86-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/platform/uv: Recognize UV5 hubless system identifier x86/platform/uv: Remove spaces from OEM IDs x86/platform/uv: Fix missing OEM_TABLE_ID x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP x86/lib: Change .weak to SYM_FUNC_START_WEAK for arch/x86/lib/mem*_64.S
2020-11-08KVM: x86: handle MSR_IA32_DEBUGCTLMSR with report_ignored_msrsPankaj Gupta1-3/+3
Windows2016 guest tries to enable LBR by setting the corresponding bits in MSR_IA32_DEBUGCTLMSR. KVM does not emulate MSR_IA32_DEBUGCTLMSR and spams the host kernel logs with error messages like: kvm [...]: vcpu1, guest rIP: 0xfffff800a8b687d3 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop" This patch fixes this by enabling error logging only with 'report_ignored_msrs=1'. Signed-off-by: Pankaj Gupta <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-08kvm: x86: request masterclock update any time guest uses different msrOliver Upton1-1/+1
Commit 5b9bb0ebbcdc ("kvm: x86: encapsulate wrmsr(MSR_KVM_SYSTEM_TIME) emulation in helper fn", 2020-10-21) subtly changed the behavior of guest writes to MSR_KVM_SYSTEM_TIME(_NEW). Restore the previous behavior; update the masterclock any time the guest uses a different msr than before. Fixes: 5b9bb0ebbcdc ("kvm: x86: encapsulate wrmsr(MSR_KVM_SYSTEM_TIME) emulation in helper fn", 2020-10-21) Signed-off-by: Oliver Upton <[email protected]> Reviewed-by: Peter Shier <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-08kvm: x86: ensure pv_cpuid.features is initialized when enabling capOliver Upton3-7/+19
Make the paravirtual cpuid enforcement mechanism idempotent to ioctl() ordering by updating pv_cpuid.features whenever userspace requests the capability. Extract this update out of kvm_update_cpuid_runtime() into a new helper function and move its other call site into kvm_vcpu_after_set_cpuid() where it more likely belongs. Fixes: 66570e966dd9 ("kvm: x86: only provide PV features if enabled in guest's CPUID") Signed-off-by: Oliver Upton <[email protected]> Reviewed-by: Peter Shier <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-08kvm: x86: reads of restricted pv msrs should also result in #GPOliver Upton1-0/+34
commit 66570e966dd9 ("kvm: x86: only provide PV features if enabled in guest's CPUID") only protects against disallowed guest writes to KVM paravirtual msrs, leaving msr reads unchecked. Fix this by enforcing KVM_CPUID_FEATURES for msr reads as well. Fixes: 66570e966dd9 ("kvm: x86: only provide PV features if enabled in guest's CPUID") Signed-off-by: Oliver Upton <[email protected]> Reviewed-by: Peter Shier <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-08KVM: x86: use positive error values for msr emulation that causes #GPMaxim Levitsky2-14/+22
Recent introduction of the userspace msr filtering added code that uses negative error codes for cases that result in either #GP delivery to the guest, or handled by the userspace msr filtering. This breaks an assumption that a negative error code returned from the msr emulation code is a semi-fatal error which should be returned to userspace via KVM_RUN ioctl and usually kill the guest. Fix this by reusing the already existing KVM_MSR_RET_INVALID error code, and by adding a new KVM_MSR_RET_FILTERED error code for the userspace filtered msrs. Fixes: 291f35fb2c1d1 ("KVM: x86: report negative values from wrmsr emulation to userspace") Reported-by: Qian Cai <[email protected]> Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-08KVM: x86/mmu: fix counting of rmap entries in pte_list_addLi RongQing1-5/+7
Fix an off-by-one style bug in pte_list_add() where it failed to account the last full set of SPTEs, i.e. when desc->sptes is full and desc->more is NULL. Merge the two "PTE_LIST_EXT-1" checks as part of the fix to avoid an extra comparison. Signed-off-by: Li RongQing <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-11-07Merge branch 'linus' into perf/kprobesIngo Molnar316-12506/+18201
Conflicts: include/asm-generic/atomic-instrumented.h kernel/kprobes.c Use the upstream atomic-instrumented.h checksum, and pick the kprobes version of kernel/kprobes.c, which effectively reverts this upstream workaround: 645f224e7ba2: ("kprobes: Tell lockdep about kprobe nesting") Since the new code *should* be fine without nesting. Knock on wood ... Signed-off-by: Ingo Molnar <[email protected]>
2020-11-07Merge branch 'linus' into perf/kprobesIngo Molnar10-32/+83
Merge recent kprobes updates into perf/kprobes that came from -mm. Signed-off-by: Ingo Molnar <[email protected]>
2020-11-07x86/platform/uv: Recognize UV5 hubless system identifierMike Travis1-3/+10
Testing shows a problem in that UV5 hubless systems were not being recognized. Add them to the list of OEM IDs checked. Fixes: 6c7794423a998 ("Add UV5 direct references") Signed-off-by: Mike Travis <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-11-07x86/platform/uv: Remove spaces from OEM IDsMike Travis1-0/+3
Testing shows that trailing spaces caused problems with the OEM_ID and the OEM_TABLE_ID. One being that the OEM_ID would not string compare correctly. Another the OEM_ID and OEM_TABLE_ID would be concatenated in the printout. Remove any trailing spaces. Fixes: 1e61f5a95f191 ("Add and decode Arch Type in UVsystab") Signed-off-by: Mike Travis <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-11-07x86/platform/uv: Fix missing OEM_TABLE_IDMike Travis1-2/+5
Testing shows a problem in that the OEM_TABLE_ID was missing for hubless systems. This is used to determine the APIC type (legacy or extended). Add the OEM_TABLE_ID to the early hubless processing. Fixes: 1e61f5a95f191 ("Add and decode Arch Type in UVsystab") Signed-off-by: Mike Travis <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-11-06x86/cpu: Avoid cpuinfo-induced IPIing of idle CPUsPaul E. McKenney1-0/+6
Currently, accessing /proc/cpuinfo sends IPIs to idle CPUs in order to learn their clock frequency. Which is a bit strange, given that waking them from idle likely significantly changes their clock frequency. This commit therefore avoids sending /proc/cpuinfo-induced IPIs to idle CPUs. [ paulmck: Also check for idle in arch_freq_prepare_all(). ] Signed-off-by: Paul E. McKenney <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: <[email protected]>
2020-11-06x86/cpu: Avoid cpuinfo-induced IPI pileupsPaul E. McKenney1-1/+9
The aperfmperf_snapshot_cpu() function is invoked upon access to /proc/cpuinfo, and it does do an early exit if the specified CPU has recently done a snapshot. Unfortunately, the indication that a snapshot has been completed is set in an IPI handler, and the execution of this handler can be delayed by any number of unfortunate events. This means that a system that starts a number of applications, each of which parses /proc/cpuinfo, can suffer from an smp_call_function_single() storm, especially given that each access to /proc/cpuinfo invokes smp_call_function_single() for all CPUs. Please note that this is not theoretical speculation. Note also that one CPU's pending IPI serves all requests, so there is no point in ever having more than one IPI pending to a given CPU. This commit therefore suppresses duplicate IPIs to a given CPU via a new ->scfpending field in the aperfmperf_sample structure. This field is set to the value one if an IPI is pending to the corresponding CPU and to zero otherwise. The aperfmperf_snapshot_cpu() function uses atomic_xchg() to set this field to the value one and sample the old value. If this function's "wait" parameter is zero, smp_call_function_single() is called only if the old value of the ->scfpending field was zero. The IPI handler uses atomic_set_release() to set this new field to zero just before returning, so that the prior stores into the aperfmperf_sample structure are seen by future requests that get to the atomic_xchg(). Future requests that pass the elapsed-time check are ordered by the fact that on x86 loads act as acquire loads, just as was the case prior to this change. The return value is based off of the age of the prior snapshot, just as before. Reported-by: Dave Jones <[email protected]> [ paulmck: Allow /proc/cpuinfo to take advantage of arch_freq_get_on_cpu(). ] [ paulmck: Add comment on memory barrier. ] Signed-off-by: Paul E. McKenney <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: <[email protected]>
2020-11-06io-mapping: Cleanup atomic iomapThomas Gleixner2-12/+3
Switch the atomic iomap implementation over to kmap_local and stick the preempt/pagefault mechanics into the generic code similar to the kmap_atomic variants. Rename the x86 map function in preparation for a non-atomic variant. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Andrew Morton <[email protected]> Link: https://lore.kernel.org/r/[email protected]