aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-03-16KVM: VMX: don't configure RM TSS for unrestricted guestSean Christopherson1-0/+3
An unrestricted guest can run with CR0.PG==0 and/or CR0.PE==0, e.g. it can run in Real Mode without requiring host emulation. The RM TSS is only used for emulating RM, i.e. it will never be used when unrestricted guest is enabled and so doesn't need to be configured. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-16x86/kvm/hyper-v: inject #GP only when invalid SINTx vector is unmaskedVitaly Kuznetsov1-1/+9
Hyper-V 2016 on KVM with SynIC enabled doesn't boot with the following trace: kvm_entry: vcpu 0 kvm_exit: reason MSR_WRITE rip 0xfffff8000131c1e5 info 0 0 kvm_hv_synic_set_msr: vcpu_id 0 msr 0x40000090 data 0x10000 host 0 kvm_msr: msr_write 40000090 = 0x10000 (#GP) kvm_inj_exception: #GP (0x0) KVM acts according to the following statement from TLFS: " 11.8.4 SINTx Registers ... Valid values for vector are 16-255 inclusive. Specifying an invalid vector number results in #GP. " However, I checked and genuine Hyper-V doesn't #GP when we write 0x10000 to SINTx. I checked with Microsoft and they confirmed that if either the Masked bit (bit 16) or the Polling bit (bit 18) is set to 1, then they ignore the value of Vector. Make KVM act accordingly. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Roman Kagan <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-16x86/kvm/hyper-v: remove stale entries from vec_bitmap/auto_eoi_bitmap on ↵Vitaly Kuznetsov2-10/+24
vector change When a new vector is written to SINx we update vec_bitmap/auto_eoi_bitmap but we forget to remove old vector from these masks (in case it is not present in some other SINTx). Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Roman Kagan <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-16x86/kvm/hyper-v: add reenlightenment MSRs supportVitaly Kuznetsov3-1/+36
Nested Hyper-V/Windows guest running on top of KVM will use TSC page clocksource in two cases: - L0 exposes invariant TSC (CPUID.80000007H:EDX[8]). - L0 provides Hyper-V Reenlightenment support (CPUID.40000003H:EAX[13]). Exposing invariant TSC effectively blocks migration to hosts with different TSC frequencies, providing reenlightenment support will be needed when we start migrating nested workloads. Implement rudimentary support for reenlightenment MSRs. For now, these are just read/write MSRs with no effect. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Roman Kagan <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-16KVM: x86: Update the exit_qualification access bits while walking an addressKarimAllah Ahmed1-2/+9
... to avoid having a stale value when handling an EPT misconfig for MMIO regions. MMIO regions that are not passed-through to the guest are handled through EPT misconfigs. The first time a certain MMIO page is touched it causes an EPT violation, then KVM marks the EPT entry to cause an EPT misconfig instead. Any subsequent accesses to the entry will generate an EPT misconfig. Things gets slightly complicated with nested guest handling for MMIO regions that are not passed through from L0 (i.e. emulated by L0 user-space). An EPT violation for one of these MMIO regions from L2, exits to L0 hypervisor. L0 would then look at the EPT12 mapping for L1 hypervisor and realize it is not present (or not sufficient to serve the request). Then L0 injects an EPT violation to L1. L1 would then update its EPT mappings. The EXIT_QUALIFICATION value for L1 would come from exit_qualification variable in "struct vcpu". The problem is that this variable is only updated on EPT violation and not on EPT misconfig. So if an EPT violation because of a read happened first, then an EPT misconfig because of a write happened afterwards. The L0 hypervisor will still contain exit_qualification value from the previous read instead of the write and end up injecting an EPT violation to the L1 hypervisor with an out of date EXIT_QUALIFICATION. The EPT violation that is injected from L0 to L1 needs to have the correct EXIT_QUALIFICATION specially for the access bits because the individual access bits for MMIO EPTs are updated only on actual access of this specific type. So for the example above, the L1 hypervisor will keep updating only the read bit in the EPT then resume the L2 guest. The L2 guest would end up causing another exit where the L0 *again* will inject another EPT violation to L1 hypervisor with *again* an out of date exit_qualification which indicates a read and not a write. Then this ping-pong just keeps happening without making any forward progress. The behavior of mapping MMIO regions changed in: commit a340b3e229b24 ("kvm: Map PFN-type memory regions as writable (if possible)") ... where an EPT violation for a read would also fixup the write bits to avoid another EPT violation which by acciddent would fix the bug mentioned above. This commit fixes this situation and ensures that the access bits for the exit_qualifcation is up to date. That ensures that even L1 hypervisor running with a KVM version before the commit mentioned above would still work. ( The description above assumes EPT to be available and used by L1 hypervisor + the L1 hypervisor is passing through the MMIO region to the L2 guest while this MMIO region is emulated by the L0 user-space ). Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: KarimAllah Ahmed <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-16KVM: x86: Make enum conversion explicit in kvm_pdptr_read()Matthias Kaehlcke1-1/+1
The type 'enum kvm_reg_ex' is an extension of 'enum kvm_reg', however the extension is only semantical and the compiler doesn't know about the relationship between the two types. In kvm_pdptr_read() a value of the extended type is passed to kvm_x86_ops->cache_reg(), which expects a value of the base type. Clang raises the following warning about the type mismatch: arch/x86/kvm/kvm_cache_regs.h:44:32: warning: implicit conversion from enumeration type 'enum kvm_reg_ex' to different enumeration type 'enum kvm_reg' [-Wenum-conversion] kvm_x86_ops->cache_reg(vcpu, VCPU_EXREG_PDPTR); Cast VCPU_EXREG_PDPTR to 'enum kvm_reg' to make the compiler happy. Signed-off-by: Matthias Kaehlcke <[email protected]> Reviewed-by: Guenter Roeck <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-16KVM: lapic: stop advertising DIRECTED_EOI when in-kernel IOAPIC is in useVitaly Kuznetsov1-1/+9
Devices which use level-triggered interrupts under Windows 2016 with Hyper-V role enabled don't work: Windows disables EOI broadcast in SPIV unconditionally. Our in-kernel IOAPIC implementation emulates an old IOAPIC version which has no EOI register so EOI never happens. The issue was discovered and discussed a while ago: https://www.spinics.net/lists/kvm/msg148098.html While this is a guest OS bug (it should check that IOAPIC has the required capabilities before disabling EOI broadcast) we can workaround it in KVM: advertising DIRECTED_EOI with in-kernel IOAPIC makes little sense anyway. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-16KVM: x86: Add support for AMD Core Perf Extension in guestJanakarajan Natarajan3-15/+130
Add support for AMD Core Performance counters in the guest. The base event select and counter MSRs are changed. In addition, with the core extension, there are 2 extra counters available for performance measurements for a total of 6. With the new MSRs, the logic to map them to the gp_counters[] is changed. New functions are added to check the validity of the get/set MSRs. If the guest has the X86_FEATURE_PERFCTR_CORE cpuid flag set, the number of counters available to the vcpu is set to 6. It the flag is not set then it is 4. Signed-off-by: Janakarajan Natarajan <[email protected]> [Squashed "Expose AMD Core Perf Extension flag to guests" - Radim.] Signed-off-by: Radim Krčmář <[email protected]>
2018-03-16x86/msr: Add AMD Core Perf Extension MSRsJanakarajan Natarajan1-0/+14
Add the EventSelect and Counter MSRs for AMD Core Perf Extension. Signed-off-by: Janakarajan Natarajan <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-14KVM: s390: provide counters for all interrupt injects/deliveryChristian Borntraeger3-11/+63
For testing the exitless interrupt support it turned out useful to have separate counters for inject and delivery of I/O interrupt. While at it do the same for all interrupt types. For timer related interrupts (clock comparator and cpu timer) we even had no delivery counters. Fix this as well. On this way some counters are being renamed to have a similar name. Signed-off-by: Christian Borntraeger <[email protected]> Reviewed-by: Cornelia Huck <[email protected]>
2018-03-14KVM: add machine check counter to kvm_statQingFeng Hao3-0/+3
This counter can be used for administration, debug or test purposes. Suggested-by: Vladislav Mironov <[email protected]> Signed-off-by: QingFeng Hao <[email protected]> Reviewed-by: Christian Borntraeger <[email protected]> Reviewed-by: Cornelia Huck <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Christian Borntraeger <[email protected]>
2018-03-14s390/setup: enable display support for KVM guestFarhan Ali3-2/+4
The S390 architecture does not support any graphics hardware, but with the latest support for Virtio GPU in Linux and Virtio GPU emulation in QEMU, it's possible to enable graphics for S390 using the Virtio GPU device. To enable display we need to enable the Linux Virtual Terminal (VT) layer for S390. But the VT subsystem initializes quite early at boot so we need a dummy console driver till the Virtio GPU driver is initialized and we can run the framebuffer console. The framebuffer console over a Virtio GPU device can be run in combination with the serial SCLP console (default on S390). The SCLP console can still be accessed by management applications (eg: via Libvirt's virsh console). Signed-off-by: Farhan Ali <[email protected]> Acked-by: Christian Borntraeger <[email protected]> Reviewed-by: Thomas Huth <[email protected]> Message-Id: <e23b61f4f599ba23881727a1e8880e9d60cc6a48.1519315352.git.alifm@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]>
2018-03-14s390/char: Rename EBCDIC keymap variablesFarhan Ali3-48/+61
The Linux Virtual Terminal (VT) layer provides a default keymap which is compiled when VT layer is enabled. But at the same time we are also compiling the EBCDIC keymap and this causes the linker to complain. So let's rename the EBCDIC keymap variables to prevent linker conflict. Signed-off-by: Farhan Ali <[email protected]> Acked-by: Christian Borntraeger <[email protected]> Acked-by: Heiko Carstens <[email protected]> Reviewed-by: Thomas Huth <[email protected]> Message-Id: <f670a2698d2372e1e990c48a29334ffe894804b1.1519315352.git.alifm@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <[email protected]>
2018-03-14Kconfig: Remove HAS_IOMEM dependency for Graphics supportFarhan Ali2-4/+7
The 'commit e25df1205f37 ("[S390] Kconfig: menus with depends on HAS_IOMEM.")' added the HAS_IOMEM dependecy for "Graphics support". This disabled the "Graphics support" menu for S390. But if we enable VT layer for S390, we would also need to enable the dummy console. So let's remove the HAS_IOMEM dependency. Move this dependency to sub menu items and console drivers that use io memory. Signed-off-by: Farhan Ali <[email protected]> Reviewed-by: Thomas Huth <[email protected]> Message-Id: <6e8ef238162df5be4462126be155975c722e9863.1519315352.git.alifm@linux.vnet.ibm.com> Acked-by: Bartlomiej Zolnierkiewicz <[email protected]> Signed-off-by: Christian Borntraeger <[email protected]>
2018-03-14KVM: s390: fix fallthrough annotationSebastian Ott1-6/+3
A case statement in kvm_s390_shadow_tables uses fallthrough annotations which are not recognized by gcc because they are hidden within a block. Move these annotations out of the block to fix (W=1) warnings like below: arch/s390/kvm/gaccess.c: In function 'kvm_s390_shadow_tables': arch/s390/kvm/gaccess.c:1029:26: warning: this statement may fall through [-Wimplicit-fallthrough=] case ASCE_TYPE_REGION1: { ^ Signed-off-by: Sebastian Ott <[email protected]> Reviewed-by: Cornelia Huck <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Christian Borntraeger <[email protected]>
2018-03-14KVM: s390: add exit io request stats and simplify codeChristian Borntraeger3-13/+6
We want to count IO exit requests in kvm_stat. At the same time we can get rid of the handle_noop function. Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Cornelia Huck <[email protected]> Signed-off-by: Christian Borntraeger <[email protected]>
2018-03-14KVM: document KVM_CAP_S390_[BPB|PSW|GMAP|COW]Christian Borntraeger1-0/+30
commit 35b3fde6203b ("KVM: s390: wire up bpb feature") has no documentation for KVM_CAP_S390_BPB. While adding this let's also add other missing capabilities like KVM_CAP_S390_PSW, KVM_CAP_S390_GMAP and KVM_CAP_S390_COW. Reviewed-by: Cornelia Huck <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Janosch Frank <[email protected]> Signed-off-by: Christian Borntraeger <[email protected]>
2018-03-14kvm: arm/arm64: vgic-v3: Tighten synchronization for guests using v2 on v3Marc Zyngier1-1/+2
On guest exit, and when using GICv2 on GICv3, we use a dsb(st) to force synchronization between the memory-mapped guest view and the system-register view that the hypervisor uses. This is incorrect, as the spec calls out the need for "a DSB whose required access type is both loads and stores with any Shareability attribute", while we're only synchronizing stores. We also lack an isb after the dsb to ensure that the latter has actually been executed before we start reading stuff from the sysregs. The fix is pretty easy: turn dsb(st) into dsb(sy), and slap an isb() just after. Cc: [email protected] Fixes: f68d2b1b73cc ("arm64: KVM: Implement vgic-v3 save/restore") Acked-by: Christoffer Dall <[email protected]> Reviewed-by: Andre Przywara <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-14KVM: arm/arm64: vgic: Don't populate multiple LRs with the same vintidMarc Zyngier6-16/+67
The vgic code is trying to be clever when injecting GICv2 SGIs, and will happily populate LRs with the same interrupt number if they come from multiple vcpus (after all, they are distinct interrupt sources). Unfortunately, this is against the letter of the architecture, and the GICv2 architecture spec says "Each valid interrupt stored in the List registers must have a unique VirtualID for that virtual CPU interface.". GICv3 has similar (although slightly ambiguous) restrictions. This results in guests locking up when using GICv2-on-GICv3, for example. The obvious fix is to stop trying so hard, and inject a single vcpu per SGI per guest entry. After all, pending SGIs with multiple source vcpus are pretty rare, and are mostly seen in scenario where the physical CPUs are severely overcomitted. But as we now only inject a single instance of a multi-source SGI per vcpu entry, we may delay those interrupts for longer than strictly necessary, and run the risk of injecting lower priority interrupts in the meantime. In order to address this, we adopt a three stage strategy: - If we encounter a multi-source SGI in the AP list while computing its depth, we force the list to be sorted - When populating the LRs, we prevent the injection of any interrupt of lower priority than that of the first multi-source SGI we've injected. - Finally, the injection of a multi-source SGI triggers the request of a maintenance interrupt when there will be no pending interrupt in the LRs (HCR_NPIE). At the point where the last pending interrupt in the LRs switches from Pending to Active, the maintenance interrupt will be delivered, allowing us to add the remaining SGIs using the same process. Cc: [email protected] Fixes: 0919e84c0fc1 ("KVM: arm/arm64: vgic-new: Add IRQ sync/flush framework") Acked-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-14KVM: arm/arm64: Reduce verbosity of KVM init logArd Biesheuvel3-5/+5
On my GICv3 system, the following is printed to the kernel log at boot: kvm [1]: 8-bit VMID kvm [1]: IDMAP page: d20e35000 kvm [1]: HYP VA range: 800000000000:ffffffffffff kvm [1]: vgic-v2@2c020000 kvm [1]: GIC system register CPU interface enabled kvm [1]: vgic interrupt IRQ1 kvm [1]: virtual timer IRQ4 kvm [1]: Hyp mode initialized successfully The KVM IDMAP is a mapping of a statically allocated kernel structure, and so printing its physical address leaks the physical placement of the kernel when physical KASLR in effect. So change the kvm_info() to kvm_debug() to remove it from the log output. While at it, trim the output a bit more: IRQ numbers can be found in /proc/interrupts, and the HYP VA and vgic-v2 lines are not highly informational either. Cc: <[email protected]> Acked-by: Will Deacon <[email protected]> Acked-by: Christoffer Dall <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-14KVM: arm/arm64: Reset mapped IRQs on VM resetChristoffer Dall3-0/+31
We currently don't allow resetting mapped IRQs from userspace, because their state is controlled by the hardware. But we do need to reset the state when the VM is reset, so we provide a function for the 'owner' of the mapped interrupt to reset the interrupt state. Currently only the timer uses mapped interrupts, so we call this function from the timer reset logic. Cc: [email protected] Fixes: 4c60e360d6df ("KVM: arm/arm64: Provide a get_input_level for the arch timer") Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-14KVM: arm/arm64: Avoid vcpu_load for other vcpu ioctls than KVM_RUNChristoffer Dall2-12/+0
Calling vcpu_load() registers preempt notifiers for this vcpu and calls kvm_arch_vcpu_load(). The latter will soon be doing a lot of heavy lifting on arm/arm64 and will try to do things such as enabling the virtual timer and setting us up to handle interrupts from the timer hardware. Loading state onto hardware registers and enabling hardware to signal interrupts can be problematic when we're not actually about to run the VCPU, because it makes it difficult to establish the right context when handling interrupts from the timer, and it makes the register access code difficult to reason about. Luckily, now when we call vcpu_load in each ioctl implementation, we can simply remove the call from the non-KVM_RUN vcpu ioctls, and our kvm_arch_vcpu_load() is only used for loading vcpu content to the physical CPU when we're actually going to run the vcpu. Cc: [email protected] Fixes: 9b062471e52a ("KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl") Reviewed-by: Julien Grall <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-14KVM: arm/arm64: vgic: Add missing irq_lock to vgic_mmio_read_pendingAndre Przywara2-0/+4
Our irq_is_pending() helper function accesses multiple members of the vgic_irq struct, so we need to hold the lock when calling it. Add that requirement as a comment to the definition and take the lock around the call in vgic_mmio_read_pending(), where we were missing it before. Fixes: 96b298000db4 ("KVM: arm/arm64: vgic-new: Add PENDING registers handlers") Signed-off-by: Andre Przywara <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-14KVM: PPC: Book3S HV: Fix trap number return from __kvmppc_vcore_entryPaul Mackerras1-5/+5
This fixes a bug where the trap number that is returned by __kvmppc_vcore_entry gets corrupted. The effect of the corruption is that IPIs get ignored on POWER9 systems when the IPI is sent via a doorbell interrupt to a CPU which is executing in a KVM guest. The effect of the IPI being ignored is often that another CPU locks up inside smp_call_function_many() (and if that CPU is holding a spinlock, other CPUs then lock up inside raw_spin_lock()). The trap number is currently held in register r12 for most of the assembly-language part of the guest exit path. In that path, we call kvmppc_subcore_exit_guest(), which is a C function, without restoring r12 afterwards. Depending on the kernel config and the compiler, it may modify r12 or it may not, so some config/compiler combinations see the bug and others don't. To fix this, we arrange for the trap number to be stored on the stack from the 'guest_bypass:' label until the end of the function, then the trap number is loaded and returned in r12 as before. Cc: [email protected] # v4.8+ Fixes: fd7bacbca47a ("KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt") Signed-off-by: Paul Mackerras <[email protected]>
2018-03-09KVM: s390: Refactor host cmma and pfmfi interpretation controlsJanosch Frank5-16/+18
use_cmma in kvm_arch means that the KVM hypervisor is allowed to use cmma, whereas use_cmma in the mm context means cmm has been used before. Let's rename the context one to uses_cmm, as the vm does use collaborative memory management but the host uses the cmm assist (interpretation facility). Also let's introduce use_pfmfi, so we can remove the pfmfi disablement when we activate cmma and rather not activate it in the first place. Signed-off-by: Janosch Frank <[email protected]> Message-Id: <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Christian Borntraeger <[email protected]> Signed-off-by: Christian Borntraeger <[email protected]>
2018-03-09KVM: s390: implement CPU model only facilitiesChristian Borntraeger3-21/+54
Some facilities should only be provided to the guest, if they are enabled by a CPU model. This allows us to avoid capabilities and to simply fall back to the cpumodel for deciding about a facility without enabling it for older QEMUs or QEMUs without a CPU model. Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Cornelia Huck <[email protected]> Signed-off-by: Christian Borntraeger <[email protected]>
2018-03-08KVM: nVMX: Enforce NMI controls on vmentry of L2 guestsKrish Sadhukhan1-2/+27
According to Intel SDM 26.2.1.1, the following rules should be enforced on vmentry: * If the "NMI exiting" VM-execution control is 0, "Virtual NMIs" VM-execution control must be 0. * If the “virtual NMIs” VM-execution control is 0, the “NMI-window exiting” VM-execution control must be 0. This patch enforces these rules when entering an L2 guest. Signed-off-by: Krish Sadhukhan <[email protected]> Reviewed-by: Liran Alon <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-06KVM: nVMX: expose VMX capabilities for nested hypervisors to userspacePaolo Bonzini2-5/+40
Use the new MSR feature framework to tell userspace which VMX capabilities are available for nested hypervisors. Before, these were only accessible with the KVM_GET_MSR VCPU ioctl, after VCPUs had been created. Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-06KVM: nVMX: introduce struct nested_vmx_msrsPaolo Bonzini1-174/+178
Move the MSRs to a separate struct, so that we can introduce a global instance and return it from the /dev/kvm KVM_GET_MSRS ioctl. Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-06KVM: X86: Don't use PV TLB flush with dedicated physical CPUsWanpeng Li1-0/+2
vCPUs are very unlikely to get preempted when they are the only task running on a CPU. PV TLB flush is slower that the native flush in that case, so disable it. Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Eduardo Habkost <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-06KVM: X86: Choose qspinlock when dedicated physical CPUs are availableWanpeng Li1-0/+5
Waiman Long mentioned that: > Generally speaking, unfair lock performs well for VMs with a small > number of vCPUs. Native qspinlock may perform better than pvqspinlock > if there is vCPU pinning and there is no vCPU over-commitment. This patch uses the KVM_HINTS_DEDICATED performance hint, which is provided by the hypervisor admin, to choose the qspinlock algorithm when a dedicated physical CPU is available. PV_DEDICATED = 1, PV_UNHALT = anything: default is qspinlock PV_DEDICATED = 0, PV_UNHALT = 1: default is Hybrid PV queued/unfair lock PV_DEDICATED = 0, PV_UNHALT = 0: default is tas Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Eduardo Habkost <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-06KVM: Introduce paravirtualization hints and KVM_HINTS_DEDICATEDWanpeng Li9-4/+55
This patch introduces kvm_para_has_hint() to query for hints about the configuration of the guests. The first hint KVM_HINTS_DEDICATED, is set if the guest has dedicated physical CPUs for each vCPU (i.e. pinning and no over-commitment). This allows optimizing spinlocks and tells the guest to avoid PV TLB flush. Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Eduardo Habkost <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-06kvm: use insert sort in kvm_io_bus_register_dev functionGal Hammer1-18/+18
The loading time of a VM is quite significant with a CPU usage reaching 100% when loading a VM that its virtio devices use a large amount of virt-queues (e.g. a virtio-serial device with max_ports=511). Most of the time is spend in re-sorting the kvm_io_bus kvm_io_range array when a new eventfd is registered. The patch replaces the existing method with an insert sort. Reviewed-by: Marcel Apfelbaum <[email protected]> Reviewed-by: Uri Lublin <[email protected]> Signed-off-by: Gal Hammer <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-06KVM: x86: KVM_CAP_SYNC_REGSKen Hofsass3-16/+145
This commit implements an enhanced x86 version of S390 KVM_CAP_SYNC_REGS functionality. KVM_CAP_SYNC_REGS "allow[s] userspace to access certain guest registers without having to call SET/GET_*REGS”. This reduces ioctl overhead which is particularly important when userspace is making synchronous guest state modifications (e.g. when emulating and/or intercepting instructions). Originally implemented upstream for the S390, the x86 differences follow: - userspace can select the register sets to be synchronized with kvm_run using bit-flags in the kvm_valid_registers and kvm_dirty_registers fields. - vcpu_events is available in addition to the regs and sregs register sets. Signed-off-by: Ken Hofsass <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> [Removed wrapper around check for reserved kvm_valid_regs. - Radim] Signed-off-by: Radim Krčmář <[email protected]>
2018-03-06KVM: x86: add SYNC_REGS_SIZE_BYTES #define.Ken Hofsass2-2/+6
Replace hardcoded padding size value for struct kvm_sync_regs with #define SYNC_REGS_SIZE_BYTES. Also update the value specified in api.txt from outdated hardcoded value to SYNC_REGS_SIZE_BYTES. Signed-off-by: Ken Hofsass <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Christian Borntraeger <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-06kvm: x86: hyperv: guest->host event signaling via eventfdRoman Kagan7-1/+164
In Hyper-V, the fast guest->host notification mechanism is the SIGNAL_EVENT hypercall, with a single parameter of the connection ID to signal. Currently this hypercall incurs a user exit and requires the userspace to decode the parameters and trigger the notification of the potentially different I/O context. To avoid the costly user exit, process this hypercall and signal the corresponding eventfd in KVM, similar to ioeventfd. The association between the connection id and the eventfd is established via the newly introduced KVM_HYPERV_EVENTFD ioctl, and maintained in an (srcu-protected) IDR. Signed-off-by: Roman Kagan <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> [asm/hyperv.h changes approved by KY Srinivasan. - Radim] Signed-off-by: Radim Krčmář <[email protected]>
2018-03-06kvm: x86: factor out kvm.arch.hyperv (de)initRoman Kagan3-1/+14
Move kvm.arch.hyperv initialization and cleanup to separate functions. For now only a mutex is inited in the former, and the latter is empty; more stuff will go in there in a followup patch. Signed-off-by: Roman Kagan <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-06Merge tag 'kvm-s390-master-4.16-3' of ↵Radim Krčmář1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux KVM: s390: Fixes - Fix random memory corruption when running as guest2 (e.g. KVM in LPAR) and starting guest3 (nested KVM) with many CPUs (e.g. a nested guest with 200 vcpu) - io interrupt delivery counter was not exported
2018-03-06Merge tag 'kvm-ppc-fixes-4.16-1' of ↵Radim Krčmář3-35/+55
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc Fixes for PPC KVM: - Fix guest time accounting in the host - Fix large-page backing for radix guests on POWER9 - Fix HPT guests on POWER9 backed by 2M or 1G pages - Compile fixes for some configs and gcc versions
2018-03-06KVM: s390: fix memory overwrites when not using SCA entriesDavid Hildenbrand1-0/+1
Even if we don't have extended SCA support, we can have more than 64 CPUs if we don't enable any HW features that might use the SCA entries. Now, this works just fine, but we missed a return, which is why we would actually store the SCA entries. If we have more than 64 CPUs, this means writing outside of the basic SCA - bad. Let's fix this. This allows > 64 CPUs when running nested (under vSIE) without random crashes. Fixes: a6940674c384 ("KVM: s390: allow 255 VCPUs when sca entries aren't used") Reported-by: Christian Borntraeger <[email protected]> Tested-by: Christian Borntraeger <[email protected]> Signed-off-by: David Hildenbrand <[email protected]> Message-Id: <[email protected]> Cc: [email protected] Reviewed-by: Cornelia Huck <[email protected]> Signed-off-by: Christian Borntraeger <[email protected]>
2018-03-04Linux 4.16-rc4Linus Torvalds1-1/+1
2018-03-04Merge branch 'x86/urgent' of ↵Linus Torvalds6-18/+21
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A small set of fixes for x86: - Add missing instruction suffixes to assembly code so it can be compiled by newer GAS versions without warnings. - Switch refcount WARN exceptions to UD2 as we did in general - Make the reboot on Intel Edison platforms work - A small documentation update so text and sample command match" * 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation, x86, resctrl: Make text and sample command match x86/platform/intel-mid: Handle Intel Edison reboot correctly x86/asm: Add instruction suffixes to bitops x86/entry/64: Add instruction suffix x86/refcounts: Switch to UD2 for exceptions
2018-03-04Merge branch 'x86-pti-for-linus' of ↵Linus Torvalds8-26/+53
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/pti fixes from Thomas Gleixner: "Three fixes related to melted spectrum: - Sync the cpu_entry_area page table to initial_page_table on 32 bit. Otherwise suspend/resume fails because resume uses initial_page_table and triggers a triple fault when accessing the cpu entry area. - Zero the SPEC_CTL MRS on XEN before suspend to address a shortcoming in the hypervisor. - Fix another switch table detection issue in objtool" * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu_entry_area: Sync cpu_entry_area to initial_page_table objtool: Fix another switch table detection issue x86/xen: Zero MSR_IA32_SPEC_CTRL before suspend
2018-03-04Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds4-5/+16
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "A small set of fixes from the timer departement: - Add a missing timer wheel clock forward when migrating timers off a unplugged CPU to prevent operating on a stale clock base and missing timer deadlines. - Use the proper shift count to extract data from a register value to prevent evaluating unrelated bits - Make the error return check in the FSL timer driver work correctly. Checking an unsigned variable for less than zero does not really work well. - Clarify the confusing comments in the ARC timer code" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Forward timer base before migrating timers clocksource/drivers/arc_timer: Update some comments clocksource/drivers/mips-gic-timer: Use correct shift count to extract data clocksource/drivers/fsl_ftm_timer: Fix error return checking
2018-03-04Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixlet from Thomas Gleixner: "Just a documentation update for the missing device tree property of the R-Car M3N interrupt controller" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: dt-bindings/irqchip/renesas-irqc: Document R-Car M3-N support
2018-03-04Merge tag 'for-4.16-rc3-tag' of ↵Linus Torvalds10-47/+191
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - when NR_CPUS is large, a SRCU structure can significantly inflate size of the main filesystem structure that would not be possible to allocate by kmalloc, so the kvalloc fallback is used - improved error handling - fix endiannes when printing some filesystem attributes via sysfs, this is could happen when a filesystem is moved between different endianity hosts - send fixes: the NO_HOLE mode should not send a write operation for a file hole - fix log replay for for special files followed by file hardlinks - fix log replay failure after unlink and link combination - fix max chunk size calculation for DUP allocation * tag 'for-4.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix log replay failure after unlink and link combination Btrfs: fix log replay failure after linking special file and fsync Btrfs: send, fix issuing write op when processing hole in no data mode btrfs: use proper endianness accessors for super_copy btrfs: alloc_chunk: fix DUP stripe size handling btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_cluster btrfs: handle failure of add_pending_csums btrfs: use kvzalloc to allocate btrfs_fs_info
2018-03-03Merge branch 'i2c/for-current-fixed' of ↵Linus Torvalds3-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "A driver fix and a documentation fix (which makes dependency handling for the next cycle easier)" * 'i2c/for-current-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: octeon: Prevent error message on bus error dt-bindings: at24: sort manufacturers alphabetically
2018-03-03Merge branch 'libnvdimm-fixes' of ↵Linus Torvalds5-17/+27
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm fixes from Dan Williams: "A 4.16 regression fix, three fixes for -stable, and a cleanup fix: - During the merge window support for the new ACPI NVDIMM Platform Capabilities structure disabled support for "deep flush", a force-unit- access like mechanism for persistent memory. Restore that mechanism. - VFIO like RDMA is yet one more memory registration / pinning interface that is incompatible with Filesystem-DAX. Disable long term pins of Filesystem-DAX mappings via VFIO. - The Filesystem-DAX detection to prevent long terms pins mistakenly also disabled Device-DAX pins which are not subject to the same block- map collision concerns. - Similar to the setup path, softlockup warnings can trigger in the shutdown path for large persistent memory namespaces. Teach for_each_device_pfn() to perform cond_resched() in all cases. - Boaz noticed that the might_sleep() in dax_direct_access() is stale as of the v4.15 kernel. These have received a build success notification from the 0day robot, and the longterm pin fixes have appeared in -next. However, I recently rebased the tree to remove some other fixes that need to be reworked after review feedback. * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: memremap: fix softlockup reports at teardown libnvdimm: re-enable deep flush for pmem devices via fsync() vfio: disable filesystem-dax page pinning dax: fix vma_is_fsdax() helper dax: ->direct_access does not sleep anymore
2018-03-03Merge tag 'kbuild-fixes-v4.16' of ↵Linus Torvalds15-27/+40
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - suppress sparse warnings about unknown attributes - fix typos and stale comments - fix build error of arch/sh - fix wrong use of ld-option vs cc-ldoption - remove redundant GCC_PLUGINS_CFLAGS assignment - fix another memory leak of Kconfig - fix line number in error messages of Kconfig - do not write confusing CONFIG_DEFCONFIG_LIST out to .config - add xstrdup() to Kconfig to handle memory shortage errors - show also a Debian package name if ncurses is missing * tag 'kbuild-fixes-v4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: MAINTAINERS: take over Kconfig maintainership kconfig: fix line number in recursive inclusion error message Coccinelle: memdup: Fix typo in warning messages kconfig: Update ncurses package names for menuconfig kbuild/kallsyms: trivial typo fix kbuild: test --build-id linker flag by ld-option instead of cc-ldoption kbuild: drop superfluous GCC_PLUGINS_CFLAGS assignment kconfig: Don't leak choice names during parsing sh: fix build error for empty CONFIG_BUILTIN_DTB_SOURCE kconfig: set SYMBOL_AUTO to the symbol marked with defconfig_list kconfig: add xstrdup() helper kbuild: disable sparse warnings about unknown attributes Makefile: Fix lying comment re. silentoldconfig
2018-03-03Merge tag 'media/v4.16-3' of ↵Linus Torvalds24-175/+329
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fixes from Mauro Carvalho Chehab: - some build fixes with randconfigs - an m88ds3103 fix to prevent an OOPS if the chip doesn't provide the right version during probe (with can happen if the hardware hangs) - a potential out of array bounds reference in tvp5150 - some fixes and improvements in the DVB memory mapped API (added for kernel 4.16) * tag 'media/v4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: media: vb2: Makefile: place vb2-trace together with vb2-core media: Don't let tvp5150_get_vbi() go out of vbi_ram_default array media: dvb: update buffer mmaped flags and frame counter media: dvb: add continuity error indicators for memory mapped buffers media: dmxdev: Fix the logic that enables DMA mmap support media: dmxdev: fix error code for invalid ioctls media: m88ds3103: don't call a non-initalized function media: au0828: add VIDEO_V4L2 dependency media: dvb: fix DVB_MMAP dependency media: dvb: fix DVB_MMAP symbol name media: videobuf2: fix build issues with vb2-trace media: videobuf2: Add VIDEOBUF2_V4L2 Kconfig option for VB2 V4L2 part