aboutsummaryrefslogtreecommitdiff
path: root/arch/powerpc/kvm
AgeCommit message (Collapse)AuthorFilesLines
2018-05-17KVM: PPC: Book3S: Check KVM_CREATE_SPAPR_TCE_64 parametersAlexey Kardashevskiy1-1/+2
Although it does not seem possible to break the host by passing bad parameters when creating a TCE table in KVM, it is still better to get an early clear indication of that than debugging weird effect this might bring. This adds some sanity checks that the page size is 4KB..16GB as this is what the actual LoPAPR supports and that the window actually fits 64bit space. Signed-off-by: Alexey Kardashevskiy <[email protected]> Acked-by: Balbir Singh <[email protected]> Signed-off-by: Paul Mackerras <[email protected]>
2018-05-17KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages with smaller ↵Alexey Kardashevskiy2-20/+94
physical pages At the moment we only support in the host the IOMMU page sizes which the guest is aware of, which is 4KB/64KB/16MB. However P9 does not support 16MB IOMMU pages, 2MB and 1GB pages are supported instead. We can still emulate bigger guest pages (for example 16MB) with smaller host pages (4KB/64KB/2MB). This allows the physical IOMMU pages to use a page size smaller or equal than the guest visible IOMMU page size. Signed-off-by: Alexey Kardashevskiy <[email protected]> Reviewed-by: David Gibson <[email protected]> Signed-off-by: Paul Mackerras <[email protected]>
2018-05-17KVM: PPC: Book3S: Use correct page shift in H_STUFF_TCEAlexey Kardashevskiy2-2/+2
The other TCE handlers use page shift from the guest visible TCE table (described by kvmppc_spapr_tce_iommu_table) so let's make H_STUFF_TCE handlers do the same thing. This should cause no behavioral change now but soon we will allow the iommu_table::it_page_shift being different from from the emulated table page size so this will play a role. Signed-off-by: Alexey Kardashevskiy <[email protected]> Reviewed-by: David Gibson <[email protected]> Acked-by: Balbir Singh <[email protected]> Signed-off-by: Paul Mackerras <[email protected]>
2018-05-17KVM: PPC: Book3S HV: Fix inaccurate commentPaul Mackerras1-1/+1
We now have interrupts hard-disabled when coming back from kvmppc_hv_entry_trampoline, so this changes the comment to reflect that. Signed-off-by: Paul Mackerras <[email protected]>
2018-05-17KVM: PPC: Book3S HV: Set RWMR on POWER8 so PURR/SPURR count correctlyPaul Mackerras1-1/+60
Although Linux doesn't use PURR and SPURR ((Scaled) Processor Utilization of Resources Register), other OSes depend on them. On POWER8 they count at a rate depending on whether the VCPU is idle or running, the activity of the VCPU, and the value in the RWMR (Region-Weighting Mode Register). Hardware expects the hypervisor to update the RWMR when a core is dispatched to reflect the number of online VCPUs in the vcore. This adds code to maintain a count in the vcore struct indicating how many VCPUs are online. In kvmppc_run_core we use that count to set the RWMR register on POWER8. If the core is split because of a static or dynamic micro-threading mode, we use the value for 8 threads. The RWMR value is not relevant when the host is executing because Linux does not use the PURR or SPURR register, so we don't bother saving and restoring the host value. For the sake of old userspace which does not set the KVM_REG_PPC_ONLINE register, we set online to 1 if it was 0 at the time of a KVM_RUN ioctl. Signed-off-by: Paul Mackerras <[email protected]>
2018-05-17KVM: PPC: Book3S HV: Add 'online' register to ONE_REG interfacePaul Mackerras1-0/+6
This adds a new KVM_REG_PPC_ONLINE register which userspace can set to 0 or 1 via the GET/SET_ONE_REG interface to indicate whether it considers the VCPU to be offline (0), that is, not currently running, or online (1). This will be used in a later patch to configure the register which controls PURR and SPURR accumulation. Signed-off-by: Paul Mackerras <[email protected]>
2018-05-17KVM: PPC: Book 3S HV: Do ptesync in radix guest exit pathPaul Mackerras1-0/+8
A radix guest can execute tlbie instructions to invalidate TLB entries. After a tlbie or a group of tlbies, it must then do the architected sequence eieio; tlbsync; ptesync to ensure that the TLB invalidation has been processed by all CPUs in the system before it can rely on no CPU using any translation that it just invalidated. In fact it is the ptesync which does the actual synchronization in this sequence, and hardware has a requirement that the ptesync must be executed on the same CPU thread as the tlbies which it is expected to order. Thus, if a vCPU gets moved from one physical CPU to another after it has done some tlbies but before it can get to do the ptesync, the ptesync will not have the desired effect when it is executed on the second physical CPU. To fix this, we do a ptesync in the exit path for radix guests. If there are any pending tlbies, this will wait for them to complete. If there aren't, then ptesync will just do the same as sync. Signed-off-by: Paul Mackerras <[email protected]>
2018-05-17KVM: PPC: Book3S HV: XIVE: Resend re-routed interrupts on CPU priority changeBenjamin Herrenschmidt1-7/+101
When a vcpu priority (CPPR) is set to a lower value (masking more interrupts), we stop processing interrupts already in the queue for the priorities that have now been masked. If those interrupts were previously re-routed to a different CPU, they might still be stuck until the older one that has them in its queue processes them. In the case of guest CPU unplug, that can be never. To address that without creating additional overhead for the normal interrupt processing path, this changes H_CPPR handling so that when such a priority change occurs, we scan the interrupt queue for that vCPU, and for any interrupt in there that has been re-routed, we replace it with a dummy and force a re-trigger. Signed-off-by: Benjamin Herrenschmidt <[email protected]> Tested-by: Alexey Kardashevskiy <[email protected]> Signed-off-by: Paul Mackerras <[email protected]>
2018-05-17KVM: PPC: Book3S HV: Make radix clear pte when unmappingNicholas Piggin1-1/+1
The current partition table unmap code clears the _PAGE_PRESENT bit out of the pte, which leaves pud_huge/pmd_huge true and does not clear pud_present/pmd_present. This can confuse subsequent page faults and possibly lead to the guest looping doing continual hypervisor page faults. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Paul Mackerras <[email protected]>
2018-05-17KVM: PPC: Book3S HV: Make radix use correct tlbie sequence in ↵Nicholas Piggin1-2/+2
kvmppc_radix_tlbie_page The standard eieio ; tlbsync ; ptesync must follow tlbie to ensure it is ordered with respect to subsequent operations. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Paul Mackerras <[email protected]>
2018-05-17KVM: PPC: Book3S HV: Snapshot timebase offset on guest entryPaul Mackerras2-45/+45
Currently, the HV KVM guest entry/exit code adds the timebase offset from the vcore struct to the timebase on guest entry, and subtracts it on guest exit. Which is fine, except that it is possible for userspace to change the offset using the SET_ONE_REG interface while the vcore is running, as there is only one timebase offset per vcore but potentially multiple VCPUs in the vcore. If that were to happen, KVM would subtract a different offset on guest exit from that which it had added on guest entry, leading to the timebase being out of sync between cores in the host, which then leads to bad things happening such as hangs and spurious watchdog timeouts. To fix this, we add a new field 'tb_offset_applied' to the vcore struct which stores the offset that is currently applied to the timebase. This value is set from the vcore tb_offset field on guest entry, and is what is subtracted from the timebase on guest exit. Since it is zero when the timebase offset is not applied, we can simplify the logic in kvmhv_start_timing and kvmhv_accumulate_time. In addition, we had secondary threads reading the timebase while running concurrently with code on the primary thread which would eventually add or subtract the timebase offset from the timebase. This occurred while saving or restoring the DEC register value on the secondary threads. Although no specific incorrect behaviour has been observed, this is a race which should be fixed. To fix it, we move the DEC saving code to just before we call kvmhv_commence_exit, and the DEC restoring code to after the point where we have waited for the primary thread to switch the MMU context and add the timebase offset. That way we are sure that the timebase contains the guest timebase value in both cases. Signed-off-by: Paul Mackerras <[email protected]>
2018-05-15Merge branch 'topic/ppc-kvm' into nextMichael Ellerman1-5/+31
This brings in one commit that we may want to share with the kvm-ppc tree, to avoid merge conflicts and get wider testing.
2018-05-15powerpc/kvm: Switch kvm pmd allocator to custom allocatorAneesh Kumar K.V1-5/+31
In the next set of patches, we will switch pmd allocator to use page fragments and the locking will be updated to split pmd ptlock. We want to avoid using fragments for partition-scoped table. Use slab cache similar to level 4 table Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-05-03powerpc64/ftrace: Disable ftrace during kvm entry/exitNaveen N. Rao2-0/+7
During guest entry/exit, we switch over to/from the guest MMU context and we cannot take exceptions in the hypervisor code. Since ftrace may be enabled and since it can result in us taking a trap, disable ftrace by setting paca->ftrace_enabled to zero. There are two paths through which we enter/exit a guest: 1. If we are the vcore runner, then we enter the guest via __kvmppc_vcore_entry() and we disable ftrace around this. This is always the case for Power9, and for the primary thread on Power8. 2. If we are a secondary thread in Power8, then we would be in nap due to SMT being disabled. We are woken up by an IPI to enter the guest. In this scenario, we enter the guest through kvm_start_guest(). We disable ftrace at this point. In this scenario, ftrace would only get re-enabled on the secondary thread when SMT is re-enabled (via start_secondary()). Signed-off-by: Naveen N. Rao <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-04-27powerpc/kvm/booke: Fix altivec related build breakLaurentiu Tudor1-0/+7
Add missing "altivec unavailable" interrupt injection helper thus fixing the linker error below: arch/powerpc/kvm/emulate_loadstore.o: In function `kvmppc_check_altivec_disabled': arch/powerpc/kvm/emulate_loadstore.c: undefined reference to `.kvmppc_core_queue_vec_unavail' Fixes: 09f984961c137c4b ("KVM: PPC: Book3S: Add MMIO emulation for VMX instructions") Signed-off-by: Laurentiu Tudor <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-04-15Merge tag 'powerpc-4.17-2' of ↵Linus Torvalds1-4/+0
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix crashes when loading modules built with a different CONFIG_RELOCATABLE value by adding CONFIG_RELOCATABLE to vermagic. - Fix busy loops in the OPAL NVRAM driver if we get certain error conditions from firmware. - Remove tlbie trace points from KVM code that's called in real mode, because it causes crashes. - Fix checkstops caused by invalid tlbiel on Power9 Radix. - Ensure the set of CPU features we "know" are always enabled is actually the minimal set when we build with support for firmware supplied CPU features. Thanks to: Aneesh Kumar K.V, Anshuman Khandual, Nicholas Piggin. * tag 'powerpc-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/64s: Fix CPU_FTRS_ALWAYS vs DT CPU features powerpc/mm/radix: Fix checkstops caused by invalid tlbiel KVM: PPC: Book3S HV: trace_tlbie must not be called in realmode powerpc/8xx: Fix build with hugetlbfs enabled powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops powerpc/powernv: define a standard delay for OPAL_BUSY type retry loops powerpc/fscr: Enable interrupts earlier before calling get_user() powerpc/64s: Fix section mismatch warnings from setup_rfi_flush() powerpc/modules: Fix crashes by adding CONFIG_RELOCATABLE to vermagic
2018-04-11KVM: PPC: Book3S HV: trace_tlbie must not be called in realmodeNicholas Piggin1-4/+0
This crashes with a "Bad real address for load" attempting to load from the vmalloc region in realmode (faulting address is in DAR). Oops: Bad interrupt in KVM entry/exit code, sig: 6 [#1] LE SMP NR_CPUS=2048 NUMA PowerNV CPU: 53 PID: 6582 Comm: qemu-system-ppc Not tainted 4.16.0-01530-g43d1859f0994 NIP: c0000000000155ac LR: c0000000000c2430 CTR: c000000000015580 REGS: c000000fff76dd80 TRAP: 0200 Not tainted (4.16.0-01530-g43d1859f0994) MSR: 9000000000201003 <SF,HV,ME,RI,LE> CR: 48082222 XER: 00000000 CFAR: 0000000102900ef0 DAR: d00017fffd941a28 DSISR: 00000040 SOFTE: 3 NIP [c0000000000155ac] perf_trace_tlbie+0x2c/0x1a0 LR [c0000000000c2430] do_tlbies+0x230/0x2f0 I suspect the reason is the per-cpu data is not in the linear chunk. This could be restored if that was able to be fixed, but for now, just remove the tracepoints. Fixes: 0428491cba92 ("powerpc/mm: Trace tlbie(l) instructions") Cc: [email protected] # v4.13+ Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-04-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds9-169/+210
Pull kvm updates from Paolo Bonzini: "ARM: - VHE optimizations - EL2 address space randomization - speculative execution mitigations ("variant 3a", aka execution past invalid privilege register access) - bugfixes and cleanups PPC: - improvements for the radix page fault handler for HV KVM on POWER9 s390: - more kvm stat counters - virtio gpu plumbing - documentation - facilities improvements x86: - support for VMware magic I/O port and pseudo-PMCs - AMD pause loop exiting - support for AMD core performance extensions - support for synchronous register access - expose nVMX capabilities to userspace - support for Hyper-V signaling via eventfd - use Enlightened VMCS when running on Hyper-V - allow userspace to disable MWAIT/HLT/PAUSE vmexits - usual roundup of optimizations and nested virtualization bugfixes Generic: - API selftest infrastructure (though the only tests are for x86 as of now)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (174 commits) kvm: x86: fix a prototype warning kvm: selftests: add sync_regs_test kvm: selftests: add API testing infrastructure kvm: x86: fix a compile warning KVM: X86: Add Force Emulation Prefix for "emulate the next instruction" KVM: X86: Introduce handle_ud() KVM: vmx: unify adjacent #ifdefs x86: kvm: hide the unused 'cpu' variable KVM: VMX: remove bogus WARN_ON in handle_ept_misconfig Revert "KVM: X86: Fix SMRAM accessing even if VM is shutdown" kvm: Add emulation for movups/movupd KVM: VMX: raise internal error for exception during invalid protected mode state KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending KVM: x86: Fix misleading comments on handling pending exceptions KVM: x86: Rename interrupt.pending to interrupt.injected KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt x86/kvm: use Enlightened VMCS when running on Hyper-V x86/hyper-v: detect nested features x86/hyper-v: define struct hv_enlightened_vmcs and clean field bits ...
2018-04-07Merge tag 'powerpc-4.17-1' of ↵Linus Torvalds9-35/+555
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Notable changes: - Support for 4PB user address space on 64-bit, opt-in via mmap(). - Removal of POWER4 support, which was accidentally broken in 2016 and no one noticed, and blocked use of some modern instructions. - Workarounds so that the hypervisor can enable Transactional Memory on Power9. - A series to disable the DAWR (Data Address Watchpoint Register) on Power9. - More information displayed in the meltdown/spectre_v1/v2 sysfs files. - A vpermxor (Power8 Altivec) implementation for the raid6 Q Syndrome. - A big series to make the allocation of our pacas (per cpu area), kernel page tables, and per-cpu stacks NUMA aware when using the Radix MMU on Power9. And as usual many fixes, reworks and cleanups. Thanks to: Aaro Koskinen, Alexandre Belloni, Alexey Kardashevskiy, Alistair Popple, Andy Shevchenko, Aneesh Kumar K.V, Anshuman Khandual, Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Christophe Lombard, Cyril Bur, Daniel Axtens, Dave Young, Finn Thain, Frederic Barrat, Gustavo Romero, Horia Geantă, Jonathan Neuschäfer, Kees Cook, Larry Finger, Laurent Dufour, Laurent Vivier, Logan Gunthorpe, Madhavan Srinivasan, Mark Greer, Mark Hairgrove, Markus Elfring, Mathieu Malaterre, Matt Brown, Matt Evans, Mauricio Faria de Oliveira, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Philippe Bergheaud, Ram Pai, Rob Herring, Sam Bobroff, Segher Boessenkool, Simon Guo, Simon Horman, Stewart Smith, Sukadev Bhattiprolu, Suraj Jitindar Singh, Thiago Jung Bauermann, Vaibhav Jain, Vaidyanathan Srinivasan, Vasant Hegde, Wei Yongjun" * tag 'powerpc-4.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (207 commits) powerpc/64s/idle: Fix restore of AMOR on POWER9 after deep sleep powerpc/64s: Fix POWER9 DD2.2 and above in cputable features powerpc/64s: Fix pkey support in dt_cpu_ftrs, add CPU_FTR_PKEY bit powerpc/64s: Fix dt_cpu_ftrs to have restore_cpu clear unwanted LPCR bits Revert "powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead" powerpc: iomap.c: introduce io{read|write}64_{lo_hi|hi_lo} powerpc: io.h: move iomap.h include so that it can use readq/writeq defs cxl: Fix possible deadlock when processing page faults from cxllib powerpc/hw_breakpoint: Only disable hw breakpoint if cpu supports it powerpc/mm/radix: Update command line parsing for disable_radix powerpc/mm/radix: Parse disable_radix commandline correctly. powerpc/mm/hugetlb: initialize the pagetable cache correctly for hugetlb powerpc/mm/radix: Update pte fragment count from 16 to 256 on radix powerpc/mm/keys: Update documentation and remove unnecessary check powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead powerpc/64s/idle: Consolidate power9_offline_stop()/power9_idle_stop() powerpc/powernv: Always stop secondaries before reboot/shutdown powerpc: hard disable irqs in smp_send_stop loop powerpc: use NMI IPI for smp_send_stop powerpc/powernv: Fix SMT4 forcing idle code ...
2018-04-03KVM: PPC: Book3S HV: Fix ppc_breakpoint_available compile errorNicholas Piggin1-0/+1
arch/powerpc/kvm/book3s_hv.c: In function ‘kvmppc_h_set_mode’: arch/powerpc/kvm/book3s_hv.c:745:8: error: implicit declaration of function ‘ppc_breakpoint_available’ if (!ppc_breakpoint_available()) ^~~~~~~~~~~~~~~~~~~~~~~~ Fixes: 398e712c007f ("KVM: PPC: Book3S HV: Return error from h_set_mode(SET_DAWR) on POWER9") Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-04-01powerpc/64s: Remove POWER4 supportNicholas Piggin1-6/+0
POWER4 has been broken since at least the change 49d09bf2a6 ("powerpc/64s: Optimise MSR handling in exception handling"), which requires mtmsrd L=1 support. This was introduced in ISA v2.01, and POWER4 supports ISA v2.00. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-04-01powerpc/kvm: Fix guest boot failure on Power9 since DAWR changesAneesh Kumar K.V1-1/+1
SLOF checks for 'sc 1' (hypercall) support by issuing a hcall with H_SET_DABR. Since the recent commit e8ebedbf3131 ("KVM: PPC: Book3S HV: Return error from h_set_dabr() on POWER9") changed H_SET_DABR to return H_UNSUPPORTED on Power9, we see guest boot failures, the symptom is the boot seems to just stop in SLOF, eg: SLOF *************************************************************** QEMU Starting Build Date = Sep 24 2017 12:23:07 FW Version = buildd@ release 20170724 <no further output> SLOF can cope if H_SET_DABR returns H_HARDWARE. So wwitch the return value to H_HARDWARE instead of H_UNSUPPORTED so that we don't break the guest boot. That does mean we return a different error to PowerVM in this case, but that's probably not a big concern. Fixes: e8ebedbf3131 ("KVM: PPC: Book3S HV: Return error from h_set_dabr() on POWER9") Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-03-31Merge branch 'topic/paca' into nextMichael Ellerman4-19/+23
Bring in yet another series that touches KVM code, and might need to be merged into the kvm-ppc branch to resolve conflicts. This required some changes in pnv_power9_force_smt4_catch/release() due to the paca array becomming an array of pointers.
2018-03-30Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-17/+18
Pull KVM fixes from Radim Krčmář: "PPC: - Fix a bug causing occasional machine check exceptions on POWER8 hosts (introduced in 4.16-rc1) x86: - Fix a guest crashing regression with nested VMX and restricted guest (introduced in 4.16-rc1) - Fix dependency check for pv tlb flush (the wrong dependency that effectively disabled the feature was added in 4.16-rc4, the original feature in 4.16-rc1, so it got decent testing)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: Fix pv tlb flush dependencies KVM: nVMX: sync vmcs02 segment regs prior to vmx_set_cr0 KVM: PPC: Book3S HV: Fix duplication of host SLB entries
2018-03-30powerpc/64s: Allocate LPPACAs individuallyNicholas Piggin1-1/+2
We no longer allocate lppacas in an array, so this patch removes the 1kB static alignment for the structure, and enforces the PAPR alignment requirements at allocation time. We can not reduce the 1kB allocation size however, due to existing KVM hypervisors. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-03-30powerpc/64: Use array of paca pointers and allocate pacas individuallyNicholas Piggin2-14/+19
Change the paca array into an array of pointers to pacas. Allocate pacas individually. This allows flexibility in where the PACAs are allocated. Future work will allocate them node-local. Platforms that don't have address limits on PACAs would be able to defer PACA allocations until later in boot rather than allocate all possible ones up-front then freeing unused. This is slightly more overhead (one additional indirection) for cross CPU paca references, but those aren't too common. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-03-30powerpc/64s: Do not allocate lppaca if we are not virtualizedNicholas Piggin2-4/+2
The "lppaca" is a structure registered with the hypervisor. This is unnecessary when running on non-virtualised platforms. One field from the lppaca (pmcregs_in_use) is also used by the host, so move the host part out into the paca (lppaca field is still updated in guest mode). Signed-off-by: Nicholas Piggin <[email protected]> [mpe: Fix non-pseries build with some #ifdefs] Signed-off-by: Michael Ellerman <[email protected]>
2018-03-28Merge tag 'powerpc-4.16-6' of ↵Linus Torvalds2-0/+14
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "Some more powerpc fixes for 4.16. Apologies if this is a bit big at rc7, but they're all reasonably important fixes. None are actually for new code, so they aren't indicative of 4.16 being in bad shape from our point of view. - Fix missing AT_BASE_PLATFORM (in auxv) when we're using a new firmware interface for describing CPU features. - Fix lost pending interrupts due to a race in our interrupt soft-masking code. - A workaround for a nest MMU bug with TLB invalidations on Power9. - A workaround for broadcast TLB invalidations on Power9. - Fix a bug in our instruction SLB miss handler, when handling bad addresses (eg. >= TASK_SIZE), which could corrupt non-volatile user GPRs. Thanks to: Aneesh Kumar K.V, Balbir Singh, Benjamin Herrenschmidt, Nicholas Piggin" * tag 'powerpc-4.16-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/64s: Fix i-side SLB miss bad address handler saving nonvolatile GPRs powerpc/mm: Fixup tlbie vs store ordering issue on POWER9 powerpc/mm/radix: Move the functions that does the actual tlbie closer powerpc/mm/radix: Remove unused code powerpc/mm: Workaround Nest MMU bug with TLB invalidations powerpc/mm: Add tracking of the number of coprocessors using a context powerpc/64s: Fix lost pending interrupt due to race causing lost update to irq_happened powerpc/64s: Fix NULL AT_BASE_PLATFORM when using DT CPU features
2018-03-28Merge branch 'fixes' into nextMichael Ellerman2-0/+14
Merge our fixes branch from the 4.16 cycle. There were a number of important fixes merged, in particular some Power9 workarounds that we want in next for testing purposes. There's also been some conflicting changes in the CPU features code which are best merged and tested before going upstream.
2018-03-28KVM: PPC: Book3S HV: Use __gfn_to_pfn_memslot() in page fault handlerPaul Mackerras1-60/+88
This changes the hypervisor page fault handler for radix guests to use the generic KVM __gfn_to_pfn_memslot() function instead of using get_user_pages_fast() and then handling the case of VM_PFNMAP vmas specially. The old code missed the case of VM_IO vmas; with this change, VM_IO vmas will now be handled correctly by code within __gfn_to_pfn_memslot. Currently, __gfn_to_pfn_memslot calls hva_to_pfn, which only uses __get_user_pages_fast for the initial lookup in the cases where either atomic or async is set. Since we are not setting either atomic or async, we do our own __get_user_pages_fast first, for now. This also adds code to check for the KVM_MEM_READONLY flag on the memslot. If it is set and this is a write access, we synthesize a data storage interrupt for the guest. In the case where the page is not normal RAM (i.e. page == NULL in kvmppc_book3s_radix_page_fault(), we read the PTE from the Linux page tables because we need the mapping attribute bits as well as the PFN. (The mapping attribute bits indicate whether accesses have to be non-cacheable and/or guarded.) Signed-off-by: Paul Mackerras <[email protected]>
2018-03-27Merge branch 'topic/ppc-kvm' into nextMichael Ellerman2-1/+19
Merge the DAWR series, which touches arch code and KVM code and may need to be merged into the kvm-ppc tree.
2018-03-27KVM: PPC: Book3S HV: Handle migration with POWER9 disabled DAWRMichael Neuling1-0/+10
POWER9 with the DAWR disabled causes problems for partition migration. Either we have to fail the migration (since we lose the DAWR) or we silently drop the DAWR and allow the migration to pass. This patch does the latter and allows the migration to pass (at the cost of silently losing the DAWR). This is not ideal but hopefully the best overall solution. This approach has been acked by Paulus. With this patch kvmppc_set_one_reg() will store the DAWR in the vcpu but won't actually set it on POWER9 hardware. Signed-off-by: Michael Neuling <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-03-27KVM: PPC: Book3S HV: Return error from h_set_dabr() on POWER9Michael Neuling1-1/+7
POWER7 compat mode guests can use h_set_dabr on POWER9. POWER9 should use the DAWR but since it's disabled there we can't. This returns H_UNSUPPORTED on a h_set_dabr() on POWER9 where the DAWR is disabled. Current Linux guests ignore this error, so they will silently not get the DAWR (sigh). The same error code is being used by POWERVM in this case. Signed-off-by: Michael Neuling <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-03-27KVM: PPC: Book3S HV: Return error from h_set_mode(SET_DAWR) on POWER9Michael Neuling1-0/+2
Return H_P2 on a h_set_mode(SET_DAWR) on POWER9 where the DAWR is disabled. Current Linux guests ignore this error, so they will silently not get the DAWR (sigh). The same error code is being used by POWERVM in this case. Signed-off-by: Michael Neuling <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-03-24Merge branch 'topic/ppc-kvm' into nextMichael Ellerman6-9/+512
This brings in two series from Paul, one of which touches KVM code and may need to be merged into the kvm-ppc tree to resolve conflicts.
2018-03-24KVM: PPC: Book3S HV: Work around TEXASR bug in fake suspend statePaul Mackerras1-7/+10
This works around a hardware bug in "Nimbus" POWER9 DD2.2 processors, where the contents of the TEXASR can get corrupted while a thread is in fake suspend state. The workaround is for the instruction emulation code to use the value saved at the most recent guest exit in real suspend mode. We achieve this by simply not saving the TEXASR into the vcpu struct on an exit in fake suspend state. We also have to take care to set the orig_texasr field only on guest exit in real suspend state. This also means that on guest entry in fake suspend state, TEXASR will be restored to the value it had on the last exit in real suspend state, effectively counteracting any hardware-caused corruption. This works because TEXASR may not be written in suspend state. With this, the guest might see the wrong values in TEXASR if it reads it while in suspend state, but will see the correct value in non-transactional state (e.g. after a treclaim), and treclaim will work correctly. With this workaround, the code will actually run slightly faster, and will operate correctly on systems without the TEXASR bug (since TEXASR may not be written in suspend state, and is only changed by failure recording, which will have already been done before we get into fake suspend state). Therefore these changes are not made subject to a CPU feature bit. Signed-off-by: Paul Mackerras <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-03-24KVM: PPC: Book3S HV: Work around XER[SO] bug in fake suspend modeSuraj Jitindar Singh1-4/+20
This works around a hardware bug in "Nimbus" POWER9 DD2.2 processors, where a treclaim performed in fake suspend mode can cause subsequent reads from the XER register to return inconsistent values for the SO (summary overflow) bit. The inconsistent SO bit state can potentially be observed on any thread in the core. We have to do the treclaim because that is the only way to get the thread out of suspend state (fake or real) and into non-transactional state. The workaround for the bug is to force the core into SMT4 mode before doing the treclaim. This patch adds the code to do that, conditional on the CPU_FTR_P9_TM_XER_SO_BUG feature bit. Signed-off-by: Suraj Jitindar Singh <[email protected]> Signed-off-by: Paul Mackerras <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-03-24KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9Paul Mackerras6-7/+491
POWER9 has hardware bugs relating to transactional memory and thread reconfiguration (changes to hardware SMT mode). Specifically, the core does not have enough storage to store a complete checkpoint of all the architected state for all four threads. The DD2.2 version of POWER9 includes hardware modifications designed to allow hypervisor software to implement workarounds for these problems. This patch implements those workarounds in KVM code so that KVM guests see a full, working transactional memory implementation. The problems center around the use of TM suspended state, where the CPU has a checkpointed state but execution is not transactional. The workaround is to implement a "fake suspend" state, which looks to the guest like suspended state but the CPU does not store a checkpoint. In this state, any instruction that would cause a transition to transactional state (rfid, rfebb, mtmsrd, tresume) or would use the checkpointed state (treclaim) causes a "soft patch" interrupt (vector 0x1500) to the hypervisor so that it can be emulated. The trechkpt instruction also causes a soft patch interrupt. On POWER9 DD2.2, we avoid returning to the guest in any state which would require a checkpoint to be present. The trechkpt in the guest entry path which would normally create that checkpoint is replaced by either a transition to fake suspend state, if the guest is in suspend state, or a rollback to the pre-transactional state if the guest is in transactional state. Fake suspend state is indicated by a flag in the PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and reads back as 0. On exit from the guest, if the guest is in fake suspend state, we still do the treclaim instruction as we would in real suspend state, in order to get into non-transactional state, but we do not save the resulting register state since there was no checkpoint. Emulation of the instructions that cause a softpatch interrupt is handled in two paths. If the guest is in real suspend mode, we call kvmhv_p9_tm_emulation_early() to handle the cases where the guest is transitioning to transactional state. This is called before we do the treclaim in the guest exit path; because we haven't done treclaim, we can get back to the guest with the transaction still active. If the instruction is a case that kvmhv_p9_tm_emulation_early() doesn't handle, or if the guest is in fake suspend state, then we proceed to do the complete guest exit path and subsequently call kvmhv_p9_tm_emulation() in host context with the MMU on. This handles all the cases including the cases that generate program interrupts (illegal instruction or TM Bad Thing) and facility unavailable interrupts. The emulation is reasonably straightforward and is mostly concerned with checking for exception conditions and updating the state of registers such as MSR and CR0. The treclaim emulation takes care to ensure that the TEXASR register gets updated as if it were the guest treclaim instruction that had done failure recording, not the treclaim done in hypervisor state in the guest exit path. With this, the KVM_CAP_PPC_HTM capability returns true (1) even if transactional memory is not available to host userspace. Signed-off-by: Paul Mackerras <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-03-23powerpc/mm: Fixup tlbie vs store ordering issue on POWER9Aneesh Kumar K.V2-0/+14
On POWER9, under some circumstances, a broadcast TLB invalidation might complete before all previous stores have drained, potentially allowing stale stores from becoming visible after the invalidation. This works around it by doubling up those TLB invalidations which was verified by HW to be sufficient to close the risk window. This will be documented in a yet-to-be-published errata. Fixes: 1a472c9dba6b ("powerpc/mm/radix: Add tlbflush routines") Signed-off-by: Aneesh Kumar K.V <[email protected]> [mpe: Enable the feature in the DT CPU features code for all Power9, rename the feature to CPU_FTR_P9_TLBIE_BUG per benh.] Signed-off-by: Michael Ellerman <[email protected]>
2018-03-23KVM: PPC: Book3S HV: Fix duplication of host SLB entriesPaul Mackerras1-17/+18
Since commit 6964e6a4e489 ("KVM: PPC: Book3S HV: Do SLB load/unload with guest LPCR value loaded", 2018-01-11), we have been seeing occasional machine check interrupts on POWER8 systems when running KVM guests, due to SLB multihit errors. This turns out to be due to the guest exit code reloading the host SLB entries from the SLB shadow buffer when the SLB was not previously cleared in the guest entry path. This can happen because the path which skips from the guest entry code to the guest exit code without entering the guest now does the skip before the SLB is cleared and loaded with guest values, but the host values are loaded after the point in the guest exit path that we skip to. To fix this, we move the code that reloads the host SLB values up so that it occurs just before the point in the guest exit code (the label guest_bypass:) where we skip to from the guest entry path. Reported-by: Alexey Kardashevskiy <[email protected]> Fixes: 6964e6a4e489 ("KVM: PPC: Book3S HV: Do SLB load/unload with guest LPCR value loaded") Tested-by: Alexey Kardashevskiy <[email protected]> Signed-off-by: Paul Mackerras <[email protected]>
2018-03-19KVM: PPC: Book3S HV: Handle 1GB pages in radix page fault handlerPaul Mackerras1-36/+93
This adds code to the radix hypervisor page fault handler to handle the case where the guest memory is backed by 1GB hugepages, and put them into the partition-scoped radix tree at the PUD level. The code is essentially analogous to the code for 2MB pages. This also rearranges kvmppc_create_pte() to make it easier to follow. Signed-off-by: Paul Mackerras <[email protected]>
2018-03-19KVM: PPC: Book3S HV: Streamline setting of reference and change bitsPaul Mackerras1-33/+19
When using the radix MMU, we can get hypervisor page fault interrupts with the DSISR_SET_RC bit set in DSISR/HSRR1, indicating that an attempt to set the R (reference) or C (change) bit in a PTE atomically failed. Previously we would find the corresponding Linux PTE and check the permission and dirty bits there, but this is not really necessary since we only need to do what the hardware was trying to do, namely set R or C atomically. This removes the code that reads the Linux PTE and just update the partition-scoped PTE, having first checked that it is still present, and if the access is a write, that the PTE still has write permission. Furthermore, we now check whether any other relevant bits are set in DSISR, and if there are, then we proceed with the rest of the function in order to handle whatever condition they represent, instead of returning to the guest as we did previously. Signed-off-by: Paul Mackerras <[email protected]>
2018-03-19KVM: PPC: Book3S HV: Radix page fault handler optimizationsPaul Mackerras1-15/+27
This improves the handling of transparent huge pages in the radix hypervisor page fault handler. Previously, if a small page is faulted in to a 2MB region of guest physical space, that means that there is a page table pointer at the PMD level, which could never be replaced by a leaf (2MB) PMD entry. This adds the code to clear the PMD, invlidate the page walk cache and free the page table page in this situation, so that the leaf PMD entry can be created. This also adds code to check whether a PMD or PTE being inserted is the same as is already there (because of a race with another CPU that faulted on the same page) and if so, we don't replace the existing entry, meaning that we don't invalidate the PTE or PMD and do a TLB invalidation. Signed-off-by: Paul Mackerras <[email protected]>
2018-03-19KVM: PPC: Remove unused kvm_unmap_hva callbackPaul Mackerras8-44/+2
Since commit fb1522e099f0 ("KVM: update to new mmu_notifier semantic v2", 2017-08-31), the MMU notifier code in KVM no longer calls the kvm_unmap_hva callback. This removes the PPC implementations of kvm_unmap_hva(). Signed-off-by: Paul Mackerras <[email protected]>
2018-03-14KVM: PPC: Book3S HV: Fix trap number return from __kvmppc_vcore_entryPaul Mackerras1-5/+5
This fixes a bug where the trap number that is returned by __kvmppc_vcore_entry gets corrupted. The effect of the corruption is that IPIs get ignored on POWER9 systems when the IPI is sent via a doorbell interrupt to a CPU which is executing in a KVM guest. The effect of the IPI being ignored is often that another CPU locks up inside smp_call_function_many() (and if that CPU is holding a spinlock, other CPUs then lock up inside raw_spin_lock()). The trap number is currently held in register r12 for most of the assembly-language part of the guest exit path. In that path, we call kvmppc_subcore_exit_guest(), which is a C function, without restoring r12 afterwards. Depending on the kernel config and the compiler, it may modify r12 or it may not, so some config/compiler combinations see the bug and others don't. To fix this, we arrange for the trap number to be stored on the stack from the 'guest_bypass:' label until the end of the function, then the trap number is loaded and returned in r12 as before. Cc: [email protected] # v4.8+ Fixes: fd7bacbca47a ("KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt") Signed-off-by: Paul Mackerras <[email protected]>
2018-03-13Merge two commits from 'kvm-ppc-fixes' into nextMichael Ellerman1-1/+3
This merges two commits from the `kvm-ppc-fixes` branch into next, as they fix build breaks we are seeing while testing next.
2018-03-06Merge tag 'kvm-ppc-fixes-4.16-1' of ↵Radim Krčmář3-35/+55
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc Fixes for PPC KVM: - Fix guest time accounting in the host - Fix large-page backing for radix guests on POWER9 - Fix HPT guests on POWER9 backed by 2M or 1G pages - Compile fixes for some configs and gcc versions
2018-03-03KVM: PPC: Book3S HV: Fix guest time accounting with VIRT_CPU_ACCOUNTING_GENLaurent Vivier1-3/+2
Since commit 8b24e69fc47e ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry"), if CONFIG_VIRT_CPU_ACCOUNTING_GEN is set, the guest time is not accounted to guest time and user time, but instead to system time. This is because guest_enter()/guest_exit() are called while interrupts are disabled and the tick counter cannot be updated between them. To fix that, move guest_exit() after local_irq_enable(), and as guest_enter() is called with IRQ disabled, call guest_enter_irqoff() instead. Fixes: 8b24e69fc47e ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry") Signed-off-by: Laurent Vivier <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]> Signed-off-by: Paul Mackerras <[email protected]>
2018-03-02KVM: PPC: Book3S HV: Fix VRMA initialization with 2MB or 1GB memory backingPaul Mackerras1-5/+7
The current code for initializing the VRMA (virtual real memory area) for HPT guests requires the page size of the backing memory to be one of 4kB, 64kB or 16MB. With a radix host we have the possibility that the backing memory page size can be 2MB or 1GB. In these cases, if the guest switches to HPT mode, KVM will not initialize the VRMA and the guest will fail to run. In fact it is not necessary that the VRMA page size is the same as the backing memory page size; any VRMA page size less than or equal to the backing memory page size is acceptable. Therefore we now choose the largest page size out of the set {4k, 64k, 16M} which is not larger than the backing memory page size. Signed-off-by: Paul Mackerras <[email protected]>
2018-03-02KVM: PPC: Book3S HV: Fix handling of large pages in radix page fault handlerPaul Mackerras1-26/+43
This fixes several bugs in the radix page fault handler relating to the way large pages in the memory backing the guest were handled. First, the check for large pages only checked for explicit huge pages and missed transparent huge pages. Then the check that the addresses (host virtual vs. guest physical) had appropriate alignment was wrong, meaning that the code never put a large page in the partition scoped radix tree; it was always demoted to a small page. Fixing this exposed bugs in kvmppc_create_pte(). We were never invalidating a 2MB PTE, which meant that if a page was initially faulted in without write permission and the guest then attempted to store to it, we would never update the PTE to have write permission. If we find a valid 2MB PTE in the PMD, we need to clear it and do a TLB invalidation before installing either the new 2MB PTE or a pointer to a page table page. This also corrects an assumption that get_user_pages_fast would set the _PAGE_DIRTY bit if we are writing, which is not true. Instead we mark the page dirty explicitly with set_page_dirty_lock(). This also means we don't need the dirty bit set on the host PTE when providing write access on a read fault. Signed-off-by: Paul Mackerras <[email protected]>