Age | Commit message (Collapse) | Author | Files | Lines |
|
Historically adding devices to their respective iommu group has been
handled by the post-init phb fixup for most devices. This was done
because:
1) The IOMMU group is tied to the PE (usually) so we can only setup the
iommu groups after we've done resource allocation since BAR location
determines the device's PE, and:
2) The sysfs directory for the pci_dev needs to be available since
iommu_add_device() wants to add an attribute for the iommu group.
However, since commit 30d87ef8b38d ("powerpc/pci: Fix
pcibios_setup_device() ordering") both conditions are met when
hose->ops->dma_dev_setup() is called so there's no real need to do
this in the fixup.
Moving the call to iommu_add_device() into pnv_pci_ioda_dma_setup_dev()
is a nice cleanup since it puts all the per-device IOMMU setup into one
place. It also results in all (non-nvlink) devices getting their iommu
group via a common path rather than relying on the bus notifier hack
in pnv_tce_iommu_bus_notifier() to handle the adding VFs and
hotplugged devices to their group.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Move the registration of IOMMU groups out of the post-phb init fixup and
into when we configure DMA for a PE. For most devices this doesn't
result in any functional changes, but for NVLink attached GPUs it
requires a bit of care. When the GPU is probed an IOMMU group would be
created for the PE that contains it. We need to ensure that group is
removed before we add the PE to the compound group that's used to keep
the translations see by the PCIe and NVLink buses the same.
No functional changes. Probably.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In pnv_ioda_setup_vf_PE() we register an iommu group for the VF PE
then call pnv_ioda_setup_bus_iommu_group() to add devices to that group.
However, this function is called before the VFs are scanned so there's
no devices to add.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Re-work the control flow a bit so what's going on is a little clearer.
This also ensures the table_group is only initialised once in the P9
case. This shouldn't be a functional change since all the GPU PCI
devices should have the same table_group configuration, but it does
look strange.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Similar to the C code change, make the AMR restore conditional on
whether the register has changed.
Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The AMR update is made conditional on AMR actually changing, which
should be the less common case on most workloads (though kernel page
faults on uaccess could be frequent, this doesn't significantly slow
down that case).
Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Writing the AMR register is documented to require context
synchronizing operations before and after, for it to take effect as
expected. The KUAP restore at interrupt exit time deliberately avoids
the isync after the AMR update because it only needs to take effect
after the context synchronizing RFID that soon follows. Add a comment
for this.
The missing isync before the update doesn't have an obvious
justification, and seems it could theoretically allow a rogue user
access to leak past the AMR update. Add isyncs for these.
Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Signed-off-by: huhai <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
We have r12 available, use it to keep CR around and don't
save it in SPRN_SPRG_SCRATCH6.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/019f314a98c107c4ca46e46c1cf402e9a44114a7.1590079969.git.christophe.leroy@csgroup.eu
|
|
Let's reduce the number of registers used in TLB miss handlers.
We have both r9 and r12 available for any temporary use.
r9 is enough, avoid using r12.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/7f330e971952abb2645fb9ca4310c0f527e84dcb.1590079969.git.christophe.leroy@csgroup.eu
|
|
This erratum is dedicated to IBM 405GP and STB03xxx
which are now gone.
Remove this erratum.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/44dbc08e9034681eb28324cbabc086e97044c36c.1590079969.git.christophe.leroy@csgroup.eu
|
|
This erratum was for IBM 403GCX, 405EP and STB03xxx which are
now gone.
Remove this erratum.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/1b6c9916514ef3e084bba57925ad9eb444627566.1590079969.git.christophe.leroy@csgroup.eu
|
|
All platforms selecting the obsolete processor are gone now.
Remove support for it.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/906c6a6df710f2826e332b8a0cd5d2859a913a1c.1590079969.git.christophe.leroy@csgroup.eu
|
|
ISS4xx has support for 405GP which is obsolete.
Remote it.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/7380974bf5952af825ae2552d0a987c0c1c8b506.1590079969.git.christophe.leroy@csgroup.eu
|
|
EP405 is an old type of board based on a 405GP which is obsolete.
Remove it.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/e9534caa51f327c841b3db5f48043a47ad70d246.1590079968.git.christophe.leroy@csgroup.eu
|
|
CONFIG_WALNUT is not selected by any config and is based
on 405GP which is obsolete.
Remove it.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/ab46013d8d33346af68faf30a719a586c3befad9.1590079968.git.christophe.leroy@csgroup.eu
|
|
CONFIG_STB03xxx is not user selectable and is not selected
by any config.
Remove it.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/d7d73f9a8ee3a890566abace568101e9b4836016.1590079968.git.christophe.leroy@csgroup.eu
|
|
CONFIG_403GCX is not user selectable and is not
selected by any platform.
Remove it.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/635f8f5ce9d1f761b3bd8dc3e8ddad500cea26c4.1590079968.git.christophe.leroy@csgroup.eu
|
|
40x was the last user of PTE_ATOMIC_UPDATES.
Drop everything related to PTE_ATOMIC_UPDATES.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/dbe8438fd1ed3e500132c8ab70269d4e6cc84531.1590079968.git.christophe.leroy@csgroup.eu
|
|
Commit 1bc54c03117b ("powerpc: rework 4xx PTE access and TLB miss")
reworked 44x PTE access to avoid atomic pte updates, and
left 8xx, 40x and fsl booke with atomic pte updates.
Commit 6cfd8990e27d ("powerpc: rework FSL Book-E PTE access and TLB
miss") removed atomic pte updates on fsl booke.
It went away on 8xx with commit ddfc20a3b9ae ("powerpc/8xx: Remove
PTE_ATOMIC_UPDATES").
40x is the last platform setting PTE_ATOMIC_UPDATES.
Rework PTE access and TLB miss to remove PTE_ATOMIC_UPDATES for 40x:
- Always handle DSI as a fault.
- Bail out of TLB miss handler when CONFIG_SWAP is set and
_PAGE_ACCESSED is not set.
- Bail out of ITLB miss handler when _PAGE_EXEC is not set.
- Only set WR bit when both _PAGE_RW and _PAGE_DIRTY are set.
- Remove _PAGE_HWWRITE
- Don't require PTE_ATOMIC_UPDATES anymore
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/99a0fcd337ef67088140d1647d75fea026a70413.1590079968.git.christophe.leroy@csgroup.eu
|
|
The latest Xilinx design tools called ISE and EDK has been released in
October 2013. New tool doesn't support any PPC405/PPC440 new designs.
These platforms are no longer supported and tested.
PowerPC 405/440 port is orphan from 2013 by
commit cdeb89943bfc ("MAINTAINERS: Fix incorrect status tag") and
commit 19624236cce1 ("MAINTAINERS: Update Grant's email address and maintainership")
that's why it is time to remove the support fot these platforms.
Signed-off-by: Michal Simek <[email protected]>
Signed-off-by: Christophe Leroy <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/8c593895e2cb57d232d85ce4d8c3a1aa7f0869cc.1590079968.git.christophe.leroy@csgroup.eu
|
|
The same complicated sequence for juggling EE, RI, soft mask, and
irq tracing is repeated 3 times, tidy these up into one function.
This differs qiute a bit between sub architectures, so this makes
the ppc32 port cleaner as well.
Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The idea behind this prefetch was to kick off a page table walk before
returning from the fault, getting some pipelining advantage.
But this never showed up any noticable performance advantage, and in
fact with KUAP the prefetches are actually blocked and cause some
kind of micro-architectural fault. Removing this improves page fault
microbenchmark performance by about 9%.
Signed-off-by: Nicholas Piggin <[email protected]>
[mpe: Keep the early return in update_mmu_cache()]
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The XIVE interrupt mode can be disabled with the "xive=off" kernel
parameter, in which case there is nothing to present to the user in the
associated /sys/kernel/debug/powerpc/xive file.
Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state")
Signed-off-by: Cédric Le Goater <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
We are going to rely on the loosening of RCU callback semantics,
introduced by this commit:
806f04e9fd2c: ("rcu: Allow for smp_call_function() running callbacks from idle")
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
There is a potential race condition between hypervisor page faults
and flushing a memslot. It is possible for a page fault to read the
memslot before a memslot is updated and then write a PTE to the
partition-scoped page tables after kvmppc_radix_flush_memslot has
completed. (Note that this race has never been explicitly observed.)
To close this race, it is sufficient to increment the MMU sequence
number while the kvm->mmu_lock is held. That will cause
mmu_notifier_retry() to return true, and the page fault will then
return to the guest without inserting a PTE.
Signed-off-by: Paul Mackerras <[email protected]>
|
|
Although in general we do not expect valid PTEs to be found in
kvmppc_create_pte when we are inserting a large page mapping, there
is one situation where this can occur. That is when dirty page
logging is turned off for a memslot while the VM is running.
Because the new memslots are installed before the old memslot is
flushed in kvmppc_core_commit_memory_region_hv(), there is a
window where a hypervisor page fault can try to install a 2MB
(or 1GB) page where there are already small page mappings which
were installed while dirty page logging was enabled and which
have not yet been flushed.
Since we have a situation where valid PTEs can legitimately be
found by kvmppc_unmap_free_pte, and which can be triggered by
userspace, just remove the WARN_ON_ONCE, since it is undesirable
to have userspace able to trigger a kernel warning.
Signed-off-by: Paul Mackerras <[email protected]>
|
|
The commit 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_*
Hcalls") added checks of secure bit of SRR1 to filter out the Hcall
reserved to the Ultravisor.
However, the Hcall H_SVM_INIT_ABORT is made by the Ultravisor passing the
context of the VM calling UV_ESM. This allows the Hypervisor to return to
the guest without going through the Ultravisor. Thus the Secure bit of SRR1
is not set in that particular case.
In the case a regular VM is calling H_SVM_INIT_ABORT, this hcall will be
filtered out in kvmppc_h_svm_init_abort() because kvm->arch.secure_guest is
not set in that case.
Fixes: 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls")
Signed-off-by: Laurent Dufour <[email protected]>
Reviewed-by: Greg Kurz <[email protected]>
Reviewed-by: Ram Pai <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
It is unsafe to traverse kvm->arch.spapr_tce_tables and
stt->iommu_tables without the RCU read lock held. Also, add
cond_resched_rcu() in places with the RCU read lock held that could take
a while to finish.
arch/powerpc/kvm/book3s_64_vio.c:76 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
no locks held by qemu-kvm/4265.
stack backtrace:
CPU: 96 PID: 4265 Comm: qemu-kvm Not tainted 5.7.0-rc4-next-20200508+ #2
Call Trace:
[c000201a8690f720] [c000000000715948] dump_stack+0xfc/0x174 (unreliable)
[c000201a8690f770] [c0000000001d9470] lockdep_rcu_suspicious+0x140/0x164
[c000201a8690f7f0] [c008000010b9fb48] kvm_spapr_tce_release_iommu_group+0x1f0/0x220 [kvm]
[c000201a8690f870] [c008000010b8462c] kvm_spapr_tce_release_vfio_group+0x54/0xb0 [kvm]
[c000201a8690f8a0] [c008000010b84710] kvm_vfio_destroy+0x88/0x140 [kvm]
[c000201a8690f8f0] [c008000010b7d488] kvm_put_kvm+0x370/0x600 [kvm]
[c000201a8690f990] [c008000010b7e3c0] kvm_vm_release+0x38/0x60 [kvm]
[c000201a8690f9c0] [c0000000005223f4] __fput+0x124/0x330
[c000201a8690fa20] [c000000000151cd8] task_work_run+0xb8/0x130
[c000201a8690fa70] [c0000000001197e8] do_exit+0x4e8/0xfa0
[c000201a8690fb70] [c00000000011a374] do_group_exit+0x64/0xd0
[c000201a8690fbb0] [c000000000132c90] get_signal+0x1f0/0x1200
[c000201a8690fcc0] [c000000000020690] do_notify_resume+0x130/0x3c0
[c000201a8690fda0] [c000000000038d64] syscall_exit_prepare+0x1a4/0x280
[c000201a8690fe20] [c00000000000c8f8] system_call_common+0xf8/0x278
====
arch/powerpc/kvm/book3s_64_vio.c:368 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
2 locks held by qemu-kvm/4264:
#0: c000201ae2d000d8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0xdc/0x950 [kvm]
#1: c000200c9ed0c468 (&kvm->srcu){....}-{0:0}, at: kvmppc_h_put_tce+0x88/0x340 [kvm]
====
arch/powerpc/kvm/book3s_64_vio.c:108 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by qemu-kvm/4257:
#0: c000200b1b363a40 (&kv->lock){+.+.}-{3:3}, at: kvm_vfio_set_attr+0x598/0x6c0 [kvm]
====
arch/powerpc/kvm/book3s_64_vio.c:146 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by qemu-kvm/4257:
#0: c000200b1b363a40 (&kv->lock){+.+.}-{3:3}, at: kvm_vfio_set_attr+0x598/0x6c0 [kvm]
Signed-off-by: Qian Cai <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
kvmppc_pmd_alloc() and kvmppc_pte_alloc() allocate some memory but then
pud_populate() and pmd_populate() will use __pa() to reference the newly
allocated memory.
Since kmemleak is unable to track the physical memory resulting in false
positives, silence those by using kmemleak_ignore().
unreferenced object 0xc000201c382a1000 (size 4096):
comm "qemu-kvm", pid 124828, jiffies 4295733767 (age 341.250s)
hex dump (first 32 bytes):
c0 00 20 09 f4 60 03 87 c0 00 20 10 72 a0 03 87 .. ..`.... .r...
c0 00 20 0e 13 a0 03 87 c0 00 20 1b dc c0 03 87 .. ....... .....
backtrace:
[<000000004cc2790f>] kvmppc_create_pte+0x838/0xd20 [kvm_hv]
kvmppc_pmd_alloc at arch/powerpc/kvm/book3s_64_mmu_radix.c:366
(inlined by) kvmppc_create_pte at arch/powerpc/kvm/book3s_64_mmu_radix.c:590
[<00000000d123c49a>] kvmppc_book3s_instantiate_page+0x2e0/0x8c0 [kvm_hv]
[<00000000bb549087>] kvmppc_book3s_radix_page_fault+0x1b4/0x2b0 [kvm_hv]
[<0000000086dddc0e>] kvmppc_book3s_hv_page_fault+0x214/0x12a0 [kvm_hv]
[<000000005ae9ccc2>] kvmppc_vcpu_run_hv+0xc5c/0x15f0 [kvm_hv]
[<00000000d22162ff>] kvmppc_vcpu_run+0x34/0x48 [kvm]
[<00000000d6953bc4>] kvm_arch_vcpu_ioctl_run+0x314/0x420 [kvm]
[<000000002543dd54>] kvm_vcpu_ioctl+0x33c/0x950 [kvm]
[<0000000048155cd6>] ksys_ioctl+0xd8/0x130
[<0000000041ffeaa7>] sys_ioctl+0x28/0x40
[<000000004afc4310>] system_call_exception+0x114/0x1e0
[<00000000fb70a873>] system_call_common+0xf0/0x278
unreferenced object 0xc0002001f0c03900 (size 256):
comm "qemu-kvm", pid 124830, jiffies 4295735235 (age 326.570s)
hex dump (first 32 bytes):
c0 00 20 10 fa a0 03 87 c0 00 20 10 fa a1 03 87 .. ....... .....
c0 00 20 10 fa a2 03 87 c0 00 20 10 fa a3 03 87 .. ....... .....
backtrace:
[<0000000023f675b8>] kvmppc_create_pte+0x854/0xd20 [kvm_hv]
kvmppc_pte_alloc at arch/powerpc/kvm/book3s_64_mmu_radix.c:356
(inlined by) kvmppc_create_pte at arch/powerpc/kvm/book3s_64_mmu_radix.c:593
[<00000000d123c49a>] kvmppc_book3s_instantiate_page+0x2e0/0x8c0 [kvm_hv]
[<00000000bb549087>] kvmppc_book3s_radix_page_fault+0x1b4/0x2b0 [kvm_hv]
[<0000000086dddc0e>] kvmppc_book3s_hv_page_fault+0x214/0x12a0 [kvm_hv]
[<000000005ae9ccc2>] kvmppc_vcpu_run_hv+0xc5c/0x15f0 [kvm_hv]
[<00000000d22162ff>] kvmppc_vcpu_run+0x34/0x48 [kvm]
[<00000000d6953bc4>] kvm_arch_vcpu_ioctl_run+0x314/0x420 [kvm]
[<000000002543dd54>] kvm_vcpu_ioctl+0x33c/0x950 [kvm]
[<0000000048155cd6>] ksys_ioctl+0xd8/0x130
[<0000000041ffeaa7>] sys_ioctl+0x28/0x40
[<000000004afc4310>] system_call_exception+0x114/0x1e0
[<00000000fb70a873>] system_call_common+0xf0/0x278
Signed-off-by: Qian Cai <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. For historical reasons, many kvm-related function parameters
retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
patch does a unified cleanup of these remaining redundant parameters.
Signed-off-by: Tianjia Zhang <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Paul Mackerras <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
The 'kvm_run' field already exists in the 'vcpu' structure, which
is the same structure as the 'kvm_run' in the 'vcpu_arch' and
should be deleted.
Signed-off-by: Tianjia Zhang <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Paul Mackerras <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
The newly introduced ibm,secure-memory nodes supersede the
ibm,uv-firmware's property secure-memory-ranges.
Firmware will no more expose the secure-memory-ranges property so first
read the new one and if not found rollback to the older one.
Signed-off-by: Laurent Dufour <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
Free function kfree() already does NULL check, so the additional
check is unnecessary, just remove it.
Signed-off-by: Chen Zhou <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
Commit 1ca3dec2b2df ("powerpc/xive: Prevent page fault issues in the
machine crash handler") fixed an issue in the FW assisted dump of
machines using hash MMU and the XIVE interrupt mode under the POWER
hypervisor. It forced the mapping of the ESB page of interrupts being
mapped in the Linux IRQ number space to make sure the 'crash kexec'
sequence worked during such an event. But it didn't handle the
un-mapping.
This mapping is now blocking the removal of a passthrough IO adapter
under the POWER hypervisor because it expects the guest OS to have
cleared all page table entries related to the adapter. If some are
still present, the RTAS call which isolates the PCI slot returns error
9001 "valid outstanding translations".
Remove these mapping in the IRQ data cleanup routine.
Under KVM, this cleanup is not required because the ESB pages for the
adapter interrupts are un-mapped from the guest by the hypervisor in
the KVM XIVE native device. This is now redundant but it's harmless.
Fixes: 1ca3dec2b2df ("powerpc/xive: Prevent page fault issues in the machine crash handler")
Cc: [email protected] # v5.5+
Signed-off-by: Cédric Le Goater <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The code patching code wants to get the value of a struct ppc_inst as
a u64 when the instruction is prefixed, so we can pass the u64 down to
__put_user_asm() and write it with a single store.
The optprobes code wants to load a struct ppc_inst as an immediate
into a register so it is useful to have it as a u64 to use the
existing helper function.
Currently this is a bit awkward because the value differs based on the
CPU endianness, so add a helper to do the conversion.
This fixes the usage in arch_prepare_optimized_kprobe() which was
previously incorrect on big endian.
Fixes: 650b55b707fd ("powerpc: Add prefixed instructions to instruction data type")
Signed-off-by: Michael Ellerman <[email protected]>
Tested-by: Jordan Niethe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In a few places we want to calculate the address of the next
instruction. Previously that was simple, we just added 4 bytes, or if
using a u32 * we incremented that pointer by 1.
But prefixed instructions make it more complicated, we need to advance
by either 4 or 8 bytes depending on the actual instruction. We also
can't do pointer arithmetic using struct ppc_inst, because it is
always 8 bytes in size on 64-bit, even though we might only need to
advance by 4 bytes.
So add a ppc_inst_next() helper which calculates the location of the
next instruction, if the given instruction was located at the given
address. Note the instruction doesn't need to actually be at the
address in memory.
Although it would seem natural for the value to be passed by value,
that makes it too easy to write a loop that will read off the end of a
page, eg:
for (; src < end; src = ppc_inst_next(src, *src),
dest = ppc_inst_next(dest, *dest))
As noticed by Christophe and Jordan, if end is the exact end of a
page, and the next page is not mapped, this will fault, because *dest
will read 8 bytes, 4 bytes into the next page.
So value is passed by reference, so the helper can be careful to use
ppc_inst_read() on it.
Signed-off-by: Michael Ellerman <[email protected]>
Reviewed-by: Jordan Niethe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Merge our fixes branch from this cycle. It contains several important
fixes we need in next for testing purposes, and also some that will
conflict with upcoming changes.
|
|
Merge Christophe's large series to use huge pages for the linear
mapping on 8xx.
From his cover letter:
The main purpose of this big series is to:
- reorganise huge page handling to avoid using mm_slices.
- use huge pages to map kernel memory on the 8xx.
The 8xx supports 4 page sizes: 4k, 16k, 512k and 8M.
It uses 2 Level page tables, PGD having 1024 entries, each entry
covering 4M address space. Then each page table has 1024 entries.
At the time being, page sizes are managed in PGD entries, implying
the use of mm_slices as it can't mix several pages of the same size
in one page table.
The first purpose of this series is to reorganise things so that
standard page tables can also handle 512k pages. This is done by
adding a new _PAGE_HUGE flag which will be copied into the Level 1
entry in the TLB miss handler. That done, we have 2 types of pages:
- PGD entries to regular page tables handling 4k/16k and 512k pages
- PGD entries to hugepd tables handling 8M pages.
There is no need to mix 8M pages with other sizes, because a 8M page
will use more than what a single PGD covers.
Then comes the second purpose of this series. At the time being, the
8xx has implemented special handling in the TLB miss handlers in order
to transparently map kernel linear address space and the IMMR using
huge pages by building the TLB entries in assembly at the time of the
exception.
As mm_slices is only for user space pages, and also because it would
anyway not be convenient to slice kernel address space, it was not
possible to use huge pages for kernel address space. But after step
one of the series, it is now more flexible to use huge pages.
This series drop all assembly 'just in time' handling of huge pages
and use huge pages in page tables instead.
Once the above is done, then comes icing on the cake:
- Use huge pages for KASAN shadow mapping
- Allow pinned TLBs with strict kernel rwx
- Allow pinned TLBs with debug pagealloc
Then, last but not least, those modifications for the 8xx allows the
following improvement on book3s/32:
- Mapping KASAN shadow with BATs
- Allowing BATs with debug pagealloc
All this allows to considerably simplify TLB miss handlers and associated
initialisation. The overhead of reading page tables is negligible
compared to the reduction of the miss handlers.
While we were at touching pte_update(), some cleanup was done
there too.
Tested widely on 8xx and 832x. Boot tested on QEMU MAC99.
|
|
Implement a kasan_init_region() dedicated to book3s/32 that
allocates KASAN regions using BATs.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/709e821602b48a1d7c211a9b156da26db98c3e9d.1589866984.git.christophe.leroy@csgroup.eu
|
|
DEBUG_PAGEALLOC only manages RW data.
Text and RO data can still be mapped with BATs.
In order to map with BATs, also enforce data alignment. Set
by default to 256M which is a good compromise for keeping
enough BATs for also KASAN and IMMR.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/fd29c1718ee44d82115d0e835ced808eb4ccbf51.1589866984.git.christophe.leroy@csgroup.eu
|
|
Implement a kasan_init_region() dedicated to 8xx that
allocates KASAN regions using huge pages.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/d2d60202a8821dc81cffe6ff59cc13c15b7e4bb6.1589866984.git.christophe.leroy@csgroup.eu
|
|
DEBUG_PAGEALLOC only manages RW data.
Text and RO data can still be mapped with hugepages and pinned TLB.
In order to map with hugepages, also enforce a 512kB data alignment
minimum. That's a trade-off between size of speed, taking into
account that DEBUG_PAGEALLOC is a debug option. Anyway the alignment
is still tunable.
We also allow tuning of alignment for book3s to limit the complexity
of the test in Kconfig that will anyway disappear in the following
patches once DEBUG_PAGEALLOC is handled together with BATs.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/c13256f2d356a316715da61fe089b3623ef217a5.1589866984.git.christophe.leroy@csgroup.eu
|
|
Pinned TLB are 8M. Now that there is no strict boundary anymore
between text and RO data, it is possible to use 8M pinned executable
TLB that covers both text and RO data.
When PIN_TLB_DATA or PIN_TLB_TEXT is selected, enforce 8M RW data
alignment and allow STRICT_KERNEL_RWX.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/c535fc97bf0dd8693192e25feeed8088701e00c6.1589866984.git.christophe.leroy@csgroup.eu
|
|
Map linear memory space with 512k and 8M pages whenever
possible.
Three mappings are performed:
- One for kernel text
- One for RO data
- One for the rest
Separating the mappings is done to be able to update the
protection later when using STRICT_KERNEL_RWX.
The ITLB miss handler now need to also handle huge TLBs
unless kernel text in pinned.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/c44f0ab5510474f25123d904cd1f4e5c6aa3c1ac.1589866984.git.christophe.leroy@csgroup.eu
|
|
Map the IMMR area with a single 512k huge page.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/9495dba06669da40e133f24607758fa6dcc65f66.1589866984.git.christophe.leroy@csgroup.eu
|
|
Add a function to early map kernel memory using huge pages.
For 512k pages, just use standard page table and map in using 512k
pages.
For 8M pages, create a hugepd table and populate the two PGD
entries with it.
This function can only be used to create page tables at startup. Once
the regular SLAB allocation functions replace memblock functions,
this function cannot allocate new pages anymore. However it can still
update existing mappings with new protections.
hugepd_none() macro is moved into asm/hugetlb.h to be usable outside
of mm/hugetlbpage.c
early_pte_alloc_kernel() is made visible.
_PAGE_HUGE flag is now displayed by ptdump.
Signed-off-by: Christophe Leroy <[email protected]>
[mpe: Change ptdump display to use "huge"]
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/68325bcd3b6f93127f7810418a2352c3519066d6.1589866984.git.christophe.leroy@csgroup.eu
|
|
Now that linear and IMMR dedicated TLB handling is gone, kernel
boundary address comparison is similar in ITLB miss handler and
in DTLB miss handler.
Create a macro named compare_to_kernel_boundary.
When TASK_SIZE is strictly below 0x80000000 and PAGE_OFFSET is
above 0x80000000, it is enough to compare to 0x8000000, and this
can be done with a single instruction.
Using not. instruction, we get to use 'blt' conditional branch as
when doing a regular comparison:
0x00000000 <= addr <= 0x7fffffff ==>
0xffffffff >= NOT(addr) >= 0x80000000
The above test corresponds to a 'blt'
Otherwise, do a regular comparison using two instructions.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/6312575d06a8813105e6564a3b12e1d373aa1b2f.1589866984.git.christophe.leroy@csgroup.eu
|
|
Similar to PPC64, accept to map RO data as ROX as a trade off between
between security and memory usage.
Having RO data executable is not a high risk as RO data can't be
modified to forge an exploit.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/8c4a0d89d944eed984dd941e509614031a5ace2b.1589866984.git.christophe.leroy@csgroup.eu
|