Age | Commit message (Collapse) | Author | Files | Lines |
|
There are several early memory allocations in arch/ code that use
memblock_phys_alloc() to allocate memory, convert the returned physical
address to the virtual address and then set the allocated memory to
zero.
Exactly the same behaviour can be achieved simply by calling
memblock_alloc(): it allocates the memory in the same way as
memblock_phys_alloc(), then it performs the phys_to_virt() conversion
and clears the allocated memory.
Replace the longer sequence with a simpler call to memblock_alloc().
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Stefan Kristiansson <[email protected]>
Cc: Vincent Chen <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- Enable THREAD_INFO_IN_TASK to move thread_info off the stack.
- A big series from Christoph reworking our DMA code to use more of
the generic infrastructure, as he said:
"This series switches the powerpc port to use the generic swiotlb
and noncoherent dma ops, and to use more generic code for the
coherent direct mapping, as well as removing a lot of dead
code."
- Increase our vmalloc space to 512T with the Hash MMU on modern
CPUs, allowing us to support machines with larger amounts of total
RAM or distance between nodes.
- Two series from Christophe, one to optimise TLB miss handlers on
6xx, and another to optimise the way STRICT_KERNEL_RWX is
implemented on some 32-bit CPUs.
- Support for KCOV coverage instrumentation which means we can run
syzkaller and discover even more bugs in our code.
And as always many clean-ups, reworks and minor fixes etc.
Thanks to: Alan Modra, Alexey Kardashevskiy, Alistair Popple, Andrea
Arcangeli, Andrew Donnellan, Aneesh Kumar K.V, Aravinda Prasad, Balbir
Singh, Brajeswar Ghosh, Breno Leitao, Christian Lamparter, Christian
Zigotzky, Christophe Leroy, Christoph Hellwig, Corentin Labbe, Daniel
Axtens, David Gibson, Diana Craciun, Firoz Khan, Gustavo A. R. Silva,
Igor Stoppa, Joe Lawrence, Joel Stanley, Jonathan Neuschäfer, Jordan
Niethe, Laurent Dufour, Madhavan Srinivasan, Mahesh Salgaonkar, Mark
Cave-Ayland, Masahiro Yamada, Mathieu Malaterre, Matteo Croce, Meelis
Roos, Michael W. Bringmann, Nathan Chancellor, Nathan Fontenot,
Nicholas Piggin, Nick Desaulniers, Nicolai Stange, Oliver O'Halloran,
Paul Mackerras, Peter Xu, PrasannaKumar Muralidharan, Qian Cai,
Rashmica Gupta, Reza Arbab, Robert P. J. Day, Russell Currey,
Sabyasachi Gupta, Sam Bobroff, Sandipan Das, Sergey Senozhatsky,
Souptick Joarder, Stewart Smith, Tyrel Datwyler, Vaibhav Jain,
YueHaibing"
* tag 'powerpc-5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (200 commits)
powerpc/32: Clear on-stack exception marker upon exception return
powerpc: Remove export of save_stack_trace_tsk_reliable()
powerpc/mm: fix "section_base" set but not used
powerpc/mm: Fix "sz" set but not used warning
powerpc/mm: Check secondary hash page table
powerpc: remove nargs from __SYSCALL
powerpc/64s: Fix unrelocated interrupt trampoline address test
powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables
powerpc/fsl: Fix the flush of branch predictor.
powerpc/powernv: Make opal log only readable by root
powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc
powerpc/powernv: move OPAL call wrapper tracing and interrupt handling to C
powerpc/64s: Fix data interrupts vs d-side MCE reentrancy
powerpc/64s: Prepare to handle data interrupts vs d-side MCE reentrancy
powerpc/64s: system reset interrupt preserve HSRRs
powerpc/64s: Fix HV NMI vs HV interrupt recoverability test
powerpc/mm/hash: Handle mmap_min_addr correctly in get_unmapped_area topdown search
powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback
selftests/powerpc: Remove duplicate header
powerpc sstep: Add support for modsd, modud instructions
...
|
|
Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
All these places for replacement were found by running the following
grep patterns on the entire kernel code. Please let me know if this
might have missed some instances. This might also have replaced some
false positives. I will appreciate suggestions, inputs and review.
1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"
This patch (of 2):
At present there are multiple places where invalid node number is
encoded as -1. Even though implicitly understood it is always better to
have macros in there. Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE. This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Jeff Kirsher <[email protected]> [ixgbe]
Acked-by: Jens Axboe <[email protected]> [mtip32xx]
Acked-by: Vinod Koul <[email protected]> [dmaengine.c]
Acked-by: Michael Ellerman <[email protected]> [powerpc]
Acked-by: Doug Ledford <[email protected]> [drivers/infiniband]
Cc: Joseph Qi <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We store 2 multilevel tables in iommu_table - one for the hardware and
one with the corresponding userspace addresses. Before allocating
the tables, the iommu_table_group_ops::get_table_size() hook returns
the combined size of the two and VFIO SPAPR TCE IOMMU driver adjusts
the locked_vm counter correctly. When the table is actually allocated,
the amount of allocated memory is stored in iommu_table::it_allocated_size
and used to decrement the locked_vm counter when we release the memory
used by the table; .get_table_size() and .create_table() calculate it
independently but the result is expected to be the same.
However the allocator does not add the userspace table size to
.it_allocated_size so when we destroy the table because of VFIO PCI
unplug (i.e. VFIO container is gone but the userspace keeps running),
we decrement locked_vm by just a half of size of memory we are
releasing.
To make things worse, since we enabled on-demand allocation of
indirect levels, it_allocated_size contains only the amount of memory
actually allocated at the table creation time which can just be a
fraction. It is not a problem with incrementing locked_vm (as
get_table_size() value is used) but it is with decrementing.
As the result, we leak locked_vm and may not be able to allocate more
IOMMU tables after few iterations of hotplug/unplug.
This sets it_allocated_size in the pnv_pci_ioda2_ops::create_table()
hook to what pnv_pci_ioda2_get_table_size() returns so from now on we
have a single place which calculates the maximum memory a table can
occupy. The original meaning of it_allocated_size is somewhat lost now
though.
We do not ditch it_allocated_size whatsoever here and we do not call
get_table_size() from vfio_iommu_spapr_tce.c when decrementing
locked_vm as we may have multiple IOMMU groups per container and even
though they all are supposed to have the same get_table_size()
implementation, there is a small chance for failure or confusion.
Fixes: 090bad39b237 ("powerpc/powernv: Add indirect levels to it_userspace")
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Currently the opal log is globally readable. It is kernel policy to
limit the visibility of physical addresses / kernel pointers to root.
Given this and the fact the opal log may contain this information it
would be better to limit the readability to root.
Fixes: bfc36894a48b ("powerpc/powernv: Add OPAL message log interface")
Cc: [email protected] # v3.15+
Signed-off-by: Jordan Niethe <[email protected]>
Reviewed-by: Stewart Smith <[email protected]>
Reviewed-by: Andrew Donnellan <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
The OPAL call wrapper gets interrupt disabling wrong. It disables
interrupts just by clearing MSR[EE], which has two problems:
- It doesn't call into the IRQ tracing subsystem, which means tracing
across OPAL calls does not always notice IRQs have been disabled.
- It doesn't go through the IRQ soft-mask code, which causes a minor
bug. MSR[EE] can not be restored by saving the MSR then clearing
MSR[EE], because a racing interrupt while soft-masked could clear
MSR[EE] between the two steps. This can cause MSR[EE] to be
incorrectly enabled when the OPAL call returns. Fortunately that
should only result in another masked interrupt being taken to
disable MSR[EE] again, but it's a bit sloppy.
The existing code also saves MSR to PACA, which is not re-entrant if
there is a nested OPAL call from different MSR contexts, which can
happen these days with SRESET interrupts on bare metal.
To fix these issues, move the tracing and IRQ handling code to C, and
call into asm just for the low level call when everything is ready to
go. Save the MSR on stack rather than PACA.
Performance cost is kept to a minimum with a few optimisations:
- The endian switch upon return is combined with the MSR restore,
which avoids an expensive context synchronizing operation for LE
kernels. This makes up for the additional mtmsrd to enable
interrupts with local_irq_enable().
- blr is now used to return from the opal_* functions that are called
as C functions, to avoid link stack corruption. This requires a
skiboot fix as well to keep the call stack balanced.
A NULL call is more costly after this, (410ns->430ns on POWER9), but
OPAL calls are generally not performance critical at this scale.
Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Commit 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via stop-api
only on Hotplug", 2017-07-21) added two calls to opal_slw_set_reg()
inside pnv_cpu_offline(), with the aim of changing the LPCR value in
the SLW image to disable wakeups from the decrementer while a CPU is
offline. However, pnv_cpu_offline() gets called each time a secondary
CPU thread is woken up to participate in running a KVM guest, that is,
not just when a CPU is offlined.
Since opal_slw_set_reg() is a very slow operation (with observed
execution times around 20 milliseconds), this means that an offline
secondary CPU can often be busy doing the opal_slw_set_reg() call
when the primary CPU wants to grab all the secondary threads so that
it can run a KVM guest. This leads to messages like "KVM: couldn't
grab CPU n" being printed and guest execution failing.
There is no need to reprogram the SLW image on every KVM guest entry
and exit. So that we do it only when a CPU is really transitioning
between online and offline, this moves the calls to
pnv_program_cpu_hotplug_lpcr() into pnv_smp_cpu_kill_self().
Fixes: 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via stop-api only on Hotplug")
Cc: [email protected] # v4.14+
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
The change_pte() notifier was designed to use as a quick path to
update secondary MMU PTEs on write permission changes or PFN changes.
For KVM, it could reduce the vm-exits when vcpu faults on the pages
that was touched up by KSM. It's not used to do cache invalidations,
for example, if we see the notifier will be called before the real PTE
update after all (please see set_pte_at_notify that set_pte_at was
called later).
All the necessary cache invalidation should all be done in
invalidate_range() already.
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Reviewed-by: Balbir Singh <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Merge commits we're sharing with kvm-ppc tree.
|
|
This adds an "in_guest" parameter to machine_check_print_event_info()
so that we can avoid trying to translate guest NIP values into
symbolic form using the host kernel's symbol table.
Reviewed-by: Aravinda Prasad <[email protected]>
Reviewed-by: Mahesh Salgaonkar <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Merge hch's big DMA rework series. This is in a topic branch in case he
wants to merge it to minimise conflicts.
|
|
There's a few important fixes in our fixes branch, in particular the
pgd/pud_present() one, so merge it now.
|
|
The compound IOMMU group rework moved iommu_register_group() together
in pnv_pci_ioda_setup_iommu_api() (which is a part of
ppc_md.pcibios_fixup). As the result, pnv_ioda_setup_bus_iommu_group()
does not create groups any more, it only adds devices to groups.
This works fine for boot time devices. However IOMMU groups for
SRIOV's VFs were added by pnv_ioda_setup_bus_iommu_group() so this got
broken: pnv_tce_iommu_bus_notifier() expects a group to be registered
for VF and it is not.
This adds missing group registration and adds a NULL pointer check
into the bus notifier so we won't crash if there is no group, although
it is not expected to happen now because of the change above.
Example oops seen prior to this patch:
$ echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
Unable to handle kernel paging request for data at address 0x00000030
Faulting instruction address: 0xc0000000004a6018
Oops: Kernel access of bad area, sig: 11 [#1]
LE SMP NR_CPUS=2048 NUMA PowerNV
CPU: 46 PID: 7006 Comm: bash Not tainted 4.15-ish
NIP: c0000000004a6018 LR: c0000000004a6014 CTR: 0000000000000000
REGS: c000008fc876b400 TRAP: 0300 Not tainted (4.15-ish)
MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
CFAR: c000000000d0be20 DAR: 0000000000000030 DSISR: 40000000 SOFTE: 1
...
NIP sysfs_do_create_link_sd.isra.0+0x68/0x150
LR sysfs_do_create_link_sd.isra.0+0x64/0x150
Call Trace:
pci_dev_type+0x0/0x30 (unreliable)
iommu_group_add_device+0x8c/0x600
iommu_add_device+0xe8/0x180
pnv_tce_iommu_bus_notifier+0xb0/0xf0
notifier_call_chain+0x9c/0x110
blocking_notifier_call_chain+0x64/0xa0
device_add+0x524/0x7d0
pci_device_add+0x248/0x450
pci_iov_add_virtfn+0x294/0x3e0
pci_enable_sriov+0x43c/0x580
mlx5_core_sriov_configure+0x15c/0x2f0 [mlx5_core]
sriov_numvfs_store+0x180/0x240
dev_attr_store+0x3c/0x60
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1ac/0x240
__vfs_write+0x3c/0x70
vfs_write+0xd8/0x220
SyS_write+0x6c/0x110
system_call+0x58/0x6c
Fixes: 0bd971676e68 ("powerpc/powernv/npu: Add compound IOMMU groups")
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reported-by: Santwana Samantray <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
There is no good reason for this helper, just opencode it.
Signed-off-by: Christoph Hellwig <[email protected]>
Tested-by: Christian Zigotzky <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Now that we've switched all the powerpc nommu and swiotlb methods to
use the generic dma_direct_* calls we can remove these ops vectors
entirely and rely on the common direct mapping bypass that avoids
indirect function calls entirely. This also allows to remove a whole
lot of boilerplate code related to setting up these operations.
Signed-off-by: Christoph Hellwig <[email protected]>
Tested-by: Christian Zigotzky <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Use the generic iommu bypass code instead of overriding set_dma_mask.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
These devices are not PCIe devices and do not have associated dma map
ops, so this is just dead code.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
This function is completely bogus - the fact that two PCIe devices come
from the same vendor has absolutely nothing to say about the DMA
capabilities and characteristics.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
The IODA reset is used to flush out any OS controlled state from the PHB.
This reset can fail if a PHB fatal error has occurred in early boot,
probably due to a because of a bad device. We already do a fundemental
reset of the device in some cases, so this patch just adds a test to force
a full reset if firmware reports an error when performing the IODA reset.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Remove linux/printk.h which is included more than once.
Signed-off-by: Sabyasachi Gupta <[email protected]>
Acked-by: Souptick Joarder <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
TCE_KILL_INVAL_ALL has moved long ago but the comment was forgotted so
finish the move and remove the comment.
Fixes: 0bbcdb437da0c4a "powerpc/powernv/npu: TCE Kill helpers cleanup"
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
With a recent change around IOMMU group, a system with an opencapi
adapter is no longer booting and we get a kernel oops:
BUG: Kernel NULL pointer dereference at 0x00000028
Faulting instruction address: 0xc0000000000aa38c
...
NIP pnv_try_setup_npu_table_group+0x1c/0x1a0
LR pnv_pci_ioda_fixup+0x1f8/0x660
Call Trace:
pnv_try_setup_npu_table_group+0x60/0x
pnv_pci_ioda_fixup+0x20c/0x660
pcibios_resource_survey+0x2c8/0x31c
pcibios_init+0xb0/0xe4
do_one_initcall+0x64/0x264
kernel_init_freeable+0x36c/0x468
kernel_init+0x2c/0x148
ret_from_kernel_thread+0x5c/0x68
An opencapi device is using a device PE, so the current code breaks
because pe->pbus is not defined.
More generally, there's no need to define an IOMMU group for opencapi,
as the device sends real addresses directly (admittedly, the
virtualization story is yet to be written). So let's fix it by
skipping the IOMMU group setup for opencapi PHBs.
Fixes: 0bd971676e68 ("powerpc/powernv/npu: Add compound IOMMU groups")
Signed-off-by: Frederic Barrat <[email protected]>
Reviewed-by: Greg Kurz <[email protected]>
Reviewed-by: Andrew Donnellan <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
There is a typo so we accidentally allocate enough memory for a pointer
when we wanted to allocate enough for a struct.
Fixes: 0bd971676e68 ("powerpc/powernv/npu: Add compound IOMMU groups")
Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".
The jump label is controlled by HAVE_JUMP_LABEL, which is defined
like this:
#if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
# define HAVE_JUMP_LABEL
#endif
We can improve this by testing 'asm goto' support in Kconfig, then
make JUMP_LABEL depend on CC_HAS_ASM_GOTO.
Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
match to the real kernel capability.
Signed-off-by: Masahiro Yamada <[email protected]>
Acked-by: Michael Ellerman <[email protected]> (powerpc)
Tested-by: Sedat Dilek <[email protected]>
|
|
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kconfig file consolidation from Masahiro Yamada:
"Consolidation of bus (PCI, PCMCIA, EISA, RapidIO) config entries by
Christoph Hellwig.
Currently, every architecture that wants to provide common peripheral
busses needs to add some boilerplate code and include the right
Kconfig files. This series instead just selects the presence (when
needed) and then handles everything in the bus-specific Kconfig file
under drivers/"
* tag 'kconfig-v4.21-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
pcmcia: remove per-arch PCMCIA config entry
eisa: consolidate EISA Kconfig entry in drivers/eisa
rapidio: consolidate RAPIDIO config entry in drivers/rapidio
pcmcia: allow PCMCIA support independent of the architecture
PCI: consolidate the PCI_SYSCALL symbol
PCI: consolidate the PCI_DOMAINS and PCI_DOMAINS_GENERIC config options
PCI: consolidate PCI config entry in drivers/pci
MIPS: remove the HT_PCI config option
|
|
Convert string compares of DT node names to use of_node_name_eq helper
instead. This removes direct access to the node name pointer.
A couple of open coded iterating thru the child node names are converted
to use for_each_child_of_node() instead.
Signed-off-by: Rob Herring <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
When a page fault happens in a GPU, the GPU signals the OS and the GPU
driver calls the fault handler which populated a page table; this allows
the GPU to complete an ATS request.
On the bare metal get_user_pages() is enough as it adds a pte to
the kernel page table but under KVM the partition scope tree does not get
updated so ATS will still fail.
This reads a byte from an effective address which causes HV storage
interrupt and KVM updates the partition scope tree.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
A broken device tree might contain more than 8 values and introduce hard
to debug memory corruption bug. This adds the boundary check.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
In order to make ATS work and translate addresses for arbitrary
LPID and PID, we need to program an NPU with LPID and allow PID wildcard
matching with a specific MSR mask.
This implements a helper to assign a GPU to LPAR and program the NPU
with a wildcard for PID and a helper to do clean-up. The helper takes
MSR (only DR/HV/PR/SF bits are allowed) to program them into NPU2 for
ATS checkout requests support.
This exports pnv_npu2_unmap_lpar_dev() as following patches will use it
from the VFIO driver.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
At the moment the powernv platform registers an IOMMU group for each
PE. There is an exception though: an NVLink bridge which is attached
to the corresponding GPU's IOMMU group making it a master.
Now we have POWER9 systems with GPUs connected to each other directly
bypassing PCI. At the moment we do not control state of these links so
we have to put such interconnected GPUs to one IOMMU group which means
that the old scheme with one GPU as a master won't work - there will
be up to 3 GPUs in such group.
This introduces a npu_comp struct which represents a compound IOMMU
group made of multiple PEs - PCI PEs (for GPUs) and NPU PEs (for
NVLink bridges). This converts the existing NVLink1 code to use the
new scheme. >From now on, each PE must have a valid
iommu_table_group_ops which will either be called directly (for a
single PE group) or indirectly from a compound group handlers.
This moves IOMMU group registration for NVLink-connected GPUs to
npu-dma.c. For POWER8, this stores a new compound group pointer in the
PE (so a GPU is still a master); for POWER9 the new group pointer is
stored in an NPU (which is allocated per a PCI host controller).
Signed-off-by: Alexey Kardashevskiy <[email protected]>
[mpe: Initialise npdev to NULL in pnv_try_setup_npu_table_group()]
Signed-off-by: Michael Ellerman <[email protected]>
|
|
At the moment NPU IOMMU is manipulated directly from the IODA2 PCI
PE code; PCI PE acts as a master to NPU PE. Soon we will have compound
IOMMU groups with several PEs from several different PHB (such as
interconnected GPUs and NPUs) so there will be no single master but
a one big IOMMU group.
This makes a first step and converts an NPU PE with a set of extern
function to a table group.
This should cause no behavioral change. Note that
pnv_npu_release_ownership() has never been implemented.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Normal PCI PEs have 2 TVEs, one per a DMA window; however NPU PE has only
one which points to one of two tables of the corresponding PCI PE.
So whenever a new DMA window is programmed to PEs, the NPU PE needs to
release old table in order to use the new one.
Commit d41ce7b1bcc3e ("powerpc/powernv/npu: Do not try invalidating 32bit
table when 64bit table is enabled") did just that but in pci-ioda.c
while it actually belongs to npu-dma.c.
This moves the single TVE handling to npu-dma.c. This does not implement
restoring though as it is highly unlikely that we can set the table to
PCI PE and cannot to NPU PE and if that fails, we could only set 32bit
table to NPU PE and this configuration is not really supported or wanted.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
The iommu_table pointer stored in iommu_table_group may get stale
by accident, this adds referencing and removes a redundant comment
about this.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Registering new IOMMU groups and adding devices to them are separated in
code and the latter is dug in the DMA setup code which it does not
really belong to.
This moved IOMMU groups setup to a separate helper which registers a group
and adds devices as before. This does not make a difference as IOMMU
groups are not used anyway; the only dependency here is that
iommu_add_device() requires a valid pointer to an iommu_table
(set by set_iommu_table_base()).
To keep the old behaviour, this does not add new IOMMU groups for PEs
with no DMA weight and also skips NVLink bridges which do not have
pci_controller_ops::setup_bridge (the normal way of adding PEs).
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
The powernv platform registers IOMMU groups and adds devices to them
from the pci_controller_ops::setup_bridge() hook except one case when
virtual functions (SRIOV VFs) are added from a bus notifier.
The pseries platform registers IOMMU groups from
the pci_controller_ops::dma_bus_setup() hook and adds devices from
the pci_controller_ops::dma_dev_setup() hook. The very same bus notifier
used for powernv does not add devices for pseries though as
__of_scan_bus() adds devices first, then it does the bus/dev DMA setup.
Both platforms use iommu_add_device() which takes a device and expects
it to have a valid IOMMU table struct with an iommu_table_group pointer
which in turn points the iommu_group struct (which represents
an IOMMU group). Although the helper seems easy to use, it relies on
some pre-existing device configuration and associated data structures
which it does not really need.
This simplifies iommu_add_device() to take the table_group pointer
directly. Pseries already has a table_group pointer handy and the bus
notified is not used anyway. For powernv, this copies the existing bus
notifier, makes it work for powernv only which means an easy way of
getting to the table_group pointer. This was tested on VFs but should
also support physical PCI hotplug.
Since iommu_add_device() receives the table_group pointer directly,
pseries does not do TCE cache invalidation (the hypervisor does) nor
allow multiple groups per a VFIO container (in other words sharing
an IOMMU table between partitionable endpoints), this removes
iommu_table_group_link from pseries.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
When introduced, the NPU context init/destroy helpers called OPAL which
enabled/disabled PID (a userspace memory context ID) filtering in an NPU
per a GPU; this was a requirement for P9 DD1.0. However newer chip
revision added a PID wildcard support so there is no more need to
call OPAL every time a new context is initialized. Also, since the PID
wildcard support was added, skiboot does not clear wildcard entries
in the NPU so these remain in the hardware till the system reboot.
This moves LPID and wildcard programming to the PE setup code which
executes once during the booting process so NPU2 context init/destroy
won't need to do additional configuration.
This replaces the check for FW_FEATURE_OPAL with a check for npu!=NULL as
this is the way to tell if the NPU support is present and configured.
This moves pnv_npu2_init() declaration as pseries should be able to use it.
This keeps pnv_npu2_map_lpar() in powernv as pseries is not allowed to
call that. This exports pnv_npu2_map_lpar_dev() as following patches
will use it from the VFIO driver.
While at it, replace redundant list_for_each_entry_safe() with
a simpler list_for_each_entry().
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
The powernv PCI code stores NPU data in the pnv_phb struct. The latter
is referenced by pci_controller::private_data. We are going to have NPU2
support in the pseries platform as well but it does not store any
private_data in in the pci_controller struct; and even if it did,
it would be a different data structure.
This makes npu a pointer and stores it one level higher in
the pci_controller struct.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
The skiboot firmware has a hot reset handler which fences the NVIDIA V100
GPU RAM on Witherspoons and makes accesses no-op instead of throwing HMIs:
https://github.com/open-power/skiboot/commit/fca2b2b839a67
Now we are going to pass V100 via VFIO which most certainly involves
KVM guests which are often terminated without getting a chance to offline
GPU RAM so we end up with a running machine with misconfigured memory.
Accessing this memory produces hardware management interrupts (HMI)
which bring the host down.
To suppress HMIs, this wires up this hot reset hook to vfio_pci_disable()
via pci_disable_device() which switches NPU2 to a safe mode and prevents
HMIs.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Acked-by: Alistair Popple <[email protected]>
Reviewed-by: David Gibson <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
opal_power_control_init() depends on opal message notifier to be
initialized, which is done in opal_init()->opal_message_init(). But both
these initialization are called through machine initcalls and it all
depends on in which order they being called. So far these are called in
correct order (may be we got lucky) and never saw any issue. But it is
clearer to control initialization order explicitly by moving
opal_power_control_init() into opal_init().
Signed-off-by: Mahesh Salgaonkar <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
CONFIG_PCI_MSI was made mandatory by commit a311e738b6d8
("powerpc/powernv: Make PCI non-optional") so the #ifdef
checks around CONFIG_PCI_MSI here can be removed entirely.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Joel Stanley <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
opal_pci_eeh_freeze_status
The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS call in
skiboot's NPU driver does not touch the pci_error_type parameter so
it might have garbage but the powernv code analyzes it nevertheless.
This initializes pcierr and fstate to zero in all call sites.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: Sam Bobroff <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
fixup_phb() is never used, this removes it.
pick_m64_pe() and reserve_m64_pe() are always defined for all powernv
PHBs: they are initialized by pnv_ioda_parse_m64_window() which is
called unconditionally from pnv_pci_init_ioda_phb() which initializes
all known PHB types on powernv so we can open code them.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
At the moment PNV_IODA_PE_DEV is only used for NPU PEs which are not
present on IODA1 machines (i.e. POWER7) so let's remove a piece of
dead code.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: Sam Bobroff <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
The macro and few headers are not used so remove them.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Acked-by: Alistair Popple <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
addresses on demand
The powernv platform maintains 2 TCE tables for VFIO - a hardware TCE
table and a table with userspace addresses; the latter is used for
marking pages dirty when corresponging TCEs are unmapped from
the hardware table.
a68bd1267b72 ("powerpc/powernv/ioda: Allocate indirect TCE levels
on demand") enabled on-demand allocation of the hardware table,
however it missed the other table so it has still been fully allocated
at the boot time. This fixes the issue by allocating a single level,
just like we do for the hardware table.
Fixes: a68bd1267b72 ("powerpc/powernv/ioda: Allocate indirect TCE levels on demand")
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.
Signed-off-by: Yangtao Li <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
There is no good reason to duplicate the PCI menu in every architecture.
Instead provide a selectable HAVE_PCI symbol that indicates availability
of PCI support, and a FORCE_PCI symbol to for PCI on and the handle the
rest in drivers/pci.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
Acked-by: Max Filippov <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
Acked-by: Geert Uytterhoeven <[email protected]>
Acked-by: Paul Burton <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>
|
|
The NPU IOMMU is setup to mirror the parent PCIe device IOMMU
setup. Therefore it does not make sense to call dma operations such as
dma_map_page(), etc. directly on these devices. The existing dma_ops
simply print a warning if they are ever called, however this is
unnecessary and the warnings are likely to go unnoticed.
It is instead simpler to remove these operations and let the generic
DMA code print warnings (eg. via a NULL pointer deref) in cases of
buggy drivers attempting dma operations on NVLink devices.
Signed-off-by: Alistair Popple <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Let's perform all checking + offlining + removing under
device_hotplug_lock, so nobody can mess with these devices via sysfs
concurrently.
[[email protected]: take device_hotplug_lock outside of loop]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Pavel Tatashin <[email protected]>
Reviewed-by: Rashmica Gupta <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Rashmica Gupta <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: John Allen <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Kate Stewart <[email protected]>
Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Mathieu Malaterre <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Nathan Fontenot <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Philippe Ombredanne <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Stephen Hemminger <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: YASUAKI ISHIMATSU <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|