Age | Commit message (Collapse) | Author | Files | Lines |
|
Note: compat variant of REGSET_TM_CGPR is almost certainly wrong;
it claims to be 48*64bit, but just as compat REGSET_GPR it stores
44*32bit of (truncated) registers + 4 32bit zeros... followed by
48 more 32bit zeroes. Might be too late to change - it's a userland
ABI, after all ;-/
Signed-off-by: Al Viro <[email protected]>
|
|
all uses are conditional upon ELF_CORE_COPY_XFPREGS, which has not
been defined on any architecture since 2010
Signed-off-by: Al Viro <[email protected]>
|
|
|
|
skiroot_defconfig fails:
arch/powerpc/kernel/fadump.c:48:17: error: ‘cpus_in_fadump’ defined but not used
48 | static atomic_t cpus_in_fadump;
Fix it by moving the definition into the #ifdef where it's used.
Fixes: ba608c4fa12c ("powerpc/fadump: fix race between pstore write and fadump crash trigger")
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Commit 2f92447f9f96 ("powerpc/book3s64/hash: Use the pte_t address from the
caller") removed the local_irq_disable from hash_preload, but it was
required for more than just the page table walk: the hash pte busy bit is
effectively a lock which may be taken in interrupt context, and the local
update flag test must not be preempted before it's used.
This solves apparent lockups with perf interrupting __hash_page_64K. If
get_perf_callchain then also takes a hash fault on the same page while it
is already locked, it will loop forever taking hash faults, which looks like
this:
cpu 0x49e: Vector: 100 (System Reset) at [c00000001a4f7d70]
pc: c000000000072dc8: hash_page_mm+0x8/0x800
lr: c00000000000c5a4: do_hash_page+0x24/0x38
sp: c0002ac1cc69ac70
msr: 8000000000081033
current = 0xc0002ac1cc602e00
paca = 0xc00000001de1f280 irqmask: 0x03 irq_happened: 0x01
pid = 20118, comm = pread2_processe
Linux version 5.8.0-rc6-00345-g1fad14f18bc6
49e:mon> t
[c0002ac1cc69ac70] c00000000000c5a4 do_hash_page+0x24/0x38 (unreliable)
--- Exception: 300 (Data Access) at c00000000008fa60 __copy_tofrom_user_power7+0x20c/0x7ac
[link register ] c000000000335d10 copy_from_user_nofault+0xf0/0x150
[c0002ac1cc69af70] c00032bf9fa3c880 (unreliable)
[c0002ac1cc69afa0] c000000000109df0 read_user_stack_64+0x70/0xf0
[c0002ac1cc69afd0] c000000000109fcc perf_callchain_user_64+0x15c/0x410
[c0002ac1cc69b060] c000000000109c00 perf_callchain_user+0x20/0x40
[c0002ac1cc69b080] c00000000031c6cc get_perf_callchain+0x25c/0x360
[c0002ac1cc69b120] c000000000316b50 perf_callchain+0x70/0xa0
[c0002ac1cc69b140] c000000000316ddc perf_prepare_sample+0x25c/0x790
[c0002ac1cc69b1a0] c000000000317350 perf_event_output_forward+0x40/0xb0
[c0002ac1cc69b220] c000000000306138 __perf_event_overflow+0x88/0x1a0
[c0002ac1cc69b270] c00000000010cf70 record_and_restart+0x230/0x750
[c0002ac1cc69b620] c00000000010d69c perf_event_interrupt+0x20c/0x510
[c0002ac1cc69b730] c000000000027d9c performance_monitor_exception+0x4c/0x60
[c0002ac1cc69b750] c00000000000b2f8 performance_monitor_common_virt+0x1b8/0x1c0
--- Exception: f00 (Performance Monitor) at c0000000000cb5b0 pSeries_lpar_hpte_insert+0x0/0x160
[link register ] c0000000000846f0 __hash_page_64K+0x210/0x540
[c0002ac1cc69ba50] 0000000000000000 (unreliable)
[c0002ac1cc69bb00] c000000000073ae0 update_mmu_cache+0x390/0x3a0
[c0002ac1cc69bb70] c00000000037f024 wp_page_copy+0x364/0xce0
[c0002ac1cc69bc20] c00000000038272c do_wp_page+0xdc/0xa60
[c0002ac1cc69bc70] c0000000003857bc handle_mm_fault+0xb9c/0x1b60
[c0002ac1cc69bd50] c00000000006c434 __do_page_fault+0x314/0xc90
[c0002ac1cc69be20] c00000000000c5c8 handle_page_fault+0x10/0x2c
--- Exception: 300 (Data Access) at 00007fff8c861fe8
SP (7ffff6b19660) is in userspace
Fixes: 2f92447f9f96 ("powerpc/book3s64/hash: Use the pte_t address from the caller")
Reported-by: Athira Rajeev <[email protected]>
Reported-by: Anton Blanchard <[email protected]>
Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Drop the repeated word "for".
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Drop the repeated word "the".
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Drop the repeated word "a".
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Drop the repeated word "in".
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Drop the repeated word "the".
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Drop the repeated words "file" and "the".
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Drop the repeated word "use".
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Drop the repeated word "per".
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Drop the repeated word "below".
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Align it with other architectures and none of the callers has
been interested its return
Signed-off-by: Li RongQing <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In note_page(), the pg_state is updated the same way in two places.
Add note_page_update_state() to do it.
Also include the display of boundary markers there as it is missing
"no level" leg, leading to a mismatch when the first two markers
are at the same address and the first displayed area uses that
address.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/a284a809f01c705bbaab303b06fda216f147a99a.1593429426.git.christophe.leroy@csgroup.eu
|
|
st->last_pa is always updated in note_page() so it can
be done outside the if/elseif/else block.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/610d6b1a60ad0bedef865a90153c1110cfaa507e.1593429426.git.christophe.leroy@csgroup.eu
|
|
When STRICT_KERNEL_RWX is set, we want to set NX bit on vmalloc
segments. But modules require exec.
Use a dedicated segment for modules. There is not much space
above kernel, and we don't waste vmalloc space to do alignment.
Therefore, we take the segment before PAGE_OFFSET for modules.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/eb8faba9148b6cf17c696ba776b4e8ee2f6313bf.1593428200.git.christophe.leroy@csgroup.eu
|
|
Kernel space starts at TASK_SIZE. Select kernel page table
when address is over TASK_SIZE.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/893425e32cd0a003539573b2d115e0ffa98bc26c.1593428200.git.christophe.leroy@csgroup.eu
|
|
User space stops at TASK_SIZE. At the moment, kernel space starts
at PAGE_OFFSET.
In order to use space between TASK_SIZE and PAGE_OFFSET for modules,
make TASK_SIZE the limit between user and kernel space.
Note that fault.c already considers TASK_SIZE as the boundary between
user and kernel space.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/b38b52cd8dabbb56fbd6f9219d6f3cdccbb43b44.1593428200.git.christophe.leroy@csgroup.eu
|
|
Instead of leaving NX unset on all segments above the start
of vmalloc space, only leave NX unset on segments used for
modules.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/7172c0f5253419315e434a1816ee3d6ed6505bc0.1593428200.git.christophe.leroy@csgroup.eu
|
|
In order to allow allocation of modules outside of vmalloc space,
use MODULES_VADDR and MODULES_END when MODULES_VADDR is defined.
Redefine module_alloc() when MODULES_VADDR defined.
Unmap corresponding KASAN shadow memory.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/7ecf5fff1eef67d450e73fc412b6ec3818483d75.1593428200.git.christophe.leroy@csgroup.eu
|
|
Use is_vmalloc_or_module_addr() instead of is_vmalloc_addr()
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/7d884db0e5a6f521331639d8c0f13e520d5a4fef.1593428200.git.christophe.leroy@csgroup.eu
|
|
The sparse tool complains as follows:
arch/powerpc/platforms/pseries/papr_scm.c:97:1: warning:
symbol 'papr_nd_regions' was not declared. Should it be static?
arch/powerpc/platforms/pseries/papr_scm.c:98:1: warning:
symbol 'papr_ndr_lock' was not declared. Should it be static?
Those variables are not used outside of papr_scm.c, so this
commit marks them static.
Fixes: 85343a8da2d9 ("powerpc/papr/scm: Add bad memory ranges to nvdimm bad ranges")
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Wei Yongjun <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Clang's objdump emits slightly different output from GNU's objdump,
causing a list of warnings to be emitted during relocatable builds.
E.g., clang's objdump emits this:
c000000000000004: 2c 00 00 48 b 0xc000000000000030
...
c000000000005c6c: 10 00 82 40 bf 2, 0xc000000000005c7c
while GNU objdump emits:
c000000000000004: 2c 00 00 48 b c000000000000030 <__start+0x30>
...
c000000000005c6c: 10 00 82 40 bne c000000000005c7c <masked_interrupt+0x3c>
Adjust llvm-objdump's output to remove the extraneous '0x' and convert
'bf' and 'bt' to 'bne' and 'beq' resp. to more closely match GNU
objdump's output.
Note that clang's objdump doesn't yet output the relocation symbols on
PPC.
Signed-off-by: Bill Wendling <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/191c67db31264b69cf6b566fd69851beb3dd0abb.1595630874.git.morbo@google.com
|
|
This implements smp_cond_load_relaxed() with the slowpath busy loop
using the preferred SMT priority pattern.
Signed-off-by: Nicholas Piggin <[email protected]>
Acked-by: Waiman Long <[email protected]>
[mpe: Make it 64-bit only to fix build errors on 32-bit]
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
This brings the behaviour of the uncontended fast path back to roughly
equivalent to simple spinlocks -- a single atomic op with lock hint.
Signed-off-by: Nicholas Piggin <[email protected]>
Acked-by: Waiman Long <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
This implements the generic paravirt qspinlocks using H_PROD and
H_CONFER to kick and wait.
This uses an un-directed yield to any CPU rather than the directed
yield to a pre-empted lock holder that paravirtualised simple
spinlocks use, that requires no kick hcall. This is something that
could be investigated and improved in future.
Performance results can be found in the commit which added queued
spinlocks.
Signed-off-by: Nicholas Piggin <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Waiman Long <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
These have shown significantly improved performance and fairness when
spinlock contention is moderate to high on very large systems.
With this series including subsequent patches, on a 16 socket 1536
thread POWER9, a stress test such as same-file open/close from all
CPUs gets big speedups, 11620op/s aggregate with simple spinlocks vs
384158op/s (33x faster), where the difference in throughput between
the fastest and slowest thread goes from 7x to 1.4x.
Thanks to the fast path being identical in terms of atomics and
barriers (after a subsequent optimisation patch), single threaded
performance is not changed (no measurable difference).
On smaller systems, performance and fairness seems to be generally
improved. Using dbench on tmpfs as a test (that starts to run into
kernel spinlock contention), a 2-socket OpenPOWER POWER9 system was
tested with bare metal and KVM guest configurations. Results can be
found here:
https://github.com/linuxppc/issues/issues/305#issuecomment-663487453
Observations are:
- Queued spinlocks are equal when contention is insignificant, as
expected and as measured with microbenchmarks.
- When there is contention, on bare metal queued spinlocks have better
throughput and max latency at all points.
- When virtualised, queued spinlocks are slightly worse approaching
peak throughput, but significantly better throughput and max latency
at all points beyond peak, until queued spinlock maximum latency
rises when clients are 2x vCPUs.
The regressions haven't been analysed very well yet, there are a lot
of things that can be tuned, particularly the paravirtualised locking,
but the numbers already look like a good net win even on relatively
small systems.
Signed-off-by: Nicholas Piggin <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Waiman Long <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
To prepare for queued spinlocks. This is a simple rename except to
update preprocessor guard name and a file reference.
Signed-off-by: Nicholas Piggin <[email protected]>
Acked-by: Waiman Long <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
These functions will be used by the queued spinlock implementation,
and may be useful elsewhere too, so move them out of spinlock.h.
Signed-off-by: Nicholas Piggin <[email protected]>
Acked-by: Waiman Long <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
MAX_NUMNODES is a theoretical maximum number of nodes thats is
supported by the kernel. Device tree properties exposes the number of
possible nodes on the current platform. The kernel would detected this
and would use it for most of its resource allocations. If the platform
now increases the nodes to over what was already exposed, then it may
lead to inconsistencies. Hence limit it to the already exposed nodes.
Suggested-by: Nathan Lynch <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Initialize Monitor Mode Control Register 3 (MMCR3)
SPR which is new in power10. For PowerISA v3.1, BHRB disable
is controlled via Monitor Mode Control Register A (MMCRA) bit,
namely "BHRB Recording Disable (BHRBRD)". This patch also initializes
MMCRA BHRBRD to disable BHRB feature at boot for power10.
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Athira Rajeev <[email protected]>
Reviewed-by: Jordan Niethe <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Previously iov->vfs_expanded was used for two purposes.
1) To work out how much we need to multiple the per-VF BAR size to figure
out the total space required for the IOV BAR.
2) To indicate that IOV is not usable with this device (vfs_expanded == 0).
We don't really need the field for either since the multiple in 1) is
always the number PEs supported by the PHB. Similarly, we don't really need
it in 2) either since the IOV data field will be NULL if we can't use IOV
with the device.
Signed-off-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Using single PE BARs to map an SR-IOV BAR is really a choice about what
strategy to use when mapping a BAR. It doesn't make much sense for this to
be a global setting since a device might have one large BAR which needs to
be mapped with single PE windows and another smaller BAR that can be mapped
with a regular segmented window. Make the segmented vs single decision a
per-BAR setting and clean up the logic that decides which mode to use.
Signed-off-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Split up the logic so that we have one branch that handles setting up a
segmented window and another that handles setting up single PE windows for
each VF.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
I want to refactor the loop this code is currently inside of. Hoist it on
out.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Remove the IODA2 PHB checks. We already assume IODA2 in several places so
there's not much point in wrapping most of the setup and teardown process
in an if block.
Signed-off-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Currently the iov->pe_num_map[] does one of two things depending on
whether single PE mode is being used or not. When it is, this contains an
array which maps a vf_index to the corresponding PE number. When single PE
mode is not being used this contains a scalar which is the base PE for the
set of enabled VFs (for for VFn is base + n).
The array was necessary because when calling pnv_ioda_alloc_pe() there is
no guarantee that the allocated PEs would be contigious. We can now
allocate contigious blocks of PEs so this is no longer an issue. This
allows us to drop the if (single_mode) {} .. else {} block scattered
through the SR-IOV code which is a nice clean up.
This also fixes a bug in pnv_pci_sriov_disable() which is the non-atomic
bitmap_clear() to manipulate the PE allocation map. Other users of the map
assume it will be accessed with atomic ops.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Rework the PE allocation logic to allow allocating blocks of PEs rather
than individually. We'll use this to allocate contigious blocks of PEs for
the SR-IOVs.
This patch also adds code to pnv_ioda_alloc_pe() and pnv_ioda_reserve_pe() to
use the existing, but unused, phb->pe_alloc_mutex. Currently these functions
use atomic bit ops to release a currently allocated PE number. However,
the pnv_ioda_alloc_pe() wants to have exclusive access to the bit map while
scanning for hole large enough to accomodate the allocation size.
Signed-off-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The sequence required to use the single PE BAR mode is kinda janky and
requires a little explanation. The API was designed with P7-IOC style
windows where the setup process is something like:
1. Configure the window start / end address
2. Enable the window
3. Map the segments of each window to the PE
For Single PE BARs the process is:
1. Set the PE for segment zero on a disabled window
2. Set the range
3. Enable the window
Move the OPAL calls into their own helper functions where the quirks can be
contained.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
No need for the multi-dimensional arrays, just use a bitmap.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
This prevents SR-IOV being used by making the SR-IOV BAR resources
unallocatable. Rename it to reflect what it actually does.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
SR-IOV support on PowerNV is a byzantine maze of hooks. I have no idea
how anyone is supposed to know how it works except through a lot of
stuffering. Write up some docs about the overall story to help out
the next sucker^Wperson who needs to tinker with it.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
pci-ioda.c is getting a bit unwieldly due to the amount of stuff jammed in
there. The SR-IOV support can be extracted easily enough and is mostly
standalone, so move it into a separate file.
This patch also moves the PowerNV SR-IOV specific fields from pci_dn and
moves them into a platform specific structure. I'm not sure how they ended
up in there in the first place, but leaking platform specifics into common
code has proven to be a terrible idea so far so lets stop doing that.
Signed-off-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
We pre-configure the m64 window for IODA1 as a 1-1 segment-PE mapping,
similar to PHB3. Currently the actual mapping of segments occurs in
pnv_ioda_pick_m64_pe(), but we can move it into pnv_ioda1_init_m64() and
drop the IODA1 specific code paths in the PE setup / teardown.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
There's an optimisation in the PE setup which skips performing DMA
setup for a PE if we only have bridges in a PE. The assumption being
that only "real" devices will DMA to system memory, which is probably
fair. However, if we start off with only bridge devices in a PE then
add a non-bridge device the new device won't be able to use DMA because
we never configured it.
Fix this (admittedly pretty weird) edge case by tracking whether we've done
the DMA setup for the PE or not. If a non-bridge device is added to the PE
(via rescan or hotplug, or whatever) we can set up DMA on demand.
This also means the only remaining user of the old "DMA Weight" code is
the IODA1 DMA setup code that it was originally added for, which is good.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Currently we have these two functions:
pnv_pci_ioda2_release_dma_pe(), and
pnv_pci_ioda2_release_pe_dma()
The first is used when tearing down VF PEs and the other is used for normal
devices. There's very little difference between the two though. The latter
(non-VF) will skip a call to pnv_pci_ioda2_unset_window() unless
CONFIG_IOMMU_API=y is set. There's no real point in doing this so fold the
two together.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Add a helper to go from a pci_bus structure to the pnv_phb that hosts
that bus. There's a lot of instances of the following pattern:
struct pci_controller *hose = pci_bus_to_host(pdev->bus);
struct pnv_phb *phb = hose->private_data;
Without any other uses of the pci_controller inside the function. This
is hard to read since it requires you to memorise the contents of the
private data fields and kind of error prone since it involves blindly
assigning a void pointer. Add a helper to make it more concise and
explicit.
Signed-off-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The EEH core has a concept of a "PE tree" to support PowerNV. The PE tree
follows the PCI bus structures because a reset asserted on an upstream
bridge will be propagated to the downstream bridges. On pseries there's a
1-1 correspondence between what the guest sees are a PHB and a PE so the
"tree" is really just a single node.
Current the EEH core is reponsible for setting up this PE tree which it
does by traversing the pci_dn tree. The structure of the pci_dn tree
matches the bus tree on PowerNV which leads to the PE tree being "correct"
this setup method doesn't make a whole lot of sense and it's actively
confusing for the pseries case where it doesn't really do anything.
We want to remove the dependence on pci_dn anyway so this patch move
choosing where to insert a new PE into the platform code rather than
being part of the generic EEH code. For PowerNV this simplifies the
tree building logic and removes the use of pci_dn. For pseries we
keep the existing logic. I'm not really convinced it does anything
due to the 1-1 PE-to-PHB correspondence so every device under that
PHB should be in the same PE, but I'd rather not remove it entirely
until we've had a chance to look at it more deeply.
Signed-off-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|