aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-04-14powerpc/rtas: improve ppc_rtas_rmo_buf_show documentationNathan Lynch1-1/+10
Add kerneldoc for ppc_rtas_rmo_buf_show(), the callback for /proc/powerpc/rtas/rmo_buffer, explaining its expected use. Signed-off-by: Nathan Lynch <[email protected]> Reviewed-by: Alexey Kardashevskiy <[email protected]> Reviewed-by: Andrew Donnellan <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-14powerpc/eeh: Fix EEH handling for hugepages in ioremap space.Mahesh Salgaonkar1-7/+4
During the EEH MMIO error checking, the current implementation fails to map the (virtual) MMIO address back to the pci device on radix with hugepage mappings for I/O. This results into failure to dispatch EEH event with no recovery even when EEH capability has been enabled on the device. eeh_check_failure(token) # token = virtual MMIO address addr = eeh_token_to_phys(token); edev = eeh_addr_cache_get_dev(addr); if (!edev) return 0; eeh_dev_check_failure(edev); <= Dispatch the EEH event In case of hugepage mappings, eeh_token_to_phys() has a bug in virt -> phys translation that results in wrong physical address, which is then passed to eeh_addr_cache_get_dev() to match it against cached pci I/O address ranges to get to a PCI device. Hence, it fails to find a match and the EEH event never gets dispatched leaving the device in failed state. The commit 33439620680be ("powerpc/eeh: Handle hugepages in ioremap space") introduced following logic to translate virt to phys for hugepage mappings: eeh_token_to_phys(): + pa = pte_pfn(*ptep); + + /* On radix we can do hugepage mappings for io, so handle that */ + if (hugepage_shift) { + pa <<= hugepage_shift; <= This is wrong + pa |= token & ((1ul << hugepage_shift) - 1); + } This patch fixes the virt -> phys translation in eeh_token_to_phys() function. $ cat /sys/kernel/debug/powerpc/eeh_address_cache mem addr range [0x0000040080000000-0x00000400807fffff]: 0030:01:00.1 mem addr range [0x0000040080800000-0x0000040080ffffff]: 0030:01:00.1 mem addr range [0x0000040081000000-0x00000400817fffff]: 0030:01:00.0 mem addr range [0x0000040081800000-0x0000040081ffffff]: 0030:01:00.0 mem addr range [0x0000040082000000-0x000004008207ffff]: 0030:01:00.1 mem addr range [0x0000040082080000-0x00000400820fffff]: 0030:01:00.0 mem addr range [0x0000040082100000-0x000004008210ffff]: 0030:01:00.1 mem addr range [0x0000040082110000-0x000004008211ffff]: 0030:01:00.0 Above is the list of cached io address ranges of pci 0030:01:00.<fn>. Before this patch: Tracing 'arg1' of function eeh_addr_cache_get_dev() during error injection clearly shows that 'addr=' contains wrong physical address: kworker/u16:0-7 [001] .... 108.883775: eeh_addr_cache_get_dev: (eeh_addr_cache_get_dev+0xc/0xf0) addr=0x80103000a510 dmesg shows no EEH recovery messages: [ 108.563768] bnx2x: [bnx2x_timer:5801(eth2)]MFW seems hanged: drv_pulse (0x9ae) != mcp_pulse (0x7fff) [ 108.563788] bnx2x: [bnx2x_hw_stats_update:870(eth2)]NIG timer max (4294967295) [ 108.883788] bnx2x: [bnx2x_acquire_hw_lock:2013(eth1)]lock_status 0xffffffff resource_bit 0x1 [ 108.884407] bnx2x 0030:01:00.0 eth1: MDC/MDIO access timeout [ 108.884976] bnx2x 0030:01:00.0 eth1: MDC/MDIO access timeout <..> After this patch: eeh_addr_cache_get_dev() trace shows correct physical address: <idle>-0 [001] ..s. 1043.123828: eeh_addr_cache_get_dev: (eeh_addr_cache_get_dev+0xc/0xf0) addr=0x40080bc7cd8 dmesg logs shows EEH recovery getting triggerred: [ 964.323980] bnx2x: [bnx2x_timer:5801(eth2)]MFW seems hanged: drv_pulse (0x746f) != mcp_pulse (0x7fff) [ 964.323991] EEH: Recovering PHB#30-PE#10000 [ 964.324002] EEH: PE location: N/A, PHB location: N/A [ 964.324006] EEH: Frozen PHB#30-PE#10000 detected <..> Fixes: 33439620680b ("powerpc/eeh: Handle hugepages in ioremap space") Cc: [email protected] # v5.3+ Reported-by: Dominic DeMarco <[email protected]> Signed-off-by: Mahesh Salgaonkar <[email protected]> Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/161821396263.48361.2796709239866588652.stgit@jupiter
2021-04-14powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handlerCédric Le Goater1-8/+19
Instead of calling irq_create_mapping() to map the IPI for a node, introduce an 'alloc' handler. This is usually an extension to support hierarchy irq_domains which is not exactly the case for XIVE-IPI domain. However, we can now use the irq_domain_alloc_irqs() routine which allocates the IRQ descriptor on the specified node, even better for cache performance on multi node machines. Signed-off-by: Cédric Le Goater <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-14powerpc/xive: Map one IPI interrupt per nodeCédric Le Goater2-15/+47
ipistorm [*] can be used to benchmark the raw interrupt rate of an interrupt controller by measuring the number of IPIs a system can sustain. When applied to the XIVE interrupt controller of POWER9 and POWER10 systems, a significant drop of the interrupt rate can be observed when crossing the second node boundary. This is due to the fact that a single IPI interrupt is used for all CPUs of the system. The structure is shared and the cache line updates impact greatly the traffic between nodes and the overall IPI performance. As a workaround, the impact can be reduced by deactivating the IRQ lockup detector ("noirqdebug") which does a lot of accounting in the Linux IRQ descriptor structure and is responsible for most of the performance penalty. As a fix, this proposal allocates an IPI interrupt per node, to be shared by all CPUs of that node. It solves the scaling issue, the IRQ lockup detector still has an impact but the XIVE interrupt rate scales linearly. It also improves the "noirqdebug" case as showed in the tables below. * P9 DD2.2 - 2s * 64 threads "noirqdebug" Mint/s Mint/s chips cpus IPI/sys IPI/chip IPI/chip IPI/sys -------------------------------------------------------------- 1 0-15 4.984023 4.875405 4.996536 5.048892 0-31 10.879164 10.544040 10.757632 11.037859 0-47 15.345301 14.688764 14.926520 15.310053 0-63 17.064907 17.066812 17.613416 17.874511 2 0-79 11.768764 21.650749 22.689120 22.566508 0-95 10.616812 26.878789 28.434703 28.320324 0-111 10.151693 31.397803 31.771773 32.388122 0-127 9.948502 33.139336 34.875716 35.224548 * P10 DD1 - 4s (not homogeneous) 352 threads "noirqdebug" Mint/s Mint/s chips cpus IPI/sys IPI/chip IPI/chip IPI/sys -------------------------------------------------------------- 1 0-15 2.409402 2.364108 2.383303 2.395091 0-31 6.028325 6.046075 6.089999 6.073750 0-47 8.655178 8.644531 8.712830 8.724702 0-63 11.629652 11.735953 12.088203 12.055979 0-79 14.392321 14.729959 14.986701 14.973073 0-95 12.604158 13.004034 17.528748 17.568095 2 0-111 9.767753 13.719831 19.968606 20.024218 0-127 6.744566 16.418854 22.898066 22.995110 0-143 6.005699 19.174421 25.425622 25.417541 0-159 5.649719 21.938836 27.952662 28.059603 0-175 5.441410 24.109484 31.133915 31.127996 3 0-191 5.318341 24.405322 33.999221 33.775354 0-207 5.191382 26.449769 36.050161 35.867307 0-223 5.102790 29.356943 39.544135 39.508169 0-239 5.035295 31.933051 42.135075 42.071975 0-255 4.969209 34.477367 44.655395 44.757074 4 0-271 4.907652 35.887016 47.080545 47.318537 0-287 4.839581 38.076137 50.464307 50.636219 0-303 4.786031 40.881319 53.478684 53.310759 0-319 4.743750 43.448424 56.388102 55.973969 0-335 4.709936 45.623532 59.400930 58.926857 0-351 4.681413 45.646151 62.035804 61.830057 [*] https://github.com/antonblanchard/ipistorm Signed-off-by: Cédric Le Goater <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-14powerpc/xive: Fix xmon command "dxi"Cédric Le Goater1-4/+10
When under xmon, the "dxi" command dumps the state of the XIVE interrupts. If an interrupt number is specified, only the state of the associated XIVE interrupt is dumped. This form of the command lacks an irq_data parameter which is nevertheless used by xmon_xive_get_irq_config(), leading to an xmon crash. Fix that by doing a lookup in the system IRQ mapping to query the IRQ descriptor data. Invalid interrupt numbers, or not belonging to the XIVE IRQ domain, OPAL event interrupt number for instance, should be caught by the previous query done at the firmware level. Fixes: 97ef27507793 ("powerpc/xive: Fix xmon support on the PowerNV platform") Reported-by: kernel test robot <[email protected]> Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Cédric Le Goater <[email protected]> Tested-by: Greg Kurz <[email protected]> Reviewed-by: Greg Kurz <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-14powerpc/xive: Simplify the dump of XIVE interrupts under xmonCédric Le Goater3-26/+17
Move the xmon routine under XIVE subsystem and rework the loop on the interrupts taking into account the xive_irq_domain to filter out IPIs. Signed-off-by: Cédric Le Goater <[email protected]> Reviewed-by: Greg Kurz <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-14powerpc/xive: Drop check on irq_data in xive_core_debug_show()Cédric Le Goater1-11/+10
When looping on IRQ descriptor, irq_data is always valid. Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state") Reported-by: kernel test robot <[email protected]> Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Cédric Le Goater <[email protected]> Reviewed-by: Greg Kurz <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-14powerpc/xive: Simplify xive_core_debug_show()Cédric Le Goater1-14/+4
Now that the IPI interrupt has its own domain, the checks on the HW interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a check on the domain. Signed-off-by: Cédric Le Goater <[email protected]> Reviewed-by: Greg Kurz <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-14powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQCédric Le Goater1-2/+1
The IPI interrupt has its own domain now. Testing the HW interrupt number is not needed anymore. Signed-off-by: Cédric Le Goater <[email protected]> Reviewed-by: Greg Kurz <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-14powerpc/xive: Introduce an IPI interrupt domainCédric Le Goater1-33/+46
The IPI interrupt is a special case of the XIVE IRQ domain. When mapping and unmapping the interrupts in the Linux interrupt number space, the HW interrupt number 0 (XIVE_IPI_HW_IRQ) is checked to distinguish the IPI interrupt from other interrupts of the system. Simplify the XIVE interrupt domain by introducing a specific domain for the IPI. Signed-off-by: Cédric Le Goater <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-14powerpc/smp: Make some symbols staticYu Kuai1-3/+3
The sparse tool complains as follows: arch/powerpc/kernel/smp.c:86:1: warning: symbol '__pcpu_scope_cpu_coregroup_map' was not declared. Should it be static? arch/powerpc/kernel/smp.c:125:1: warning: symbol '__pcpu_scope_thread_group_l1_cache_map' was not declared. Should it be static? arch/powerpc/kernel/smp.c:132:1: warning: symbol '__pcpu_scope_thread_group_l2_cache_map' was not declared. Should it be static? These symbols are not used outside of smp.c, so this commit marks them static. Reported-by: Hulk Robot <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-14macintosh/via-pmu: Make some symbols staticYu Kuai1-2/+2
The sparse tool complains as follows: drivers/macintosh/via-pmu.c:183:5: warning: symbol 'pmu_cur_battery' was not declared. Should it be static? drivers/macintosh/via-pmu.c:190:5: warning: symbol '__fake_sleep' was not declared. Should it be static? These symbols are not used outside of via-pmu.c, so this commit marks them static. Reported-by: Hulk Robot <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-14windfarm: make symbol 'wf_thread' staticYu Kuai1-1/+1
The sparse tool complains as follows: drivers/macintosh/windfarm_core.c:59:20: warning: symbol 'wf_thread' was not declared. Should it be static? This symbol is not used outside of windfarm_core.c, so this commit marks it static. Reported-by: Hulk Robot <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-14macintosh/windfarm: Make symbol 'pm121_sys_state' staticYu Kuai1-1/+1
The sparse tool complains as follows: drivers/macintosh/windfarm_pm121.c:436:24: warning: symbol 'pm121_sys_state' was not declared. Should it be static? This symbol is not used outside of windfarm_pm121.c, so this commit marks it static. Reported-by: Hulk Robot <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-14powerpc/mce: Make symbol 'mce_ue_event_work' staticLi Huafei1-1/+1
The sparse tool complains as follows: arch/powerpc/kernel/mce.c:43:1: warning: symbol 'mce_ue_event_work' was not declared. Should it be static? This symbol is not used outside of mce.c, so this commit marks it static. Signed-off-by: Li Huafei <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-14powerpc/security: Make symbol 'stf_barrier' staticLi Huafei1-1/+1
The sparse tool complains as follows: arch/powerpc/kernel/security.c:253:6: warning: symbol 'stf_barrier' was not declared. Should it be static? This symbol is not used outside of security.c, so this commit marks it static. Signed-off-by: Li Huafei <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-14powerpc/32s: Define a MODULE area below kernel text all the timeChristophe Leroy3-10/+1
On book3s/32, the segment below kernel text is used for module allocation when CONFIG_STRICT_KERNEL_RWX is defined. In order to benefit from the powerpc specific module_alloc() function which allocate modules with 32 Mbytes from end of kernel text, use that segment below PAGE_OFFSET at all time. Signed-off-by: Christophe Leroy <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/a46dcdd39a9e80b012d86c294c4e5cd8d31665f3.1617283827.git.christophe.leroy@csgroup.eu
2021-04-14powerpc/8xx: Define a MODULE area below kernel textChristophe Leroy1-0/+3
On the 8xx, TASK_SIZE is 0x80000000. The space between TASK_SIZE and PAGE_OFFSET is not used. In order to benefit from the powerpc specific module_alloc() function which allocate modules with 32 Mbytes from end of kernel text, define MODULES_VADDR and MODULES_END. Set a 256Mb area just below PAGE_OFFSET, like book3s/32. Signed-off-by: Christophe Leroy <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/a225606d5b3a8bc53fe612ad52c855c60b0a0a58.1617283827.git.christophe.leroy@csgroup.eu
2021-04-14powerpc/modules: Load modules closer to kernel textChristophe Leroy1-3/+20
On book3s/32, when STRICT_KERNEL_RWX is selected, modules are allocated on the segment just before kernel text, ie on the 0xb0000000-0xbfffffff when PAGE_OFFSET is 0xc0000000. On the 8xx, TASK_SIZE is 0x80000000. The space between TASK_SIZE and PAGE_OFFSET is not used and could be used for modules. The idea comes from ARM architecture. Having modules just below PAGE_OFFSET offers an opportunity to minimise the distance between kernel text and modules and avoid trampolines in modules to access kernel functions or other module functions. When MODULES_VADDR is defined, powerpc has it's own module_alloc() function. In that function, first try to allocate the module above the limit defined by '_etext - 32M'. Then if the allocation fails, fallback to the entire MODULES area. DEBUG logs in module_32.c without the patch: [ 1572.588822] module_32: Applying ADD relocate section 13 to 12 [ 1572.588891] module_32: Doing plt for call to 0xc00671a4 at 0xcae04024 [ 1572.588964] module_32: Initialized plt for 0xc00671a4 at cae04000 [ 1572.589037] module_32: REL24 value = CAE04000. location = CAE04024 [ 1572.589110] module_32: Location before: 48000001. [ 1572.589171] module_32: Location after: 4BFFFFDD. [ 1572.589231] module_32: ie. jump to 03FFFFDC+CAE04024 = CEE04000 [ 1572.589317] module_32: Applying ADD relocate section 15 to 14 [ 1572.589386] module_32: Doing plt for call to 0xc00671a4 at 0xcadfc018 [ 1572.589457] module_32: Initialized plt for 0xc00671a4 at cadfc000 [ 1572.589529] module_32: REL24 value = CADFC000. location = CADFC018 [ 1572.589601] module_32: Location before: 48000000. [ 1572.589661] module_32: Location after: 4BFFFFE8. [ 1572.589723] module_32: ie. jump to 03FFFFE8+CADFC018 = CEDFC000 With the patch: [ 279.404671] module_32: Applying ADD relocate section 13 to 12 [ 279.404741] module_32: REL24 value = C00671B4. location = BF808024 [ 279.404814] module_32: Location before: 48000001. [ 279.404874] module_32: Location after: 4885F191. [ 279.404933] module_32: ie. jump to 0085F190+BF808024 = C00671B4 [ 279.405016] module_32: Applying ADD relocate section 15 to 14 [ 279.405085] module_32: REL24 value = C00671B4. location = BF800018 [ 279.405156] module_32: Location before: 48000000. [ 279.405215] module_32: Location after: 4886719C. [ 279.405275] module_32: ie. jump to 0086719C+BF800018 = C00671B4 We see that with the patch, no plt entries are set. Signed-off-by: Christophe Leroy <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/0c3d5cb8a4dfdf6ca1b8aeb385c01470d6628d55.1617283827.git.christophe.leroy@csgroup.eu
2021-04-14powerpc/mm: Add cond_resched() while removing hpte mappingsVaibhav Jain1-1/+12
While removing large number of mappings from hash page tables for large memory systems as soft-lockup is reported because of the time spent inside htap_remove_mapping() like one below: watchdog: BUG: soft lockup - CPU#8 stuck for 23s! <snip> NIP plpar_hcall+0x38/0x58 LR pSeries_lpar_hpte_invalidate+0x68/0xb0 Call Trace: 0x1fffffffffff000 (unreliable) pSeries_lpar_hpte_removebolted+0x9c/0x230 hash__remove_section_mapping+0xec/0x1c0 remove_section_mapping+0x28/0x3c arch_remove_memory+0xfc/0x150 devm_memremap_pages_release+0x180/0x2f0 devm_action_release+0x30/0x50 release_nodes+0x28c/0x300 device_release_driver_internal+0x16c/0x280 unbind_store+0x124/0x170 drv_attr_store+0x44/0x60 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1b0/0x290 __vfs_write+0x3c/0x70 vfs_write+0xd4/0x270 ksys_write+0xdc/0x130 system_call+0x5c/0x70 Fix this by adding a cond_resched() to the loop in htap_remove_mapping() that issues hcall to remove hpte mapping. The call to cond_resched() is issued every HZ jiffies which should prevent the soft-lockup from being reported. Suggested-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Vaibhav Jain <[email protected]> Reviewed-by: Christophe Leroy <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-14powerpc/papr_scm: Implement support for H_SCM_FLUSH hcallShivaprasad G Bhat3-1/+55
Add support for ND_REGION_ASYNC capability if the device tree indicates 'ibm,hcall-flush-required' property in the NVDIMM node. Flush is done by issuing H_SCM_FLUSH hcall to the hypervisor. If the flush request failed, the hypervisor is expected to to reflect the problem in the subsequent nvdimm H_SCM_HEALTH call. This patch prevents mmap of namespaces with MAP_SYNC flag if the nvdimm requires an explicit flush[1]. References: [1] https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c Signed-off-by: Shivaprasad G Bhat <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> [mpe: Use unsigned long / long instead of uint64_t/int64_t] Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/161703936121.36.7260632399582101498.stgit@e1fbed493c87
2021-04-12powerpc/signal32: Fix build failure with CONFIG_SPEChristophe Leroy1-1/+1
Add missing fault exit label in unsafe_copy_from_user() in order to avoid following build failure with CONFIG_SPE CC arch/powerpc/kernel/signal_32.o arch/powerpc/kernel/signal_32.c: In function 'restore_user_regs': arch/powerpc/kernel/signal_32.c:565:36: error: macro "unsafe_copy_from_user" requires 4 arguments, but only 3 given 565 | ELF_NEVRREG * sizeof(u32)); | ^ In file included from ./include/linux/uaccess.h:11, from ./include/linux/sched/task.h:11, from ./include/linux/sched/signal.h:9, from ./include/linux/rcuwait.h:6, from ./include/linux/percpu-rwsem.h:7, from ./include/linux/fs.h:33, from ./include/linux/huge_mm.h:8, from ./include/linux/mm.h:707, from arch/powerpc/kernel/signal_32.c:17: ./arch/powerpc/include/asm/uaccess.h:428: note: macro "unsafe_copy_from_user" defined here 428 | #define unsafe_copy_from_user(d, s, l, e) \ | arch/powerpc/kernel/signal_32.c:564:3: error: 'unsafe_copy_from_user' undeclared (first use in this function); did you mean 'raw_copy_from_user'? 564 | unsafe_copy_from_user(current->thread.evr, &sr->mc_vregs, | ^~~~~~~~~~~~~~~~~~~~~ | raw_copy_from_user arch/powerpc/kernel/signal_32.c:564:3: note: each undeclared identifier is reported only once for each function it appears in make[3]: *** [arch/powerpc/kernel/signal_32.o] Error 1 Fixes: 627b72bee84d ("powerpc/signal32: Convert restore_[tm]_user_regs() to user access block") Reported-by: kernel test robot <[email protected]> Reported-by: Guenter Roeck <[email protected]> Signed-off-by: Christophe Leroy <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/aad2cb1801a3cc99bc27081022925b9fc18a0dfb.1618159169.git.christophe.leroy@csgroup.eu
2021-04-12KVM: PPC: Book3S HV: Ensure MSR[HV] is always clear in guest MSRNicholas Piggin2-4/+4
Rather than clear the HV bit from the MSR at guest entry, make it clear that the hypervisor does not allow the guest to set the bit. The HV clear is kept in guest entry for now, but a future patch will warn if it is set. Signed-off-by: Nicholas Piggin <[email protected]> Acked-by: Paul Mackerras <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-12KVM: PPC: Book3S HV: Ensure MSR[ME] is always set in guest MSRNicholas Piggin2-1/+6
Rather than add the ME bit to the MSR at guest entry, make it clear that the hypervisor does not allow the guest to clear the bit. The ME set is kept in guest entry for now, but a future patch will warn if it's not present. Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Daniel Axtens <[email protected]> Reviewed-by: Fabiano Rosas <[email protected]> Acked-by: Paul Mackerras <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-12powerpc/64s: remove KVM SKIP test from instruction breakpoint handlerNicholas Piggin1-2/+7
The code being executed in KVM_GUEST_MODE_SKIP is hypervisor code with MSR[IR]=0, so the faults of concern are the d-side ones caused by access to guest context by the hypervisor. Instruction breakpoint interrupts are not a concern here. It's unlikely any good would come of causing breaks in this code, but skipping the instruction that caused it won't help matters (e.g., skip the mtmsr that sets MSR[DR]=0 or clears KVM_GUEST_MODE_SKIP). [Paul notes: "the 0x1300 interrupt was dropped from the architecture a long time ago and is not generated by P7, P8, P9 or P10." So add a comment about this in the handler code while we're here. ] Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Daniel Axtens <[email protected]> Reviewed-by: Fabiano Rosas <[email protected]> Acked-by: Paul Mackerras <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-12powerpc/64s: Remove KVM handler support from CBE_RAS interruptsNicholas Piggin1-6/+0
Cell does not support KVM. Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Fabiano Rosas <[email protected]> Acked-by: Paul Mackerras <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-12KVM: PPC: Book3S HV: Fix CONFIG_SPAPR_TCE_IOMMU=n default hcallsNicholas Piggin1-0/+2
This config option causes the warning in init_default_hcalls to fire because the TCE handlers are in the default hcall list but not implemented. Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Daniel Axtens <[email protected]> Acked-by: Paul Mackerras <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-12KVM: PPC: Book3S HV: remove unused kvmppc_h_protect argumentNicholas Piggin2-4/+2
The va argument is not used in the function or set by its asm caller, so remove it to be safe. Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Daniel Axtens <[email protected]> Acked-by: Paul Mackerras <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-12KVM: PPC: Book3S HV: Remove redundant mtspr PSPBNicholas Piggin1-1/+0
This SPR is set to 0 twice when exiting the guest. Suggested-by: Fabiano Rosas <[email protected]> Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Daniel Axtens <[email protected]> Acked-by: Paul Mackerras <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-12KVM: PPC: Book3S HV: Prevent radix guests setting LPCR[TC]Nicholas Piggin1-0/+4
Prevent radix guests setting LPCR[TC]. This bit only applies to hash partitions. Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Alexey Kardashevskiy <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-12KVM: PPC: Book3S HV: Disallow LPCR[AIL] to be set to 1 or 2Nicholas Piggin1-1/+6
These are already disallowed by H_SET_MODE from the guest, also disallow these by updating LPCR directly. AIL modes can affect the host interrupt behaviour while the guest LPCR value is set, so filter it here too. Suggested-by: Fabiano Rosas <[email protected]> Signed-off-by: Nicholas Piggin <[email protected]> Acked-by: Paul Mackerras <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-12KVM: PPC: Book3S HV: Add a function to filter guest LPCR bitsNicholas Piggin3-19/+59
Guest LPCR depends on hardware type, and future changes will add restrictions based on errata and guest MMU mode. Move this logic to a common function and use it for the cases where the guest wants to update its LPCR (or the LPCR of a nested guest). This also adds a warning in other places that set or update LPCR if we try to set something that would have been disallowed by the filter, as a sanity check. Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Fabiano Rosas <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-12KVM: PPC: Book3S HV: Nested move LPCR sanitising to sanitise_hv_regsNicholas Piggin1-6/+21
This will get a bit more complicated in future patches. Move it into the helper function. This change allows the L1 hypervisor to determine some of the LPCR bits that the L0 is using to run it, which could be a privilege violation (LPCR is HV-privileged), although the same problem exists now for HFSCR for example. Discussion of the HV privilege issue is ongoing and can be resolved with a later change. Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Fabiano Rosas <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-12KVM: PPC: Book3S HV P9: Restore host CTRL SPR after guest exitNicholas Piggin1-0/+3
The host CTRL (runlatch) value is not restored after guest exit. The host CTRL should always be 1 except in CPU idle code, so this can result in the host running with runlatch clear, and potentially switching to a different vCPU which then runs with runlatch clear as well. This has little effect on P9 machines, CTRL is only responsible for some PMU counter logic in the host and so other than corner cases of software relying on that, or explicitly reading the runlatch value (Linux does not appear to be affected but it's possible non-Linux guests could be), there should be no execution correctness problem, though it could be used as a covert channel between guests. There may be microcontrollers, firmware or monitoring tools that sample the runlatch value out-of-band, however since the register is writable by guests, these values would (should) not be relied upon for correct operation of the host, so suboptimal performance or incorrect reporting should be the worst problem. Fixes: 95a6432ce9038 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-08powerpc/32: Remove powerpc specific definition of 'ptrdiff_t'Christophe Leroy1-5/+0
For unknown reason, old commit d27dfd388715 ("Import pre2.0.8") changed 'ptrdiff_t' from 'int' to 'long'. GCC expects it as 'int' really, and this leads to the following warning when building KFENCE: CC mm/kfence/report.o In file included from ./include/linux/printk.h:7, from ./include/linux/kernel.h:16, from mm/kfence/report.c:10: mm/kfence/report.c: In function 'kfence_report_error': ./include/linux/kern_levels.h:5:18: warning: format '%td' expects argument of type 'ptrdiff_t', but argument 6 has type 'long int' [-Wformat=] 5 | #define KERN_SOH "\001" /* ASCII Start Of Header */ | ^~~~~~ ./include/linux/kern_levels.h:11:18: note: in expansion of macro 'KERN_SOH' 11 | #define KERN_ERR KERN_SOH "3" /* error conditions */ | ^~~~~~~~ ./include/linux/printk.h:343:9: note: in expansion of macro 'KERN_ERR' 343 | printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__) | ^~~~~~~~ mm/kfence/report.c:213:3: note: in expansion of macro 'pr_err' 213 | pr_err("Out-of-bounds %s at 0x%p (%luB %s of kfence-#%td):\n", | ^~~~~~ <asm-generic/uapi/posix-types.h> defines it as 'int', and defines 'size_t' and 'ssize_t' exactly as powerpc do, so remove the powerpc specific definitions and fallback on generic ones. Signed-off-by: Christophe Leroy <[email protected]> Acked-by: Segher Boessenkool <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/e43d133bf52fa19e577f64f3a3a38cedc570377d.1617616601.git.christophe.leroy@csgroup.eu
2021-04-08powerpc: iommu: fix build when neither PCI or IBMVIO is setRandy Dunlap1-0/+1
When neither CONFIG_PCI nor CONFIG_IBMVIO is set/enabled, iommu.c has a build error. The fault injection code is not useful in that kernel config, so make the FAIL_IOMMU option depend on PCI || IBMVIO. Prevents this build error (warning escalated to error): ../arch/powerpc/kernel/iommu.c:178:30: error: 'fail_iommu_bus_notifier' defined but not used [-Werror=unused-variable] 178 | static struct notifier_block fail_iommu_bus_notifier = { Fixes: d6b9a81b2a45 ("powerpc: IOMMU fault injection") Reported-by: kernel test robot <[email protected]> Suggested-by: Michael Ellerman <[email protected]> Signed-off-by: Randy Dunlap <[email protected]> Acked-by: Randy Dunlap <[email protected]> # build-tested Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-08powerpc/pseries: remove unneeded semicolonYang Li1-1/+1
Eliminate the following coccicheck warning: ./arch/powerpc/platforms/pseries/lpar.c:1633:2-3: Unneeded semicolon Reported-by: Abaci Robot <[email protected]> Signed-off-by: Yang Li <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-08powerpc/64s: power4 nap fixup in CNicholas Piggin5-45/+35
There is no need for this to be in asm, use the new intrrupt entry wrapper. Signed-off-by: Nicholas Piggin <[email protected]> Tested-by: Andreas Schwab <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-08powerpc/perf: Fix PMU constraint check for EBB eventsAthira Rajeev1-2/+2
The power PMU group constraints includes check for EBB events to make sure all events in a group must agree on EBB. This will prevent scheduling EBB and non-EBB events together. But in the existing check, settings for constraint mask and value is interchanged. Patch fixes the same. Before the patch, PMU selftest "cpu_event_pinned_vs_ebb_test" fails with below in dmesg logs. This happens because EBB event gets enabled along with a non-EBB cpu event. [35600.453346] cpu_event_pinne[41326]: illegal instruction (4) at 10004a18 nip 10004a18 lr 100049f8 code 1 in cpu_event_pinned_vs_ebb_test[10000000+10000] Test results after the patch: $ ./pmu/ebb/cpu_event_pinned_vs_ebb_test test: cpu_event_pinned_vs_ebb tags: git_version:v5.12-rc5-93-gf28c3125acd3-dirty Binding to cpu 8 EBB Handler is at 0x100050c8 read error on event 0x7fffe6bd4040! PM_RUN_INST_CMPL: result 9872 running/enabled 37930432 success: cpu_event_pinned_vs_ebb This bug was hidden by other logic until commit 1908dc911792 (perf: Tweak perf_event_attr::exclusive semantics). Fixes: 4df489991182 ("powerpc/perf: Add power8 EBB support") Reported-by: Thadeu Lima de Souza Cascardo <[email protected]> Signed-off-by: Athira Rajeev <[email protected]> [mpe: Mention commit 1908dc911792] Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-08selftests/powerpc: Suggest memtrace instead of /dev/mem for ci memoryJordan Niethe1-10/+1
The suggested alternative for getting cache-inhibited memory with 'mem=' and /dev/mem is pretty hacky. Also, PAPR guests do not allow system memory to be mapped cache-inhibited so despite /dev/mem being available this will not work which can cause confusion. Instead recommend using the memtrace buffers. memtrace is only available on powernv so there will not be any chance of trying to do this in a guest. Signed-off-by: Jordan Niethe <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-08powerpc/powernv/memtrace: Allow mmaping trace buffersJordan Niethe1-1/+17
Let the memory removed from the linear mapping to be used for the trace buffers be mmaped. This is a useful way of providing cache-inhibited memory for the alignment_handler selftest. Signed-off-by: Jordan Niethe <[email protected]> [mpe: make memtrace_mmap() static as noticed by [email protected]] Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-08powerpc/kexec: Don't use .machine ppc64 in trampoline_64.SMichael Ellerman1-1/+0
As best as I can tell the ".machine" directive in trampoline_64.S is no longer, or never was, necessary. It was added in commit 0d97631392c2 ("powerpc: Add purgatory for kexec_file_load() implementation."), which created the file based on the kexec-tools purgatory. It may be/have-been necessary in the kexec-tools version, but we have a completely different build system, and we already pass the desired CPU flags, eg: gcc ... -m64 -Wl,-a64 -mabi=elfv2 -Wa,-maltivec -Wa,-mpower4 -Wa,-many ... arch/powerpc/purgatory/trampoline_64.S So drop the ".machine" directive and rely on the assembler flags. Reported-by: Daniel Axtens <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Reviewed-by: Segher Boessenkool <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-08powerpc/64: Move security code into security.cMichael Ellerman2-264/+261
When the original spectre/meltdown mitigations were merged we put them in setup_64.c for lack of a better place. Since then we created security.c for some of the other mitigation related code. But it should all be in there. This sort of code movement can cause trouble for backports, but hopefully this code is relatively stable these days (famous last words). Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-08powerpc/mm/64s: Allow STRICT_KERNEL_RWX againMichael Ellerman1-1/+1
We have now fixed the known bugs in STRICT_KERNEL_RWX for Book3S 64-bit Hash and Radix MMUs, see preceding commits, so allow the option to be selected again. Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-08powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPARMichael Ellerman1-1/+104
When we enabled STRICT_KERNEL_RWX we received some reports of boot failures when using the Hash MMU and running under phyp. The crashes are intermittent, and often exhibit as a completely unresponsive system, or possibly an oops. One example, which was caught in xmon: [ 14.068327][ T1] devtmpfs: mounted [ 14.069302][ T1] Freeing unused kernel memory: 5568K [ 14.142060][ T347] BUG: Unable to handle kernel instruction fetch [ 14.142063][ T1] Run /sbin/init as init process [ 14.142074][ T347] Faulting instruction address: 0xc000000000004400 cpu 0x2: Vector: 400 (Instruction Access) at [c00000000c7475e0] pc: c000000000004400: exc_virt_0x4400_instruction_access+0x0/0x80 lr: c0000000001862d4: update_rq_clock+0x44/0x110 sp: c00000000c747880 msr: 8000000040001031 current = 0xc00000000c60d380 paca = 0xc00000001ec9de80 irqmask: 0x03 irq_happened: 0x01 pid = 347, comm = kworker/2:1 ... enter ? for help [c00000000c747880] c0000000001862d4 update_rq_clock+0x44/0x110 (unreliable) [c00000000c7478f0] c000000000198794 update_blocked_averages+0xb4/0x6d0 [c00000000c7479f0] c000000000198e40 update_nohz_stats+0x90/0xd0 [c00000000c747a20] c0000000001a13b4 _nohz_idle_balance+0x164/0x390 [c00000000c747b10] c0000000001a1af8 newidle_balance+0x478/0x610 [c00000000c747be0] c0000000001a1d48 pick_next_task_fair+0x58/0x480 [c00000000c747c40] c000000000eaab5c __schedule+0x12c/0x950 [c00000000c747cd0] c000000000eab3e8 schedule+0x68/0x120 [c00000000c747d00] c00000000016b730 worker_thread+0x130/0x640 [c00000000c747da0] c000000000174d50 kthread+0x1a0/0x1b0 [c00000000c747e10] c00000000000e0f0 ret_from_kernel_thread+0x5c/0x6c This shows that CPU 2, which was idle, woke up and then appears to randomly take an instruction fault on a completely valid area of kernel text. The cause turns out to be the call to hash__mark_rodata_ro(), late in boot. Due to the way we layout text and rodata, that function actually changes the permissions for all of text and rodata to read-only plus execute. To do the permission change we use a hypervisor call, H_PROTECT. On phyp that appears to be implemented by briefly removing the mapping of the kernel text, before putting it back with the updated permissions. If any other CPU is executing during that window, it will see spurious faults on the kernel text and/or data, leading to crashes. To fix it we use stop machine to collect all other CPUs, and then have them drop into real mode (MMU off), while we change the mapping. That way they are unaffected by the mapping temporarily disappearing. We don't see this bug on KVM because KVM always use VPM=1, where faults are directed to the hypervisor, and the fault will be serialised vs the h_protect() by HPTE_V_HVLOCK. Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-08powerpc/mm/64s/hash: Factor out change_memory_range()Michael Ellerman1-8/+15
Pull the loop calling hpte_updateboltedpp() out of hash__change_memory_range() into a helper function. We need it to be a separate function for the next patch. Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-08powerpc/64s: Use htab_convert_pte_flags() in hash__mark_rodata_ro()Michael Ellerman1-2/+4
In hash__mark_rodata_ro() we pass the raw PP_RXXX value to hash__change_memory_range(). That has the effect of setting the key to zero, because PP_RXXX contains no key value. Fix it by using htab_convert_pte_flags(), which knows how to convert a pgprot into a pp value, including the key. Fixes: d94b827e89dc ("powerpc/book3s64/kuap: Use Key 3 for kernel mapping with hash translation") Signed-off-by: Michael Ellerman <[email protected]> Reviewed-by: Daniel Axtens <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-08powerpc/pseries: Add key to flags in pSeries_lpar_hpte_updateboltedpp()Michael Ellerman1-1/+3
The flags argument to plpar_pte_protect() (aka. H_PROTECT), includes the key in bits 9-13, but currently we always set those bits to zero. In the past that hasn't been a problem because we always used key 0 for the kernel, and updateboltedpp() is only used for kernel mappings. However since commit d94b827e89dc ("powerpc/book3s64/kuap: Use Key 3 for kernel mapping with hash translation") we are now inadvertently changing the key (to zero) when we call plpar_pte_protect(). That hasn't broken anything because updateboltedpp() is only used for STRICT_KERNEL_RWX, which is currently disabled on 64s due to other bugs. But we want to fix that, so first we need to pass the key correctly to plpar_pte_protect(). We can't pass our newpp value directly in, we have to convert it into the form expected by the hcall. The hcall we're using here is H_PROTECT, which is specified in section 14.5.4.1.6 of LoPAPR v1.1. It takes a `flags` parameter, and the description for flags says: * flags: AVPN, pp0, pp1, pp2, key0-key4, n, and for the CMO option: CMO Option flags as defined in Table 189‚ If you then go to the start of the parent section, 14.5.4.1, on page 405, it says: Register Linkage (For hcall() tokens 0x04 - 0x18) * On Call * R3 function call token * R4 flags (see Table 178‚ “Page Frame Table Access flags field definition‚” on page 401) Then you have to go to section 14.5.3, and on page 394 there is a list of hcalls and their tokens (table 176), and there you can see that H_PROTECT == 0x18. Finally you can look at table 178, on page 401, where it specifies the layout of the bits for the key: Bit Function ----------------- 50-54 | key0-key4 Those are big-endian bit numbers, converting to normal bit numbers you get bits 9-13, or 0x3e00. In the kernel we have: #define HPTE_R_KEY_HI ASM_CONST(0x3000000000000000) #define HPTE_R_KEY_LO ASM_CONST(0x0000000000000e00) So the LO bits of newpp are already in the right place, and the HI bits need to be shifted down by 48. Fixes: d94b827e89dc ("powerpc/book3s64/kuap: Use Key 3 for kernel mapping with hash translation") Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-08powerpc/mm/64s: Add _PAGE_KERNEL_ROXMichael Ellerman1-0/+1
In the past we had a fallback definition for _PAGE_KERNEL_ROX, but we removed that in commit d82fd29c5a8c ("powerpc/mm: Distribute platform specific PAGE and PMD flags and definitions") and added definitions for each MMU family. However we missed adding a definition for 64s, which was not really a bug because it's currently not used. But we'd like to use PAGE_KERNEL_ROX in a future patch so add a definition now. Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-08selftests/powerpc: Test for spurious kernel memory faults on radixJordan Niethe2-0/+50
Previously when mapping kernel memory on radix, no ptesync was included which would periodically lead to unhandled spurious faults. Mapping kernel memory is used when code patching with Strict RWX enabled. As suggested by Chris Riedl, turning ftrace on and off does a large amount of code patching so is a convenient way to see this kind of fault. Add a selftest to try and trigger this kind of a spurious fault. It tests for 30 seconds which is usually long enough for the issue to show up. Signed-off-by: Jordan Niethe <[email protected]> [mpe: Rename it to better reflect what it does, rather than the symptom] Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]