aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-09-24powerpc/pseries: Fix unitialized timer reset on migrationMichael Bringmann1-1/+2
After migration of a powerpc LPAR, the kernel executes code to update the system state to reflect new platform characteristics. Such changes include modifications to device tree properties provided to the system by PHYP. Property notifications received by the post_mobility_fixup() code are passed along to the kernel in general through a call to of_update_property() which in turn passes such events back to all modules through entries like the '.notifier_call' function within the NUMA module. When the NUMA module updates its state, it resets its event timer. If this occurs after a previous call to stop_topology_update() or on a system without VPHN enabled, the code runs into an unitialized timer structure and crashes. This patch adds a safety check along this path toward the problem code. An example crash log is as follows. ibmvscsi 30000081: Re-enabling adapter! ------------[ cut here ]------------ kernel BUG at kernel/time/timer.c:958! Oops: Exception in kernel mode, sig: 5 [#1] LE SMP NR_CPUS=2048 NUMA pSeries Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag lockd unix_diag af_packet_diag netlink_diag grace fscache sunrpc xts vmx_crypto pseries_rng sg binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod CPU: 11 PID: 3067 Comm: drmgr Not tainted 4.17.0+ #179 ... NIP mod_timer+0x4c/0x400 LR reset_topology_timer+0x40/0x60 Call Trace: 0xc0000003f9407830 (unreliable) reset_topology_timer+0x40/0x60 dt_update_callback+0x100/0x120 notifier_call_chain+0x90/0x100 __blocking_notifier_call_chain+0x60/0x90 of_property_notify+0x90/0xd0 of_update_property+0x104/0x150 update_dt_property+0xdc/0x1f0 pseries_devicetree_update+0x2d0/0x510 post_mobility_fixup+0x7c/0xf0 migration_store+0xa4/0xc0 kobj_attr_store+0x30/0x60 sysfs_kf_write+0x64/0xa0 kernfs_fop_write+0x16c/0x240 __vfs_write+0x40/0x200 vfs_write+0xc8/0x240 ksys_write+0x5c/0x100 system_call+0x58/0x6c Fixes: 5d88aa85c00b ("powerpc/pseries: Update CPU maps when device tree is updated") Cc: [email protected] # v3.10+ Signed-off-by: Michael Bringmann <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-20powerpc/pkeys: Fix reading of ibm, processor-storage-keys propertyThiago Jung Bauermann1-1/+1
scan_pkey_feature() uses of_property_read_u32_array() to read the ibm,processor-storage-keys property and calls be32_to_cpu() on the value it gets. The problem is that of_property_read_u32_array() already returns the value converted to the CPU byte order. The value of pkeys_total ends up more or less sane because there's a min() call in pkey_initialize() which reduces pkeys_total to 32. So in practice the kernel ignores the fact that the hypervisor reserved one key for itself (the device tree advertises 31 keys in my test VM). This is wrong, but the effect in practice is that when a process tries to allocate the 32nd key, it gets an -EINVAL error instead of -ENOSPC which would indicate that there aren't any keys available Fixes: cf43d3b26452 ("powerpc: Enable pkey subsystem") Cc: [email protected] # v4.16+ Signed-off-by: Thiago Jung Bauermann <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-20powerpc: fix csum_ipv6_magic() on little endian platformsChristophe Leroy1-0/+3
On little endian platforms, csum_ipv6_magic() keeps len and proto in CPU byte order. This generates a bad results leading to ICMPv6 packets from other hosts being dropped by powerpc64le platforms. In order to fix this, len and proto should be converted to network byte order ie bigendian byte order. However checksumming 0x12345678 and 0x56341278 provide the exact same result so it is enough to rotate the sum of len and proto by 1 byte. PPC32 only support bigendian so the fix is needed for PPC64 only Fixes: e9c4943a107b ("powerpc: Implement csum_ipv6_magic in assembly") Reported-by: Jianlin Shi <[email protected]> Reported-by: Xin Long <[email protected]> Cc: <[email protected]> # 4.18+ Signed-off-by: Christophe Leroy <[email protected]> Tested-by: Xin Long <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-20powerpc/powernv/ioda2: Reduce upper limit for DMA window size (again)Alexey Kardashevskiy1-1/+1
mpe: This was fixed originally in commit d3d4ffaae439 ("powerpc/powernv/ioda2: Reduce upper limit for DMA window size"), but contrary to what the merge commit says was inadvertently lost by me in commit ce57c6610cc2 ("Merge branch 'topic/ppc-kvm' into next") which brought in changes that moved the code to a new file. So reapply it to the new file. Original commit message follows: We use PHB in mode1 which uses bit 59 to select a correct DMA window. However there is mode2 which uses bits 59:55 and allows up to 32 DMA windows per a PE. Even though documentation does not clearly specify that, it seems that the actual hardware does not support bits 59:55 even in mode1, in other words we can create a window as big as 1<<58 but DMA simply won't work. This reduces the upper limit from 59 to 55 bits to let the userspace know about the hardware limits. Fixes: ce57c6610cc2 ("Merge branch 'topic/ppc-kvm' into next") Signed-off-by: Alexey Kardashevskiy <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/fadump: re-register firmware-assisted dump if already registeredHari Bathini1-2/+2
Firmware-Assisted Dump (FADump) needs to be registered again after any memory hot add/remove operation to update the crash memory ranges. But currently, the kernel returns '-EEXIST' if we try to register without uregistering it first. This could expose the system to racing issues while unregistering and registering FADump from userspace during udev events. Spare the userspace of this and let it be taken care of in the kernel space for a simpler interface. Since this change, running 'echo 1 > /sys/kernel/fadump_registered' would result in re-regisering (unregistering and registering) FADump, if it was already registered. Signed-off-by: Hari Bathini <[email protected]> Acked-by: Mahesh Salgaonkar <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/pseries: Remove VLA from lparcfg_write()Suraj Jitindar Singh1-3/+2
In lparcfg_write we hard code kbuf_sz and then use this as the variable length of kbuf creating a variable length array. Since we're hard coding the length anyway just define the array using this as the length and remove the need for kbuf_sz, thus removing the variable length array. Signed-off-by: Suraj Jitindar Singh <[email protected]> Reviewed-by: Joel Stanley <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/prom: Remove VLA in prom_check_platform_support()Suraj Jitindar Singh1-2/+5
In prom_check_platform_support() we retrieve and parse the "ibm,arch-vec-5-platform-support" property of the chosen node. Currently we use a variable length array however to avoid this use an array of constant length 8. This property is used to indicate the supported options of vector 5 bytes 23-26 of the ibm,architecture.vec node. Each of these options is a pair of bytes, thus for 4 options we have a max length of 8 bytes. Signed-off-by: Suraj Jitindar Singh <[email protected]> Reviewed-by: Joel Stanley <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc: Fix duplicate const clang warning in user access codeAnton Blanchard1-3/+3
This re-applies commit b91c1e3e7a6f ("powerpc: Fix duplicate const clang warning in user access code") (Jun 2015) which was undone in commits: f2ca80905929 ("powerpc/sparse: Constify the address pointer in __get_user_nosleep()") (Feb 2017) d466f6c5cac1 ("powerpc/sparse: Constify the address pointer in __get_user_nocheck()") (Feb 2017) f84ed59a612d ("powerpc/sparse: Constify the address pointer in __get_user_check()") (Feb 2017) We see a large number of duplicate const errors in the user access code when building with llvm/clang: include/linux/pagemap.h:576:8: warning: duplicate 'const' declaration specifier [-Wduplicate-decl-specifier] ret = __get_user(c, uaddr); The problem is we are doing const __typeof__(*(ptr)), which will hit the warning if ptr is marked const. Removing const does not seem to have any effect on GCC code generation. Signed-off-by: Anton Blanchard <[email protected]> Signed-off-by: Joel Stanley <[email protected]> Reviewed-by: Nick Desaulniers <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/boot: Ensure _zimage_start is a weak symbolJoel Stanley1-1/+3
When building with clang crt0's _zimage_start is not marked weak, which breaks the build when linking the kernel image: $ objdump -t arch/powerpc/boot/crt0.o |grep _zimage_start$ 0000000000000058 g .text 0000000000000000 _zimage_start ld: arch/powerpc/boot/wrapper.a(crt0.o): in function '_zimage_start': (.text+0x58): multiple definition of '_zimage_start'; arch/powerpc/boot/pseries-head.o:(.text+0x0): first defined here Clang requires the .weak directive to appear after the symbol is declared. The binutils manual says: This directive sets the weak attribute on the comma separated list of symbol names. If the symbols do not already exist, they will be created. So it appears this is different with clang. The only reference I could see for this was an OpenBSD mailing list post[1]. Changing it to be after the declaration fixes building with Clang, and still works with GCC. $ objdump -t arch/powerpc/boot/crt0.o |grep _zimage_start$ 0000000000000058 w .text 0000000000000000 _zimage_start Reported to clang as https://bugs.llvm.org/show_bug.cgi?id=38921 [1] https://groups.google.com/forum/#!topic/fa.openbsd.tech/PAgKKen2YCY Signed-off-by: Joel Stanley <[email protected]> Reviewed-by: Nick Desaulniers <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/configs: Update skiroot defconfigJoel Stanley1-46/+108
Disable new features from recent releases, and clean out some other unused options: - Enable EXPERT, so we can disable some things - Disable non-powerpc BPF decoders - Disable TASKSTATS - Disable unused syscalls - Set more things to be modules - Turn off unused network vendors - PPC_OF_BOOT_TRAMPOLINE and FB_OF are unused on powernv - Drop unused Radeon and Matrox GPU drivers - IPV6 support landed in petitboot - Bringup related command line powersave=off dropped, switch to quiet Set CONFIG_I2C_CHARDEV=y as the module is not loaded automatically, and without this i2cget etc. will fail in the skiroot environment. This defconfig gets us build coverage of KERNEL_XZ, which was broken in the 4.19 merge window for powerpc. Signed-off-by: Joel Stanley <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/pseries: Disable CPU hotplug across migrationsNathan Fontenot1-0/+2
When performing partition migrations all present CPUs must be online as all present CPUs must make the H_JOIN call as part of the migration process. Once all present CPUs make the H_JOIN call, one CPU is returned to make the rtas call to perform the migration to the destination system. During testing of migration and changing the SMT state we have found instances where CPUs are offlined, as part of the SMT state change, before they make the H_JOIN call. This results in a hung system where every CPU is either in H_JOIN or offline. To prevent this this patch disables CPU hotplug during the migration process. Signed-off-by: Nathan Fontenot <[email protected]> Reviewed-by: Tyrel Datwyler <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/pseries: Remove unneeded uses of dlpar work queueNathan Fontenot4-43/+19
There are three instances in which dlpar hotplug events are invoked; handling a hotplug interrupt (in a kvm guest), handling a dlpar request through sysfs, and updating LMB affinity when handling a PRRN event. Only in the case of handling a hotplug interrupt do we have to put the work on a workqueue, the other cases can handle the dlpar request directly. This patch exports the handle_dlpar_errorlog() function so that dlpar hotplug events can be handled directly and updates the two instances mentioned above to use the direct invocation. Signed-off-by: Nathan Fontenot <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/pseries: Remove prrn_work workqueueNathan Fontenot1-14/+3
When a PRRN event is received we are already running in a worker thread. Instead of spawning off another worker thread on the prrn_work workqueue to handle the PRRN event we can just call the PRRN handler routine directly. With this update we can also pass the scope variable for the PRRN event directly to the handler instead of it being a global variable. This patch fixes the following oops mnessage we are seeing in PRRN testing: Oops: Bad kernel stack pointer, sig: 6 [#1] SMP NR_CPUS=2048 NUMA pSeries Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache binfmt_misc reiserfs vfat fat rpadlpar_io(X) rpaphp(X) tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag af_packet xfs libcrc32c dm_service_time ibmveth(X) ses enclosure scsi_transport_sas rtc_generic btrfs xor raid6_pq sd_mod ibmvscsi(X) scsi_transport_srp ipr(X) libata sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 Supported: Yes, External 54 CPU: 7 PID: 18967 Comm: kworker/u96:0 Tainted: G X 4.4.126-94.22-default #1 Workqueue: pseries hotplug workque pseries_hp_work_fn task: c000000775367790 ti: c00000001ebd4000 task.ti: c00000070d140000 NIP: 0000000000000000 LR: 000000001fb3d050 CTR: 0000000000000000 REGS: c00000001ebd7d40 TRAP: 0700 Tainted: G X (4.4.126-94.22-default) MSR: 8000000102081000 <41,VEC,ME5 CR: 28000002 XER: 20040018 4 CFAR: 000000001fb3d084 40 419 1 3 GPR00: 000000000000000040000000000010007 000000001ffff400 000000041fffe200 GPR04: 000000000000008050000000000000000 000000001fb15fa8 0000000500000500 GPR08: 000000000001f40040000000000000001 0000000000000000 000005:5200040002 GPR12: 00000000000000005c000000007a05400 c0000000000e89f8 000000001ed9f668 GPR16: 000000001fbeff944000000001fbeff94 000000001fb545e4 0000006000000060 GPR20: ffffffffffffffff4ffffffffffffffff 0000000000000000 0000000000000000 GPR24: 00000000000000005400000001fb3c000 0000000000000000 000000001fb1b040 GPR28: 000000001fb240004000000001fb440d8 0000000000000008 0000000000000000 NIP [0000000000000000] 5 (null) LR [000000001fb3d050] 031fb3d050 Call Trace: 4 Instruction dump: 4 5:47 12 2 XXXXXXXX XXXXXXXX XXXXX4XX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXX5XX XXXXXXXX 60000000 60000000 60000000 60000000 ---[ end trace aa5627b04a7d9d6b ]--- 3NMI watchdog: BUG: soft lockup - CPU#27 stuck for 23s! [kworker/27:0:13903] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache binfmt_misc reiserfs vfat fat rpadlpar_io(X) rpaphp(X) tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag af_packet xfs libcrc32c dm_service_time ibmveth(X) ses enclosure scsi_transport_sas rtc_generic btrfs xor raid6_pq sd_mod ibmvscsi(X) scsi_transport_srp ipr(X) libata sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 Supported: Yes, External CPU: 27 PID: 13903 Comm: kworker/27:0 Tainted: G D X 4.4.126-94.22-default #1 Workqueue: events prrn_work_fn task: c000000747cfa390 ti: c00000074712c000 task.ti: c00000074712c000 NIP: c0000000008002a8 LR: c000000000090770 CTR: 000000000032e088 REGS: c00000074712f7b0 TRAP: 0901 Tainted: G D X (4.4.126-94.22-default) MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22482044 XER: 20040000 CFAR: c0000000008002c4 SOFTE: 1 GPR00: c000000000090770 c00000074712fa30 c000000000f09800 c000000000fa1928 6:02 GPR04: c000000775f5e000 fffffffffffffffe 0000000000000001 c000000000f42db8 GPR08: 0000000000000001 0000000080000007 0000000000000000 0000000000000000 GPR12: 8006210083180000 c000000007a14400 NIP [c0000000008002a8] _raw_spin_lock+0x68/0xd0 LR [c000000000090770] mobility_rtas_call+0x50/0x100 Call Trace: 59 5 [c00000074712fa60] [c000000000090770] mobility_rtas_call+0x50/0x100 [c00000074712faf0] [c000000000090b08] pseries_devicetree_update+0xf8/0x530 [c00000074712fc20] [c000000000031ba4] prrn_work_fn+0x34/0x50 [c00000074712fc40] [c0000000000e0390] process_one_work+0x1a0/0x4e0 [c00000074712fcd0] [c0000000000e0870] worker_thread+0x1a0/0x6105:57 2 [c00000074712fd80] [c0000000000e8b18] kthread+0x128/0x150 [c00000074712fe30] [c0000000000096f8] ret_from_kernel_thread+0x5c/0x64 Instruction dump: 2c090000 40c20010 7d40192d 40c2fff0 7c2004ac 2fa90000 40de0018 5:540030 3 e8010010 ebe1fff8 7c0803a6 4e800020 <7c210b78> e92d0000 89290009 792affe3 Signed-off-by: John Allen <[email protected]> Signed-off-by: Haren Myneni <[email protected]> Signed-off-by: Nathan Fontenot <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR requestNathan Fontenot2-39/+21
The updates to powerpc numa and memory hotplug code now use the in-kernel LMB array instead of the device tree. This change allows the pseries memory DLPAR code to only update the device tree once after successfully handling a DLPAR request. Prior to the in-kernel LMB array, the numa code looked up the affinity for memory being added in the device tree, the code now looks this up in the LMB array. This change means the memory hotplug code can just update the affinity for an LMB in the LMB array instead of updating the device tree. This also provides a savings in kernel memory. When updating the device tree old properties are never free'ed since there is no usecount on properties. This behavior leads to a new copy of the property being allocated every time a LMB is added or removed (i.e. a request to add 100 LMBs creates 100 new copies of the property). With this update only a single new property is created when a DLPAR request completes successfully. Signed-off-by: Nathan Fontenot <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc: avoid -mno-sched-epilog on GCC 4.9 and newerNicholas Piggin1-2/+6
Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Joel Stanley <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc: consolidate -mno-sched-epilog into FTRACE flagsNicholas Piggin5-13/+13
Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Joel Stanley <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc: remove old GCC version checksNicholas Piggin1-29/+2
GCC 4.6 is the minimum supported now. Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Joel Stanley <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/64s/hash: Add a SLB preload cacheNicholas Piggin5-39/+142
When switching processes, currently all user SLBEs are cleared, and a few (exec_base, pc, and stack) are preloaded. In trivial testing with small apps, this tends to miss the heap and low 256MB segments, and it will also miss commonly accessed segments on large memory workloads. Add a simple round-robin preload cache that just inserts the last SLB miss into the head of the cache and preloads those at context switch time. Every 256 context switches, the oldest entry is removed from the cache to shrink the cache and require fewer slbmte if they are unused. Much more could go into this, including into the SLB entry reclaim side to track some LRU information etc, which would require a study of large memory workloads. But this is a simple thing we can do now that is an obvious win for common workloads. With the full series, process switching speed on the context_switch benchmark on POWER9/hash (with kernel speculation security masures disabled) increases from 140K/s to 178K/s (27%). POWER8 does not change much (within 1%), it's unclear why it does not see a big gain like POWER9. Booting to busybox init with 256MB segments has SLB misses go down from 945 to 69, and with 1T segments 900 to 21. These could almost all be eliminated by preloading a bit more carefully with ELF binary loading. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/64s/hash: provide arch_setup_exec hooks for hash slice setupNicholas Piggin6-0/+37
This will be used by the SLB code in the next patch, but for now this sets the slb_addr_limit to the correct size for 32-bit tasks. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/64s: xmon do not dump hash fields when using radix modeNicholas Piggin1-19/+21
Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/64s/hash: SLB allocation status bitmapsNicholas Piggin4-17/+59
Add 32-entry bitmaps to track the allocation status of the first 32 SLB entries, and whether they are user or kernel entries. These are used to allocate free SLB entries first, before resorting to the round robin allocator. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/64s/hash: remove user SLB data from the pacaNicholas Piggin8-103/+40
User SLB mappig data is copied into the PACA from the mm->context so it can be accessed by the SLB miss handlers. After the C conversion, SLB miss handlers now run with relocation on, and user SLB misses are able to take recursive kernel SLB misses, so the user SLB mapping data can be removed from the paca and accessed directly. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/64s/hash: convert SLB miss handlers to CNicholas Piggin6-624/+196
This patch moves SLB miss handlers completely to C, using the standard exception handler macros to set up the stack and branch to C. This can be done because the segment containing the kernel stack is always bolted, so accessing it with relocation on will not cause an SLB exception. Arbitrary kernel memory may not be accessed when handling kernel space SLB misses, so care should be taken there. However user SLB misses can access any kernel memory, which can be used to move some fields out of the paca (in later patches). User SLB misses could quite easily reconcile IRQs and set up a first class kernel environment and exit via ret_from_except, however that doesn't seem to be necessary at the moment, so we only do that if a bad fault is encountered. [ Credit to Aneesh for bug fixes, error checks, and improvements to bad address handling, etc ] Signed-off-by: Nicholas Piggin <[email protected]> Since RFC: - Added MSR[RI] handling - Fixed up a register loss bug exposed by irq tracing (Aneesh) - Reject misses outside the defined kernel regions (Aneesh) - Added several more sanity checks and error handling (Aneesh), we may look at consolidating these tests and tightenig up the code but for a first pass we decided it's better to check carefully. Since v1: - Fixed SLB cache corruption (Aneesh) - Fixed untidy SLBE allocation "leak" in get_vsid error case - Now survives some stress testing on real hardware Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/64s/hash: Use POWER9 SLBIA IH=3 variant in switch_slbNicholas Piggin2-40/+56
POWER9 introduces SLBIA IH=3, which invalidates all SLB entries and associated lookaside information that have a class value of 1, which Linux assigns to user addresses. This matches what switch_slb wants, and allows a simple fast implementation that avoids the slb_cache complexity. As a side-effect, the POWER5 < DD2.1 SLB invalidation workaround is also avoided on POWER9. Process context switching rate is improved about 2.2% for a small process that hits the slb cache which is the best case for the current code. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/64s/hash: Use POWER6 SLBIA IH=1 variant in switch_slbNicholas Piggin1-15/+23
The SLBIA IH=1 hint will remove all non-zero SLBEs, but only invalidate ERAT entries associated with a class value of 1, for processors that support the hint (e.g., POWER6 and newer), which Linux assigns to user addresses. This prevents kernel ERAT entries from being invalidated when context switchig (if the thread faulted in more than 8 user SLBEs). Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/64s/hash: remove the vmalloc segment from the bolted SLBNicholas Piggin2-19/+6
Remove the vmalloc segment from bolted SLBEs. This is not required to be bolted, and seems like it was added to help pre-load the SLB on context switch. However there are now other segments like the vmemmap segment and non-zero node memory that often take misses after a context switch, so it is better to solve this in a more general way. A subsequent change will track free SLB entries and uses those rather than round-robin overwrite valid entries, which makes it far less likely for kernel SLBEs to be evicted after they are installed. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/64s/hash: move POWER5 < DD2.1 slbie workaround where it is neededNicholas Piggin1-7/+7
The POWER5 < DD2.1 issue is that slbie needs to be issued more than once. It came in with this change: [email protected], 2004-04-29 07:12:31-07:00, [email protected] [PATCH] POWER5 erratum workaround Early POWER5 revisions (<DD2.1) have a problem requiring slbie instructions to be repeated under some circumstances. The patch below adds a workaround (patch made by Anton Blanchard). (aka. 3e4520f7605243abf66a7ccd3d2e49e48e8c0483 in the full history tree) The extra slbie in switch_slb is done even for the case where slbia is called (slb_flush_and_rebolt). I don't believe that is required because there are other slb_flush_and_rebolt callers which do not issue the workaround slbie, which would be broken if it was required. It also seems to be fine inside the isync with the first slbie, as it is in the kernel stack switch code. So move this workaround to where it is required. This is not much of an optimisation because this is the fast path, but it makes the code more understandable and neater. Signed-off-by: Nicholas Piggin <[email protected]> [mpe: Retain slbie_data initialisation to avoid compiler warning] Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/64s/hash: avoid the POWER5 < DD2.1 slb invalidate workaround on POWER8/9Nicholas Piggin2-3/+7
I only have POWER8/9 to test, so just remove it for those. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/64s/hash: Fix stab_rr off by one initializationNicholas Piggin1-1/+1
This causes SLB alloation to start 1 beyond the start of the SLB. There is no real problem because after it wraps it stats behaving properly, it's just surprisig to see when looking at SLB traces. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powernv/pseries: consolidate code for mce early handling.Mahesh Salgaonkar1-127/+28
Now that other platforms also implements real mode mce handler, lets consolidate the code by sharing existing powernv machine check early code. Rename machine_check_powernv_early to machine_check_common_early and reuse the code. Signed-off-by: Mahesh Salgaonkar <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/pseries: Dump the SLB contents on SLB MCE errors.Mahesh Salgaonkar5-1/+112
If we get a machine check exceptions due to SLB errors then dump the current SLB contents which will be very much helpful in debugging the root cause of SLB errors. Introduce an exclusive buffer per cpu to hold faulty SLB entries. In real mode mce handler saves the old SLB contents into this buffer accessible through paca and print it out later in virtual mode. With this patch the console will log SLB contents like below on SLB MCE errors: [ 507.297236] SLB contents of cpu 0x1 [ 507.297237] Last SLB entry inserted at slot 16 [ 507.297238] 00 c000000008000000 400ea1b217000500 [ 507.297239] 1T ESID= c00000 VSID= ea1b217 LLP:100 [ 507.297240] 01 d000000008000000 400d43642f000510 [ 507.297242] 1T ESID= d00000 VSID= d43642f LLP:110 [ 507.297243] 11 f000000008000000 400a86c85f000500 [ 507.297244] 1T ESID= f00000 VSID= a86c85f LLP:100 [ 507.297245] 12 00007f0008000000 4008119624000d90 [ 507.297246] 1T ESID= 7f VSID= 8119624 LLP:110 [ 507.297247] 13 0000000018000000 00092885f5150d90 [ 507.297247] 256M ESID= 1 VSID= 92885f5150 LLP:110 [ 507.297248] 14 0000010008000000 4009e7cb50000d90 [ 507.297249] 1T ESID= 1 VSID= 9e7cb50 LLP:110 [ 507.297250] 15 d000000008000000 400d43642f000510 [ 507.297251] 1T ESID= d00000 VSID= d43642f LLP:110 [ 507.297252] 16 d000000008000000 400d43642f000510 [ 507.297253] 1T ESID= d00000 VSID= d43642f LLP:110 [ 507.297253] ---------------------------------- [ 507.297254] SLB cache ptr value = 3 [ 507.297254] Valid SLB cache entries: [ 507.297255] 00 EA[0-35]= 7f000 [ 507.297256] 01 EA[0-35]= 1 [ 507.297257] 02 EA[0-35]= 1000 [ 507.297257] Rest of SLB cache entries: [ 507.297258] 03 EA[0-35]= 7f000 [ 507.297258] 04 EA[0-35]= 1 [ 507.297259] 05 EA[0-35]= 1000 [ 507.297260] 06 EA[0-35]= 12 [ 507.297260] 07 EA[0-35]= 7f000 Suggested-by: Aneesh Kumar K.V <[email protected]> Suggested-by: Michael Ellerman <[email protected]> Signed-off-by: Mahesh Salgaonkar <[email protected]> Reviewed-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/pseries: Display machine check error details.Mahesh Salgaonkar2-0/+138
Extract the MCE error details from RTAS extended log and display it to console. With this patch you should now see mce logs like below: [ 142.371818] Severe Machine check interrupt [Recovered] [ 142.371822] NIP [d00000000ca301b8]: init_module+0x1b8/0x338 [bork_kernel] [ 142.371822] Initiator: CPU [ 142.371823] Error type: SLB [Multihit] [ 142.371824] Effective address: d00000000ca70000 Signed-off-by: Mahesh Salgaonkar <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/pseries: Flush SLB contents on SLB MCE errors.Mahesh Salgaonkar10-6/+213
On pseries, as of today system crashes if we get a machine check exceptions due to SLB errors. These are soft errors and can be fixed by flushing the SLBs so the kernel can continue to function instead of system crash. We do this in real mode before turning on MMU. Otherwise we would run into nested machine checks. This patch now fetches the rtas error log in real mode and flushes the SLBs on SLB/ERAT errors. Signed-off-by: Mahesh Salgaonkar <[email protected]> Signed-off-by: Michal Suchanek <[email protected]> Reviewed-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/pseries: Define MCE error event section.Mahesh Salgaonkar2-0/+103
On pseries, the machine check error details are part of RTAS extended event log passed under Machine check exception section. This patch adds the definition of rtas MCE event section and related helper functions. Signed-off-by: Mahesh Salgaonkar <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19selftests/powerpc: Do not fail with rescheduleBreno Leitao2-3/+15
There are cases where the test is not expecting to have the transaction aborted, but, the test process might have been rescheduled, either in the OS level or by KVM (if it is running on a KVM guest machine). The process reschedule will cause a treclaim/recheckpoint which will cause the transaction to doom, aborting the transaction as soon as the process is rescheduled back to the CPU. This might cause the test to fail, but this is not a failure in essence. If that is the case, TEXASR[FC] is indicated with either TM_CAUSE_RESCHEDULE or TM_CAUSE_KVM_RESCHEDULE for KVM interruptions. In this scenario, ignore these two failures and avoid the whole test to return failure. Signed-off-by: Breno Leitao <[email protected]> Reviewed-by: Gustavo Romero <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/iommu: Avoid derefence before pointer checkBreno Leitao1-1/+1
The tbl pointer is being derefenced by IOMMU_PAGE_SIZE prior the check if it is not NULL. Just moving the dereference code to after the check, where there will be guarantee that 'tbl' will not be NULL. Signed-off-by: Breno Leitao <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/xive: Use xive_cpu->chip_id instead of looking it up againBreno Leitao1-10/+1
Function xive_native_get_ipi() might use chip_id without it being initialized, if the CPU node is not found, as reported by smatch: error: uninitialized symbol 'chip_id' As suggested by Cédric, we can use xc->chip_id instead of consulting the device tree for chip id, which is safe since xive_prepare_cpu() should have initialized ->chip_id by the time xive_native_get_ipi() is called. Signed-off-by: Breno Leitao <[email protected]> Reviewed-by: Cédric Le Goater <[email protected]> [mpe: Tweak change log] Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19ocxl: Fix access to the AFU Descriptor DataChristophe Lombard1-1/+3
The AFU Information DVSEC capability is a means to extract common, general information about all of the AFUs associated with a Function independent of the specific functionality that each AFU provides. Write in the AFU Index field allows to access to the descriptor data for each AFU. With the current code, we are not able to access to these specific data when the index >= 1 because we are writing to the wrong location. All requests to the data of each AFU are pointing to those of the AFU 0, which could have impacts when using a card with more than one AFU per function. This patch fixes the access to the AFU Descriptor Data indexed by the AFU Info Index field. Fixes: 5ef3166e8a32 ("ocxl: Driver code for 'generic' opencapi devices") Cc: stable <[email protected]> # 4.16 Signed-off-by: Christophe Lombard <[email protected]> Acked-by: Frederic Barrat <[email protected]> Acked-by: Andrew Donnellan <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-19powerpc/memtrace: Remove memory in chunksRashmica Gupta1-5/+16
When hot-removing memory release_mem_region_adjustable() splits iomem resources if they are not the exact size of the memory being hot-deleted. Adding this memory back to the kernel adds a new resource. Eg a node has memory 0x0 - 0xfffffffff. Hot-removing 1GB from 0xf40000000 results in the single resource 0x0-0xfffffffff being split into two resources: 0x0-0xf3fffffff and 0xf80000000-0xfffffffff. When we hot-add the memory back we now have three resources: 0x0-0xf3fffffff, 0xf40000000-0xf7fffffff, and 0xf80000000-0xfffffffff. This is an issue if we try to remove some memory that overlaps resources. Eg when trying to remove 2GB at address 0xf40000000, release_mem_region_adjustable() fails as it expects the chunk of memory to be within the boundaries of a single resource. We then get the warning: "Unable to release resource" and attempting to use memtrace again gives us this error: "bash: echo: write error: Resource temporarily unavailable" This patch makes memtrace remove memory in chunks that are always the same size from an address that is always equal to end_of_memory - n*size, for some n. So hotremoving and hotadding memory of different sizes will now not attempt to remove memory that spans multiple resources. Signed-off-by: Rashmica Gupta <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-18powerpc: Avoid code patching freed init sectionsMichael Neuling3-0/+9
This stops us from doing code patching in init sections after they've been freed. In this chain: kvm_guest_init() -> kvm_use_magic_page() -> fault_in_pages_readable() -> __get_user() -> __get_user_nocheck() -> barrier_nospec(); We have a code patching location at barrier_nospec() and kvm_guest_init() is an init function. This whole chain gets inlined, so when we free the init section (hence kvm_guest_init()), this code goes away and hence should no longer be patched. We seen this as userspace memory corruption when using a memory checker while doing partition migration testing on powervm (this starts the code patching post migration via /sys/kernel/mobility/migration). In theory, it could also happen when using /sys/kernel/debug/powerpc/barrier_nospec. Cc: [email protected] # 4.13+ Signed-off-by: Michael Neuling <[email protected]> Reviewed-by: Nicholas Piggin <[email protected]> Reviewed-by: Christophe Leroy <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-17powerpc/64: Remove static branch hints from memset()Anton Blanchard1-2/+2
Static branch hints override dynamic branch prediction on recent POWER CPUs. We should only use them when we are overwhelmingly sure of the direction. Signed-off-by: Anton Blanchard <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-17powerpc/pseries/mm: call H_BLOCK_REMOVELaurent Dufour2-8/+207
This hypervisor's call allows to remove up to 8 ptes with only call to tlbie. The virtual pages must be all within the same naturally aligned 8 pages virtual address block and have the same page and segment size encodings. Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Laurent Dufour <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-17powerpc/pseries/mm: factorize PTE slot computationLaurent Dufour1-7/+20
This part of code will be called also when dealing with H_BLOCK_REMOVE. Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Laurent Dufour <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-17powerpc/pseries/mm: Introducing FW_FEATURE_BLOCK_REMOVELaurent Dufour2-1/+3
This feature tells if the hcall H_BLOCK_REMOVE is available. Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Laurent Dufour <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-17powerpc/tm: Fix HTM documentationBreno Leitao2-11/+15
This patch simply fix part of the documentation on the HTM code. This fixes reference to old fields that were renamed in commit 000ec280e3dd ("powerpc: tm: Rename transct_(*) to ck(\1)_state") It also documents better the flow after commit eb5c3f1c8647 ("powerpc: Always save/restore checkpointed regs during treclaim/trecheckpoint"), where tm_recheckpoint can recheckpoint what is in ck{fp,vr}_state blindly. Signed-off-by: Breno Leitao <[email protected]> Acked-by: Michael Neuling <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-17powerpc/selftests: Wait all threads to joinBreno Leitao1-10/+17
Test tm-tmspr might exit before all threads stop executing, because it just waits for the very last thread to join before proceeding/exiting. This patch makes sure that all threads that were created will join before proceeding/exiting. This patch also guarantees that the amount of threads being created is equal to thread_num. Signed-off-by: Breno Leitao <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-17powerpc/powernv: Don't select the cpufreq governorsJoel Stanley3-5/+6
Deciding wich govenors should be built into the kernel can be left to users to configure. Fixes: 81f359027a3a ("cpufreq: powernv: Select CPUFreq related Kconfig options for powernv") Signed-off-by: Joel Stanley <[email protected]> [mpe: Update powernv/ppc64 defconfigs to enable them by default] Signed-off-by: Michael Ellerman <[email protected]>
2018-09-17KVM: PPC: Book3S HV: Fix guest r11 corruption with POWER9 TM workaroundsMichael Neuling1-2/+2
When we come into the softpatch handler (0x1500), we use r11 to store the HSRR0 for later use by the denorm handler. We also use the softpatch handler for the TM workarounds for POWER9. Unfortunately, in kvmppc_interrupt_hv we later store r11 out to the vcpu assuming it's still what we got from userspace. This causes r11 to be corrupted in the VCPU and hence when we restore the guest, we get a corrupted r11. We've seen this when running TM tests inside guests on P9. This fixes the problem by only touching r11 in the denorm case. Fixes: 4bb3c7a020 ("KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9") Cc: <[email protected]> # 4.17+ Test-by: Suraj Jitindar Singh <[email protected]> Reviewed-by: Paul Mackerras <[email protected]> Signed-off-by: Michael Neuling <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-09-14powerpc/vdso: Correct call frame informationAlan Modra4-0/+4
Call Frame Information is used by gdb for back-traces and inserting breakpoints on function return for the "finish" command. This failed when inside __kernel_clock_gettime. More concerning than difficulty debugging is that CFI is also used by stack frame unwinding code to implement exceptions. If you have an app that needs to handle asynchronous exceptions for some reason, and you are unlucky enough to get one inside the VDSO time functions, your app will crash. What's wrong: There is control flow in __kernel_clock_gettime that reaches label 99 without saving lr in r12. CFI info however is interpreted by the unwinder without reference to control flow: It's a simple matter of "Execute all the CFI opcodes up to the current address". That means the unwinder thinks r12 contains the return address at label 99. Disabuse it of that notion by resetting CFI for the return address at label 99. Note that the ".cfi_restore lr" could have gone anywhere from the "mtlr r12" a few instructions earlier to the instruction at label 99. I put the CFI as late as possible, because in general that's best practice (and if possible grouped with other CFI in order to reduce the number of CFI opcodes executed when unwinding). Using r12 as the return address is perfectly fine after the "mtlr r12" since r12 on that code path still contains the return address. __get_datapage also has a CFI error. That function temporarily saves lr in r0, and reflects that fact with ".cfi_register lr,r0". A later use of r0 means the CFI at that point isn't correct, as r0 no longer contains the return address. Fix that too. Signed-off-by: Alan Modra <[email protected]> Tested-by: Reza Arbab <[email protected]> Signed-off-by: Paul Mackerras <[email protected]>
2018-09-14powerpc/tm: Fix HFSCR bit for no suspend caseMichael Neuling1-6/+12
Currently on P9N DD2.1 we end up taking infinite TM facility unavailable exceptions on the first TM usage by userspace. In the special case of TM no suspend (P9N DD2.1), Linux is told TM is off via CPU dt-ftrs but told to (partially) use it via OPAL_REINIT_CPUS_TM_SUSPEND_DISABLED. So HFSCR[TM] will be off from dt-ftrs but we need to turn it on for the no suspend case. This patch fixes this by enabling HFSCR TM in this case. Cc: [email protected] # 4.15+ Signed-off-by: Michael Neuling <[email protected]> Signed-off-by: Paul Mackerras <[email protected]>