aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-06-28powerpc/fadump: avoid holes in boot memory area when fadump is registeredHari Bathini3-0/+20
To register fadump, boot memory area - the size of low memory chunk that is required for a kernel to boot successfully when booted with restricted memory, is assumed to have no holes. But this memory area is currently not protected from hot-remove operations. So, fadump could fail to re-register after a memory hot-remove operation, if memory is removed from boot memory area. To avoid this, ensure that memory from boot memory area is not hot-removed when fadump is registered. Signed-off-by: Hari Bathini <[email protected]> Reviewed-by: Mahesh J Salgaonkar <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-28powerpc/fadump: avoid duplicates in crash memory rangesHari Bathini1-2/+13
fadump sets up crash memory ranges to be used for creating PT_LOAD program headers in elfcore header. Memory chunk RMA_START through boot memory area size is added as the first memory range because firmware, at the time of crash, moves this memory chunk to different location specified during fadump registration making it necessary to create a separate program header for it with the correct offset. This memory chunk is skipped while setting up the remaining memory ranges. But currently, there is possibility that some of this memory may have duplicate entries like when it is hot-removed and added again. Ensure that no two memory ranges represent the same memory. When 5 lmbs are hot-removed and then hot-plugged before registering fadump, here is how the program headers in /proc/vmcore exported by fadump look like without this change: Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align NOTE 0x0000000000010000 0x0000000000000000 0x0000000000000000 0x0000000000001894 0x0000000000001894 0 LOAD 0x0000000000021020 0xc000000000000000 0x0000000000000000 0x0000000040000000 0x0000000040000000 RWE 0 LOAD 0x0000000040031020 0xc000000000000000 0x0000000000000000 0x0000000010000000 0x0000000010000000 RWE 0 LOAD 0x0000000050040000 0xc000000010000000 0x0000000010000000 0x0000000050000000 0x0000000050000000 RWE 0 LOAD 0x00000000a0040000 0xc000000060000000 0x0000000060000000 0x000000019ffe0000 0x000000019ffe0000 RWE 0 and with this change: Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align NOTE 0x0000000000010000 0x0000000000000000 0x0000000000000000 0x0000000000001894 0x0000000000001894 0 LOAD 0x0000000000021020 0xc000000000000000 0x0000000000000000 0x0000000040000000 0x0000000040000000 RWE 0 LOAD 0x0000000040030000 0xc000000040000000 0x0000000040000000 0x0000000020000000 0x0000000020000000 RWE 0 LOAD 0x0000000060030000 0xc000000060000000 0x0000000060000000 0x000000019ffe0000 0x000000019ffe0000 RWE 0 Signed-off-by: Hari Bathini <[email protected]> Reviewed-by: Mahesh J Salgaonkar <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-28powerpc/perf: Fix branch event code for power9Madhavan Srinivasan2-2/+10
Correct "branch" event code of Power9 is "r4d05e". Replace the current "branch" event code with "r4d05e" and add a hack to use "r10012" as event code for Power9 DD1. Fixes: d89f473ff6f8 ("powerpc/perf: Fix PM_BRU_CMPL event code for power9") Reported-by: Anton Blanchard <[email protected]> Signed-off-by: Madhavan Srinivasan <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-28powerpc/xive: Silence message about VP block allocationBenjamin Herrenschmidt1-2/+2
There is no reason for that message to be pr_info(), it will be printed every time we start a KVM guest. Signed-off-by: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-27powerpc/64s: Invalidate ERAT on powersave wakeup for POWER9Benjamin Herrenschmidt2-3/+12
On POWER9 the ERAT may be incorrect on wakeup from some stop states that lose state. This causes random segvs and illegal instructions when these stop states are enabled. This patch invalidates the ERAT on wakeup on POWER9 to prevent this from causing a problem. Signed-off-by: Michael Neuling <[email protected]> Signed-off-by: Benjamin Herrenschmidt <[email protected]> Reviewed-by: Nicholas Piggin <[email protected]> [mpe: Merge comment change with upstream changes] Signed-off-by: Michael Ellerman <[email protected]>
2017-06-27powerpc: Only do ERAT invalidate on radix context switch on P9 DD1Benjamin Herrenschmidt1-5/+10
From: Michael Neuling <[email protected]> On P9 (Nimbus) DD2 and later, in radix mode, the move to the PID register will implicitly invalidate the user space ERAT entries and leave the kernel ones alone. Thus the only thing needed is an isync() to synchronize this with subsequent uaccess's Signed-off-by: Michael Neuling <[email protected]> Signed-off-by: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-27powerpc/powernv/pci: Enable 64-bit devices to access >4GB DMA spaceRussell Currey1-2/+91
On PHB3/POWER8 systems, devices can select between two different sections of address space, TVE#0 and TVE#1. TVE#0 is intended for 32bit devices that aren't capable of addressing more than 4GB. Selecting TVE#1 instead, with the capability of addressing over 4GB, is performed by setting bit 59 of a PCI address. However, some devices aren't capable of addressing at least 59 bits, but still want more than 4GB of DMA space. In order to enable this, reconfigure TVE#0 to be suitable for 64-bit devices by allocating memory past the initial 4GB that is inaccessible by 64-bit DMAs. This bypass mode is only enabled if a device requests 4GB or more of DMA address space, if the system has PHB3 (POWER8 systems), and if the device does not share a PE with any devices from different vendors. Signed-off-by: Russell Currey <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-27powerpc/powernv/pci: Add helper to check if a PE has a single vendorRussell Currey1-0/+25
Add a helper that determines if all the devices contained in a given PE are all from the same vendor or not. This can be useful in determining if it's okay to make PE-wide changes that may be suitable for some devices but not for others. This is used later in the series. Signed-off-by: Russell Currey <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-27powerpc/powernv/pci: Add support for PHB4 diagnosticsRussell Currey2-2/+178
As with P7IOC and PHB3, add kernel-side support for decoding and printing diagnostic data for PHB4. Signed-off-by: Russell Currey <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-27powerpc/powernv/pci: Dynamically allocate PHB diag dataRussell Currey4-18/+29
Diagnostic data for PHBs currently works by allocated a fixed-sized buffer. This is simple, but either wastes memory (though only a few kilobytes) or in the case of PHB4 isn't enough to fit the whole data blob. For machines that don't describe the diagnostic data size in the device tree, use the hardcoded buffer size as before. For those that do, only allocate exactly what's needed. In the special case of P7IOC (which has two types of diag data), the larger should be specified in the device tree. Signed-off-by: Russell Currey <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-27powerpc/powernv/pci: Reduce spam when dumping PESTRussell Currey2-20/+34
Dumping the PE State Tables (PEST) can be highly verbose if a number of PEs are affected, especially in the case where the whole PHB is frozen and 512 lines get printed. Check for duplicates when dumping the PEST to reduce useless output. For example: PE[0f8] A/B: 9700002600000000 80000080d00000f8 PE[0f9] A/B: 8000000000000000 0000000000000000 PE[..0fe] A/B: as above PE[0ff] A/B: 8440002b00000000 0000000000000000 instead of: PE[0f8] A/B: 9700002600000000 80000080d00000f8 PE[0f9] A/B: 8000000000000000 0000000000000000 PE[0fa] A/B: 8000000000000000 0000000000000000 PE[0fb] A/B: 8000000000000000 0000000000000000 PE[0fc] A/B: 8000000000000000 0000000000000000 PE[0fd] A/B: 8000000000000000 0000000000000000 PE[0fe] A/B: 8000000000000000 0000000000000000 PE[0ff] A/B: 8440002b00000000 0000000000000000 and you can imagine how much worse it can get for 512 PEs. Signed-off-by: Russell Currey <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-27powerpc/tm: Fix commentMichael Neuling1-2/+2
Update to real function name. Signed-off-by: Michael Neuling <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-27powerpc: Fix asm offsets to point to actual FP and VMX regsMichael Neuling1-4/+4
The asm code assumes the FP regs are at the start of fp_state. While this is true now, it may not always be the case and there is nothing enforcing it. This fixes the asm-offsets to point to the actual FP registers inside the fp_state. Similarly for VMX. Signed-off-by: Michael Neuling <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-27powerpc: Fix /proc/cpuinfo revision for POWER9 DD2Michael Neuling1-0/+4
The P9 PVR bits 12-15 don't indicate a revision but instead different chip configurations. From BookIV we have: Bits Configuration 0 : Scale out 12 cores 1 : Scale out 24 cores 2 : Scale up 12 cores 3 : Scale up 24 cores DD1 doesn't use this but DD2 does. Linux will mostly use the "Scale out 24 core" configuration (ie. SMT4 not SMT8) which results in a PVR of 0x004e1200. The reported revision in /proc/cpuinfo is hence reported incorrectly as "18.0". This patch fixes this to mask off only the relevant bits for the major revision (ie. bits 8-11) for POWER9. Signed-off-by: Michael Neuling <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-26powerpc/32: Avoid miscompilation w/GCC 4.6.3 - don't inline copy_to/from_user()Michael Ellerman1-7/+1
Larry Finger reported that his Powerbook G4 was no longer booting with v4.12-rc, userspace was up but giving weird errors such as: udevd[64]: starting version 175 udevd[64]: Unable to receive ctrl message: Bad address. modprobe: chdir(4.12-rc1): No such file or directory He bisected the problem to commit 3448890c32c3 ("powerpc: get rid of zeroing, switch to RAW_COPY_USER"). Al identified that the problem is actually a miscompilation by GCC 4.6.3, which is exposed by the above commit. Al also pointed out that inlining copy_to/from_user() is probably of little or no benefit, which is correct. Using Anton's copy_to_user benchmark, with a pathological single byte copy, we see a small increase in performance by *removing* inlining: Before (inlined): # time ./copy_to_user -w -l 1 -i 10000000 ( x 3 ) real 0m22.063s real 0m22.059s real 0m22.076s After: # time ./copy_to_user -w -l 1 -i 10000000 ( x 3 ) real 0m21.325s real 0m21.299s real 0m21.364s So as a small performance improvement and to avoid the miscompilation, drop inlining copy_to/from_user() on 32-bit. Fixes: 3448890c32c3 ("powerpc: get rid of zeroing, switch to RAW_COPY_USER") Reported-by: Larry Finger <[email protected]> Suggested-by: Al Viro <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-23powerpc/mm: Trace tlbie(l) instructionsBalbir Singh7-4/+67
Add a trace point for tlbie(l) (Translation Lookaside Buffer Invalidate Entry (Local)) instructions. The tlbie instruction has changed over the years, so not all versions accept the same operands. Use the ISA v3 field operands because they are the most verbose, we may change them in future. Example output: qemu-system-ppc-5371 [016] 1412.369519: tlbie: tlbie with lpid 0, local 1, rb=67bd8900174c11c1, rs=0, ric=0 prs=0 r=0 Signed-off-by: Balbir Singh <[email protected]> [mpe: Add some missing trace_tlbie()s, reword change log] Signed-off-by: Michael Ellerman <[email protected]>
2017-06-23cxl: Fixes for Coherent Accelerator Interface Architecture 2.0Christophe Lombard6-47/+57
A previous set of patches "cxl: Add support for Coherent Accelerator Interface Architecture 2.0" has introduced a new support for the CAPI cards. These patches have been tested on Simulation environment and quite a bit of them have been tested on real hardware. This patch brings new fixes after a series of tests carried out on new equipment: - Add POWER9 definition. - Re-enable any masked interrupts when the AFU is not activated after resetting the AFU. - Remove the api cxl_is_psl8/9 which is no longer useful. - Do not dump CAPI1 registers. - Rewrite cxl_is_page_fault() function. - Do not register slb callack on P9. Fixes: f24be42aab37 ("cxl: Add psl9 specific code") Signed-off-by: Christophe Lombard <[email protected]> Acked-by: Frederic Barrat <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-23powerpc/64: Initialise thread_info for emergency stacksNicholas Piggin1-3/+28
Emergency stacks have their thread_info mostly uninitialised, which in particular means garbage preempt_count values. Emergency stack code runs with interrupts disabled entirely, and is used very rarely, so this has been unnoticed so far. It was found by a proposed new powerpc watchdog that takes a soft-NMI directly from the masked_interrupt handler and using the emergency stack. That crashed at BUG_ON(in_nmi()) in nmi_enter(). preempt_count()s were found to be garbage. To fix this, zero the entire THREAD_SIZE allocation, and initialize the thread_info. Cc: [email protected] Reported-by: Abdul Haleem <[email protected]> Signed-off-by: Nicholas Piggin <[email protected]> [mpe: Move it all into setup_64.c, use a function not a macro. Fix crashes on Cell by setting preempt_count to 0 not HARDIRQ_OFFSET] Signed-off-by: Michael Ellerman <[email protected]>
2017-06-22powerpc/powernv/npu-dma: Add explicit flush when sending an ATSDAlistair Popple1-29/+65
NPU2 requires an extra explicit flush to an active GPU PID when sending address translation shoot downs (ATSDs) to reliably flush the GPU TLB. This patch adds just such a flush at the end of each sequence of ATSDs. We can safely use PID 0 which is always reserved and active on the GPU. PID 0 is only used for init_mm which will never be a user mm on the GPU. To enforce this we add a check in pnv_npu2_init_context() just in case someone tries to use PID 0 on the GPU. Signed-off-by: Alistair Popple <[email protected]> [mpe: Use true/false for bool literals] Signed-off-by: Michael Ellerman <[email protected]>
2017-06-22powerpc: Convert VDSO update function to use new update_vsyscall interfacePaul Mackerras2-17/+53
This converts the powerpc VDSO time update function to use the new interface introduced in commit 576094b7f0aa ("time: Introduce new GENERIC_TIME_VSYSCALL", 2012-09-11). Where the old interface gave us the time as of the last update in seconds and whole nanoseconds, with the new interface we get the nanoseconds part effectively in a binary fixed-point format with tk->tkr_mono.shift bits to the right of the binary point. With the old interface, the fractional nanoseconds got truncated, meaning that the value returned by the VDSO clock_gettime function would have about 1ns of jitter in it compared to the value computed by the generic timekeeping code in the kernel. The powerpc VDSO time functions (clock_gettime and gettimeofday) already work in units of 2^-32 seconds, or 0.23283 ns, because that makes it simple to split the result into seconds and fractional seconds, and represent the fractional seconds in either microseconds or nanoseconds. This is good enough accuracy for now, so this patch avoids changing how the VDSO works or the interface in the VDSO data page. This patch converts the powerpc update_vsyscall_old to be called update_vsyscall and use the new interface. We convert the fractional second to units of 2^-32 seconds without truncating to whole nanoseconds. (There is still a conversion to whole nanoseconds for any legacy users of the vdso_data/systemcfg stamp_xtime field.) In addition, this improves the accuracy of the computation of tb_to_xs for those systems with high-frequency timebase clocks (>= 268.5 MHz) by doing the right shift in two parts, one before the multiplication and one after, rather than doing the right shift before the multiplication. (We can't do all of the right shift after the multiplication unless we use 128-bit arithmetic.) Signed-off-by: Paul Mackerras <[email protected]> Acked-by: John Stultz <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-21powerpc/time: Fix tracing in time.cSantosh Sivaraj2-5/+3
Since trace_clock is in a different file and already marked with notrace, enable tracing in time.c by removing it from the disabled list in Makefile. Also annotate clocksource read functions and sched_clock with notrace. Testing: Timer and ftrace selftests run with different trace clocks. Acked-by: Naveen N. Rao <[email protected]> Signed-off-by: Santosh Sivaraj <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-21powerpc/64s: Rename slb_allocate_realmode() to slb_allocate()Michael Ellerman3-13/+5
As for slb_miss_realmode(), rename slb_allocate_realmode() to avoid confusion over whether it runs in real or virtual mode - it runs in both. Signed-off-by: Michael Ellerman <[email protected]> Reviewed-by: Nicholas Piggin <[email protected]>
2017-06-21powerpc/64s: Rename slb_miss_realmode() to slb_miss_common()Michael Ellerman1-6/+9
slb_miss_realmode() doesn't always runs in real mode, which is what the name implies. So rename it to avoid confusing people. Signed-off-by: Michael Ellerman <[email protected]> Reviewed-by: Nicholas Piggin <[email protected]>
2017-06-21powerpc/64s: Use BRANCH_TO_COMMON() for slb_miss_realmodeMichael Ellerman1-38/+4
All the callers of slb_miss_realmode currently open code the #ifndef CONFIG_RELOCATABLE check and the branch via CTR in the RELOCATABLE case. We have a macro to do this, BRANCH_TO_COMMON(), so use it. Signed-off-by: Michael Ellerman <[email protected]> Reviewed-by: Nicholas Piggin <[email protected]>
2017-06-20powerpc/64s/paca: EX_CTR is not used with RELOCATABLE=n, remove itNicholas Piggin1-1/+4
Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-20powerpc/64s/paca: EX_R3 can be merged with EX_DARNicholas Piggin1-5/+11
EX_R3 is used only for a small section of the bad stack handler. Merge it with EX_DAR. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-20powerpc/64s/paca: EX_LR can be merged with EX_DARNicholas Piggin1-5/+12
EX_LR is used only for a small section of the SLB miss handler. Merge it with EX_DAR. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-20powerpc/64s/paca: EX_SRR0 is unused, remove itNicholas Piggin1-11/+10
Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-20powerpc/64s: Add EX_SIZE definition for paca exception save areasNicholas Piggin3-4/+14
Rather than open-coding it 4 times. Signed-off-by: Nicholas Piggin <[email protected]> [mpe: Move __ASSEMBLY__ guards into head-64.h where they're really needed] Signed-off-by: Michael Ellerman <[email protected]>
2017-06-20powerpc/64s: Avoid r3 save/restore in SLB miss handlerNicholas Piggin1-15/+26
The SLB miss handler uses r3 for the faulting address but r12 is mostly able to be freed up to save r3 in. It just requires SRR1 be reloaded again on error. It would be more conventional to use r12 for SRR1 (and use r11 to save r3), but slb_allocate_realmode clobbers r11 and not r12. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-20powerpc/64s: SLB miss already has CTR saved for relocatable kernelNicholas Piggin1-8/+1
The EXCEPTION_PROLOG_1 used by SLB miss already saves CTR when the kernel is built with CONFIG_RELOCATABLE. So it does not have to be saved and reloaded when branching to slb_miss_realmode. It can be restored from the PACA as usual. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-20powerpc/64s: Avoid saving faulting address into EX_DAR in SLB missNicholas Piggin1-5/+8
The EX_DAR save area is only used in exceptional cases. With r3 no longer clobbered by slb_allocate_realmode, saving faulting address to EX_DAR can be deferred to those cases. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-20powerpc/64s: Preserve r3 in slb_allocate_realmode()Nicholas Piggin1-10/+14
One fewer registers clobbered by this function means the SLB miss handler can save one fewer. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-19powerpc/64s/idle: Run latch switch is done with MSR[EE]=0Nicholas Piggin1-6/+6
In the idle sleep/wake code we know that MSR[EE] is clear, so we can avoid 2 x mfmsr and 2 x mtmsr by calling the double-underscore versions of the run latch routines which assume interrupts are already disabled. Acked-by: Vaidyanathan Srinivasan <[email protected]> Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-19powerpc/64s/idle: Predict HMI wakeup as unlikelyNicholas Piggin1-1/+1
In a busy system, idle wakeups can be expected from IPIs and device interrupts. Reviewed-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-19powerpc/64s/idle: Avoid SRR usage in idle sleep/wake pathsNicholas Piggin3-32/+38
Idle code now always runs at the 0xc... effective address whether in real or virtual mode. This means rfid can be ditched, along with a lot of SRR manipulations. In the wakeup path, carry SRR1 around in r12. Use mtmsrd to change MSR states as required. This also balances the return prediction for the idle call, by doing blr rather than rfid to return to the idle caller. On POWER9, 2-process context switch on different cores, with snooze disabled, increases performance by 2%. Signed-off-by: Nicholas Piggin <[email protected]> [mpe: Incorporate v2 fixes from Nick] Signed-off-by: Michael Ellerman <[email protected]>
2017-06-19powerpc/64s/idle: Branch to handler with virtual mode offsetNicholas Piggin2-2/+17
Have the system reset idle wakeup handlers branched to in real mode with the 0xc... kernel address applied. This allows simplifications of avoiding rfid when switching to virtual mode in the wakeup handler. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-19powerpc/64s: Don't unbalance the return branch predictor in __replay_interrupt()Nicholas Piggin1-1/+7
The __replay_interrupt() code is branched to with bl, but the caller is returned to directly with rfid from the interrupt. Instead, rfid to a stub that returns to the caller with blr, which should keep the return branch predictor balanced. Reviewed-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-19powerpc/64s: msgclr when handling doorbell exceptions from system resetNicholas Piggin4-2/+38
msgsnd doorbell exceptions are cleared when the doorbell interrupt is taken. However if a doorbell exception causes a system reset interrupt wake from power saving state, the message is not cleared. Processing the doorbell from the system reset interrupt requires msgclr to avoid taking the exception again. Testing this plus the previous wakup direct patch gives: original wakeup direct msgclr Different threads, same core: 315k/s 264k/s 345k/s Different cores: 235k/s 242k/s 242k/s Net speedup is +10% for same core, and +3% for different core. Reviewed-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-19powerpc/64s/idle: Process interrupts from system reset wakeupNicholas Piggin3-2/+38
When the CPU wakes from low power state, it begins at the system reset interrupt with the exception that caused the wakeup encoded in SRR1. Today, powernv idle wakeup ignores the wakeup reason (except a special case for HMI), and the regular interrupt corresponding to the exception will fire after the idle wakeup exits. Change this to replay the interrupt from the idle wakeup before interrupts are hard-enabled. Test on POWER8 of context_switch selftests benchmark with polling idle disabled (e.g., always nap, giving cross-CPU IPIs) gives the following results: original wakeup direct Different threads, same core: 315k/s 264k/s Different cores: 235k/s 242k/s There is a slowdown for doorbell IPI (same core) case because system reset wakeup does not clear the message and the doorbell interrupt fires again needlessly. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-19powerpc/powernv: Simplify lazy IRQ handling in CPU offlineNicholas Piggin2-23/+29
Rather than concern ourselves with any soft-mask logic in the CPU hotplug handler, just hard disable interrupts. This ensures there are no lazy-irqs pending, which means we can call directly to idle instruction in order to sleep. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-19powerpc/64s/idle: Move soft interrupt mask logic into C codeNicholas Piggin9-89/+128
This simplifies the asm and fixes irq-off tracing over sleep instructions. Also move powersave_nap check for POWER8 into C code, and move PSSCR register value calculation for POWER9 into C. Reviewed-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-16powerpc/perf: Fix oops when kthread execs user processRavi Bangoria1-1/+2
When a kthread calls call_usermodehelper() the steps are: 1. allocate current->mm 2. load_elf_binary() 3. populate current->thread.regs While doing this, interrupts are not disabled. If there is a perf interrupt in the middle of this process (i.e. step 1 has completed but not yet reached to step 3) and if perf tries to read userspace regs, kernel oops with following log: Unable to handle kernel paging request for data at address 0x00000000 Faulting instruction address: 0xc0000000000da0fc ... Call Trace: perf_output_sample_regs+0x6c/0xd0 perf_output_sample+0x4e4/0x830 perf_event_output_forward+0x64/0x90 __perf_event_overflow+0x8c/0x1e0 record_and_restart+0x220/0x5c0 perf_event_interrupt+0x2d8/0x4d0 performance_monitor_exception+0x54/0x70 performance_monitor_common+0x158/0x160 --- interrupt: f01 at avtab_search_node+0x150/0x1a0 LR = avtab_search_node+0x100/0x1a0 ... load_elf_binary+0x6e8/0x15a0 search_binary_handler+0xe8/0x290 do_execveat_common.isra.14+0x5f4/0x840 call_usermodehelper_exec_async+0x170/0x210 ret_from_kernel_thread+0x5c/0x7c Fix it by setting abi to PERF_SAMPLE_REGS_ABI_NONE when userspace pt_regs are not set. Fixes: ed4a4ef85cf5 ("powerpc/perf: Add support for sampling interrupt register state") Cc: [email protected] # v4.7+ Signed-off-by: Ravi Bangoria <[email protected]> Acked-by: Naveen N. Rao <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-16powerpc/64s: Handle data breakpoints in Radix modeNaveen N. Rao1-4/+7
On Power9, trying to use data breakpoints throws the splat shown below. This is because the check for a data breakpoint in DSISR is in do_hash_page(), which is not called when in Radix mode. Unable to handle kernel paging request for data at address 0xc000000000e19218 Faulting instruction address: 0xc0000000001155e8 cpu 0x0: Vector: 300 (Data Access) at [c0000000ef1e7b20] pc: c0000000001155e8: find_pid_ns+0x48/0xe0 lr: c000000000116ac4: find_task_by_vpid+0x44/0x90 sp: c0000000ef1e7da0 msr: 9000000000009033 dar: c000000000e19218 dsisr: 400000 Move the check to handle_page_fault() so as to catch data breakpoints in both Hash and Radix MMU modes. We have to change the check in do_hash_page() against 0xa410 to use 0xa450, so as to include the value of (DSISR_DABRMATCH << 16). There are two sites that call handle_page_fault() when in Radix, both already pass DSISR in r4. Fixes: caca285e5ab4 ("powerpc/mm/radix: Use STD_MMU_64 to properly isolate hash related code") Cc: [email protected] # v4.7+ Reported-by: Shriya R. Kulkarni <[email protected]> Signed-off-by: Naveen N. Rao <[email protected]> [mpe: Fix the fall-through case on hash, we need to reload DSISR] Signed-off-by: Michael Ellerman <[email protected]>
2017-06-16powerpc/kprobes: Skip livepatch_handler() for jprobesNaveen N. Rao3-5/+41
ftrace_caller() depends on a modified regs->nip to detect if a certain function has been livepatched. However, with KPROBES_ON_FTRACE, it is possible for regs->nip to have been modified by the kprobes pre_handler (jprobes, for instance). In this case, we do not want to invoke the livepatch_handler so as not to consume the livepatch stack. To distinguish between the two (kprobes and livepatch), we check if there is an active kprobe on the current function. If there is, then we know for sure that it must have modified the NIP as we don't support livepatching a kprobe'd function. In this case, we simply skip the livepatch_handler and branch to the new NIP. Otherwise, the livepatch_handler is invoked. Fixes: ead514d5fb30 ("powerpc/kprobes: Add support for KPROBES_ON_FTRACE") Signed-off-by: Naveen N. Rao <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-16powerpc/ftrace: Pass the correct stack pointer for DYNAMIC_FTRACE_WITH_REGSNaveen N. Rao1-8/+12
For DYNAMIC_FTRACE_WITH_REGS, we should be passing-in the original set of registers in pt_regs, to capture the state _before_ ftrace_caller. However, we are instead passing the stack pointer *after* allocating a stack frame in ftrace_caller. Fix this by saving the proper value of r1 in pt_regs. Also, use SAVE_10GPRS() to simplify the code. Fixes: 153086644fd1 ("powerpc/ftrace: Add support for -mprofile-kernel ftrace ABI") Cc: [email protected] # v4.6+ Signed-off-by: Naveen N. Rao <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-16powerpc/kprobes: Pause function_graph tracing during jprobes handlingNaveen N. Rao1-0/+11
This fixes a crash when function_graph and jprobes are used together. This is essentially commit 237d28db036e ("ftrace/jprobes/x86: Fix conflict between jprobes and function graph tracing"), but for powerpc. Jprobes breaks function_graph tracing since the jprobe hook needs to use jprobe_return(), which never returns back to the hook, but instead to the original jprobe'd function. The solution is to momentarily pause function_graph tracing before invoking the jprobe hook and re-enable it when returning back to the original jprobe'd function. Fixes: 6794c78243bf ("powerpc64: port of the function graph tracer") Cc: [email protected] # v2.6.30+ Signed-off-by: Naveen N. Rao <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Acked-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-16powerpc/debug: Add missing warn flag to WARN_ON's non-builtin pathAlexey Kardashevskiy1-1/+1
When trapped on WARN_ON(), report_bug() is expected to return BUG_TRAP_TYPE_WARN so the caller will increment NIP by 4 and continue. The __builtin_constant_p() path of the PPC's WARN_ON() calls (indirectly) __WARN_FLAGS() which has BUGFLAG_WARNING set, however the other branch does not which makes report_bug() report a bug rather than a warning. Fixes: f26dee15103f ("debug: Avoid setting BUGFLAG_WARNING twice") Signed-off-by: Alexey Kardashevskiy <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-15powerpc/xive: Fix offset for store EOI MMIOsBenjamin Herrenschmidt3-8/+10
Architecturally we should apply a 0x400 offset for these. Not doing it will break future HW implementations. The offset of 0 is supposed to remain for "triggers" though not all sources support both trigger and store EOI, and in P9 specifically, some sources will treat 0 as a store EOI. But future chips will not. So this makes us use the properly architected offset which should work always. Fixes: 243e25112d06 ("powerpc/xive: Native exploitation of the XIVE interrupt controller") Signed-off-by: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-06-15drivers/watchdog/Kconfig: Update CONFIG_WATCHDOG_RTAS dependenciesMurilo Opsfelder Araujo1-1/+1
drivers/watchdog/wdrtas.c uses symbols defined in arch/powerpc/kernel/rtas.c, which are exported iff CONFIG_PPC_RTAS is selected. Building wdrtas.c without setting CONFIG_PPC_RTAS throws the following errors: ERROR: ".rtas_token" [drivers/watchdog/wdrtas.ko] undefined! ERROR: "rtas_data_buf" [drivers/watchdog/wdrtas.ko] undefined! ERROR: "rtas_data_buf_lock" [drivers/watchdog/wdrtas.ko] undefined! ERROR: ".rtas_get_sensor" [drivers/watchdog/wdrtas.ko] undefined! ERROR: ".rtas_call" [drivers/watchdog/wdrtas.ko] undefined! This was identified during a randconfig build where CONFIG_WATCHDOG_RTAS=m and CONFIG_PPC_RTAS was not set. Logs are here: http://kisskb.ellerman.id.au/kisskb/buildresult/12982152/ This patch fixes the issue by updating CONFIG_WATCHDOG_RTAS to depend on just CONFIG_PPC_RTAS, removing COMPILE_TEST entirely. Signed-off-by: Murilo Opsfelder Araujo <[email protected]> Reviewed-by: Guenter Roeck <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>