aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-11-13powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundaryNicholas Piggin1-1/+1
While mapping hints with a length that cross 128TB are disallowed, MAP_FIXED allocations that cross 128TB are allowed. These are failing on hash (on radix they succeed). Add an additional case for fixed mappings to expand the addr_limit when crossing 128TB. Fixes: f4ea6dcb08ea ("powerpc/mm: Enable mappings above 128TB") Cc: [email protected] # v4.12+ Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-13powerpc/64s/hash: Fix fork() with 512TB process address spaceNicholas Piggin1-4/+4
Hash unconditionally resets the addr_limit to default (128TB) when the mm context is initialised. If a process has > 128TB mappings when it forks, the child will not get the 512TB addr_limit, so accesses to valid > 128TB mappings will fail in the child. Fix this by only resetting the addr_limit to default if it was 0. Non zero indicates it was duplicated from the parent (0 means exec()). Fixes: f4ea6dcb08ea ("powerpc/mm: Enable mappings above 128TB") Cc: [email protected] # v4.12+ Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-13powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocationNicholas Piggin1-26/+24
When allocating VA space with a hint that crosses 128TB, the SLB addr_limit variable is not expanded if addr is not > 128TB, but the slice allocation looks at task_size, which is 512TB. This results in slice_check_fit() incorrectly succeeding because the slice_count truncates off bit 128 of the requested mask, so the comparison to the available mask succeeds. Fix this by using mm->context.addr_limit instead of mm->task_size for testing allocation limits. This causes such allocations to fail. Fixes: f4ea6dcb08ea ("powerpc/mm: Enable mappings above 128TB") Cc: [email protected] # v4.12+ Reported-by: Florian Weimer <[email protected]> Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-13powerpc/64s/hash: Fix 512T hint detection to use >= 128TMichael Ellerman1-2/+2
Currently userspace is able to request mmap() search between 128T-512T by specifying a hint address that is greater than 128T. But that means a hint of 128T exactly will return an address below 128T, which is confusing and wrong. So fix the logic to check the hint is greater than *or equal* to 128T. Fixes: f4ea6dcb08ea ("powerpc/mm: Enable mappings above 128TB") Cc: [email protected] # v4.12+ Suggested-by: Aneesh Kumar K.V <[email protected]> Suggested-by: Nicholas Piggin <[email protected]> [mpe: Split out of Nick's bigger patch] Signed-off-by: Michael Ellerman <[email protected]>
2017-11-13powerpc: Fix DABR match on hash based systemsBenjamin Herrenschmidt2-2/+2
Commit 398a719d34a1 ("powerpc/mm: Update bits used to skip hash_page") mistakenly dropped the DSISR_DABRMATCH bit from the mask of bit tested to skip trying to hash a page. As a result, the DABR matches would no longer be detected. This adds it back. We open code it in the 2 places where it matters rather than fold it into DSISR_BAD_FAULT_32S/64S because this isn't technically a bad fault and while we would never hit it with the current code, I prefer if page_fault_is_bad() didn't trigger on these. Fixes: 398a719d34a1 ("powerpc/mm: Update bits used to skip hash_page") Cc: [email protected] # v4.14 Tested-by: Pedro Miraglia Franco de Carvalho <[email protected]> Signed-off-by: Benjamin Herrenschmidt <[email protected]>
2017-11-13powerpc/signal: Properly handle return value from uprobe_deny_signal()Naveen N. Rao1-1/+1
When a uprobe is installed on an instruction that we currently do not emulate, we copy the instruction into a xol buffer and single step that instruction. If that instruction generates a fault, we abort the single stepping before invoking the signal handler. Once the signal handler is done, the uprobe trap is hit again since the instruction is retried and the process repeats. We use uprobe_deny_signal() to detect if the xol instruction triggered a signal. If so, we clear TIF_SIGPENDING and set TIF_UPROBE so that the signal is not handled until after the single stepping is aborted. In this case, uprobe_deny_signal() returns true and get_signal() ends up returning 0. However, in do_signal(), we are not looking at the return value, but depending on ksig.sig for further action, all with an uninitialized ksig that is not touched in this scenario. Fix the same by initializing ksig.sig to 0. Fixes: 129b69df9c90 ("powerpc: Use get_signal() signal_setup_done()") Cc: [email protected] # v3.17+ Reported-by: Anton Blanchard <[email protected]> Signed-off-by: Naveen N. Rao <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-13powerpc/fadump: use kstrtoint to handle sysfs storeMichal Suchanek1-4/+13
Currently sysfs store handlers in fadump use if buf[0] == 'char'. This means input "100foo" is interpreted as '1' and "01" as '0'. Change to kstrtoint so leading zeroes and the like is handled in expected way. Signed-off-by: Michal Suchanek <[email protected]> Acked-by: Hari Bathini <[email protected]> Signed-off-by: Michal Suchanek <a class="moz-txt-link-rfc2396E" href="mailto:[email protected]">&lt;[email protected]&gt;</a></pre> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-13powerpc/lib: Implement UACCESS_FLUSHCACHE APIOliver O'Halloran4-0/+41
Implement the architecture specific portitions of the UACCESS_FLUSHCACHE API. This provides functions for the copy_user_flushcache iterator that ensure that when the copy is finished the destination buffer contains a copy of the original and that the destination buffer is clean in the processor caches. Signed-off-by: Oliver O'Halloran <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-13powerpc/lib: Implement PMEM APIOliver O'Halloran3-1/+36
Implement the architecture specific cache maintence functions that make up the "PMEM API". Currently the writeback and invalidate functions are the same since the function of the DCBST (data cache block store) instruction is typically interpreted as "writeback to the point of coherency" rather than to memory. As a result implementing the API requires a full cache flush rather than just a cache write back. This will probably change in the not-too-distant future. Signed-off-by: Oliver O'Halloran <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-13powerpc/powernv/npu: Don't explicitly flush nmmu tlbAlistair Popple2-5/+26
The nest mmu required an explicit flush as a tlbi would not flush it in the same way as the core. However an alternate firmware fix exists which should eliminate the need for this flush, so instead add a device-tree property (ibm,nmmu-flush) on the NVLink2 PHB to enable it only if required. Signed-off-by: Alistair Popple <[email protected]> Reviewed-by: Frederic Barrat <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-13powerpc/powernv/npu: Use flush_all_mm() instead of flush_tlb_mm()Alistair Popple1-1/+1
With the optimisations introduced by commit a46cc7a908 ("powerpc/mm/radix: Improve TLB/PWC flushes"), flush_tlb_mm() no longer flushes the page walk cache with radix. Switch to using flush_all_mm() to ensure the pwc and tlb are properly flushed on the nmmu. Signed-off-by: Alistair Popple <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-13powerpc/powernv/idle: Round up latency and residency valuesVaidyanathan Srinivasan1-2/+2
On PowerNV platforms, firmware provides exit latency and target residency for each of the idle states in nano seconds. Cpuidle framework expects the values in micro seconds. Round up to nearest micro seconds to avoid errors in cases where the values are defined as fractional micro seconds. Default idle state of 'snooze' has exit latency of zero. If other states have fractional micro second exit latency, they would get rounded down to zero micro second and make cpuidle framework choose deeper idle state when snooze loop is the right choice. Reported-by: Anton Blanchard <[email protected]> Signed-off-by: Vaidyanathan Srinivasan <[email protected]> Reviewed-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/kprobes: refactor kprobe_lookup_name for safer string operationsNaveen N. Rao1-27/+20
Use safer string manipulation functions when dealing with a user-provided string in kprobe_lookup_name(). Reported-by: David Laight <[email protected]> Signed-off-by: Naveen N. Rao <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/kprobes: Blacklist emulate_update_regs() from kprobesNaveen N. Rao1-0/+1
Commit 3cdfcbfd32b9d ("powerpc: Change analyse_instr so it doesn't modify *regs") introduced emulate_update_regs() to perform part of what emulate_step() was doing earlier. However, this function was not added to the kprobes blacklist. Add it so as to prevent it from being probed. Signed-off-by: Naveen N. Rao <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/kprobes: Do not disable interrupts for optprobes and kprobes_on_ftraceNaveen N. Rao2-18/+2
Per Documentation/kprobes.txt, we don't necessarily need to disable interrupts before invoking the kprobe handlers. Masami submitted similar changes for x86 via commit a19b2e3d783964 ("kprobes/x86: Remove IRQ disabling from ftrace-based/optimized kprobes"). Do the same for powerpc. Signed-off-by: Naveen N. Rao <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/kprobes: Disable preemption before invoking probe handler for optprobesNaveen N. Rao1-2/+3
Per Documentation/kprobes.txt, probe handlers need to be invoked with preemption disabled. Update optimized_callback() to do so. Also move get_kprobe_ctlblk() invocation post preemption disable, since it accesses pre-cpu data. This was not an issue so far since optprobes wasn't selected if CONFIG_PREEMPT was enabled. Commit a30b85df7d599f ("kprobes: Use synchronize_rcu_tasks() for optprobe with CONFIG_PREEMPT=y") changes this. Signed-off-by: Naveen N. Rao <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/64s: ppc_save_regs is now needed for all 64s buildsStephen Rothwell1-1/+1
Commit 78adf6c214f0 ("powerpc/64s: Implement system reset idle wakeup reason"), added a call to ppc_save_regs() in the book3s code. ppc_save_regs() is only built if XMON and/or KEXEC_CORE are enabled, which is usually the case, however if they're not enabled then the build breaks. Fix it by making the Makefile check also build ppc_save_regs.o if CONFIG_PPC_BOOK3S is enabled. Fixes: 78adf6c214f0 ("powerpc/64s: Implement system reset idle wakeup reason") Signed-off-by: Stephen Rothwell <[email protected]> [mpe: Write change log] Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/mm/radix: Fix crashes on Power9 DD1 with radix MMU and STRICT_RWXBalbir Singh1-0/+10
When using the radix MMU on Power9 DD1, to work around a hardware problem, radix__pte_update() is required to do a two stage update of the PTE. First we write a zero value into the PTE, then we flush the TLB, and then we write the new PTE value. In the normal case that works OK, but it does not work if we're updating the PTE that maps the code we're executing, because the mapping is removed by the TLB flush and we can no longer execute from it. Unfortunately the STRICT_RWX code needs to do exactly that. The exact symptoms when we hit this case vary, sometimes we print an oops and then get stuck after that, but I've also seen a machine just get stuck continually page faulting with no oops printed. The variance is presumably due to the exact layout of the text and the page size used for the mappings. In all cases we are unable to boot to a shell. There are possible solutions such as creating a second mapping of the TLB flush code, executing from that, and then jumping back to the original. However we don't want to add that level of complexity for a DD1 work around. So just detect that we're running on Power9 DD1 and refrain from changing the permissions, effectively disabling STRICT_RWX on Power9 DD1. Fixes: 7614ff3272a1 ("powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix") Cc: [email protected] # v4.13+ Reported-by: Andrew Jeffery <[email protected]> [Changelog as suggested by Michael Ellerman <[email protected]>] Signed-off-by: Balbir Singh <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12crypto/nx: Do not initialize workmem allocationHaren Myneni1-1/+1
We are using percpu send window on P9 NX (powerNV) instead of opening / closing per each crypto session. Means txwin is removed from workmem. So we do not need to initialize workmem for each request. Signed-off-by: Haren Myneni <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12crypto/nx: Use percpu send window for NX requestsHaren Myneni1-81/+68
For P9 NX, the send window is opened for each crypto session and closed upon free. But VAS supports 64K windows per chip for all coprocessors including in user space support. So there is a possibility of not getting the window for kernel requests. This patch reserves windows for each coprocessor type (NX842) and are available forever for kernel requests, Opens each window for each CPU on the corresponding chip during driver initialization. So then use the percpu txwin for NX requests depends on the CPU on which the process is executing. Signed-off-by: Haren Myneni <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/vas: Add support for user receive windowSukadev Bhattiprolu1-7/+49
Add support for user space receive window (for the Fast thread-wakeup coprocessor type) Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/vas: Define vas_win_id()Sukadev Bhattiprolu3-0/+42
Define an interface to return a system-wide unique id for a given VAS window. The vas_win_id() will be used in a follow-on patch to generate an unique handle for a user space receive window. Applications can use this handle to pair send and receive windows for fast thread-wakeup. The hardware refers to this system-wide unique id as a Partition Send Window ID which is expected to be used during fault handling. Hence the "pswid" in the function names. Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/vas: Define vas_win_paste_addr()Sukadev Bhattiprolu2-0/+17
Define an interface that the NX drivers can use to find the physical paste address of a send window. This interface is expected to be used with the mmap() operation of the NX driver's device. i.e the user space process can use driver's mmap() operation to map the send window's paste address into their address space and then use copy and paste instructions to submit the CRBs to the NX engine. Note that kernel drivers will use vas_paste_crb() directly and don't need this interface. Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc: Define set_thread_uses_vas()Sukadev Bhattiprolu3-10/+35
A CP_ABORT instruction is required in processes that have mapped a VAS "paste address" with the intention of using COPY/PASTE instructions. But since CP_ABORT is expensive, we want to restrict it to only processes that use/intend to use COPY/PASTE. Define an interface, set_thread_uses_vas(), that VAS can use to indicate that the current process opened a send window. During context switch, issue CP_ABORT only for processes that have the flag set. Thanks for input from Nick Piggin, Michael Ellerman. Signed-off-by: Sukadev Bhattiprolu <[email protected]> [mpe: Fix to not use new_thread after _switch() returns] Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc: Add support for setting SPRN_TIDRSukadev Bhattiprolu3-0/+120
We need the SPRN_TIDR to be set for use with fast thread-wakeup (core- to-core wakeup) and also with CAPI. Each thread in a process needs to have a unique id within the process. But for now, we assign globally unique thread ids to all threads in the system. Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: Philippe Bergheaud <[email protected]> Signed-off-by: Christophe Lombard <[email protected]> [mpe: Simplify tidr clearing on fork() and ctx switch code] Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/vas: Export HVWC to debugfsSukadev Bhattiprolu5-9/+257
Export the VAS Window context information to debugfs. We need to hold a mutex when closing the window to prevent a race with the debugfs read(). Rather than introduce a per-instance mutex, we use the global vas_mutex for now, since it is not heavily contended. The window->cop field is only relevant to a receive window so we were not setting it for a send window (which is is paired to a receive window anyway). But to simplify reporting in debugfs, set the 'cop' field for the send window also. Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/vas, nx-842: Define and use chip_to_vas_id()Sukadev Bhattiprolu3-15/+23
Define a helper, chip_to_vas_id() to map a given chip id to corresponding vas id. Normally, callers of vas_rx_win_open() and vas_tx_win_open() want the VAS window to be on the same chip where the calling thread is executing. These callers can pass in -1 for the VAS id. This interface will be useful if a thread running on one chip wants to open a window on another chip (like the NX-842 driver does during start up). Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/vas: Create cpu to vas id mappingSukadev Bhattiprolu1-1/+13
Create a cpu to vasid mapping so callers can specify -1 instead of trying to find a VAS id. Changelog[v2] [Michael Ellerman] Use per-cpu variables to simplify code. Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/vas: poll for return of window creditsSukadev Bhattiprolu1-0/+45
Normally, the NX driver waits for the CRBs to be processed before closing the window. But it is better to ensure that the credits are returned before the window gets reassigned later. Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/vas: Save configured window creditsSukadev Bhattiprolu2-2/+5
Save the configured max window credits for a window in the vas_window structure. We will need this when polling for return of window credits. Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/vas: Reduce polling interval for busy stateSukadev Bhattiprolu1-4/+6
A VAS window is normally in "busy" state for only a short duration. Reduce the time we wait for the window to go to "not-busy" state to speed-up vas_win_close() a bit. Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/vas: Use helper to unpin/close windowSukadev Bhattiprolu1-7/+15
Use a helper to have the hardware unpin and mark a window closed. Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/vas: Drop poll_window_cast_out().Sukadev Bhattiprolu1-17/+17
Polling for window cast out is listed in the spec, but turns out that it is not strictly necessary and slows down window close. Making it a stub for now. Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/vas: Cleanup some debug codeSukadev Bhattiprolu2-45/+17
Clean up vas.h and the debug code around ifdef vas_debug. Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/vas: Validate window creditsSukadev Bhattiprolu2-2/+8
NX-842, the only user of VAS, sets the window credits to default values but VAS should check the credits against the possible max values. The VAS_WCREDS_MIN is not needed and can be dropped. Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-12powerpc/vas: init missing fields from [rt]xattrSukadev Bhattiprolu1-0/+6
Initialize a few missing window context fields from the window attributes specified by the caller. These fields are currently set to their default values by the caller (NX-842), but would be good to apply them anyway. Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-10powerpc/64: Set DSCR default initially from SPRNicholas Piggin3-0/+26
Take the DSCR value set by firmware as the dscr_default value, rather than zero. POWER9 recommends DSCR default to a non-zero value. Signed-off-by: From: Nicholas Piggin <[email protected]> [mpe: Make record_spr_defaults() __init] Signed-off-by: Michael Ellerman <[email protected]>
2017-11-10powerpc/powernv: Avoid waiting for secondary hold spinloop with OPALNicholas Piggin2-6/+20
OPAL boot does not insert secondaries at 0x60 to wait at the secondary hold spinloop. Instead they are started later, and inserted at generic_secondary_smp_init(), which is after the secondary hold spinloop. Avoid waiting on this spinloop when booting with OPAL firmware. This wait always times out that case. This saves 100ms boot time on powernv, and 10s of seconds of real time when booting on the simulator in SMP. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-10powerpc/64s/radix: Improve TLB flushing for page table freeingNicholas Piggin1-29/+61
Unmaps that free page tables always flush the entire PID, which is sub-optimal. Provide TLB range flushing with an additional PWC flush that can be use for va range invalidations with PWC flush. Time to munmap N pages of memory including last level page table teardown (after mmap, touch), local invalidate: N 1 2 4 8 16 32 64 vanilla 3.2us 3.3us 3.4us 3.6us 4.1us 5.2us 7.2us patched 1.4us 1.5us 1.7us 1.9us 2.6us 3.7us 6.2us Global invalidate: N 1 2 4 8 16 32 64 vanilla 2.2us 2.3us 2.4us 2.6us 3.2us 4.1us 6.2us patched 2.1us 2.5us 3.4us 5.2us 8.7us 15.7us 6.2us Local invalidates get much better across the board. Global ones have the same issue where multiple tlbies for va flush do get slower than the single tlbie to invalidate the PID. None of this test captures the TLB benefits of avoiding killing everything. Global gets worse, but it is brought in to line with global invalidate for munmap()s that do not free page tables. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-10powerpc/64s/radix: Introduce local single page ceiling for TLB range flushNicholas Piggin1-4/+19
The single page flush ceiling is the cut-off point at which we switch from invalidating individual pages, to invalidating the entire process address space in response to a range flush. Introduce a local variant of this heuristic because local and global tlbie have significantly different properties: - Local tlbiel requires 128 instructions to invalidate a PID, global tlbie only 1 instruction. - Global tlbie instructions are expensive broadcast operations. The local ceiling has been made much higher, 2x the number of instructions required to invalidate the entire PID (i.e., 256 pages). Time to mprotect N pages of memory (after mmap, touch), local invalidate: N 32 34 64 128 256 512 vanilla 7.4us 9.0us 14.6us 26.4us 50.2us 98.3us patched 7.4us 7.8us 13.8us 26.4us 51.9us 98.3us The behaviour of both is identical at N=32 and N=512. Between there, the vanilla kernel does a PID invalidate and the patched kernel does a va range invalidate. At N=128, these require the same number of tlbiel instructions, so the patched version can be sen to be cheaper when < 128, and more expensive when > 128. However this does not well capture the cost of invalidated TLB. The additional cost at 256 pages does not seem prohibitive. It may be the case that increasing the limit further would continue to be beneficial to avoid invalidating all of the process's TLB entries. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-10powerpc/64s/radix: Optimize flush_tlb_rangeNicholas Piggin1-35/+103
Currently for radix, flush_tlb_range flushes the entire PID, because the Linux mm code does not tell us about page size here for THP vs regular pages. This is quite sub-optimal for small mremap / mprotect / change_protection. So implement va range flushes with two flush passes, one for each page size (regular and THP). The second flush has an order of matnitude fewer tlbie instructions than the first, so it is a relatively small additional cost. There is still room for improvement here with some changes to generic APIs, particularly if there are mostly THP pages to be invalidated, the small page flushes could be reduced. Time to mprotect 1 page of memory (after mmap, touch): vanilla 2.9us 1.8us patched 1.2us 1.6us Time to mprotect 30 pages of memory (after mmap, touch): vanilla 8.2us 7.2us patched 6.9us 17.9us Time to mprotect 34 pages of memory (after mmap, touch): vanilla 9.1us 8.0us patched 9.0us 8.0us 34 pages is the point at which the invalidation switches from va to entire PID, which tlbie can do in a single instruction. This is why in the case of 30 pages, the new code runs slower for this test. This is a deliberate tradeoff already present in the unmap and THP promotion code, the idea is that the benefit from avoiding flushing entire TLB for this PID on all threads in the system. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-10powerpc/64s/radix: Implement _tlbie(l)_va_range flush functionsNicholas Piggin1-30/+39
Move the barriers and range iteration down into the _tlbie* level, which improves readability. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-10powerpc/64s/radix: Optimize TLB range flush barriersNicholas Piggin1-9/+32
Short range flushes issue a sequences of tlbie(l) instructions for individual effective addresses. These do not all require individual barrier sequences, only one covering all tlbie(l) instructions. Commit f7327e0ba3 ("powerpc/mm/radix: Remove unnecessary ptesync") made a similar optimization for tlbiel for PID flushing. For tlbie, the ISA says: The tlbsync instruction provides an ordering function for the effects of all tlbie instructions executed by the thread executing the tlbsync instruction, with respect to the memory barrier created by a subsequent ptesync instruction executed by the same thread. Time to munmap 30 pages of memory (after mmap, touch): local global vanilla 10.9us 22.3us patched 3.4us 14.4us Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-10Merge branch 'fixes' into nextMichael Ellerman15-60/+125
We have some dependencies & conflicts between patches in fixes and things to go in next, both in the radix TLB flush code and the IMC PMU driver. So merge fixes into next.
2017-11-09selftests/powerpc: Check FP/VEC on exception in TMGustavo Romero4-1/+379
Add a self test to check if FP/VEC/VSX registers are sane (restored correctly) after a FP/VEC/VSX unavailable exception is caught during a transaction. This test checks all possibilities in a thread regarding the combination of MSR.[FP|VEC] states in a thread and for each scenario raises a FP/VEC/VSX unavailable exception in transactional state, verifying if vs0 and vs32 registers, which are representatives of FP/VEC/VSX reg sets, are not corrupted. Signed-off-by: Gustavo Romero <[email protected]> Signed-off-by: Breno Leitao <[email protected]> Signed-off-by: Cyril Bur <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-08powerpc/xmon: Support dumping software pagetablesBalbir Singh1-0/+116
It would be nice to be able to dump page tables in a particular context. eg: dumping vmalloc space: 0:mon> dv 0xd00037fffff00000 pgd @ 0xc0000000017c0000 pgdp @ 0xc0000000017c00d8 = 0x00000000f10b1000 pudp @ 0xc0000000f10b13f8 = 0x00000000f10d0000 pmdp @ 0xc0000000f10d1ff8 = 0x00000000f1102000 ptep @ 0xc0000000f1102780 = 0xc0000000f1ba018e Maps physical address = 0x00000000f1ba0000 Flags = Accessed Dirty Read Write This patch does not replicate the complex code of dump_pagetable and has no support for bolted linear mapping, thats why I've it's called dump virtual page table support. The format of the PTE can be expanded even further to add more useful information about the flags in the PTE if required. Signed-off-by: Balbir Singh <[email protected]> [mpe: Bike shed the output format, show the pgdir, fix build failures] Signed-off-by: Michael Ellerman <[email protected]>
2017-11-07powerpc/mm/hash: Remove stale comment.Michal Suchanek1-4/+0
In commit e6f81a92015b ("powerpc/mm/hash: Support 68 bit VA") the masking is folded into ASM_VSID_SCRAMBLE but the comment about masking is removed only from the firt use of ASM_VSID_SCRAMBLE. Signed-off-by: Michal Suchanek <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-07powerpc/powernv/ioda: Remove explicit max window size checkAlexey Kardashevskiy1-1/+1
DMA windows can only have a size of power of two on IODA2 hardware and using memory_hotplug_max() to determine the upper limit won't work correcly if it returns not power of two value. This removes the check as the platform code does this check in pnv_pci_ioda2_setup_default_config() anyway; the other client is VFIO and that thing checks against locked_vm limit which prevents the userspace from locking too much memory. It is expected to impact DPDK on machines with non-power-of-two RAM size, mostly. KVM guests are less likely to be affected as usually guests get less than half of hosts RAM. Signed-off-by: Alexey Kardashevskiy <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-07powerpc/powernv/cpufreq: Fix the frequency read by /proc/cpuinfoShriya1-1/+1
The call to /proc/cpuinfo in turn calls cpufreq_quick_get() which returns the last frequency requested by the kernel, but may not reflect the actual frequency the processor is running at. This patch makes a call to cpufreq_get() instead which returns the current frequency reported by the hardware. Fixes: fb5153d05a7d ("powerpc: powernv: Implement ppc_md.get_proc_freq()") Signed-off-by: Shriya <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2017-11-06powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 PMU workaround on DD2.1Nicholas Piggin1-12/+19
DD2.1 does not have to save MMCR0 for all state-loss idle states, only after deep idle states (like other PMU registers). Reviewed-by: Vaidyanathan Srinivasan <[email protected]> Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>