Age | Commit message (Collapse) | Author | Files | Lines |
|
The Broadcast Assist Unit (BAU) TLB shootdown handler is being rewritten
to become the UV BAU APIC driver. It is designed to speed up sending
IPIs to selective CPUs within the system. Remove the current TLB
shutdown handler (tlb_uv.c) file and a couple of kernel hooks in the
interim.
Signed-off-by: Mike Travis <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Dimitri Sivanich <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
This allows moving the leave_mm() call into generic code before
rcu_idle_enter(). Gets rid of more trace_*_rcuidle() users.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Tested-by: Marco Elver <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
"Misc changes:
- Unexport various PAT primitives
- Unexport per-CPU tlbstate and uninline TLB helpers"
* tag 'x86-mm-2020-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/tlb/uv: Add a forward declaration for struct flush_tlb_info
x86/cpu: Export native_write_cr4() only when CONFIG_LKTDM=m
x86/tlb: Restrict access to tlbstate
xen/privcmd: Remove unneeded asm/tlb.h include
x86/tlb: Move PCID helpers where they are used
x86/tlb: Uninline nmi_uaccess_okay()
x86/tlb: Move cr4_set_bits_and_update_boot() to the usage site
x86/tlb: Move paravirt_tlb_remove_table() to the usage site
x86/tlb: Move __flush_tlb_all() out of line
x86/tlb: Move flush_tlb_others() out of line
x86/tlb: Move __flush_tlb_one_kernel() out of line
x86/tlb: Move __flush_tlb_one_user() out of line
x86/tlb: Move __flush_tlb_global() out of line
x86/tlb: Move __flush_tlb() out of line
x86/alternatives: Move temporary_mm helpers into C
x86/cr4: Sanitize CR4.PCE update
x86/cpu: Uninline CR4 accessors
x86/tlb: Uninline __get_current_cr3_fast()
x86/mm: Use pgprotval_t in protval_4k_2_large() and protval_large_2_4k()
x86/mm: Unexport __cachemode2pte_tbl
...
|
|
Remove fault handling on vmalloc areas, as the vmalloc code now takes
care of synchronizing changes to all page-tables in the system.
Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H . Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Aside of the fact that they are used only in the TLB code, especially
having the comment close to the actual implementation makes a lot of
sense.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Alexandre Chartre <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
cpu_tlbstate is exported because various TLB-related functions need
access to it, but cpu_tlbstate is sensitive information which should
only be accessed by well-contained kernel functions and not be directly
exposed to modules.
nmi_access_ok() is the last inline function which requires access to
cpu_tlbstate. Move it into the TLB code.
No functional change.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Alexandre Chartre <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Reduce the number of required exports to one and make flush_tlb_global()
static to the TLB code.
flush_tlb_local() cannot be confined to the TLB code as the MTRR
handling requires a PGE-less flush.
Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
cpu_tlbstate is exported because various TLB-related functions need
access to it, but cpu_tlbstate is sensitive information which should
only be accessed by well-contained kernel functions and not be directly
exposed to modules.
As a last step, move __flush_tlb_others() out of line and hide the
native function. The latter can be static when CONFIG_PARAVIRT is
disabled.
No functional change.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Alexandre Chartre <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
cpu_tlbstate is exported because various TLB-related functions need
access to it, but cpu_tlbstate is sensitive information which should
only be accessed by well-contained kernel functions and not be directly
exposed to modules.
As a fourth step, move __flush_tlb_one_kernel() out of line and hide
the native function. The latter can be static when CONFIG_PARAVIRT is
disabled.
Consolidate the name space while at it and remove the pointless extra
wrapper in the paravirt code.
No functional change.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Alexandre Chartre <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
cpu_tlbstate is exported because various TLB-related functions need access
to it, but cpu_tlbstate is sensitive information which should only be
accessed by well-contained kernel functions and not be directly exposed to
modules.
As a third step, move _flush_tlb_one_user() out of line and hide the
native function. The latter can be static when CONFIG_PARAVIRT is
disabled.
Consolidate the name space while at it and remove the pointless extra
wrapper in the paravirt code.
No functional change.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Alexandre Chartre <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
cpu_tlbstate is exported because various TLB-related functions need
access to it, but cpu_tlbstate is sensitive information which should
only be accessed by well-contained kernel functions and not be directly
exposed to modules.
As a second step, move __flush_tlb_global() out of line and hide the
native function. The latter can be static when CONFIG_PARAVIRT is
disabled.
Consolidate the namespace while at it and remove the pointless extra
wrapper in the paravirt code.
No functional change.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Alexandre Chartre <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
cpu_tlbstate is exported because various TLB-related functions need
access to it, but cpu_tlbstate is sensitive information which should
only be accessed by well-contained kernel functions and not be directly
exposed to modules.
As a first step, move __flush_tlb() out of line and hide the native
function. The latter can be static when CONFIG_PARAVIRT is disabled.
Consolidate the namespace while at it and remove the pointless extra
wrapper in the paravirt code.
No functional change.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Alexandre Chartre <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
load_mm_cr4_irqsoff() is really a strange name for a function which has
only one purpose: Update the CR4.PCE bit depending on the perf state.
Rename it to update_cr4_pce_mm(), move it into the tlb code and provide a
function which can be invoked by the perf smp function calls.
Another step to remove exposure of cpu_tlbstate.
No functional change.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Alexandre Chartre <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
cpu_tlbstate is exported because various TLB-related functions need
access to it, but cpu_tlbstate is sensitive information which should
only be accessed by well-contained kernel functions and not be directly
exposed to modules.
In preparation for unexporting cpu_tlbstate move __get_current_cr3_fast()
into the x86 TLB management code.
No functional change.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Alexandre Chartre <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The allocation mask is no longer used by on_each_cpu_cond() and
on_each_cpu_cond_mask() and can be removed.
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
load_mm_cr4() is always called with interrupts disabled from:
- switch_mm_irqs_off()
- refresh_pce(), which is a on_each_cpu() callback
Thus, disabling interrupts in cr4_set/clear_bits() is redundant.
Implement cr4_set/clear_bits_irqsoff() helpers, rename load_mm_cr4() to
load_mm_cr4_irqsoff() and use the new helpers. The new helpers do not need
a lockdep assert as __cr4_set() has one already.
The renaming in combination with the checks in __cr4_set() ensure that any
changes in the boundary conditions at the call sites will be detected.
[ tglx: Massaged change log ]
Signed-off-by: Jan Kiszka <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
We need to pick up post-rc1 changes to various document files so they don't
get lost in Mauro's massive RST conversion push.
|
|
Mostly due to x86 and acpi conversion, several documentation
links are still pointing to the old file. Fix them.
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Reviewed-by: Wolfram Sang <[email protected]>
Reviewed-by: Sven Van Asbroeck <[email protected]>
Reviewed-by: Bhupesh Sharma <[email protected]>
Acked-by: Mark Brown <[email protected]>
Signed-off-by: Jonathan Corbet <[email protected]>
|
|
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Move flush_tlb_info variables off the stack. This allows to align
flush_tlb_info to cache-line and avoid potentially unnecessary cache
line movements. It also allows to have a fixed virtual-to-physical
translation of the variables, which reduces TLB misses.
Use per-CPU struct for flush_tlb_mm_range() and
flush_tlb_kernel_range(). Add debug assertions to ensure there are
no nested TLB flushes that might overwrite the per-CPU data. For
arch_tlbbatch_flush() use a const struct.
Results when running a microbenchmarks that performs 10^6 MADV_DONTEED
operations and touching a page, in which 3 additional threads run a
busy-wait loop (5 runs, PTI and retpolines are turned off):
base off-stack
---- ---------
avg (usec/op) 1.629 1.570 (-3%)
stddev 0.014 0.009
Signed-off-by: Nadav Amit <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Revert the following commit:
515ab7c41306: ("x86/mm: Align TLB invalidation info")
I found out (the hard way) that under some .config options (notably L1_CACHE_SHIFT=7)
and compiler combinations this on-stack alignment leads to a 320 byte
stack usage, which then triggers a KASAN stack warning elsewhere.
Using 320 bytes of stack space for a 40 byte structure is ludicrous and
clearly not right.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Nadav Amit <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: 515ab7c41306 ("x86/mm: Align TLB invalidation info")
Link: http://lkml.kernel.org/r/[email protected]
[ Minor changelog edits. ]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The "cpu" local variable became unused after
a2055abe9c67 ("x86/mm: Pass flush_tlb_info to flush_tlb_others() etc").
Remove it.
Signed-off-by: Shaokun Zhang <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: x86-ml <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Instead of punting and doing tlb_flush_all(), do the same as
flush_tlb_kernel_range() does and use single page invalidations.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The IBPB speculation barrier is issued from switch_mm() when the kernel
switches to a user space task with a different mm than the user space task
which ran last on the same CPU.
An additional optimization is to avoid IBPB when the incoming task can be
ptraced by the outgoing task. This optimization only works when switching
directly between two user space tasks. When switching from a kernel task to
a user space task the optimization fails because the previous task cannot
be accessed anymore. So for quite some scenarios the optimization is just
adding overhead.
The upcoming conditional IBPB support will issue IBPB only for user space
tasks which have the TIF_SPEC_IB bit set. This requires to handle the
following cases:
1) Switch from a user space task (potential attacker) which has
TIF_SPEC_IB set to a user space task (potential victim) which has
TIF_SPEC_IB not set.
2) Switch from a user space task (potential attacker) which has
TIF_SPEC_IB not set to a user space task (potential victim) which has
TIF_SPEC_IB set.
This needs to be optimized for the case where the IBPB can be avoided when
only kernel threads ran in between user space tasks which belong to the
same process.
The current check whether two tasks belong to the same context is using the
tasks context id. While correct, it's simpler to use the mm pointer because
it allows to mangle the TIF_SPEC_IB bit into it. The context id based
mechanism requires extra storage, which creates worse code.
When a task is scheduled out its TIF_SPEC_IB bit is mangled as bit 0 into
the per CPU storage which is used to track the last user space mm which was
running on a CPU. This bit can be used together with the TIF_SPEC_IB bit of
the incoming task to make the decision whether IBPB needs to be issued or
not to cover the two cases above.
As conditional IBPB is going to be the default, remove the dubious ptrace
check for the IBPB always case and simply issue IBPB always when the
process changes.
Move the storage to a different place in the struct as the original one
created a hole.
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Casey Schaufler <[email protected]>
Cc: Asit Mallick <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Jon Masters <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Dave Stewart <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 pti updates from Ingo Molnar:
"The main changes:
- Make the IBPB barrier more strict and add STIBP support (Jiri
Kosina)
- Micro-optimize and clean up the entry code (Andy Lutomirski)
- ... plus misc other fixes"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation: Propagate information about RSB filling mitigation to sysfs
x86/speculation: Enable cross-hyperthread spectre v2 STIBP mitigation
x86/speculation: Apply IBPB more strictly to avoid cross-process data leak
x86/speculation: Add RETPOLINE_AMD support to the inline asm CALL_NOSPEC variant
x86/CPU: Fix unused variable warning when !CONFIG_IA32_EMULATION
x86/pti/64: Remove the SYSCALL64 entry trampoline
x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch space
x86/entry/64: Document idtentry
|
|
Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
when all it really needs to do is reload %CR3 at the next context switch,
assuming no page table pages got freed.
Memory ordering is used to prevent race conditions between switch_mm_irqs_off,
which checks whether .tlb_gen changed, and the TLB invalidation code, which
increments .tlb_gen whenever page table entries get invalidated.
The atomic increment in inc_mm_tlb_gen is its own barrier; the context
switch code adds an explicit barrier between reading tlbstate.is_lazy and
next->context.tlb_gen.
CPUs in lazy TLB mode remain part of the mm_cpumask(mm), both because
that allows TLB flush IPIs to be sent at page table freeing time, and
because the cache line bouncing on the mm_cpumask(mm) was responsible
for about half the CPU use in switch_mm_irqs_off().
We can change native_flush_tlb_others() without touching other
(paravirt) implementations of flush_tlb_others() because we'll be
flushing less. The existing implementations flush more and are
therefore still correct.
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Tested-by: Song Liu <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
Pass the information on to native_flush_tlb_others.
No functional changes.
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
Add an argument to flush_tlb_mm_range to indicate whether page tables
are about to be freed after this TLB flush. This allows for an
optimization of flush_tlb_mm_range to skip CPUs in lazy TLB mode.
No functional changes.
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
Move some code that will be needed for the lazy -> !lazy state
transition when a lazy TLB CPU has gotten out of date.
No functional changes, since the if (real_prev == next) branch
always returns.
(cherry picked from commit 61d0beb5796ab11f7f3bf38cb2eccc6579aaa70b)
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
On most workloads, the number of context switches far exceeds the
number of TLB flushes sent. Optimizing the context switches, by always
using lazy TLB mode, speeds up those workloads.
This patch results in about a 1% reduction in CPU use on a two socket
Broadwell system running a memcache like workload.
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Tested-by: Song Liu <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
(cherry picked from commit 95b0e6357d3e4e05349668940d7ff8f3b7e7e11e)
Acked-by: Dave Hansen <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
Use the new tlb_get_unmap_shift() to determine the stride of the
INVLPG loop.
Cc: Nick Piggin <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
|
|
Currently, IBPB is only issued in cases when switching into a non-dumpable
process, the rationale being to protect such 'important and security
sensitive' processess (such as GPG) from data leaking into a different
userspace process via spectre v2.
This is however completely insufficient to provide proper userspace-to-userpace
spectrev2 protection, as any process can poison branch buffers before being
scheduled out, and the newly scheduled process immediately becomes spectrev2
victim.
In order to minimize the performance impact (for usecases that do require
spectrev2 protection), issue the barrier only in cases when switching between
processess where the victim can't be ptraced by the potential attacker (as in
such cases, the attacker doesn't have to bother with branch buffers at all).
[ tglx: Split up PTRACE_MODE_NOACCESS_CHK into PTRACE_MODE_SCHED and
PTRACE_MODE_IBPB to be able to do ptrace() context tracking reasonably
fine-grained ]
Fixes: 18bf3c3ea8 ("x86/speculation: Use Indirect Branch Prediction Barrier in context switch")
Originally-by: Tim Chen <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "WoodhouseDavid" <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: "SchauflerCasey" <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
A NMI can hit in the middle of context switching or in the middle of
switch_mm_irqs_off(). In either case, CR3 might not match current->mm,
which could cause copy_from_user_nmi() and friends to read the wrong
memory.
Fix it by adding a new nmi_uaccess_okay() helper and checking it in
copy_from_user_nmi() and in __copy_from_user_nmi()'s callers.
Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/dd956eba16646fd0b15c3c0741269dfd84452dac.1535557289.git.luto@kernel.org
|
|
Revert commits:
95b0e6357d3e x86/mm/tlb: Always use lazy TLB mode
64482aafe55f x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
ac0315896970 x86/mm/tlb: Make lazy TLB mode lazier
61d0beb5796a x86/mm/tlb: Restructure switch_mm_irqs_off()
2ff6ddf19c0e x86/mm/tlb: Leave lazy TLB mode at page table free time
In order to simplify the TLB invalidate fixes for x86 and unify the
parts that need backporting. We'll try again later.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fixes the following sparse warning:
arch/x86/mm/tlb.c:38:6: warning: symbol 'clear_asid_other' was not declared. Should it be static?
Signed-off-by: zhong jiang <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Song Liu noticed switch_mm_irqs_off() taking a lot of CPU time in recent
kernels,using 1.8% of a 48 CPU system during a netperf to localhost run.
Digging into the profile, we noticed that cpumask_clear_cpu and
cpumask_set_cpu together take about half of the CPU time taken by
switch_mm_irqs_off().
However, the CPUs running netperf end up switching back and forth
between netperf and the idle task, which does not require changes
to the mm_cpumask. Furthermore, the init_mm cpumask ends up being
the most heavily contended one in the system.
Simply skipping changes to mm_cpumask(&init_mm) reduces overhead.
Reported-and-tested-by: Song Liu <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Now that CPUs in lazy TLB mode no longer receive TLB shootdown IPIs, except
at page table freeing time, and idle CPUs will no longer get shootdown IPIs
for things like mprotect and madvise, we can always use lazy TLB mode.
Tested-by: Song Liu <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
CPUs in !is_lazy have either received TLB flush IPIs earlier on during
the munmap (when the user memory was unmapped), or have context switched
and reloaded during that stage of the munmap.
Page table free TLB flushes only need to be sent to CPUs in lazy TLB
mode, which TLB contents might not yet be up to date yet.
Tested-by: Song Liu <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
when all it really needs to do is reload %CR3 at the next context switch,
assuming no page table pages got freed.
Memory ordering is used to prevent race conditions between switch_mm_irqs_off,
which checks whether .tlb_gen changed, and the TLB invalidation code, which
increments .tlb_gen whenever page table entries get invalidated.
The atomic increment in inc_mm_tlb_gen is its own barrier; the context
switch code adds an explicit barrier between reading tlbstate.is_lazy and
next->context.tlb_gen.
Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set
are not removed from the mm_cpumask(mm), since that would prevent the TLB
flush IPIs at page table free time from being sent to all the CPUs
that need them.
This patch reduces total CPU use in the system by about 1-2% for a
memcache workload on two socket systems, and by about 1% for a heavily
multi-process netperf between two systems.
Tested-by: Song Liu <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Move some code that will be needed for the lazy -> !lazy state
transition when a lazy TLB CPU has gotten out of date.
No functional changes, since the if (real_prev == next) branch
always returns.
Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Andy discovered that speculative memory accesses while in lazy
TLB mode can crash a system, when a CPU tries to dereference a
speculative access using memory contents that used to be valid
page table memory, but have since been reused for something else
and point into la-la land.
The latter problem can be prevented in two ways. The first is to
always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
the second one is to only send the TLB shootdown at page table
freeing time.
The second should result in fewer IPIs, since operationgs like
mprotect and madvise are very common with some workloads, but
do not involve page table freeing. Also, on munmap, batching
of page table freeing covers much larger ranges of virtual
memory than the batching of unmapped user pages.
Tested-by: Song Liu <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
pgtable_l5_enabled is defined using cpu_feature_enabled() but we refer
to it as a variable. This is misleading.
Make pgtable_l5_enabled() a function.
We cannot literally define it as a function due to circular dependencies
between header files. Function-alike macros is close enough.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
This patch converts the of CONFIG_X86_5LEVEL check to runtime checks for
p4d folding.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 PTI and Spectre related fixes and updates from Ingo Molnar:
"Here's the latest set of Spectre and PTI related fixes and updates:
Spectre:
- Add entry code register clearing to reduce the Spectre attack
surface
- Update the Spectre microcode blacklist
- Inline the KVM Spectre helpers to get close to v4.14 performance
again.
- Fix indirect_branch_prediction_barrier()
- Fix/improve Spectre related kernel messages
- Fix array_index_nospec_mask() asm constraint
- KVM: fix two MSR handling bugs
PTI:
- Fix a paranoid entry PTI CR3 handling bug
- Fix comments
objtool:
- Fix paranoid_entry() frame pointer warning
- Annotate WARN()-related UD2 as reachable
- Various fixes
- Add Add Peter Zijlstra as objtool co-maintainer
Misc:
- Various x86 entry code self-test fixes
- Improve/simplify entry code stack frame generation and handling
after recent heavy-handed PTI and Spectre changes. (There's two
more WIP improvements expected here.)
- Type fix for cache entries
There's also some low risk non-fix changes I've included in this
branch to reduce backporting conflicts:
- rename a confusing x86_cpu field name
- de-obfuscate the naming of single-TLB flushing primitives"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
x86/entry/64: Fix CR3 restore in paranoid_exit()
x86/cpu: Change type of x86_cache_size variable to unsigned int
x86/spectre: Fix an error message
x86/cpu: Rename cpu_data.x86_mask to cpu_data.x86_stepping
selftests/x86/mpx: Fix incorrect bounds with old _sigfault
x86/mm: Rename flush_tlb_single() and flush_tlb_one() to __flush_tlb_one_[user|kernel]()
x86/speculation: Add <asm/msr-index.h> dependency
nospec: Move array_index_nospec() parameter checking into separate macro
x86/speculation: Fix up array_index_nospec_mask() asm constraint
x86/debug: Use UD2 for WARN()
x86/debug, objtool: Annotate WARN()-related UD2 as reachable
objtool: Fix segfault in ignore_unreachable_insn()
selftests/x86: Disable tests requiring 32-bit support on pure 64-bit systems
selftests/x86: Do not rely on "int $0x80" in single_step_syscall.c
selftests/x86: Do not rely on "int $0x80" in test_mremap_vdso.c
selftests/x86: Fix build bug caused by the 5lvl test which has been moved to the VM directory
selftests/x86/pkeys: Remove unused functions
selftests/x86: Clean up and document sscanf() usage
selftests/x86: Fix vDSO selftest segfault for vsyscall=none
x86/entry/64: Remove the unused 'icebp' macro
...
|
|
__flush_tlb_one_[user|kernel]()
flush_tlb_single() and flush_tlb_one() sound almost identical, but
they really mean "flush one user translation" and "flush one kernel
translation". Rename them to flush_tlb_one_user() and
flush_tlb_one_kernel() to make the semantics more obvious.
[ I was looking at some PTI-related code, and the flush-one-address code
is unnecessarily hard to understand because the names of the helpers are
uninformative. This came up during PTI review, but no one got around to
doing it. ]
Signed-off-by: Andy Lutomirski <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Eduardo Valentin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Linux-MM <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Link: http://lkml.kernel.org/r/3303b02e3c3d049dc5235d5651e0ae6d29a34354.1517414378.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The TLB invalidation info is allocated on the stack, which might cause
it to be unaligned. Since this information may be transferred to
different cores for TLB shootdown, this may cause an additional cache
line to become shared. While the overhead is likely to be small, the
fix is simple.
We do not use __cacheline_aligned() since it also defines the section,
which is inappropriate for stack variables.
Signed-off-by: Nadav Amit <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Conflicts:
arch/arm64/kernel/entry.S
arch/x86/Kconfig
include/linux/sched/mm.h
kernel/fork.c
Signed-off-by: Ingo Molnar <[email protected]>
|
|
There are two places where core serialization is needed by membarrier:
1) When returning from the membarrier IPI,
2) After scheduler updates curr to a thread with a different mm, before
going back to user-space, since the curr->mm is used by membarrier to
check whether it needs to send an IPI to that CPU.
x86-32 uses IRET as return from interrupt, and both IRET and SYSEXIT to go
back to user-space. The IRET instruction is core serializing, but not
SYSEXIT.
x86-64 uses IRET as return from interrupt, which takes care of the IPI.
However, it can return to user-space through either SYSRETL (compat
code), SYSRETQ, or IRET. Given that SYSRET{L,Q} is not core serializing,
we rely instead on write_cr3() performed by switch_mm() to provide core
serialization after changing the current mm, and deal with the special
case of kthread -> uthread (temporarily keeping current mm into
active_mm) by adding a sync_core() in that specific case.
Use the new sync_core_before_usermode() to guarantee this.
Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrea Parri <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Dave Watson <[email protected]>
Cc: David Sehr <[email protected]>
Cc: Greg Hackmann <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Document the membarrier requirement on having a full memory barrier in
__schedule() after coming from user-space, before storing to rq->curr.
It is provided by smp_mb__after_spinlock() in __schedule().
Document that membarrier requires a full barrier on transition from
kernel thread to userspace thread. We currently have an implicit barrier
from atomic_dec_and_test() in mmdrop() that ensures this.
The x86 switch_mm_irqs_off() full barrier is currently provided by many
cpumask update operations as well as write_cr3(). Document that
write_cr3() provides this barrier.
Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrea Parri <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Dave Watson <[email protected]>
Cc: David Sehr <[email protected]>
Cc: Greg Hackmann <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Russell King <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|