Age | Commit message (Collapse) | Author | Files | Lines |
|
We want all of the syscall entries to run with interrupts off so that
we can efficiently run context tracking before enabling interrupts.
This will regress int $0x80 performance on 32-bit kernels by a
couple of cycles. This shouldn't matter much -- int $0x80 is not a
fast path.
This effectively reverts:
657c1eea0019 ("x86/entry/32: Fix entry_INT80_32() to expect interrupts to be on")
... and fixes the same issue differently.
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frédéric Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/59b4f90c9ebfccd8c937305dbbbca680bc74b905.1457558566.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Leonid Shatz noticed that the SDM interpretation of the following
recent commit:
394db20ca240741 ("x86/fpu: Disable AVX when eagerfpu is off")
... is incorrect and that the original behavior of the FPU code was correct.
Because AVX is not stated in CR0 TS bit description, it was mistakenly
believed to be not supported for lazy context switch. This turns out
to be false:
Intel Software Developer's Manual Vol. 3A, Sec. 2.5 Control Registers:
'TS Task Switched bit (bit 3 of CR0) -- Allows the saving of the x87 FPU/
MMX/SSE/SSE2/SSE3/SSSE3/SSE4 context on a task switch to be delayed until
an x87 FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instruction is actually executed
by the new task.'
Intel Software Developer's Manual Vol. 2A, Sec. 2.4 Instruction Exception
Specification:
'AVX instructions refer to exceptions by classes that include #NM
"Device Not Available" exception for lazy context switch.'
So revert the commit.
Reported-by: Leonid Shatz <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi V. Shankar <[email protected]>
Cc: Sai Praneeth Prakhya <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The first instruction of the SYSENTER entry runs on its own tiny
stack. That stack can be used if a #DB or NMI is delivered before
the SYSENTER prologue switches to a real stack.
We have code in place to prevent us from overflowing the tiny stack.
For added paranoia, add a canary to the stack and check it in
do_debug() -- that way, if something goes wrong with the #DB logic,
we'll eventually notice.
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/6ff9a806f39098b166dc2c41c1db744df5272f29.1457578375.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Right after SYSENTER, we can get a #DB or NMI. On x86_32, there's no IST,
so the exception handler is invoked on the temporary SYSENTER stack.
Because the SYSENTER stack is very small, we have a fixup to switch
off the stack quickly when this happens. The old fixup had several issues:
1. It checked the interrupt frame's CS and EIP. This wasn't
obviously correct on Xen or if vm86 mode was in use [1].
2. In the NMI handler, it did some frightening digging into the
stack frame. I'm not convinced this digging was correct.
3. The fixup didn't switch stacks and then switch back. Instead, it
synthesized a brand new stack frame that would redirect the IRET
back to the SYSENTER code. That frame was highly questionable.
For one thing, if NMI nested inside #DB, we would effectively
abort the #DB prologue, which was probably safe but was
frightening. For another, the code used PUSHFL to write the
FLAGS portion of the frame, which was simply bogus -- by the time
PUSHFL was called, at least TF, NT, VM, and all of the arithmetic
flags were clobbered.
Simplify this considerably. Instead of looking at the saved frame
to see where we came from, check the hardware ESP register against
the SYSENTER stack directly. Malicious user code cannot spoof the
kernel ESP register, and by moving the check after SAVE_ALL, we can
use normal PER_CPU accesses to find all the relevant addresses.
With this patch applied, the improved syscall_nt_32 test finally
passes on 32-bit kernels.
[1] It isn't obviously correct, but it is nonetheless safe from vm86
shenanigans as far as I can tell. A user can't point EIP at
entry_SYSENTER_32 while in vm86 mode because entry_SYSENTER_32,
like all kernel addresses, is greater than 0xffff and would thus
violate the CS segment limit.
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/b2cdbc037031c07ecf2c40a96069318aec0e7971.1457578375.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Due to a blatant design error, SYSENTER doesn't clear TF (single-step).
As a result, if a user does SYSENTER with TF set, we will single-step
through the kernel until something clears TF. There is absolutely
nothing we can do to prevent this short of turning off SYSENTER [1].
Simplify the handling considerably with two changes:
1. We already sanitize EFLAGS in SYSENTER to clear NT and AC. We can
add TF to that list of flags to sanitize with no overhead whatsoever.
2. Teach do_debug() to ignore single-step traps in the SYSENTER prologue.
That's all we need to do.
Don't get too excited -- our handling is still buggy on 32-bit
kernels. There's nothing wrong with the SYSENTER code itself, but
the #DB prologue has a clever fixup for traps on the very first
instruction of entry_SYSENTER_32, and the fixup doesn't work quite
correctly. The next two patches will fix that.
[1] We could probably prevent it by forcing BTF on at all times and
making sure we clear TF before any branches in the SYSENTER
code. Needless to say, this is a bad idea.
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/a30d2ea06fe4b621fe6a9ef911b02c0f38feb6f2.1457578375.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Leaving any bits set in DR6 on return from a debug exception is
asking for trouble. Prevent it by writing zero right away and
clarify the comment.
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/3857676e1be8fb27db4b89bbb1e2052b7f435ff4.1457578375.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The SDM says that debug exceptions clear BTF, and we need to keep
TIF_BLOCKSTEP in sync with BTF. Clear it unconditionally and improve
the comment.
I suspect that the fact that kmemcheck could cause TIF_BLOCKSTEP not
to be cleared was just an oversight.
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/fa86e55d196e6dde5b38839595bde2a292c52fdc.1457578375.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
|
|
After fixing FPU option parsing, we now parse the 'no387' boot option
too early: no387 clears X86_FEATURE_FPU before it's even probed, so
the boot CPU promptly re-enables it.
I suspect it gets even more confused on SMP.
Fix the probing code to leave X86_FEATURE_FPU off if it's been
disabled by setup_clear_cpu_cap().
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Casasnovas <[email protected]>
Cc: Sai Praneeth Prakhya <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: yu-cheng yu <[email protected]>
Fixes: 4f81cbafcce2 ("x86/fpu: Fix early FPU command-line parsing")
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Several cases of overlapping changes, as well as one instance
(vxlan) of a bug fix in 'net' overlapping with code movement
in 'net-next'.
Signed-off-by: David S. Miller <[email protected]>
|
|
Make use of the EXTABLE_FAULT exception table entries to write
a kernel copy routine that doesn't crash the system if it
encounters a machine check. Prime use case for this is to copy
from large arrays of non-volatile memory used as storage.
We have to use an unrolled copy loop for now because current
hardware implementations treat a machine check in "rep mov"
as fatal. When that is fixed we can simplify.
Return type is a "bool". True means that we copied OK, false means
that it didn't.
Signed-off-by: Tony Luck <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Link: http://lkml.kernel.org/r/a44e1055efc2d2a9473307b22c91caa437aa3f8b.1456439214.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
It no longer has any users.
Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Luis R. Rodriguez <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
x86_64 has very clean espfix handling on paravirt: espfix64 is set
up in native_iret, so paravirt systems that override iret bypass
espfix64 automatically. This is robust and straightforward.
x86_32 is messier. espfix is set up before the IRET paravirt patch
point, so it can't be directly conditionalized on whether we use
native_iret. We also can't easily move it into native_iret without
regressing performance due to a bizarre consideration. Specifically,
on 64-bit kernels, the logic is:
if (regs->ss & 0x4)
setup_espfix;
On 32-bit kernels, the logic is:
if ((regs->ss & 0x4) && (regs->cs & 0x3) == 3 &&
(regs->flags & X86_EFLAGS_VM) == 0)
setup_espfix;
The performance of setup_espfix itself is essentially irrelevant, but
the comparison happens on every IRET so its performance matters. On
x86_64, there's no need for any registers except flags to implement
the comparison, so we fold the whole thing into native_iret. On
x86_32, we don't do that because we need a free register to
implement the comparison efficiently. We therefore do espfix setup
before restoring registers on x86_32.
This patch gets rid of the explicit paravirt_enabled check by
introducing X86_BUG_ESPFIX on 32-bit systems and using an ALTERNATIVE
to skip espfix on paravirt systems where iret != native_iret. This is
also messy, but it's at least in line with other things we do.
This improves espfix performance by removing a branch, but no one
cares. More importantly, it removes a paravirt_enabled user, which is
good because paravirt_enabled is ill-defined and is going away.
Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Luis R. Rodriguez <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
ignore_nmis is used in two distinct places:
1. modified through {stop,restart}_nmi by alternative_instructions
2. read by do_nmi to determine if default_do_nmi should be called or not
thus the access pattern conforms to __read_mostly and do_nmi() is a fastpath.
Signed-off-by: Kostenzer Felix <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
_flat_send_IPI_mask: 157 bytes, 3 callsites
text data bss dec hex filename
96183823 20860520 36122624 153166967 9212477 vmlinux1_before
96183699 20860520 36122624 153166843 92123fb vmlinux
Signed-off-by: Denys Vlasenko <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
__default_send_IPI_shortcut: 49 bytes, 2 callsites
__default_send_IPI_dest_field: 108 bytes, 7 callsites
text data bss dec hex filename
96184086 20860488 36122624 153167198 921255e vmlinux_before
96183823 20860520 36122624 153166967 9212477 vmlinux
Signed-off-by: Denys Vlasenko <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
In an attempt to aid in understanding of what the threshold_block
structure holds, provide comments to describe the members here. Also,
trim comments around threshold_restart_bank() and update copyright info.
No functional change is introduced.
Signed-off-by: Aravind Gopalakrishnan <[email protected]>
[ Shorten comments. ]
Signed-off-by: Borislav Petkov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: linux-edac <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
In upcoming processors, the BLKPTR field is no longer used to indicate
the MSR number of the additional register. Insted, it simply indicates
the prescence of additional MSRs.
Fix the logic here to gather MSR address from MSR_AMD64_SMCA_MCx_MISC()
for newer processors and fall back to existing logic for older
processors.
[ Drop nextaddr_out label; style cleanups. ]
Signed-off-by: Aravind Gopalakrishnan <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: linux-edac <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
For Scalable MCA enabled processors, errors are listed per IP block. And
since it is not required for an IP to map to a particular bank, we need
to use HWID and McaType values from the MCx_IPID register to figure out
which IP a given bank represents.
We also have a new bit (TCC) in the MCx_STATUS register to indicate Task
context is corrupt.
Add logic here to decode errors from all known IP blocks for Fam17h
Model 00-0fh and to print TCC errors.
[ Minor fixups. ]
Signed-off-by: Aravind Gopalakrishnan <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: linux-edac <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
It is 0 because for !0 values we would have exited already.
Signed-off-by: Borislav Petkov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Turn them into proper sentences. Add comments to microcode_sanity_check() to
explain what it does.
Signed-off-by: Borislav Petkov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Merge the two consecutive "if (ext_table_size)". No functional change.
Signed-off-by: Borislav Petkov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
sizeof(u32) is perfectly clear as it is.
Signed-off-by: Borislav Petkov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Microcode checksum verification should be done using unsigned 32-bit
values otherwise the calculation overflow results in undefined
behaviour.
This is also nicely documented in the SDM, section "Microcode Update
Checksum":
"To check for a corrupt microcode update, software must perform a
unsigned DWORD (32-bit) checksum of the microcode update. Even though
some fields are signed, the checksum procedure treats all DWORDs as
unsigned. Microcode updates with a header version equal to 00000001H
must sum all DWORDs that comprise the microcode update. A valid
checksum check will yield a value of 00000000H."
but for some reason the code has been using ints from the very
beginning.
In practice, this bug possibly manifested itself only when doing the
microcode data checksum - apparently, currently shipped Intel microcode
doesn't have an extended signature table for which we do checksum
verification too.
UBSAN: Undefined behaviour in arch/x86/kernel/cpu/microcode/intel_lib.c:105:12
signed integer overflow:
-1500151068 + -2125470173 cannot be represented in type 'int'
CPU: 0 PID: 0 Comm: swapper Not tainted 4.5.0-rc5+ #495
...
Call Trace:
dump_stack
? inotify_ioctl
ubsan_epilogue
handle_overflow
__ubsan_handle_add_overflow
microcode_sanity_check
get_matching_model_microcode.isra.2.constprop.8
? early_idt_handler_common
? strlcpy
? find_cpio_data
load_ucode_intel_bsp
load_ucode_bsp
? load_ucode_bsp
x86_64_start_kernel
[ Expand and massage commit message. ]
Signed-off-by: Chris Bainbridge <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
On modern Intel systems TSC is derived from the new Always Running Timer
(ART). ART can be captured simultaneous to the capture of
audio and network device clocks, allowing a correlation between timebases
to be constructed. Upon capture, the driver converts the captured ART
value to the appropriate system clock using the correlated clocksource
mechanism.
On systems that support ART a new CPUID leaf (0x15) returns parameters
“m” and “n” such that:
TSC_value = (ART_value * m) / n + k [n >= 1]
[k is an offset that can adjusted by a privileged agent. The
IA32_TSC_ADJUST MSR is an example of an interface to adjust k.
See 17.14.4 of the Intel SDM for more details]
Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Reviewed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Christopher S. Hall <[email protected]>
[jstultz: Tweaked to fix build issue, also reworked math for
64bit division on 32bit systems, as well as !CONFIG_CPU_FREQ build
fixes]
Signed-off-by: John Stultz <[email protected]>
|
|
Pause/unpause graph tracing around do_suspend_lowlevel as it has
inconsistent call/return info after it jumps to the wakeup vector.
The graph trace buffer will otherwise become misaligned and
may eventually crash and hang on suspend.
To reproduce the issue and test the fix:
Run a function_graph trace over suspend/resume and set the graph
function to suspend_devices_and_enter. This consistently hangs the
system without this fix.
Signed-off-by: Todd Brandt <[email protected]>
Cc: All applicable <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
Let the non boot cpus call into idle with the corresponding hotplug state, so
the hotplug core can handle the further bringup. That's a first step to
convert the boot side of the hotplugged cpus to do all the synchronization
with the other side through the state machine. For now it'll only start the
hotplug thread and kick the full bringup of the cpu.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: Rik van Riel <[email protected]>
Cc: Rafael Wysocki <[email protected]>
Cc: "Srivatsa S. Bhat" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Sebastian Siewior <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Turner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
For per package oriented services we must be able to rely on the number of CPU
packages to be within bounds. Create a tracking facility, which
- calculates the number of possible packages depending on nr_cpu_ids after boot
- makes sure that the package id is within the number of possible packages. If
the apic id is outside we map it to a logical package id if there is enough
space available.
Provide interfaces for drivers to query the mapping and do translations from
physcial to logical ids.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Harish Chegondi <[email protected]>
Cc: Jacob Pan <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Luis R. Rodriguez <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
objtool reports the following warning for kretprobe_trampoline():
arch/x86/kernel/kprobes/core.o: warning: objtool: kretprobe_trampoline()+0x20: call without frame pointer save/setup
kretprobes are a special case where the stack is intentionally wrong.
The return address isn't known at the beginning of the trampoline, so
the stack frame can't be set up properly before it calls
trampoline_handler().
Because kretprobe handlers don't sleep, the frame pointer doesn't *have*
to be accurate in the trampoline. So it's ok to tell objtool to ignore
it. This results in no actual changes to the generated code.
Signed-off-by: Josh Poimboeuf <[email protected]>
Cc: Ananth N Mavinakayanahalli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Anil S Keshavamurthy <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Bernd Petrovitsch <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Chris J Arges <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Michal Marek <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Pedro Alves <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/7eaf37de52456ff822ffc86b928edb5d48a40ef1.1456719558.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Add a new macro, STACK_FRAME_NON_STANDARD(), which is used to denote a
function which does something unusual related to its stack frame. Use
of the macro prevents objtool from emitting a false positive warning.
Signed-off-by: Josh Poimboeuf <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Bernd Petrovitsch <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Chris J Arges <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michal Marek <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Pedro Alves <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/34487a17b23dba43c50941599d47054a9584b219.1456719558.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Code which runs outside the kernel's normal mode of operation often does
unusual things which can cause a static analysis tool like objtool to
emit false positive warnings:
- boot image
- vdso image
- relocation
- realmode
- efi
- head
- purgatory
- modpost
Set OBJECT_FILES_NON_STANDARD for their related files and directories,
which will tell objtool to skip checking them. It's ok to skip them
because they don't affect runtime stack traces.
Also skip the following code which does the right thing with respect to
frame pointers, but is too "special" to be validated by a tool:
- entry
- mcount
Also skip the test_nx module because it modifies its exception handling
table at runtime, which objtool can't understand. Fortunately it's
just a test module so it doesn't matter much.
Currently objtool is the only user of OBJECT_FILES_NON_STANDARD, but it
might eventually be useful for other tools.
Signed-off-by: Josh Poimboeuf <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Bernd Petrovitsch <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Chris J Arges <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michal Marek <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Pedro Alves <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/366c080e3844e8a5b6a0327dc7e8c2b90ca3baeb.1456719558.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <[email protected]>
|
|
table format
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Adam Buchbinder <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The kretprobe_trampoline_holder() wrapper around kretprobe_trampoline()
isn't used anywhere and adds some unnecessary frame pointer instructions
which never execute. Instead, just make kretprobe_trampoline() a proper
ELF function.
Signed-off-by: Josh Poimboeuf <[email protected]>
Acked-by: Masami Hiramatsu <[email protected]>
Cc: Ananth N Mavinakayanahalli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Anil S Keshavamurthy <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Bernd Petrovitsch <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Chris J Arges <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michal Marek <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Pedro Alves <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/92d921b102fb865a7c254cfde9e4a0a72b9a781e.1453405861.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <[email protected]>
|
|
do_suspend_lowlevel() is a callable non-leaf function which doesn't
honor CONFIG_FRAME_POINTER, which can result in bad stack traces.
Create a stack frame for it when CONFIG_FRAME_POINTER is enabled.
Signed-off-by: Josh Poimboeuf <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Acked-by: Pavel Machek <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Bernd Petrovitsch <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Chris J Arges <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michal Marek <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Pedro Alves <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/7383d87dd40a460e0d757a0793498b9d06a7ee0d.1453405861.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <[email protected]>
|
|
vide() is a callable function, but is missing the ELF function type,
which confuses tools like stacktool.
Properly annotate it to be a callable function. The generated code is
unchanged.
Signed-off-by: Josh Poimboeuf <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Bernd Petrovitsch <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Chris J Arges <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michal Marek <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Pedro Alves <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/a324095f5c9390ff39b15b4562ea1bbeda1a8282.1453405861.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Conflicts:
drivers/net/phy/bcm7xxx.c
drivers/net/phy/marvell.c
drivers/net/vxlan.c
All three conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller <[email protected]>
|
|
This removes the CONFIG_DEBUG_RODATA option and makes it always enabled.
This simplifies the code and also makes it clearer that read-only mapped
memory is just as fundamental a security feature in kernel-space as it is
in user-space.
Suggested-by: Ingo Molnar <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: David Brown <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Emese Revfy <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mathias Krause <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: PaX Team <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: linux-arch <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
It's simpler to look at the topology mask than iterating over all online cpus
to find a cpu on the same package.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
|
|
. avoid walking the stack when there is no room left in the buffer
. generalize get_perf_callchain() to be called from bpf helper
Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Protection keys provide new page-based protection in hardware.
But, they have an interesting attribute: they only affect data
accesses and never affect instruction fetches. That means that
if we set up some memory which is set as "access-disabled" via
protection keys, we can still execute from it.
This patch uses protection keys to set up mappings to do just that.
If a user calls:
mmap(..., PROT_EXEC);
or
mprotect(ptr, sz, PROT_EXEC);
(note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
notice this, and set a special protection key on the memory. It
also sets the appropriate bits in the Protection Keys User Rights
(PKRU) register so that the memory becomes unreadable and
unwritable.
I haven't found any userspace that does this today. With this
facility in place, we expect userspace to move to use it
eventually. Userspace _could_ start doing this today. Any
PROT_EXEC calls get converted to PROT_READ inside the kernel, and
would transparently be upgraded to "true" PROT_EXEC with this
code. IOW, userspace never has to do any PROT_EXEC runtime
detection.
This feature provides enhanced protection against leaking
executable memory contents. This helps thwart attacks which are
attempting to find ROP gadgets on the fly.
But, the security provided by this approach is not comprehensive.
The PKRU register which controls access permissions is a normal
user register writable from unprivileged userspace. An attacker
who can execute the 'wrpkru' instruction can easily disable the
protection provided by this feature.
The protection key that is used for execute-only support is
permanently dedicated at compile time. This is fine for now
because there is currently no API to set a protection key other
than this one.
Despite there being a constant PKRU value across the entire
system, we do not set it unless this feature is in use in a
process. That is to preserve the PKRU XSAVE 'init state',
which can lead to faster context switches.
PKRU *is* a user register and the kernel is modifying it. That
means that code doing:
pkru = rdpkru()
pkru |= 0x100;
mmap(..., PROT_EXEC);
wrpkru(pkru);
could lose the bits in PKRU that enforce execute-only
permissions. To avoid this, we suggest avoiding ever calling
mmap() or mprotect() when the PKRU value is expected to be
unstable.
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Chen Gang <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Piotr Kwapulinski <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Stephen Smalley <[email protected]>
Cc: Vladimir Murzin <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The Protection Key Rights for User memory (PKRU) is a 32-bit
user-accessible register. It contains two bits for each
protection key: one to write-disable (WD) access to memory
covered by the key and another to access-disable (AD).
Userspace can read/write the register with the RDPKRU and WRPKRU
instructions. But, the register is saved and restored with the
XSAVE family of instructions, which means we have to treat it
like a floating point register.
The kernel needs to write to the register if it wants to
implement execute-only memory or if it implements a system call
to change PKRU.
To do this, we need to create a 'pkru_state' buffer, read the old
contents in to it, modify it, and then tell the FPU code that
there is modified data in there so it can (possibly) move the
buffer back in to the registers.
This uses the fpu__xfeature_set_state() function that we defined
in the previous patch.
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
We want to modify the Protection Key rights inside the kernel, so
we need to change PKRU's contents. But, if we do a plain
'wrpkru', when we return to userspace we might do an XRSTOR and
wipe out the kernel's 'wrpkru'. So, we need to go after PKRU in
the xsave buffer.
We do this by:
1. Ensuring that we have the XSAVE registers (fpregs) in the
kernel FPU buffer (fpstate)
2. Looking up the location of a given state in the buffer
3. Filling in the stat
4. Ensuring that the hardware knows that state is present there
(basically that the 'init optimization' is not in place).
5. Copying the newly-modified state back to the registers if
necessary.
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Casasnovas <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|