aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/entry/common.c
AgeCommit message (Collapse)AuthorFilesLines
2020-11-04x86/entry: Move nmi entry/exit into common codeThomas Gleixner1-34/+0
Lockdep state handling on NMI enter and exit is nothing specific to X86. It's not any different on other architectures. Also the extra state type is not necessary, irqentry_state_t can carry the necessary information as well. Move it to common code and extend irqentry_state_t to carry lockdep state. [ Ira: Make exit_rcu and lockdep a union as they are mutually exclusive between the IRQ and NMI exceptions, and add kernel documentation for struct irqentry_state_t ] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Ira Weiny <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-22x86/irq: Make run_on_irqstack_cond() typesafeThomas Gleixner1-1/+1
Sami reported that run_on_irqstack_cond() requires the caller to cast functions to mismatching types, which trips indirect call Control-Flow Integrity (CFI) in Clang. Instead of disabling CFI on that function, provide proper helpers for the three call variants. The actual ASM code stays the same as that is out of reach. [ bp: Fix __run_on_irqstack() prototype to match. ] Fixes: 931b94145981 ("x86/entry: Provide helpers for executing on the irqstack") Reported-by: Nathan Chancellor <[email protected]> Reported-by: Sami Tolvanen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Tested-by: Sami Tolvanen <[email protected]> Cc: <[email protected]> Link: https://github.com/ClangBuiltLinux/linux/issues/1052 Link: https://lkml.kernel.org/r/[email protected]
2020-09-04x86/entry: Unbreak 32bit fast syscallThomas Gleixner1-9/+20
Andy reported that the syscall treacing for 32bit fast syscall fails: # ./tools/testing/selftests/x86/ptrace_syscall_32 ... [RUN] SYSEMU [FAIL] Initial args are wrong (nr=224, args=10 11 12 13 14 4289172732) ... [RUN] SYSCALL [FAIL] Initial args are wrong (nr=29, args=0 0 0 0 0 4289172732) The eason is that the conversion to generic entry code moved the retrieval of the sixth argument (EBP) after the point where the syscall entry work runs, i.e. ptrace, seccomp, audit... Unbreak it by providing a split up version of syscall_enter_from_user_mode(). - syscall_enter_from_user_mode_prepare() establishes state and enables interrupts - syscall_enter_from_user_mode_work() runs the entry work Replace the call to syscall_enter_from_user_mode() in the 32bit fast syscall C-entry with the split functions and stick the EBP retrieval between them. Fixes: 27d6b4d14f5c ("x86/entry: Use generic syscall entry function") Reported-by: Andy Lutomirski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-07-26Merge branch 'locking/nmi' into x86/entryIngo Molnar1-0/+34
Resolve conflicts with ongoing lockdep work that fixed the NMI entry code. Conflicts: arch/x86/entry/common.c arch/x86/include/asm/idtentry.h Signed-off-by: Ingo Molnar <[email protected]>
2020-07-24x86/entry: Cleanup idtentry_enter/exitThomas Gleixner1-3/+3
Remove the temporary defines and fixup all references. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Kees Cook <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-24x86/entry: Use generic interrupt entry/exit codeThomas Gleixner1-166/+1
Replace the x86 code with the generic variant. Use temporary defines for idtentry_* which will be cleaned up in the next step. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-24x86/entry: Use generic syscall exit functionalityThomas Gleixner1-216/+5
Replace the x86 variant with the generic version. Provide the relevant architecture specific helper functions and defines. Use a temporary define for idtentry_exit_user which will be cleaned up seperately. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Kees Cook <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-24x86/entry: Use generic syscall entry functionThomas Gleixner1-173/+8
Replace the syscall entry work handling with the generic version. Provide the necessary helper inlines to handle the real architecture specific parts, e.g. ptrace. Use a temporary define for idtentry_enter_user which will be cleaned up seperately. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Kees Cook <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-24x86/entry: Move user return notifier out of loopThomas Gleixner1-4/+4
Guests and user space share certain MSRs. KVM sets these MSRs to guest values once and does not set them back to user space values on every VM exit to spare the costly MSR operations. User return notifiers ensure that these MSRs are set back to the correct values before returning to user space in exit_to_usermode_loop(). There is no reason to evaluate the TIF flag indicating that user return notifiers need to be invoked in the loop. The important point is that they are invoked before returning to user space. Move the invocation out of the loop into the section which does the last preperatory steps before returning to user space. That section is not preemptible and runs with interrupts disabled until the actual return. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-24x86/entry: Consolidate 32/64 bit syscall entryThomas Gleixner1-52/+41
64bit and 32bit entry code have the same open coded syscall entry handling after the bitwidth specific bits. Move it to a helper function and share the code. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-24x86/entry: Consolidate check_user_regs()Thomas Gleixner1-15/+9
The user register sanity check is sprinkled all over the place. Move it into enter_from_user_mode(). Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Kees Cook <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-10x86/entry: Fix NMI vs IRQ state trackingPeter Zijlstra1-4/+38
While the nmi_enter() users did trace_hardirqs_{off_prepare,on_finish}() there was no matching lockdep_hardirqs_*() calls to complete the picture. Introduce idtentry_{enter,exit}_nmi() to enable proper IRQ state tracking across the NMIs. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-10Merge branch 'x86/urgent' into x86/entry to pick up upstream fixes.Thomas Gleixner1-2/+2
2020-07-09x86/entry/common: Make prepare_exit_to_usermode() staticThomas Gleixner1-1/+1
No users outside this file anymore. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-09x86/entry: Mark check_user_regs() noinstrThomas Gleixner1-1/+1
It's called from the non-instrumentable section. Fixes: c9c26150e61d ("x86/entry: Assert that syscalls are on the right stack") Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-06x86/entry: Rename idtentry_enter/exit_cond_rcu() to idtentry_enter/exit()Andy Lutomirski1-22/+28
They were originally called _cond_rcu because they were special versions with conditional RCU handling. Now they're the standard entry and exit path, so the _cond_rcu part is just confusing. Drop it. Also change the signature to make them more extensible and more foolproof. No functional change -- it's pure refactoring. Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/247fc67685263e0b673e1d7f808182d28ff80359.1593795633.git.luto@kernel.org
2020-07-04x86/entry, selftests: Further improve user entry sanity checksAndy Lutomirski1-0/+19
Chasing down a Xen bug caused me to realize that the new entry sanity checks are still fairly weak. Add some more checks. Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/881de09e786ab93ce56ee4a2437ba2c308afe7a9.1593795633.git.luto@kernel.org
2020-07-01x86/entry: Move SYSENTER's regs->sp and regs->flags fixups into CAndy Lutomirski1-0/+12
The SYSENTER asm (32-bit and compat) contains fixups for regs->sp and regs->flags. Move the fixups into C and fix some comments while at it. This is a valid cleanup all by itself, and it also simplifies the subsequent patch that will fix Xen PV SYSENTER. Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/fe62bef67eda7fac75b8f3dbafccf571dc4ece6b.1593191971.git.luto@kernel.org
2020-07-01x86/entry: Assert that syscalls are on the right stackAndy Lutomirski1-3/+15
Now that the entry stack is a full page, it's too easy to regress the system call entry code and end up on the wrong stack without noticing. Assert that all system calls (SYSCALL64, SYSCALL32, SYSENTER, and INT80) are on the right stack and have pt_regs in the right place. Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/52059e42bb0ab8551153d012d68f7be18d72ff8e.1593191971.git.luto@kernel.org
2020-06-12x86/entry: Force rcu_irq_enter() when in idle taskThomas Gleixner1-7/+28
The idea of conditionally calling into rcu_irq_enter() only when RCU is not watching turned out to be not completely thought through. Paul noticed occasional premature end of grace periods in RCU torture testing. Bisection led to the commit which made the invocation of rcu_irq_enter() conditional on !rcu_is_watching(). It turned out that this conditional breaks RCU assumptions about the idle task when the scheduler tick happens to be a nested interrupt. Nested interrupts can happen when the first interrupt invokes softirq processing on return which enables interrupts. If that nested tick interrupt does not invoke rcu_irq_enter() then the RCU's irq-nesting checks will believe that this interrupt came directly from idle, which will cause RCU to report a quiescent state. Because this interrupt instead came from a softirq handler which might have been executing an RCU read-side critical section, this can cause the grace period to end prematurely. Change the condition from !rcu_is_watching() to is_idle_task(current) which enforces that interrupts in the idle task unconditionally invoke rcu_irq_enter() independent of the RCU state. This is also correct vs. user mode entries in NOHZ full scenarios because user mode entries bring RCU out of EQS and force the RCU irq nesting state accounting to nested. As only the first interrupt can enter from user mode a nested tick interrupt will enter from kernel mode and as the nesting state accounting is forced to nesting it will not do anything stupid even if rcu_irq_enter() has not been invoked. Fixes: 3eeec3858488 ("x86/entry: Provide idtentry_entry/exit_cond_rcu()") Reported-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: "Paul E. McKenney" <[email protected]> Reviewed-by: "Paul E. McKenney" <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Acked-by: Frederic Weisbecker <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-06-11x86/entry: Rename trace_hardirqs_off_prepare()Peter Zijlstra1-3/+3
The typical pattern for trace_hardirqs_off_prepare() is: ENTRY lockdep_hardirqs_off(); // because hardware ... do entry magic instrumentation_begin(); trace_hardirqs_off_prepare(); ... do actual work trace_hardirqs_on_prepare(); lockdep_hardirqs_on_prepare(); instrumentation_end(); ... do exit magic lockdep_hardirqs_on(); which shows that it's named wrong, rename it to trace_hardirqs_off_finish(), as it concludes the hardirq_off transition. Also, given that the above is the only correct order, make the traditional all-in-one trace_hardirqs_off() follow suit. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-06-11x86/entry: Make enter_from_user_mode() staticThomas Gleixner1-1/+1
The ASM users are gone. All callers are local. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-06-11x86/entry: Switch XEN/PV hypercall entry to IDTENTRYThomas Gleixner1-0/+78
Convert the XEN/PV hypercall to IDTENTRY: - Emit the ASM stub with DECLARE_IDTENTRY - Remove the ASM idtentry in 64-bit - Remove the open coded ASM entry code in 32-bit - Remove the old prototypes The handler stubs need to stay in ASM code as they need corner case handling and adjustment of the stack pointer. Provide a new C function which invokes the entry/exit handling and calls into the XEN handler on the interrupt stack if required. The exit code is slightly different from the regular idtentry_exit() on non-preemptible kernels. If the hypercall is preemptible and need_resched() is set then XEN provides a preempt hypercall scheduling function. Move this functionality into the entry code so it can use the existing idtentry functionality. [ mingo: Build fixes. ] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Acked-by: Juergen Gross <[email protected]> Tested-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-06-11x86/entry: Split out idtentry_exit_cond_resched()Thomas Gleixner1-15/+15
The XEN PV hypercall requires the ability of conditional rescheduling when preemption is disabled because some hypercalls take ages. Split out the rescheduling code from idtentry_exit_cond_rcu() so it can be reused for that. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-06-11x86/entry: Clean up idtentry_enter/exit() leftoversThomas Gleixner1-38/+29
Now that everything is converted to conditional RCU handling remove idtentry_enter/exit() and tidy up the conditional functions. This does not remove rcu_irq_exit_preempt(), to avoid conflicts with the RCU tree. Will be removed once all of this hits Linus's tree. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-06-11x86/entry: Provide idtentry_enter/exit_user()Thomas Gleixner1-0/+31
As there are exceptions which already handle entry from user mode and from kernel mode separately, providing explicit user entry/exit handling callbacks makes sense and makes the code easier to understand. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-06-11x86/entry: Provide idtentry_entry/exit_cond_rcu()Thomas Gleixner1-15/+64
After a lengthy discussion [1] it turned out that RCU does not need a full rcu_irq_enter/exit() when RCU is already watching. All it needs if NOHZ_FULL is active is to check whether the tick needs to be restarted. This allows to avoid a separate variant for the pagefault handler which cannot invoke rcu_irq_enter() on a kernel pagefault which might sleep. The cond_rcu argument is only temporary and will be removed once the existing users of idtentry_enter/exit() have been cleaned up. After that the code can be significantly simplified. [ mingo: Simplified the control flow ] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: "Paul E. McKenney" <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: [1] https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected]
2020-06-11x86/entry/common: Provide idtentry_enter/exit()Thomas Gleixner1-0/+99
Provide functions which handle the low level entry and exit similar to enter/exit from user mode. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-06-11x86/entry: Move irq flags tracing to prepare_exit_to_usermode()Thomas Gleixner1-1/+18
This is another step towards more C-code and less convoluted ASM. Similar to the entry path, invoke the tracer before context tracking which might turn off RCU and invoke lockdep as the last step before going back to user space. Annotate the code sections in exit_to_user_mode() accordingly so objtool won't complain about the tracer invocation. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-06-11x86/entry: Move irq tracing on syscall entry to C-codeThomas Gleixner1-2/+19
Now that the C entry points are safe, move the irq flags tracing code into the entry helper: - Invoke lockdep before calling into context tracking - Use the safe trace_hardirqs_on_prepare() trace function after context tracking established state and RCU is watching. enter_from_user_mode() is also still invoked from the exception/interrupt entry code which still contains the ASM irq flags tracing. So this is just a redundant and harmless invocation of tracing / lockdep until these are removed as well. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-06-11x86/entry/common: Protect against instrumentationThomas Gleixner1-44/+89
Mark the various syscall entries with noinstr to protect them against instrumentation and add the noinstrumentation_begin()/end() annotations to mark the parts of the functions which are safe to call out into instrumentable code. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-06-11x86/entry: Mark enter_from_user_mode() noinstrThomas Gleixner1-1/+1
Both the callers in the low level ASM code and __context_tracking_exit() which is invoked from enter_from_user_mode() via user_exit_irqoff() are marked NOKPROBE. Allowing enter_from_user_mode() to be probed is inconsistent at best. Aside of that while function tracing per se is safe the function trace entry/exit points can be used via BPF as well which is not safe to use before context tracking has reached CONTEXT_KERNEL and adjusted RCU. Mark it noinstr which moves it into the instrumentation protected text section and includes notrace. Note, this needs further fixups in context tracking to ensure that the full call chain is protected. Will be addressed in follow up changes. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-31Merge branch 'x86-cleanups-for-linus' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "This topic tree contains more commits than usual: - most of it are uaccess cleanups/reorganization by Al - there's a bunch of prototype declaration (--Wmissing-prototypes) cleanups - misc other cleanups all around the map" * 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) x86/mm/set_memory: Fix -Wmissing-prototypes warnings x86/efi: Add a prototype for efi_arch_mem_reserve() x86/mm: Mark setup_emu2phys_nid() static x86/jump_label: Move 'inline' keyword placement x86/platform/uv: Add a missing prototype for uv_bau_message_interrupt() kill uaccess_try() x86: unsafe_put-style macro for sigmask x86: x32_setup_rt_frame(): consolidate uaccess areas x86: __setup_rt_frame(): consolidate uaccess areas x86: __setup_frame(): consolidate uaccess areas x86: setup_sigcontext(): list user_access_{begin,end}() into callers x86: get rid of put_user_try in __setup_rt_frame() (both 32bit and 64bit) x86: ia32_setup_rt_frame(): consolidate uaccess areas x86: ia32_setup_frame(): consolidate uaccess areas x86: ia32_setup_sigcontext(): lift user_access_{begin,end}() into the callers x86/alternatives: Mark text_poke_loc_init() static x86/cpu: Fix a -Wmissing-prototypes warning for init_ia32_feat_ctl() x86/mm: Drop pud_mknotpresent() x86: Replace setup_irq() by request_irq() x86/configs: Slightly reduce defconfigs ...
2020-03-21x86/entry/32: Enable pt_regs based syscallsBrian Gerst1-15/+0
Enable pt_regs based syscalls for 32-bit. This makes the 32-bit native kernel consistent with the 64-bit kernel, and improves the syscall interface by not needing to push all 6 potential arguments onto the stack. Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Dominik Brodowski <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-21x86/entry/64: Move sys_ni_syscall stub to common.cBrian Gerst1-0/+7
so it can be available to multiple syscall tables. Also directly return -ENOSYS instead of bouncing to the generic sys_ni_syscall(). Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-02-17x86/syscalls: Add prototypes for C syscall callbacksBenjamin Thiel1-0/+1
.. in order to fix a couple of -Wmissing-prototypes warnings. No functional change. [ bp: Massage commit message and drop newlines. ] Signed-off-by: Benjamin Thiel <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2019-11-16x86/ioperm: Move TSS bitmap update to exit to user workThomas Gleixner1-0/+4
There is no point to update the TSS bitmap for tasks which use I/O bitmaps on every context switch. It's enough to update it right before exiting to user space. That reduces the context switch bitmap handling to invalidating the io bitmap base offset in the TSS when the outgoing task has TIF_IO_BITMAP set. The invaldiation is done on purpose when a task with an IO bitmap switches out to prevent any possible leakage of an activated IO bitmap. It also removes the requirement to update the tasks bitmap atomically in ioperm(). Signed-off-by: Thomas Gleixner <[email protected]>
2019-07-22x86/syscalls: Split the x32 syscalls into their own tableAndy Lutomirski1-6/+7
For unfortunate historical reasons, the x32 syscalls and the x86_64 syscalls are not all numbered the same. As an example, ioctl() is nr 16 on x86_64 but 514 on x32. This has potentially nasty consequences, since it means that there are two valid RAX values to do ioctl(2) and two invalid RAX values. The valid values are 16 (i.e. ioctl(2) using the x86_64 ABI) and (514 | 0x40000000) (i.e. ioctl(2) using the x32 ABI). The invalid values are 514 and (16 | 0x40000000). 514 will enter the "COMPAT_SYSCALL_DEFINE3(ioctl, ...)" entry point with in_compat_syscall() and in_x32_syscall() returning false, whereas (16 | 0x40000000) will enter the native entry point with in_compat_syscall() and in_x32_syscall() returning true. Both are bogus, and both will exercise code paths in the kernel and in any running seccomp filters that really ought to be unreachable. Splitting out the x32 syscalls into their own tables, allows both bogus invocations to return -ENOSYS. I've checked glibc, musl, and Bionic, and all of them appear to call syscalls with their correct numbers, so this change should have no effect on them. There is an added benefit going forward: new syscalls that need special handling on x32 can share the same number on x32 and x86_64. This means that the special syscall range 512-547 can be treated as a legacy wart instead of something that may need to be extended in the future. Also add a selftest to verify the new behavior. Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/208024256b764312598f014ebfb0a42472c19354.1562185330.git.luto@kernel.org
2019-07-08Merge tag 'arm64-upstream' of ↵Linus Torvalds1-11/+6
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: - arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP} - Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to manage the permissions of executable vmalloc regions more strictly - Slight performance improvement by keeping softirqs enabled while touching the FPSIMD/SVE state (kernel_neon_begin/end) - Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new XAFLAG and AXFLAG instructions for floating point comparison flags manipulation) and FRINT (rounding floating point numbers to integers) - Re-instate ARM64_PSEUDO_NMI support which was previously marked as BROKEN due to some bugs (now fixed) - Improve parking of stopped CPUs and implement an arm64-specific panic_smp_self_stop() to avoid warning on not being able to stop secondary CPUs during panic - perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI platforms - perf: DDR performance monitor support for iMX8QXP - cache_line_size() can now be set from DT or ACPI/PPTT if provided to cope with a system cache info not exposed via the CPUID registers - Avoid warning on hardware cache line size greater than ARCH_DMA_MINALIGN if the system is fully coherent - arm64 do_page_fault() and hugetlb cleanups - Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep) - Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the 'arm_boot_flags' introduced in 5.1) - CONFIG_RANDOMIZE_BASE now enabled in defconfig - Allow the selection of ARM64_MODULE_PLTS, currently only done via RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill over into the vmalloc area - Make ZONE_DMA32 configurable * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (54 commits) perf: arm_spe: Enable ACPI/Platform automatic module loading arm_pmu: acpi: spe: Add initial MADT/SPE probing ACPI/PPTT: Add function to return ACPI 6.3 Identical tokens ACPI/PPTT: Modify node flag detection to find last IDENTICAL x86/entry: Simplify _TIF_SYSCALL_EMU handling arm64: rename dump_instr as dump_kernel_instr arm64/mm: Drop [PTE|PMD]_TYPE_FAULT arm64: Implement panic_smp_self_stop() arm64: Improve parking of stopped CPUs arm64: Expose FRINT capabilities to userspace arm64: Expose ARMv8.5 CondM capability to userspace arm64: defconfig: enable CONFIG_RANDOMIZE_BASE arm64: ARM64_MODULES_PLTS must depend on MODULES arm64: bpf: do not allocate executable memory arm64/kprobes: set VM_FLUSH_RESET_PERMS on kprobe instruction pages arm64/mm: wire up CONFIG_ARCH_HAS_SET_DIRECT_MAP arm64: module: create module allocations without exec permissions arm64: Allow user selection of ARM64_MODULE_PLTS acpi/arm64: ignore 5.1 FADTs that are reported as 5.0 arm64: Allow selecting Pseudo-NMI again ...
2019-06-27x86/entry: Simplify _TIF_SYSCALL_EMU handlingSudeep Holla1-11/+6
The usage of emulated and _TIF_SYSCALL_EMU flags in syscall_trace_enter is more complicated than required. Cc: Andy Lutomirski <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Sudeep Holla <[email protected]> Signed-off-by: Catalin Marinas <[email protected]>
2019-06-05treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 257Thomas Gleixner1-1/+1
Based on 1 normalized pattern(s): gpl v2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 19 file(s). Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Allison Randal <[email protected]> Reviewed-by: Richard Fontana <[email protected]> Reviewed-by: Steve Winslow <[email protected]> Reviewed-by: Kate Stewart <[email protected]> Reviewed-by: Alexios Zavras <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2019-05-14Merge branch 'x86-mds-for-linus' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MDS mitigations from Thomas Gleixner: "Microarchitectural Data Sampling (MDS) is a hardware vulnerability which allows unprivileged speculative access to data which is available in various CPU internal buffers. This new set of misfeatures has the following CVEs assigned: CVE-2018-12126 MSBDS Microarchitectural Store Buffer Data Sampling CVE-2018-12130 MFBDS Microarchitectural Fill Buffer Data Sampling CVE-2018-12127 MLPDS Microarchitectural Load Port Data Sampling CVE-2019-11091 MDSUM Microarchitectural Data Sampling Uncacheable Memory MDS attacks target microarchitectural buffers which speculatively forward data under certain conditions. Disclosure gadgets can expose this data via cache side channels. Contrary to other speculation based vulnerabilities the MDS vulnerability does not allow the attacker to control the memory target address. As a consequence the attacks are purely sampling based, but as demonstrated with the TLBleed attack samples can be postprocessed successfully. The mitigation is to flush the microarchitectural buffers on return to user space and before entering a VM. It's bolted on the VERW instruction and requires a microcode update. As some of the attacks exploit data structures shared between hyperthreads, full protection requires to disable hyperthreading. The kernel does not do that by default to avoid breaking unattended updates. The mitigation set comes with documentation for administrators and a deeper technical view" * 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/speculation/mds: Fix documentation typo Documentation: Correct the possible MDS sysfs values x86/mds: Add MDSUM variant to the MDS documentation x86/speculation/mds: Add 'mitigations=' support for MDS x86/speculation/mds: Print SMT vulnerable on MSBDS with mitigations off x86/speculation/mds: Fix comment x86/speculation/mds: Add SMT warning message x86/speculation: Move arch_smt_update() call to after mitigation decisions x86/speculation/mds: Add mds=full,nosmt cmdline option Documentation: Add MDS vulnerability documentation Documentation: Move L1TF to separate directory x86/speculation/mds: Add mitigation mode VMWERV x86/speculation/mds: Add sysfs reporting for MDS x86/speculation/mds: Add mitigation control for MDS x86/speculation/mds: Conditionally clear CPU buffers on idle entry x86/kvm/vmx: Add MDS protection when L1D Flush is not active x86/speculation/mds: Clear CPU buffers on exit to user x86/speculation/mds: Add mds_clear_cpu_buffers() x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests x86/speculation/mds: Add BUG_MSBDS_ONLY ...
2019-04-12x86/fpu: Defer FPU state load until return to userspaceRik van Riel1-1/+9
Defer loading of FPU state until return to userspace. This gives the kernel the potential to skip loading FPU state for tasks that stay in kernel mode, or for tasks that end up with repeated invocations of kernel_fpu_begin() & kernel_fpu_end(). The fpregs_lock/unlock() section ensures that the registers remain unchanged. Otherwise a context switch or a bottom half could save the registers to its FPU context and the processor's FPU registers would became random if modified at the same time. KVM swaps the host/guest registers on entry/exit path. This flow has been kept as is. First it ensures that the registers are loaded and then saves the current (host) state before it loads the guest's registers. The swap is done at the very end with disabled interrupts so it should not change anymore before theg guest is entered. The read/save version seems to be cheaper compared to memcpy() in a micro benchmark. Each thread gets TIF_NEED_FPU_LOAD set as part of fork() / fpu__copy(). For kernel threads, this flag gets never cleared which avoids saving / restoring the FPU state for kernel threads and during in-kernel usage of the FPU registers. [ bp: Correct and update commit message and fix checkpatch warnings. s/register/registers/ where it is used in plural. minor comment corrections. remove unused trace_x86_fpu_activate_state() TP. ] Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Aubrey Li <[email protected]> Cc: Babu Moger <[email protected]> Cc: "Chang S. Bae" <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: "Jason A. Donenfeld" <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: kvm ML <[email protected]> Cc: Nicolai Stange <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: "Radim Krčmář" <[email protected]> Cc: Tim Chen <[email protected]> Cc: Waiman Long <[email protected]> Cc: x86-ml <[email protected]> Cc: Yi Wang <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2019-03-06x86/speculation/mds: Clear CPU buffers on exit to userThomas Gleixner1-0/+3
Add a static key which controls the invocation of the CPU buffer clear mechanism on exit to user space and add the call into prepare_exit_to_usermode() and do_nmi() right before actually returning. Add documentation which kernel to user space transition this covers and explain why some corner cases are not mitigated. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Reviewed-by: Jon Masters <[email protected]> Tested-by: Jon Masters <[email protected]>
2018-12-03x86: Fix various typos in commentsIngo Molnar1-1/+1
Go over arch/x86/ and fix common typos in comments, and a typo in an actual function argument name. No change in functionality intended. Cc: Thomas Gleixner <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-06-22rseq: Avoid infinite recursion when delivering SIGSEGVWill Deacon1-1/+1
When delivering a signal to a task that is using rseq, we call into __rseq_handle_notify_resume() so that the registers pushed in the sigframe are updated to reflect the state of the restartable sequence (for example, ensuring that the signal returns to the abort handler if necessary). However, if the rseq management fails due to an unrecoverable fault when accessing userspace or certain combinations of RSEQ_CS_* flags, then we will attempt to deliver a SIGSEGV. This has the potential for infinite recursion if the rseq code continuously fails on signal delivery. Avoid this problem by using force_sigsegv() instead of force_sig(), which is explicitly designed to reset the SEGV handler to SIG_DFL in the case of a recursive fault. In doing so, remove rseq_signal_deliver() from the internal rseq API and have an optional struct ksignal * parameter to rseq_handle_notify_resume() instead. Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Mathieu Desnoyers <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2018-06-06x86: Add support for restartable sequencesMathieu Desnoyers1-0/+3
Call the rseq_handle_notify_resume() function on return to userspace if TIF_NOTIFY_RESUME thread flag is set. Perform fixup on the pre-signal frame when a signal is delivered on top of a restartable sequence critical section. Check that system calls are not invoked from within rseq critical sections by invoking rseq_signal() from syscall_return_slowpath(). With CONFIG_DEBUG_RSEQ, such behavior results in termination of the process with SIGSEGV. Signed-off-by: Mathieu Desnoyers <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dave Watson <[email protected]> Cc: Will Deacon <[email protected]> Cc: Andi Kleen <[email protected]> Cc: "H . Peter Anvin" <[email protected]> Cc: Chris Lameter <[email protected]> Cc: Russell King <[email protected]> Cc: Andrew Hunter <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: "Paul E . McKenney" <[email protected]> Cc: Paul Turner <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Ben Maurer <[email protected]> Cc: [email protected] Cc: Andy Lutomirski <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-04-05syscalls/x86: Unconditionally enable 'struct pt_regs' based syscalls on x86_64Dominik Brodowski1-8/+2
Removing CONFIG_SYSCALL_PTREGS from arch/x86/Kconfig and simply selecting ARCH_HAS_SYSCALL_WRAPPER unconditionally on x86-64 allows us to simplify several codepaths. Signed-off-by: Dominik Brodowski <[email protected]> Acked-by: Linus Torvalds <[email protected]> Cc: Al Viro <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-04-05syscalls/x86: Use 'struct pt_regs' based syscall calling for IA32_EMULATION ↵Dominik Brodowski1-0/+4
and x32 Extend ARCH_HAS_SYSCALL_WRAPPER for i386 emulation and for x32 on 64-bit x86. For x32, all we need to do is to create an additional stub for each compat syscall which decodes the parameters in x86-64 ordering, e.g.: asmlinkage long __compat_sys_x32_xyzzy(struct pt_regs *regs) { return c_SyS_xyzzy(regs->di, regs->si, regs->dx); } For i386 emulation, we need to teach compat_sys_*() to take struct pt_regs as its only argument, e.g.: asmlinkage long __compat_sys_ia32_xyzzy(struct pt_regs *regs) { return c_SyS_xyzzy(regs->bx, regs->cx, regs->dx); } In addition, we need to create additional stubs for common syscalls (that is, for syscalls which have the same parameters on 32-bit and 64-bit), e.g.: asmlinkage long __sys_ia32_xyzzy(struct pt_regs *regs) { return c_sys_xyzzy(regs->bx, regs->cx, regs->dx); } This approach avoids leaking random user-provided register content down the call chain. This patch is based on an original proof-of-concept | From: Linus Torvalds <[email protected]> | Signed-off-by: Linus Torvalds <[email protected]> and was split up and heavily modified by me, in particular to base it on ARCH_HAS_SYSCALL_WRAPPER. Signed-off-by: Dominik Brodowski <[email protected]> Acked-by: Linus Torvalds <[email protected]> Cc: Al Viro <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-04-05syscalls/x86: Use 'struct pt_regs' based syscall calling convention for ↵Dominik Brodowski1-0/+4
64-bit syscalls Let's make use of ARCH_HAS_SYSCALL_WRAPPER=y on pure 64-bit x86-64 systems: Each syscall defines a stub which takes struct pt_regs as its only argument. It decodes just those parameters it needs, e.g: asmlinkage long sys_xyzzy(const struct pt_regs *regs) { return SyS_xyzzy(regs->di, regs->si, regs->dx); } This approach avoids leaking random user-provided register content down the call chain. For example, for sys_recv() which is a 4-parameter syscall, the assembly now is (in slightly reordered fashion): <sys_recv>: callq <__fentry__> /* decode regs->di, ->si, ->dx and ->r10 */ mov 0x70(%rdi),%rdi mov 0x68(%rdi),%rsi mov 0x60(%rdi),%rdx mov 0x38(%rdi),%rcx [ SyS_recv() is automatically inlined by the compiler, as it is not [yet] used anywhere else ] /* clear %r9 and %r8, the 5th and 6th args */ xor %r9d,%r9d xor %r8d,%r8d /* do the actual work */ callq __sys_recvfrom /* cleanup and return */ cltq retq The only valid place in an x86-64 kernel which rightfully calls a syscall function on its own -- vsyscall -- needs to be modified to pass struct pt_regs onwards as well. To keep the syscall table generation working independent of SYSCALL_PTREGS being enabled, the stubs are named the same as the "original" syscall stubs, i.e. sys_*(). This patch is based on an original proof-of-concept | From: Linus Torvalds <[email protected]> | Signed-off-by: Linus Torvalds <[email protected]> and was split up and heavily modified by me, in particular to base it on ARCH_HAS_SYSCALL_WRAPPER, to limit it to 64-bit-only for the time being, and to update the vsyscall to the new calling convention. Signed-off-by: Dominik Brodowski <[email protected]> Acked-by: Linus Torvalds <[email protected]> Cc: Al Viro <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>