From c0b0ae8a871bc2ebbe1ff9c9871efcf88994ffec Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Sat, 12 May 2018 21:58:21 -0400 Subject: audit: use inline function to set audit context Recognizing that the audit context is an internal audit value, use an access function to set the audit context pointer for the task rather than reaching directly into the task struct to set it. Signed-off-by: Richard Guy Briggs [PM: merge fuzz in audit.h] Signed-off-by: Paul Moore --- kernel/fork.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel/fork.c') diff --git a/kernel/fork.c b/kernel/fork.c index 242c8c93d285..cd18448b025a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1713,7 +1713,7 @@ static __latent_entropy struct task_struct *copy_process( p->start_time = ktime_get_ns(); p->real_start_time = ktime_get_boot_ns(); p->io_context = NULL; - p->audit_context = NULL; + audit_set_context(p, NULL); cgroup_fork(p); #ifdef CONFIG_NUMA p->mempolicy = mpol_dup(p->mempolicy); -- cgit From d7822b1e24f2df5df98c76f0e94a5416349ff759 Mon Sep 17 00:00:00 2001 From: Mathieu Desnoyers Date: Sat, 2 Jun 2018 08:43:54 -0400 Subject: rseq: Introduce restartable sequences system call Expose a new system call allowing each thread to register one userspace memory area to be used as an ABI between kernel and user-space for two purposes: user-space restartable sequences and quick access to read the current CPU number value from user-space. * Restartable sequences (per-cpu atomics) Restartables sequences allow user-space to perform update operations on per-cpu data without requiring heavy-weight atomic operations. The restartable critical sections (percpu atomics) work has been started by Paul Turner and Andrew Hunter. It lets the kernel handle restart of critical sections. [1] [2] The re-implementation proposed here brings a few simplifications to the ABI which facilitates porting to other architectures and speeds up the user-space fast path. Here are benchmarks of various rseq use-cases. Test hardware: arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading The following benchmarks were all performed on a single thread. * Per-CPU statistic counter increment getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 344.0 31.4 11.0 x86-64: 15.3 2.0 7.7 * LTTng-UST: write event 32-bit header, 32-bit payload into tracer per-cpu buffer getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 2502.0 2250.0 1.1 x86-64: 117.4 98.0 1.2 * liburcu percpu: lock-unlock pair, dereference, read/compare word getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 751.0 128.5 5.8 x86-64: 53.4 28.6 1.9 * jemalloc memory allocator adapted to use rseq Using rseq with per-cpu memory pools in jemalloc at Facebook (based on rseq 2016 implementation): The production workload response-time has 1-2% gain avg. latency, and the P99 overall latency drops by 2-3%. * Reading the current CPU number Speeding up reading the current CPU number on which the caller thread is running is done by keeping the current CPU number up do date within the cpu_id field of the memory area registered by the thread. This is done by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the current thread. Upon return to user-space, a notify-resume handler updates the current CPU value within the registered user-space memory area. User-space can then read the current CPU number directly from memory. Keeping the current cpu id in a memory area shared between kernel and user-space is an improvement over current mechanisms available to read the current CPU number, which has the following benefits over alternative approaches: - 35x speedup on ARM vs system call through glibc - 20x speedup on x86 compared to calling glibc, which calls vdso executing a "lsl" instruction, - 14x speedup on x86 compared to inlined "lsl" instruction, - Unlike vdso approaches, this cpu_id value can be read from an inline assembly, which makes it a useful building block for restartable sequences. - The approach of reading the cpu id through memory mapping shared between kernel and user-space is portable (e.g. ARM), which is not the case for the lsl-based x86 vdso. On x86, yet another possible approach would be to use the gs segment selector to point to user-space per-cpu data. This approach performs similarly to the cpu id cache, but it has two disadvantages: it is not portable, and it is incompatible with existing applications already using the gs segment selector for other purposes. Benchmarking various approaches for reading the current CPU number: ARMv7 Processor rev 4 (v7l) Machine model: Cubietruck - Baseline (empty loop): 8.4 ns - Read CPU from rseq cpu_id: 16.7 ns - Read CPU from rseq cpu_id (lazy register): 19.8 ns - glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns - getcpu system call: 234.9 ns x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz: - Baseline (empty loop): 0.8 ns - Read CPU from rseq cpu_id: 0.8 ns - Read CPU from rseq cpu_id (lazy register): 0.8 ns - Read using gs segment selector: 0.8 ns - "lsl" inline assembly: 13.0 ns - glibc 2.19-0ubuntu6 getcpu: 16.6 ns - getcpu system call: 53.9 ns - Speed (benchmark taken on v8 of patchset) Running 10 runs of hackbench -l 100000 seems to indicate, contrary to expectations, that enabling CONFIG_RSEQ slightly accelerates the scheduler: Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1 kernel parameter), with a Linux v4.6 defconfig+localyesconfig, restartable sequences series applied. * CONFIG_RSEQ=n avg.: 41.37 s std.dev.: 0.36 s * CONFIG_RSEQ=y avg.: 40.46 s std.dev.: 0.33 s - Size On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is 567 bytes, and the data size increase of vmlinux is 5696 bytes. [1] https://lwn.net/Articles/650333/ [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf Signed-off-by: Mathieu Desnoyers Signed-off-by: Thomas Gleixner Acked-by: Peter Zijlstra (Intel) Cc: Joel Fernandes Cc: Catalin Marinas Cc: Dave Watson Cc: Will Deacon Cc: Andi Kleen Cc: "H . Peter Anvin" Cc: Chris Lameter Cc: Russell King Cc: Andrew Hunter Cc: Michael Kerrisk Cc: "Paul E . McKenney" Cc: Paul Turner Cc: Boqun Feng Cc: Josh Triplett Cc: Steven Rostedt Cc: Ben Maurer Cc: Alexander Viro Cc: linux-api@vger.kernel.org Cc: Andy Lutomirski Cc: Andrew Morton Cc: Linus Torvalds Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com --- MAINTAINERS | 11 ++ arch/Kconfig | 7 + fs/exec.c | 1 + include/linux/sched.h | 134 +++++++++++++++++ include/linux/syscalls.h | 4 +- include/trace/events/rseq.h | 57 +++++++ include/uapi/linux/rseq.h | 133 +++++++++++++++++ init/Kconfig | 23 +++ kernel/Makefile | 1 + kernel/fork.c | 2 + kernel/rseq.c | 357 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/core.c | 2 + kernel/sys_ni.c | 3 + 13 files changed, 734 insertions(+), 1 deletion(-) create mode 100644 include/trace/events/rseq.h create mode 100644 include/uapi/linux/rseq.h create mode 100644 kernel/rseq.c (limited to 'kernel/fork.c') diff --git a/MAINTAINERS b/MAINTAINERS index aa635837a6af..a384243d911b 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -11976,6 +11976,17 @@ F: include/dt-bindings/reset/ F: include/linux/reset.h F: include/linux/reset-controller.h +RESTARTABLE SEQUENCES SUPPORT +M: Mathieu Desnoyers +M: Peter Zijlstra +M: "Paul E. McKenney" +M: Boqun Feng +L: linux-kernel@vger.kernel.org +S: Supported +F: kernel/rseq.c +F: include/uapi/linux/rseq.h +F: include/trace/events/rseq.h + RFKILL M: Johannes Berg L: linux-wireless@vger.kernel.org diff --git a/arch/Kconfig b/arch/Kconfig index b695a3e3e922..095ba99968c1 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -272,6 +272,13 @@ config HAVE_REGS_AND_STACK_ACCESS_API declared in asm/ptrace.h For example the kprobes-based event tracer needs this API. +config HAVE_RSEQ + bool + depends on HAVE_REGS_AND_STACK_ACCESS_API + help + This symbol should be selected by an architecture if it + supports an implementation of restartable sequences. + config HAVE_CLK bool help diff --git a/fs/exec.c b/fs/exec.c index 183059c427b9..2c3911612b22 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1822,6 +1822,7 @@ static int do_execveat_common(int fd, struct filename *filename, current->fs->in_exec = 0; current->in_execve = 0; membarrier_execve(current); + rseq_execve(current); acct_update_integrals(current); task_numa_free(current); free_bprm(bprm); diff --git a/include/linux/sched.h b/include/linux/sched.h index 14e4f9c12337..3aa4fcb74e76 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -27,6 +27,7 @@ #include #include #include +#include /* task_struct member predeclarations (sorted alphabetically): */ struct audit_context; @@ -1047,6 +1048,17 @@ struct task_struct { unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ +#ifdef CONFIG_RSEQ + struct rseq __user *rseq; + u32 rseq_len; + u32 rseq_sig; + /* + * RmW on rseq_event_mask must be performed atomically + * with respect to preemption. + */ + unsigned long rseq_event_mask; +#endif + struct tlbflush_unmap_batch tlb_ubc; struct rcu_head rcu; @@ -1757,4 +1769,126 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask); #define TASK_SIZE_OF(tsk) TASK_SIZE #endif +#ifdef CONFIG_RSEQ + +/* + * Map the event mask on the user-space ABI enum rseq_cs_flags + * for direct mask checks. + */ +enum rseq_event_mask_bits { + RSEQ_EVENT_PREEMPT_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT, + RSEQ_EVENT_SIGNAL_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT, + RSEQ_EVENT_MIGRATE_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT, +}; + +enum rseq_event_mask { + RSEQ_EVENT_PREEMPT = (1U << RSEQ_EVENT_PREEMPT_BIT), + RSEQ_EVENT_SIGNAL = (1U << RSEQ_EVENT_SIGNAL_BIT), + RSEQ_EVENT_MIGRATE = (1U << RSEQ_EVENT_MIGRATE_BIT), +}; + +static inline void rseq_set_notify_resume(struct task_struct *t) +{ + if (t->rseq) + set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); +} + +void __rseq_handle_notify_resume(struct pt_regs *regs); + +static inline void rseq_handle_notify_resume(struct pt_regs *regs) +{ + if (current->rseq) + __rseq_handle_notify_resume(regs); +} + +static inline void rseq_signal_deliver(struct pt_regs *regs) +{ + preempt_disable(); + __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask); + preempt_enable(); + rseq_handle_notify_resume(regs); +} + +/* rseq_preempt() requires preemption to be disabled. */ +static inline void rseq_preempt(struct task_struct *t) +{ + __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); + rseq_set_notify_resume(t); +} + +/* rseq_migrate() requires preemption to be disabled. */ +static inline void rseq_migrate(struct task_struct *t) +{ + __set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask); + rseq_set_notify_resume(t); +} + +/* + * If parent process has a registered restartable sequences area, the + * child inherits. Only applies when forking a process, not a thread. In + * case a parent fork() in the middle of a restartable sequence, set the + * resume notifier to force the child to retry. + */ +static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) +{ + if (clone_flags & CLONE_THREAD) { + t->rseq = NULL; + t->rseq_len = 0; + t->rseq_sig = 0; + t->rseq_event_mask = 0; + } else { + t->rseq = current->rseq; + t->rseq_len = current->rseq_len; + t->rseq_sig = current->rseq_sig; + t->rseq_event_mask = current->rseq_event_mask; + rseq_preempt(t); + } +} + +static inline void rseq_execve(struct task_struct *t) +{ + t->rseq = NULL; + t->rseq_len = 0; + t->rseq_sig = 0; + t->rseq_event_mask = 0; +} + +#else + +static inline void rseq_set_notify_resume(struct task_struct *t) +{ +} +static inline void rseq_handle_notify_resume(struct pt_regs *regs) +{ +} +static inline void rseq_signal_deliver(struct pt_regs *regs) +{ +} +static inline void rseq_preempt(struct task_struct *t) +{ +} +static inline void rseq_migrate(struct task_struct *t) +{ +} +static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) +{ +} +static inline void rseq_execve(struct task_struct *t) +{ +} + +#endif + +#ifdef CONFIG_DEBUG_RSEQ + +void rseq_syscall(struct pt_regs *regs); + +#else + +static inline void rseq_syscall(struct pt_regs *regs) +{ +} + +#endif + #endif diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 390e814fdc8d..73810808cdf2 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -66,6 +66,7 @@ struct old_linux_dirent; struct perf_event_attr; struct file_handle; struct sigaltstack; +struct rseq; union bpf_attr; #include @@ -897,7 +898,8 @@ asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val); asmlinkage long sys_pkey_free(int pkey); asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags, unsigned mask, struct statx __user *buffer); - +asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len, + int flags, uint32_t sig); /* * Architecture-specific system calls diff --git a/include/trace/events/rseq.h b/include/trace/events/rseq.h new file mode 100644 index 000000000000..a04a64bc1a00 --- /dev/null +++ b/include/trace/events/rseq.h @@ -0,0 +1,57 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM rseq + +#if !defined(_TRACE_RSEQ_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_RSEQ_H + +#include +#include + +TRACE_EVENT(rseq_update, + + TP_PROTO(struct task_struct *t), + + TP_ARGS(t), + + TP_STRUCT__entry( + __field(s32, cpu_id) + ), + + TP_fast_assign( + __entry->cpu_id = raw_smp_processor_id(); + ), + + TP_printk("cpu_id=%d", __entry->cpu_id) +); + +TRACE_EVENT(rseq_ip_fixup, + + TP_PROTO(unsigned long regs_ip, unsigned long start_ip, + unsigned long post_commit_offset, unsigned long abort_ip), + + TP_ARGS(regs_ip, start_ip, post_commit_offset, abort_ip), + + TP_STRUCT__entry( + __field(unsigned long, regs_ip) + __field(unsigned long, start_ip) + __field(unsigned long, post_commit_offset) + __field(unsigned long, abort_ip) + ), + + TP_fast_assign( + __entry->regs_ip = regs_ip; + __entry->start_ip = start_ip; + __entry->post_commit_offset = post_commit_offset; + __entry->abort_ip = abort_ip; + ), + + TP_printk("regs_ip=0x%lx start_ip=0x%lx post_commit_offset=%lu abort_ip=0x%lx", + __entry->regs_ip, __entry->start_ip, + __entry->post_commit_offset, __entry->abort_ip) +); + +#endif /* _TRACE_SOCK_H */ + +/* This part must be outside protection */ +#include diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h new file mode 100644 index 000000000000..d620fa43756c --- /dev/null +++ b/include/uapi/linux/rseq.h @@ -0,0 +1,133 @@ +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_RSEQ_H +#define _UAPI_LINUX_RSEQ_H + +/* + * linux/rseq.h + * + * Restartable sequences system call API + * + * Copyright (c) 2015-2018 Mathieu Desnoyers + */ + +#ifdef __KERNEL__ +# include +#else +# include +#endif + +#include + +enum rseq_cpu_id_state { + RSEQ_CPU_ID_UNINITIALIZED = -1, + RSEQ_CPU_ID_REGISTRATION_FAILED = -2, +}; + +enum rseq_flags { + RSEQ_FLAG_UNREGISTER = (1 << 0), +}; + +enum rseq_cs_flags_bit { + RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0, + RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1, + RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2, +}; + +enum rseq_cs_flags { + RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT = + (1U << RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT), + RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL = + (1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT), + RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE = + (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), +}; + +/* + * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always + * contained within a single cache-line. It is usually declared as + * link-time constant data. + */ +struct rseq_cs { + /* Version of this structure. */ + __u32 version; + /* enum rseq_cs_flags */ + __u32 flags; + LINUX_FIELD_u32_u64(start_ip); + /* Offset from start_ip. */ + LINUX_FIELD_u32_u64(post_commit_offset); + LINUX_FIELD_u32_u64(abort_ip); +} __attribute__((aligned(4 * sizeof(__u64)))); + +/* + * struct rseq is aligned on 4 * 8 bytes to ensure it is always + * contained within a single cache-line. + * + * A single struct rseq per thread is allowed. + */ +struct rseq { + /* + * Restartable sequences cpu_id_start field. Updated by the + * kernel, and read by user-space with single-copy atomicity + * semantics. Aligned on 32-bit. Always contains a value in the + * range of possible CPUs, although the value may not be the + * actual current CPU (e.g. if rseq is not initialized). This + * CPU number value should always be compared against the value + * of the cpu_id field before performing a rseq commit or + * returning a value read from a data structure indexed using + * the cpu_id_start value. + */ + __u32 cpu_id_start; + /* + * Restartable sequences cpu_id field. Updated by the kernel, + * and read by user-space with single-copy atomicity semantics. + * Aligned on 32-bit. Values RSEQ_CPU_ID_UNINITIALIZED and + * RSEQ_CPU_ID_REGISTRATION_FAILED have a special semantic: the + * former means "rseq uninitialized", and latter means "rseq + * initialization failed". This value is meant to be read within + * rseq critical sections and compared with the cpu_id_start + * value previously read, before performing the commit instruction, + * or read and compared with the cpu_id_start value before returning + * a value loaded from a data structure indexed using the + * cpu_id_start value. + */ + __u32 cpu_id; + /* + * Restartable sequences rseq_cs field. + * + * Contains NULL when no critical section is active for the current + * thread, or holds a pointer to the currently active struct rseq_cs. + * + * Updated by user-space, which sets the address of the currently + * active rseq_cs at the beginning of assembly instruction sequence + * block, and set to NULL by the kernel when it restarts an assembly + * instruction sequence block, as well as when the kernel detects that + * it is preempting or delivering a signal outside of the range + * targeted by the rseq_cs. Also needs to be set to NULL by user-space + * before reclaiming memory that contains the targeted struct rseq_cs. + * + * Read and set by the kernel with single-copy atomicity semantics. + * Set by user-space with single-copy atomicity semantics. Aligned + * on 64-bit. + */ + LINUX_FIELD_u32_u64(rseq_cs); + /* + * - RSEQ_DISABLE flag: + * + * Fallback fast-track flag for single-stepping. + * Set by user-space if lack of progress is detected. + * Cleared by user-space after rseq finish. + * Read by the kernel. + * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT + * Inhibit instruction sequence block restart and event + * counter increment on preemption for this thread. + * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL + * Inhibit instruction sequence block restart and event + * counter increment on signal delivery for this thread. + * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE + * Inhibit instruction sequence block restart and event + * counter increment on migration for this thread. + */ + __u32 flags; +} __attribute__((aligned(4 * sizeof(__u64)))); + +#endif /* _UAPI_LINUX_RSEQ_H */ diff --git a/init/Kconfig b/init/Kconfig index 18b151f0ddc1..33ec06fddaaa 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1417,6 +1417,29 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS config ARCH_HAS_MEMBARRIER_SYNC_CORE bool +config RSEQ + bool "Enable rseq() system call" if EXPERT + default y + depends on HAVE_RSEQ + select MEMBARRIER + help + Enable the restartable sequences system call. It provides a + user-space cache for the current CPU number value, which + speeds up getting the current CPU number from user-space, + as well as an ABI to speed up user-space operations on + per-CPU data. + + If unsure, say Y. + +config DEBUG_RSEQ + default n + bool "Enabled debugging of rseq() system call" if EXPERT + depends on RSEQ && DEBUG_KERNEL + help + Enable extra debugging checks for the rseq system call. + + If unsure, say N. + config EMBEDDED bool "Embedded system" option allnoconfig_y diff --git a/kernel/Makefile b/kernel/Makefile index f85ae5dfa474..7085c841c413 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -113,6 +113,7 @@ obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o obj-$(CONFIG_TORTURE_TEST) += torture.o obj-$(CONFIG_HAS_IOMEM) += memremap.o +obj-$(CONFIG_RSEQ) += rseq.o $(obj)/configs.o: $(obj)/config_data.h diff --git a/kernel/fork.c b/kernel/fork.c index a5d21c42acfc..70992bfeba81 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1899,6 +1899,8 @@ static __latent_entropy struct task_struct *copy_process( */ copy_seccomp(p); + rseq_fork(p, clone_flags); + /* * Process group and session signals need to be delivered to just the * parent before the fork or both the parent and the child after the diff --git a/kernel/rseq.c b/kernel/rseq.c new file mode 100644 index 000000000000..ae306f90c514 --- /dev/null +++ b/kernel/rseq.c @@ -0,0 +1,357 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Restartable sequences system call + * + * Copyright (C) 2015, Google, Inc., + * Paul Turner and Andrew Hunter + * Copyright (C) 2015-2018, EfficiOS Inc., + * Mathieu Desnoyers + */ + +#include +#include +#include +#include +#include +#include + +#define CREATE_TRACE_POINTS +#include + +#define RSEQ_CS_PREEMPT_MIGRATE_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE | \ + RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT) + +/* + * + * Restartable sequences are a lightweight interface that allows + * user-level code to be executed atomically relative to scheduler + * preemption and signal delivery. Typically used for implementing + * per-cpu operations. + * + * It allows user-space to perform update operations on per-cpu data + * without requiring heavy-weight atomic operations. + * + * Detailed algorithm of rseq user-space assembly sequences: + * + * init(rseq_cs) + * cpu = TLS->rseq::cpu_id_start + * [1] TLS->rseq::rseq_cs = rseq_cs + * [start_ip] ---------------------------- + * [2] if (cpu != TLS->rseq::cpu_id) + * goto abort_ip; + * [3] + * [post_commit_ip] ---------------------------- + * + * The address of jump target abort_ip must be outside the critical + * region, i.e.: + * + * [abort_ip] < [start_ip] || [abort_ip] >= [post_commit_ip] + * + * Steps [2]-[3] (inclusive) need to be a sequence of instructions in + * userspace that can handle being interrupted between any of those + * instructions, and then resumed to the abort_ip. + * + * 1. Userspace stores the address of the struct rseq_cs assembly + * block descriptor into the rseq_cs field of the registered + * struct rseq TLS area. This update is performed through a single + * store within the inline assembly instruction sequence. + * [start_ip] + * + * 2. Userspace tests to check whether the current cpu_id field match + * the cpu number loaded before start_ip, branching to abort_ip + * in case of a mismatch. + * + * If the sequence is preempted or interrupted by a signal + * at or after start_ip and before post_commit_ip, then the kernel + * clears TLS->__rseq_abi::rseq_cs, and sets the user-space return + * ip to abort_ip before returning to user-space, so the preempted + * execution resumes at abort_ip. + * + * 3. Userspace critical section final instruction before + * post_commit_ip is the commit. The critical section is + * self-terminating. + * [post_commit_ip] + * + * 4. + * + * On failure at [2], or if interrupted by preempt or signal delivery + * between [1] and [3]: + * + * [abort_ip] + * F1. + */ + +static int rseq_update_cpu_id(struct task_struct *t) +{ + u32 cpu_id = raw_smp_processor_id(); + + if (__put_user(cpu_id, &t->rseq->cpu_id_start)) + return -EFAULT; + if (__put_user(cpu_id, &t->rseq->cpu_id)) + return -EFAULT; + trace_rseq_update(t); + return 0; +} + +static int rseq_reset_rseq_cpu_id(struct task_struct *t) +{ + u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED; + + /* + * Reset cpu_id_start to its initial state (0). + */ + if (__put_user(cpu_id_start, &t->rseq->cpu_id_start)) + return -EFAULT; + /* + * Reset cpu_id to RSEQ_CPU_ID_UNINITIALIZED, so any user coming + * in after unregistration can figure out that rseq needs to be + * registered again. + */ + if (__put_user(cpu_id, &t->rseq->cpu_id)) + return -EFAULT; + return 0; +} + +static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs) +{ + struct rseq_cs __user *urseq_cs; + unsigned long ptr; + u32 __user *usig; + u32 sig; + int ret; + + ret = __get_user(ptr, &t->rseq->rseq_cs); + if (ret) + return ret; + if (!ptr) { + memset(rseq_cs, 0, sizeof(*rseq_cs)); + return 0; + } + urseq_cs = (struct rseq_cs __user *)ptr; + if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs))) + return -EFAULT; + if (rseq_cs->version > 0) + return -EINVAL; + + /* Ensure that abort_ip is not in the critical section. */ + if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset) + return -EINVAL; + + usig = (u32 __user *)(rseq_cs->abort_ip - sizeof(u32)); + ret = get_user(sig, usig); + if (ret) + return ret; + + if (current->rseq_sig != sig) { + printk_ratelimited(KERN_WARNING + "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n", + sig, current->rseq_sig, current->pid, usig); + return -EPERM; + } + return 0; +} + +static int rseq_need_restart(struct task_struct *t, u32 cs_flags) +{ + u32 flags, event_mask; + int ret; + + /* Get thread flags. */ + ret = __get_user(flags, &t->rseq->flags); + if (ret) + return ret; + + /* Take critical section flags into account. */ + flags |= cs_flags; + + /* + * Restart on signal can only be inhibited when restart on + * preempt and restart on migrate are inhibited too. Otherwise, + * a preempted signal handler could fail to restart the prior + * execution context on sigreturn. + */ + if (unlikely((flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) && + (flags & RSEQ_CS_PREEMPT_MIGRATE_FLAGS) != + RSEQ_CS_PREEMPT_MIGRATE_FLAGS)) + return -EINVAL; + + /* + * Load and clear event mask atomically with respect to + * scheduler preemption. + */ + preempt_disable(); + event_mask = t->rseq_event_mask; + t->rseq_event_mask = 0; + preempt_enable(); + + return !!(event_mask & ~flags); +} + +static int clear_rseq_cs(struct task_struct *t) +{ + /* + * The rseq_cs field is set to NULL on preemption or signal + * delivery on top of rseq assembly block, as well as on top + * of code outside of the rseq assembly block. This performs + * a lazy clear of the rseq_cs field. + * + * Set rseq_cs to NULL with single-copy atomicity. + */ + return __put_user(0UL, &t->rseq->rseq_cs); +} + +/* + * Unsigned comparison will be true when ip >= start_ip, and when + * ip < start_ip + post_commit_offset. + */ +static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs) +{ + return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset; +} + +static int rseq_ip_fixup(struct pt_regs *regs) +{ + unsigned long ip = instruction_pointer(regs); + struct task_struct *t = current; + struct rseq_cs rseq_cs; + int ret; + + ret = rseq_get_rseq_cs(t, &rseq_cs); + if (ret) + return ret; + + /* + * Handle potentially not being within a critical section. + * If not nested over a rseq critical section, restart is useless. + * Clear the rseq_cs pointer and return. + */ + if (!in_rseq_cs(ip, &rseq_cs)) + return clear_rseq_cs(t); + ret = rseq_need_restart(t, rseq_cs.flags); + if (ret <= 0) + return ret; + ret = clear_rseq_cs(t); + if (ret) + return ret; + trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset, + rseq_cs.abort_ip); + instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip); + return 0; +} + +/* + * This resume handler must always be executed between any of: + * - preemption, + * - signal delivery, + * and return to user-space. + * + * This is how we can ensure that the entire rseq critical section, + * consisting of both the C part and the assembly instruction sequence, + * will issue the commit instruction only if executed atomically with + * respect to other threads scheduled on the same CPU, and with respect + * to signal handlers. + */ +void __rseq_handle_notify_resume(struct pt_regs *regs) +{ + struct task_struct *t = current; + int ret; + + if (unlikely(t->flags & PF_EXITING)) + return; + if (unlikely(!access_ok(VERIFY_WRITE, t->rseq, sizeof(*t->rseq)))) + goto error; + ret = rseq_ip_fixup(regs); + if (unlikely(ret < 0)) + goto error; + if (unlikely(rseq_update_cpu_id(t))) + goto error; + return; + +error: + force_sig(SIGSEGV, t); +} + +#ifdef CONFIG_DEBUG_RSEQ + +/* + * Terminate the process if a syscall is issued within a restartable + * sequence. + */ +void rseq_syscall(struct pt_regs *regs) +{ + unsigned long ip = instruction_pointer(regs); + struct task_struct *t = current; + struct rseq_cs rseq_cs; + + if (!t->rseq) + return; + if (!access_ok(VERIFY_READ, t->rseq, sizeof(*t->rseq)) || + rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs)) + force_sig(SIGSEGV, t); +} + +#endif + +/* + * sys_rseq - setup restartable sequences for caller thread. + */ +SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, + int, flags, u32, sig) +{ + int ret; + + if (flags & RSEQ_FLAG_UNREGISTER) { + /* Unregister rseq for current thread. */ + if (current->rseq != rseq || !current->rseq) + return -EINVAL; + if (current->rseq_len != rseq_len) + return -EINVAL; + if (current->rseq_sig != sig) + return -EPERM; + ret = rseq_reset_rseq_cpu_id(current); + if (ret) + return ret; + current->rseq = NULL; + current->rseq_len = 0; + current->rseq_sig = 0; + return 0; + } + + if (unlikely(flags)) + return -EINVAL; + + if (current->rseq) { + /* + * If rseq is already registered, check whether + * the provided address differs from the prior + * one. + */ + if (current->rseq != rseq || current->rseq_len != rseq_len) + return -EINVAL; + if (current->rseq_sig != sig) + return -EPERM; + /* Already registered. */ + return -EBUSY; + } + + /* + * If there was no rseq previously registered, + * ensure the provided rseq is properly aligned and valid. + */ + if (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq)) || + rseq_len != sizeof(*rseq)) + return -EINVAL; + if (!access_ok(VERIFY_WRITE, rseq, rseq_len)) + return -EFAULT; + current->rseq = rseq; + current->rseq_len = rseq_len; + current->rseq_sig = sig; + /* + * If rseq was previously inactive, and has just been + * registered, ensure the cpu_id_start and cpu_id fields + * are updated before returning to user-space. + */ + rseq_set_notify_resume(current); + + return 0; +} diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e9866f86f304..a98d54cd5535 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1191,6 +1191,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) if (p->sched_class->migrate_task_rq) p->sched_class->migrate_task_rq(p); p->se.nr_migrations++; + rseq_migrate(p); perf_event_task_migrate(p); } @@ -2634,6 +2635,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev, { sched_info_switch(rq, prev, next); perf_event_task_sched_out(prev, next); + rseq_preempt(prev); fire_sched_out_preempt_notifiers(prev, next); prepare_task(next); prepare_arch_switch(next); diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 183169c2a75b..86f832d6ff6f 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -432,3 +432,6 @@ COND_SYSCALL(setresgid16); COND_SYSCALL(setresuid16); COND_SYSCALL(setreuid16); COND_SYSCALL(setuid16); + +/* restartable sequence */ +COND_SYSCALL(rseq); -- cgit From 88aa7cc688d48ddd84558b41d5905a0db9535c4b Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Thu, 7 Jun 2018 17:05:28 -0700 Subject: mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct mmap_sem is on the hot path of kernel, and it very contended, but it is abused too. It is used to protect arg_start|end and evn_start|end when reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make sense since those proc files just expect to read 4 values atomically and not related to VM, they could be set to arbitrary values by C/R. And, the mmap_sem contention may cause unexpected issue like below: INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 Call Trace: schedule+0x36/0x80 rwsem_down_read_failed+0xf0/0x150 call_rwsem_down_read_failed+0x18/0x30 down_read+0x20/0x40 proc_pid_cmdline_read+0xd9/0x4e0 __vfs_read+0x37/0x150 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x1a/0xc5 Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock for them to mitigate the abuse of mmap_sem. So, introduce a new spinlock in mm_struct to protect the concurrent access to arg_start|end, env_start|end and others, as well as replace write map_sem to read to protect the race condition between prctl and sys_brk which might break check_data_rlimit(), and makes prctl more friendly to other VM operations. This patch just eliminates the abuse of mmap_sem, but it can't resolve the above hung task warning completely since the later access_remote_vm() call needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the future. [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock] Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi Reviewed-by: Cyrill Gorcunov Acked-by: Michal Hocko Cc: Alexey Dobriyan Cc: Matthew Wilcox Cc: Mateusz Guzik Cc: Kirill Tkhai Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/base.c | 8 ++++---- include/linux/mm_types.h | 2 ++ kernel/fork.c | 1 + kernel/sys.c | 10 ++++++++-- mm/init-mm.c | 1 + 5 files changed, 16 insertions(+), 6 deletions(-) (limited to 'kernel/fork.c') diff --git a/fs/proc/base.c b/fs/proc/base.c index af128b374143..ef3f7df50023 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -239,12 +239,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf, goto out_mmput; } - down_read(&mm->mmap_sem); + spin_lock(&mm->arg_lock); arg_start = mm->arg_start; arg_end = mm->arg_end; env_start = mm->env_start; env_end = mm->env_end; - up_read(&mm->mmap_sem); + spin_unlock(&mm->arg_lock); BUG_ON(arg_start > arg_end); BUG_ON(env_start > env_end); @@ -927,10 +927,10 @@ static ssize_t environ_read(struct file *file, char __user *buf, if (!mmget_not_zero(mm)) goto free; - down_read(&mm->mmap_sem); + spin_lock(&mm->arg_lock); env_start = mm->env_start; env_end = mm->env_end; - up_read(&mm->mmap_sem); + spin_unlock(&mm->arg_lock); while (count > 0) { size_t this_len, max_len; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 21612347d311..49dd59ea90f7 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -413,6 +413,8 @@ struct mm_struct { unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */ unsigned long stack_vm; /* VM_STACK */ unsigned long def_flags; + + spinlock_t arg_lock; /* protect the below fields */ unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; unsigned long arg_start, arg_end, env_start, env_end; diff --git a/kernel/fork.c b/kernel/fork.c index 80b48a8fb47b..c6d1c1ce9ed7 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -899,6 +899,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mm->pinned_vm = 0; memset(&mm->rss_stat, 0, sizeof(mm->rss_stat)); spin_lock_init(&mm->page_table_lock); + spin_lock_init(&mm->arg_lock); mm_init_cpumask(mm); mm_init_aio(mm); mm_init_owner(mm, p); diff --git a/kernel/sys.c b/kernel/sys.c index d1b2b8d934bb..38509dc1f77b 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2018,7 +2018,11 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data return error; } - down_write(&mm->mmap_sem); + /* + * arg_lock protects concurent updates but we still need mmap_sem for + * read to exclude races with sys_brk. + */ + down_read(&mm->mmap_sem); /* * We don't validate if these members are pointing to @@ -2032,6 +2036,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data * to any problem in kernel itself */ + spin_lock(&mm->arg_lock); mm->start_code = prctl_map.start_code; mm->end_code = prctl_map.end_code; mm->start_data = prctl_map.start_data; @@ -2043,6 +2048,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data mm->arg_end = prctl_map.arg_end; mm->env_start = prctl_map.env_start; mm->env_end = prctl_map.env_end; + spin_unlock(&mm->arg_lock); /* * Note this update of @saved_auxv is lockless thus @@ -2055,7 +2061,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data if (prctl_map.auxv_size) memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv)); - up_write(&mm->mmap_sem); + up_read(&mm->mmap_sem); return 0; } #endif /* CONFIG_CHECKPOINT_RESTORE */ diff --git a/mm/init-mm.c b/mm/init-mm.c index f94d5d15ebc0..f0179c9c04c2 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -22,6 +22,7 @@ struct mm_struct init_mm = { .mm_count = ATOMIC_INIT(1), .mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem), .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock), + .arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock), .mmlist = LIST_HEAD_INIT(init_mm.mmlist), .user_ns = &init_user_ns, INIT_MM_CONTEXT(init_mm) -- cgit From 050e9baa9dc9fbd9ce2b27f0056990fc9e0a08a0 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Thu, 14 Jun 2018 12:21:18 +0900 Subject: Kbuild: rename CC_STACKPROTECTOR[_STRONG] config variables The changes to automatically test for working stack protector compiler support in the Kconfig files removed the special STACKPROTECTOR_AUTO option that picked the strongest stack protector that the compiler supported. That was all a nice cleanup - it makes no sense to have the AUTO case now that the Kconfig phase can just determine the compiler support directly. HOWEVER. It also meant that doing "make oldconfig" would now _disable_ the strong stackprotector if you had AUTO enabled, because in a legacy config file, the sane stack protector configuration would look like CONFIG_HAVE_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_NONE is not set # CONFIG_CC_STACKPROTECTOR_REGULAR is not set # CONFIG_CC_STACKPROTECTOR_STRONG is not set CONFIG_CC_STACKPROTECTOR_AUTO=y and when you ran this through "make oldconfig" with the Kbuild changes, it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version used to be disabled (because it was really enabled by AUTO), and would disable it in the new config, resulting in: CONFIG_HAVE_CC_STACKPROTECTOR=y CONFIG_CC_HAS_STACKPROTECTOR_NONE=y CONFIG_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_STRONG is not set CONFIG_CC_HAS_SANE_STACKPROTECTOR=y That's dangerously subtle - people could suddenly find themselves with the weaker stack protector setup without even realizing. The solution here is to just rename not just the old RECULAR stack protector option, but also the strong one. This does that by just removing the CC_ prefix entirely for the user choices, because it really is not about the compiler support (the compiler support now instead automatially impacts _visibility_ of the options to users). This results in "make oldconfig" actually asking the user for their choice, so that we don't have any silent subtle security model changes. The end result would generally look like this: CONFIG_HAVE_CC_STACKPROTECTOR=y CONFIG_CC_HAS_STACKPROTECTOR_NONE=y CONFIG_STACKPROTECTOR=y CONFIG_STACKPROTECTOR_STRONG=y CONFIG_CC_HAS_SANE_STACKPROTECTOR=y where the "CC_" versions really are about internal compiler infrastructure, not the user selections. Acked-by: Masahiro Yamada Signed-off-by: Linus Torvalds --- Documentation/kbuild/kconfig-language.txt | 2 +- Documentation/security/self-protection.rst | 2 +- Makefile | 4 ++-- arch/Kconfig | 6 +++--- arch/arm/kernel/asm-offsets.c | 2 +- arch/arm/kernel/entry-armv.S | 4 ++-- arch/arm/kernel/process.c | 2 +- arch/arm64/kernel/process.c | 2 +- arch/mips/kernel/asm-offsets.c | 2 +- arch/mips/kernel/octeon_switch.S | 2 +- arch/mips/kernel/process.c | 2 +- arch/mips/kernel/r2300_switch.S | 2 +- arch/mips/kernel/r4k_switch.S | 2 +- arch/sh/kernel/process.c | 2 +- arch/sh/kernel/process_32.c | 2 +- arch/x86/entry/entry_32.S | 2 +- arch/x86/entry/entry_64.S | 2 +- arch/x86/include/asm/processor.h | 2 +- arch/x86/include/asm/segment.h | 2 +- arch/x86/include/asm/stackprotector.h | 6 +++--- arch/x86/kernel/asm-offsets.c | 2 +- arch/x86/kernel/asm-offsets_32.c | 2 +- arch/x86/kernel/asm-offsets_64.c | 2 +- arch/x86/kernel/cpu/common.c | 2 +- arch/x86/kernel/head_32.S | 2 +- arch/xtensa/kernel/asm-offsets.c | 2 +- arch/xtensa/kernel/entry.S | 2 +- arch/xtensa/kernel/process.c | 2 +- include/linux/sched.h | 2 +- include/linux/stackprotector.h | 2 +- kernel/configs/android-recommended.config | 2 +- kernel/fork.c | 2 +- kernel/panic.c | 2 +- 33 files changed, 39 insertions(+), 39 deletions(-) (limited to 'kernel/fork.c') diff --git a/Documentation/kbuild/kconfig-language.txt b/Documentation/kbuild/kconfig-language.txt index a4eb01843c04..3534a84d206c 100644 --- a/Documentation/kbuild/kconfig-language.txt +++ b/Documentation/kbuild/kconfig-language.txt @@ -480,7 +480,7 @@ There are several features that need compiler support. The recommended way to describe the dependency on the compiler feature is to use "depends on" followed by a test macro. -config CC_STACKPROTECTOR +config STACKPROTECTOR bool "Stack Protector buffer overflow detection" depends on $(cc-option,-fstack-protector) ... diff --git a/Documentation/security/self-protection.rst b/Documentation/security/self-protection.rst index 0f53826c78b9..e1ca698e0006 100644 --- a/Documentation/security/self-protection.rst +++ b/Documentation/security/self-protection.rst @@ -156,7 +156,7 @@ The classic stack buffer overflow involves writing past the expected end of a variable stored on the stack, ultimately writing a controlled value to the stack frame's stored return address. The most widely used defense is the presence of a stack canary between the stack variables and the -return address (``CONFIG_CC_STACKPROTECTOR``), which is verified just before +return address (``CONFIG_STACKPROTECTOR``), which is verified just before the function returns. Other defenses include things like shadow stacks. Stack depth overflow diff --git a/Makefile b/Makefile index 73f0bb2c7a98..8a26b5937241 100644 --- a/Makefile +++ b/Makefile @@ -687,8 +687,8 @@ KBUILD_CFLAGS += $(call cc-option,-Wframe-larger-than=${CONFIG_FRAME_WARN}) endif stackp-flags-$(CONFIG_CC_HAS_STACKPROTECTOR_NONE) := -fno-stack-protector -stackp-flags-$(CONFIG_CC_STACKPROTECTOR) := -fstack-protector -stackp-flags-$(CONFIG_CC_STACKPROTECTOR_STRONG) := -fstack-protector-strong +stackp-flags-$(CONFIG_STACKPROTECTOR) := -fstack-protector +stackp-flags-$(CONFIG_STACKPROTECTOR_STRONG) := -fstack-protector-strong KBUILD_CFLAGS += $(stackp-flags-y) diff --git a/arch/Kconfig b/arch/Kconfig index ebbb45096191..c302b3dd0058 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -558,7 +558,7 @@ config HAVE_CC_STACKPROTECTOR config CC_HAS_STACKPROTECTOR_NONE def_bool $(cc-option,-fno-stack-protector) -config CC_STACKPROTECTOR +config STACKPROTECTOR bool "Stack Protector buffer overflow detection" depends on HAVE_CC_STACKPROTECTOR depends on $(cc-option,-fstack-protector) @@ -582,9 +582,9 @@ config CC_STACKPROTECTOR about 3% of all kernel functions, which increases kernel code size by about 0.3%. -config CC_STACKPROTECTOR_STRONG +config STACKPROTECTOR_STRONG bool "Strong Stack Protector" - depends on CC_STACKPROTECTOR + depends on STACKPROTECTOR depends on $(cc-option,-fstack-protector-strong) default y help diff --git a/arch/arm/kernel/asm-offsets.c b/arch/arm/kernel/asm-offsets.c index 27c5381518d8..974d8d7d1bcd 100644 --- a/arch/arm/kernel/asm-offsets.c +++ b/arch/arm/kernel/asm-offsets.c @@ -61,7 +61,7 @@ int main(void) { DEFINE(TSK_ACTIVE_MM, offsetof(struct task_struct, active_mm)); -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR DEFINE(TSK_STACK_CANARY, offsetof(struct task_struct, stack_canary)); #endif BLANK(); diff --git a/arch/arm/kernel/entry-armv.S b/arch/arm/kernel/entry-armv.S index 1752033b0070..179a9f6bd1e3 100644 --- a/arch/arm/kernel/entry-armv.S +++ b/arch/arm/kernel/entry-armv.S @@ -791,7 +791,7 @@ ENTRY(__switch_to) ldr r6, [r2, #TI_CPU_DOMAIN] #endif switch_tls r1, r4, r5, r3, r7 -#if defined(CONFIG_CC_STACKPROTECTOR) && !defined(CONFIG_SMP) +#if defined(CONFIG_STACKPROTECTOR) && !defined(CONFIG_SMP) ldr r7, [r2, #TI_TASK] ldr r8, =__stack_chk_guard .if (TSK_STACK_CANARY > IMM12_MASK) @@ -807,7 +807,7 @@ ENTRY(__switch_to) ldr r0, =thread_notify_head mov r1, #THREAD_NOTIFY_SWITCH bl atomic_notifier_call_chain -#if defined(CONFIG_CC_STACKPROTECTOR) && !defined(CONFIG_SMP) +#if defined(CONFIG_STACKPROTECTOR) && !defined(CONFIG_SMP) str r7, [r8] #endif THUMB( mov ip, r4 ) diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c index 1523cb18b109..225d1c58d2de 100644 --- a/arch/arm/kernel/process.c +++ b/arch/arm/kernel/process.c @@ -39,7 +39,7 @@ #include #include -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR #include unsigned long __stack_chk_guard __read_mostly; EXPORT_SYMBOL(__stack_chk_guard); diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index f08a2ed9db0d..e10bc363f533 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -59,7 +59,7 @@ #include #include -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR #include unsigned long __stack_chk_guard __read_mostly; EXPORT_SYMBOL(__stack_chk_guard); diff --git a/arch/mips/kernel/asm-offsets.c b/arch/mips/kernel/asm-offsets.c index c1cd41456d42..cbe4742d2fff 100644 --- a/arch/mips/kernel/asm-offsets.c +++ b/arch/mips/kernel/asm-offsets.c @@ -83,7 +83,7 @@ void output_task_defines(void) OFFSET(TASK_FLAGS, task_struct, flags); OFFSET(TASK_MM, task_struct, mm); OFFSET(TASK_PID, task_struct, pid); -#if defined(CONFIG_CC_STACKPROTECTOR) +#if defined(CONFIG_STACKPROTECTOR) OFFSET(TASK_STACK_CANARY, task_struct, stack_canary); #endif DEFINE(TASK_STRUCT_SIZE, sizeof(struct task_struct)); diff --git a/arch/mips/kernel/octeon_switch.S b/arch/mips/kernel/octeon_switch.S index e42113fe2762..896080b445c2 100644 --- a/arch/mips/kernel/octeon_switch.S +++ b/arch/mips/kernel/octeon_switch.S @@ -61,7 +61,7 @@ #endif 3: -#if defined(CONFIG_CC_STACKPROTECTOR) && !defined(CONFIG_SMP) +#if defined(CONFIG_STACKPROTECTOR) && !defined(CONFIG_SMP) PTR_LA t8, __stack_chk_guard LONG_L t9, TASK_STACK_CANARY(a1) LONG_S t9, 0(t8) diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c index 3775a8d694fb..8d85046adcc8 100644 --- a/arch/mips/kernel/process.c +++ b/arch/mips/kernel/process.c @@ -180,7 +180,7 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long usp, return 0; } -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR #include unsigned long __stack_chk_guard __read_mostly; EXPORT_SYMBOL(__stack_chk_guard); diff --git a/arch/mips/kernel/r2300_switch.S b/arch/mips/kernel/r2300_switch.S index 665897139f30..71b1aafae1bb 100644 --- a/arch/mips/kernel/r2300_switch.S +++ b/arch/mips/kernel/r2300_switch.S @@ -36,7 +36,7 @@ LEAF(resume) cpu_save_nonscratch a0 sw ra, THREAD_REG31(a0) -#if defined(CONFIG_CC_STACKPROTECTOR) && !defined(CONFIG_SMP) +#if defined(CONFIG_STACKPROTECTOR) && !defined(CONFIG_SMP) PTR_LA t8, __stack_chk_guard LONG_L t9, TASK_STACK_CANARY(a1) LONG_S t9, 0(t8) diff --git a/arch/mips/kernel/r4k_switch.S b/arch/mips/kernel/r4k_switch.S index 17cf9341c1cf..58232ae6cfae 100644 --- a/arch/mips/kernel/r4k_switch.S +++ b/arch/mips/kernel/r4k_switch.S @@ -31,7 +31,7 @@ cpu_save_nonscratch a0 LONG_S ra, THREAD_REG31(a0) -#if defined(CONFIG_CC_STACKPROTECTOR) && !defined(CONFIG_SMP) +#if defined(CONFIG_STACKPROTECTOR) && !defined(CONFIG_SMP) PTR_LA t8, __stack_chk_guard LONG_L t9, TASK_STACK_CANARY(a1) LONG_S t9, 0(t8) diff --git a/arch/sh/kernel/process.c b/arch/sh/kernel/process.c index 68b1a67533ce..4d1bfc848dd3 100644 --- a/arch/sh/kernel/process.c +++ b/arch/sh/kernel/process.c @@ -12,7 +12,7 @@ struct kmem_cache *task_xstate_cachep = NULL; unsigned int xstate_size; -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR unsigned long __stack_chk_guard __read_mostly; EXPORT_SYMBOL(__stack_chk_guard); #endif diff --git a/arch/sh/kernel/process_32.c b/arch/sh/kernel/process_32.c index 93522069cb15..27fddb56b3e1 100644 --- a/arch/sh/kernel/process_32.c +++ b/arch/sh/kernel/process_32.c @@ -177,7 +177,7 @@ __switch_to(struct task_struct *prev, struct task_struct *next) { struct thread_struct *next_t = &next->thread; -#if defined(CONFIG_CC_STACKPROTECTOR) && !defined(CONFIG_SMP) +#if defined(CONFIG_STACKPROTECTOR) && !defined(CONFIG_SMP) __stack_chk_guard = next->stack_canary; #endif diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index bef8e2b202a8..2582881d19ce 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -239,7 +239,7 @@ ENTRY(__switch_to_asm) movl %esp, TASK_threadsp(%eax) movl TASK_threadsp(%edx), %esp -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR movl TASK_stack_canary(%edx), %ebx movl %ebx, PER_CPU_VAR(stack_canary)+stack_canary_offset #endif diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 3166b9674429..73a522d53b53 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -357,7 +357,7 @@ ENTRY(__switch_to_asm) movq %rsp, TASK_threadsp(%rdi) movq TASK_threadsp(%rsi), %rsp -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR movq TASK_stack_canary(%rsi), %rbx movq %rbx, PER_CPU_VAR(irq_stack_union)+stack_canary_offset #endif diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index e28add6b791f..cfd29ee8c3da 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -412,7 +412,7 @@ extern asmlinkage void ignore_sysret(void); void save_fsgs_for_kvm(void); #endif #else /* X86_64 */ -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR /* * Make sure stack canary segment base is cached-aligned: * "For Intel Atom processors, avoid non zero segment base address diff --git a/arch/x86/include/asm/segment.h b/arch/x86/include/asm/segment.h index 8f09012b92e7..e293c122d0d5 100644 --- a/arch/x86/include/asm/segment.h +++ b/arch/x86/include/asm/segment.h @@ -146,7 +146,7 @@ # define __KERNEL_PERCPU 0 #endif -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR # define __KERNEL_STACK_CANARY (GDT_ENTRY_STACK_CANARY*8) #else # define __KERNEL_STACK_CANARY 0 diff --git a/arch/x86/include/asm/stackprotector.h b/arch/x86/include/asm/stackprotector.h index 371b3a4af000..8ec97a62c245 100644 --- a/arch/x86/include/asm/stackprotector.h +++ b/arch/x86/include/asm/stackprotector.h @@ -34,7 +34,7 @@ #ifndef _ASM_STACKPROTECTOR_H #define _ASM_STACKPROTECTOR_H 1 -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR #include #include @@ -105,7 +105,7 @@ static inline void load_stack_canary_segment(void) #endif } -#else /* CC_STACKPROTECTOR */ +#else /* STACKPROTECTOR */ #define GDT_STACK_CANARY_INIT @@ -121,5 +121,5 @@ static inline void load_stack_canary_segment(void) #endif } -#endif /* CC_STACKPROTECTOR */ +#endif /* STACKPROTECTOR */ #endif /* _ASM_STACKPROTECTOR_H */ diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c index 76417a9aab73..dcb008c320fe 100644 --- a/arch/x86/kernel/asm-offsets.c +++ b/arch/x86/kernel/asm-offsets.c @@ -32,7 +32,7 @@ void common(void) { BLANK(); OFFSET(TASK_threadsp, task_struct, thread.sp); -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR OFFSET(TASK_stack_canary, task_struct, stack_canary); #endif diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c index f91ba53e06c8..a4a3be399f4b 100644 --- a/arch/x86/kernel/asm-offsets_32.c +++ b/arch/x86/kernel/asm-offsets_32.c @@ -50,7 +50,7 @@ void foo(void) DEFINE(TSS_sysenter_sp0, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) - offsetofend(struct cpu_entry_area, entry_stack_page.stack)); -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR BLANK(); OFFSET(stack_canary_offset, stack_canary, canary); #endif diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c index bf51e51d808d..b2dcd161f514 100644 --- a/arch/x86/kernel/asm-offsets_64.c +++ b/arch/x86/kernel/asm-offsets_64.c @@ -69,7 +69,7 @@ int main(void) OFFSET(TSS_sp1, tss_struct, x86_tss.sp1); BLANK(); -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR DEFINE(stack_canary_offset, offsetof(union irq_stack_union, stack_canary)); BLANK(); #endif diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 910b47ee8078..0df7151cfef4 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1599,7 +1599,7 @@ DEFINE_PER_CPU(unsigned long, cpu_current_top_of_stack) = (unsigned long)&init_thread_union + THREAD_SIZE; EXPORT_PER_CPU_SYMBOL(cpu_current_top_of_stack); -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR DEFINE_PER_CPU_ALIGNED(struct stack_canary, stack_canary); #endif diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S index b59e4fb40fd9..abe6df15a8fb 100644 --- a/arch/x86/kernel/head_32.S +++ b/arch/x86/kernel/head_32.S @@ -375,7 +375,7 @@ ENDPROC(startup_32_smp) */ __INIT setup_once: -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR /* * Configure the stack canary. The linker can't handle this by * relocation. Manually set base address in stack canary diff --git a/arch/xtensa/kernel/asm-offsets.c b/arch/xtensa/kernel/asm-offsets.c index 022cf918ec20..67904f55f188 100644 --- a/arch/xtensa/kernel/asm-offsets.c +++ b/arch/xtensa/kernel/asm-offsets.c @@ -76,7 +76,7 @@ int main(void) DEFINE(TASK_PID, offsetof (struct task_struct, pid)); DEFINE(TASK_THREAD, offsetof (struct task_struct, thread)); DEFINE(TASK_THREAD_INFO, offsetof (struct task_struct, stack)); -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR DEFINE(TASK_STACK_CANARY, offsetof(struct task_struct, stack_canary)); #endif DEFINE(TASK_STRUCT_SIZE, sizeof (struct task_struct)); diff --git a/arch/xtensa/kernel/entry.S b/arch/xtensa/kernel/entry.S index 5caff0744f3c..9cbc380e9572 100644 --- a/arch/xtensa/kernel/entry.S +++ b/arch/xtensa/kernel/entry.S @@ -1971,7 +1971,7 @@ ENTRY(_switch_to) s32i a1, a2, THREAD_SP # save stack pointer #endif -#if defined(CONFIG_CC_STACKPROTECTOR) && !defined(CONFIG_SMP) +#if defined(CONFIG_STACKPROTECTOR) && !defined(CONFIG_SMP) movi a6, __stack_chk_guard l32i a8, a3, TASK_STACK_CANARY s32i a8, a6, 0 diff --git a/arch/xtensa/kernel/process.c b/arch/xtensa/kernel/process.c index 8dd0593fb2c4..483dcfb6e681 100644 --- a/arch/xtensa/kernel/process.c +++ b/arch/xtensa/kernel/process.c @@ -58,7 +58,7 @@ void (*pm_power_off)(void) = NULL; EXPORT_SYMBOL(pm_power_off); -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR #include unsigned long __stack_chk_guard __read_mostly; EXPORT_SYMBOL(__stack_chk_guard); diff --git a/include/linux/sched.h b/include/linux/sched.h index 16e4d984fe51..cfb7da88c217 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -742,7 +742,7 @@ struct task_struct { pid_t pid; pid_t tgid; -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR /* Canary value for the -fstack-protector GCC feature: */ unsigned long stack_canary; #endif diff --git a/include/linux/stackprotector.h b/include/linux/stackprotector.h index 03696c729fb4..6b792d080eee 100644 --- a/include/linux/stackprotector.h +++ b/include/linux/stackprotector.h @@ -6,7 +6,7 @@ #include #include -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR # include #else static inline void boot_init_stack_canary(void) diff --git a/kernel/configs/android-recommended.config b/kernel/configs/android-recommended.config index 946fb92418f7..81e9af7dcec2 100644 --- a/kernel/configs/android-recommended.config +++ b/kernel/configs/android-recommended.config @@ -12,7 +12,7 @@ CONFIG_BLK_DEV_DM=y CONFIG_BLK_DEV_LOOP=y CONFIG_BLK_DEV_RAM=y CONFIG_BLK_DEV_RAM_SIZE=8192 -CONFIG_CC_STACKPROTECTOR_STRONG=y +CONFIG_STACKPROTECTOR_STRONG=y CONFIG_COMPACTION=y CONFIG_CPU_SW_DOMAIN_PAN=y CONFIG_DM_CRYPT=y diff --git a/kernel/fork.c b/kernel/fork.c index 08c6e5e217a0..92870be50bba 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -811,7 +811,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) clear_tsk_need_resched(tsk); set_task_stack_end_magic(tsk); -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR tsk->stack_canary = get_random_canary(); #endif diff --git a/kernel/panic.c b/kernel/panic.c index 42e487488554..8b2e002d52eb 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -623,7 +623,7 @@ static __init int register_warn_debugfs(void) device_initcall(register_warn_debugfs); #endif -#ifdef CONFIG_CC_STACKPROTECTOR +#ifdef CONFIG_STACKPROTECTOR /* * Called when gcc's -fstack-protector feature is used, and -- cgit From 655c79bb40a0870adcd0871057d01de11625882b Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Thu, 14 Jun 2018 15:26:34 -0700 Subject: mm: check for SIGKILL inside dup_mmap() loop As a theoretical problem, dup_mmap() of an mm_struct with 60000+ vmas can loop while potentially allocating memory, with mm->mmap_sem held for write by current thread. This is bad if current thread was selected as an OOM victim, for current thread will continue allocations using memory reserves while OOM reaper is unable to reclaim memory. As an actually observable problem, it is not difficult to make OOM reaper unable to reclaim memory if the OOM victim is blocked at i_mmap_lock_write() in this loop. Unfortunately, since nobody can explain whether it is safe to use killable wait there, let's check for SIGKILL before trying to allocate memory. Even without an OOM event, there is no point with continuing the loop from the beginning if current thread is killed. I tested with debug printk(). This patch should be safe because we already fail if security_vm_enough_memory_mm() or kmem_cache_alloc(GFP_KERNEL) fails and exit_mmap() handles it. ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting exit_mmap() due to NULL mmap ***** [akpm@linux-foundation.org: add comment] Link: http://lkml.kernel.org/r/201804071938.CDE04681.SOFVQJFtMHOOLF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa Cc: Alexander Viro Cc: Rik van Riel Cc: Michal Hocko Cc: Kirill A. Shutemov Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'kernel/fork.c') diff --git a/kernel/fork.c b/kernel/fork.c index 92870be50bba..9440d61b925c 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -440,6 +440,14 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, continue; } charge = 0; + /* + * Don't duplicate many vmas if we've been oom-killed (for + * example) + */ + if (fatal_signal_pending(current)) { + retval = -EINTR; + goto out; + } if (mpnt->vm_flags & VM_ACCOUNT) { unsigned long len = vma_pages(mpnt); -- cgit