aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2017-08-25sched/topology: Improve commentsPeter Zijlstra1-2/+3
Mike provided a better comment for destroy_sched_domain() ... Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2017-08-25sched/topology: Fix memory leak in __sdt_alloc()Shu Wang1-9/+7
Found this issue by kmemleak: the 'sg' and 'sgc' pointers from __sdt_alloc() might be leaked as each domain holds many groups' ref, but in destroy_sched_domain(), it only declined the first group ref. Onlining and offlining a CPU can trigger this leak, and cause OOM. Reproducer for my 6 CPUs machine: while true do echo 0 > /sys/devices/system/cpu/cpu5/online; echo 1 > /sys/devices/system/cpu/cpu5/online; done unreferenced object 0xffff88007d772a80 (size 64): comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s) hex dump (first 32 bytes): c0 22 77 7d 00 88 ff ff 02 00 00 00 01 00 00 00 ."w}............ 40 2a 77 7d 00 88 ff ff 00 00 00 00 00 00 00 00 @*w}............ backtrace: [<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0 [<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280 [<ffffffff810d94a8>] build_sched_domains+0x1e8/0xf20 [<ffffffff810da674>] partition_sched_domains+0x304/0x360 [<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40 [<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0 [<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400 [<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0 [<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0 [<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160 [<ffffffff810af5d9>] kthread+0x109/0x140 [<ffffffff81770e45>] ret_from_fork+0x25/0x30 [<ffffffffffffffff>] 0xffffffffffffffff unreferenced object 0xffff88007d772a40 (size 64): comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s) hex dump (first 32 bytes): 03 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 ................ 00 04 00 00 00 00 00 00 4f 3c fc ff 00 00 00 00 ........O<...... backtrace: [<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0 [<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280 [<ffffffff810da16d>] build_sched_domains+0xead/0xf20 [<ffffffff810da674>] partition_sched_domains+0x304/0x360 [<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40 [<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0 [<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400 [<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0 [<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0 [<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160 [<ffffffff810af5d9>] kthread+0x109/0x140 [<ffffffff81770e45>] ret_from_fork+0x25/0x30 [<ffffffffffffffff>] 0xffffffffffffffff Reported-by: Chunyu Hu <[email protected]> Signed-off-by: Shu Wang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Chunyu Hu <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-25Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar18-50/+219
Signed-off-by: Ingo Molnar <[email protected]>
2017-08-25locking/lockdep: Fix workqueue crossrelease annotationPeter Zijlstra2-21/+58
The new completion/crossrelease annotations interact unfavourable with the extant flush_work()/flush_workqueue() annotations. The problem is that when a single work class does: wait_for_completion(&C) and complete(&C) in different executions, we'll build dependencies like: lock_map_acquire(W) complete_acquire(C) and lock_map_acquire(W) complete_release(C) which results in the dependency chain: W->C->W, which lockdep thinks spells deadlock, even though there is no deadlock potential since works are ran concurrently. One possibility would be to change the work 'lock' to recursive-read, but that would mean hitting a lockdep limitation on recursive locks. Also, unconditinoally switching to recursive-read here would fail to detect the actual deadlock on single-threaded workqueues, which do have a problem with this. For now, forcefully disregard these locks for crossrelease. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-25workqueue/lockdep: 'Fix' flush_work() annotationPeter Zijlstra1-9/+11
The flush_work() annotation as introduced by commit: e159489baa71 ("workqueue: relax lockdep annotation on flush_work()") hits on the lockdep problem with recursive read locks. The situation as described is: Work W1: Work W2: Task: ARR(Q) ARR(Q) flush_workqueue(Q) A(W1) A(W2) A(Q) flush_work(W2) R(Q) A(W2) R(W2) if (special) A(Q) else ARR(Q) R(Q) where: A - acquire, ARR - acquire-read-recursive, R - release. Where under 'special' conditions we want to trigger a lock recursion deadlock, but otherwise allow the flush_work(). The allowing is done by using recursive read locks (ARR), but lockdep is broken for recursive stuff. However, there appears to be no need to acquire the lock if we're not 'special', so if we remove the 'else' clause things become much simpler and no longer need the recursion thing at all. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-25Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar16-48/+217
Signed-off-by: Ingo Molnar <[email protected]>
2017-08-25tracing, perf: Adjust code layout in get_recursion_context()Jesper Dangaard Brouer1-1/+1
In an XDP redirect applications using tracepoint xdp:xdp_redirect to diagnose TX overrun, I noticed perf_swevent_get_recursion_context() was consuming 2% CPU. This was reduced to 1.85% with this simple change. Looking at the annotated asm code, it was clear that the unlikely case in_nmi() test was chosen (by the compiler) as the most likely event/branch. This small adjustment makes the compiler (GCC version 7.1.1 20170622 (Red Hat 7.1.1-3)) put in_nmi() as an unlikely branch. Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/150342256382.16595.986861478681783732.stgit@firesoul Signed-off-by: Ingo Molnar <[email protected]>
2017-08-25perf/core: Don't report zero PIDs for exiting tasksOleg Nesterov1-9/+14
The exiting/dead task has no PIDs and in this case perf_event_pid/tid() return zero, change them to return -1 to distinguish this case from idle threads. Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Arnaldo Carvalho <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-25perf/aux: Ensure aux_wakeup represents most recent wakeup indexWill Deacon2-3/+3
The aux_watermark member of struct ring_buffer represents the period (in terms of bytes) at which wakeup events should be generated when data is written to the aux buffer in non-snapshot mode. On hardware that cannot generate an interrupt when the aux_head reaches an arbitrary wakeup index (such as ARM SPE), the aux_head sampled from handle->head in perf_aux_output_{skip,end} may in fact be past the wakeup index. This can lead to wakeup slowly falling behind the head. For example, consider the case where hardware can only generate an interrupt on a page-boundary and the aux buffer is initialised as follows: // Buffer size is 2 * PAGE_SIZE rb->aux_head = rb->aux_wakeup = 0 rb->aux_watermark = PAGE_SIZE / 2 following the first perf_aux_output_begin call, the handle is initialised with: handle->head = 0 handle->size = 2 * PAGE_SIZE handle->wakeup = PAGE_SIZE / 2 and the hardware will be programmed to generate an interrupt at PAGE_SIZE. When the interrupt is raised, the hardware head will be at PAGE_SIZE, so calling perf_aux_output_end(handle, PAGE_SIZE) puts the ring buffer into the following state: rb->aux_head = PAGE_SIZE rb->aux_wakeup = PAGE_SIZE / 2 rb->aux_watermark = PAGE_SIZE / 2 and then the next call to perf_aux_output_begin will result in: handle->head = handle->wakeup = PAGE_SIZE for which the semantics are unclear and, for a smaller aux_watermark (e.g. PAGE_SIZE / 4), then the wakeup would in fact be behind head at this point. This patch fixes the problem by rounding down the aux_head (as sampled from the handle) to the nearest aux_watermark boundary when updating rb->aux_wakeup, therefore taking into account any overruns by the hardware. Reported-by: Mark Rutland <[email protected]> Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Alexander Shishkin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-25perf/aux: Make aux_{head,wakeup} ring_buffer members longWill Deacon2-19/+16
The aux_head and aux_wakeup members of struct ring_buffer are defined using the local_t type, despite the fact that they are only accessed via the perf_aux_output_*() functions, which cannot race with each other for a given ring buffer. This patch changes the type of the members to long, so we can avoid using the local_*() API where it isn't needed. Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-25Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar7-36/+57
Signed-off-by: Ingo Molnar <[email protected]>
2017-08-25perf/core: Fix group {cpu,task} validationMark Rutland1-20/+19
Regardless of which events form a group, it does not make sense for the events to target different tasks and/or CPUs, as this leaves the group inconsistent and impossible to schedule. The core perf code assumes that these are consistent across (successfully intialised) groups. Core perf code only verifies this when moving SW events into a HW context. Thus, we can violate this requirement for pure SW groups and pure HW groups, unless the relevant PMU driver happens to perform this verification itself. These mismatched groups subsequently wreak havoc elsewhere. For example, we handle watchpoints as SW events, and reserve watchpoint HW on a per-CPU basis at pmu::event_init() time to ensure that any event that is initialised is guaranteed to have a slot at pmu::add() time. However, the core code only checks the group leader's cpu filter (via event_filter_match()), and can thus install follower events onto CPUs violating thier (mismatched) CPU filters, potentially installing them into a CPU without sufficient reserved slots. This can be triggered with the below test case, resulting in warnings from arch backends. #define _GNU_SOURCE #include <linux/hw_breakpoint.h> #include <linux/perf_event.h> #include <sched.h> #include <stdio.h> #include <sys/prctl.h> #include <sys/syscall.h> #include <unistd.h> static int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu, int group_fd, unsigned long flags) { return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags); } char watched_char; struct perf_event_attr wp_attr = { .type = PERF_TYPE_BREAKPOINT, .bp_type = HW_BREAKPOINT_RW, .bp_addr = (unsigned long)&watched_char, .bp_len = 1, .size = sizeof(wp_attr), }; int main(int argc, char *argv[]) { int leader, ret; cpu_set_t cpus; /* * Force use of CPU0 to ensure our CPU0-bound events get scheduled. */ CPU_ZERO(&cpus); CPU_SET(0, &cpus); ret = sched_setaffinity(0, sizeof(cpus), &cpus); if (ret) { printf("Unable to set cpu affinity\n"); return 1; } /* open leader event, bound to this task, CPU0 only */ leader = perf_event_open(&wp_attr, 0, 0, -1, 0); if (leader < 0) { printf("Couldn't open leader: %d\n", leader); return 1; } /* * Open a follower event that is bound to the same task, but a * different CPU. This means that the group should never be possible to * schedule. */ ret = perf_event_open(&wp_attr, 0, 1, leader, 0); if (ret < 0) { printf("Couldn't open mismatched follower: %d\n", ret); return 1; } else { printf("Opened leader/follower with mismastched CPUs\n"); } /* * Open as many independent events as we can, all bound to the same * task, CPU0 only. */ do { ret = perf_event_open(&wp_attr, 0, 0, -1, 0); } while (ret >= 0); /* * Force enable/disble all events to trigger the erronoeous * installation of the follower event. */ printf("Opened all events. Toggling..\n"); for (;;) { prctl(PR_TASK_PERF_EVENTS_DISABLE, 0, 0, 0, 0); prctl(PR_TASK_PERF_EVENTS_ENABLE, 0, 0, 0, 0); } return 0; } Fix this by validating this requirement regardless of whether we're moving events. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Zhou Chengming <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-24strparser: initialize all callbacksEric Biggers1-5/+5
commit bbb03029a899 ("strparser: Generalize strparser") added more function pointers to 'struct strp_callbacks'; however, kcm_attach() was not updated to initialize them. This could cause the ->lock() and/or ->unlock() function pointers to be set to garbage values, causing a crash in strp_work(). Fix the bug by moving the callback structs into static memory, so unspecified members are zeroed. Also constify them while we're at it. This bug was found by syzkaller, which encountered the following splat: IP: 0x55 PGD 3b1ca067 P4D 3b1ca067 PUD 3b12f067 PMD 0 Oops: 0010 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 2 PID: 1194 Comm: kworker/u8:1 Not tainted 4.13.0-rc4-next-20170811 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Workqueue: kstrp strp_work task: ffff88006bb0e480 task.stack: ffff88006bb10000 RIP: 0010:0x55 RSP: 0018:ffff88006bb17540 EFLAGS: 00010246 RAX: dffffc0000000000 RBX: ffff88006ce4bd60 RCX: 0000000000000000 RDX: 1ffff1000d9c97bd RSI: 0000000000000000 RDI: ffff88006ce4bc48 RBP: ffff88006bb17558 R08: ffffffff81467ab2 R09: 0000000000000000 R10: ffff88006bb17438 R11: ffff88006bb17940 R12: ffff88006ce4bc48 R13: ffff88003c683018 R14: ffff88006bb17980 R15: ffff88003c683000 FS: 0000000000000000(0000) GS:ffff88006de00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000055 CR3: 000000003c145000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: process_one_work+0xbf3/0x1bc0 kernel/workqueue.c:2098 worker_thread+0x223/0x1860 kernel/workqueue.c:2233 kthread+0x35e/0x430 kernel/kthread.c:231 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431 Code: Bad RIP value. RIP: 0x55 RSP: ffff88006bb17540 CR2: 0000000000000055 ---[ end trace f0e4920047069cee ]--- Here is a C reproducer (requires CONFIG_BPF_SYSCALL=y and CONFIG_AF_KCM=y): #include <linux/bpf.h> #include <linux/kcm.h> #include <linux/types.h> #include <stdint.h> #include <sys/ioctl.h> #include <sys/socket.h> #include <sys/syscall.h> #include <unistd.h> static const struct bpf_insn bpf_insns[3] = { { .code = 0xb7 }, /* BPF_MOV64_IMM(0, 0) */ { .code = 0x95 }, /* BPF_EXIT_INSN() */ }; static const union bpf_attr bpf_attr = { .prog_type = 1, .insn_cnt = 2, .insns = (uintptr_t)&bpf_insns, .license = (uintptr_t)"", }; int main(void) { int bpf_fd = syscall(__NR_bpf, BPF_PROG_LOAD, &bpf_attr, sizeof(bpf_attr)); int inet_fd = socket(AF_INET, SOCK_STREAM, 0); int kcm_fd = socket(AF_KCM, SOCK_DGRAM, 0); ioctl(kcm_fd, SIOCKCMATTACH, &(struct kcm_attach) { .fd = inet_fd, .bpf_fd = bpf_fd }); } Fixes: bbb03029a899 ("strparser: Generalize strparser") Cc: Dmitry Vyukov <[email protected]> Cc: Tom Herbert <[email protected]> Signed-off-by: Eric Biggers <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-24Merge tag 'trace-v4.13-rc3' of ↵Linus Torvalds6-16/+38
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Various bug fixes: - Two small memory leaks in error paths. - A missed return error code on an error path. - A fix to check the tracing ring buffer CPU when it doesn't exist (caused by setting maxcpus on the command line that is less than the actual number of CPUs, and then onlining them manually). - A fix to have the reset of boot tracers called by lateinit_sync() instead of just lateinit(). As some of the tracers register via lateinit(), and if the clear happens before the tracer is registered, it will never start even though it was told to via the kernel command line" * tag 'trace-v4.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix freeing of filter in create_filter() when set_str is false tracing: Fix kmemleak in tracing_map_array_free() ftrace: Check for null ret_stack on profile function graph entry function ring-buffer: Have ring_buffer_alloc_read_page() return error on offline CPU tracing: Missing error code in tracer_alloc_buffers() tracing: Call clear_boot_tracer() at lateinit_sync
2017-08-24cpuset: Fix incorrect memory_pressure control file mappingWaiman Long1-0/+1
The memory_pressure control file was incorrectly set up without a private value (0, by default). As a result, this control file was treated like memory_migrate on read. By adding back the FILE_MEMORY_PRESSURE private value, the correct memory pressure value will be returned. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Fixes: 7dbdb199d3bf ("cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE") Cc: [email protected] # v4.4+
2017-08-24tracing: Fix freeing of filter in create_filter() when set_str is falseSteven Rostedt (VMware)1-0/+4
Performing the following task with kmemleak enabled: # cd /sys/kernel/tracing/events/irq/irq_handler_entry/ # echo 'enable_event:kmem:kmalloc:3 if irq >' > trigger # echo 'enable_event:kmem:kmalloc:3 if irq > 31' > trigger # echo scan > /sys/kernel/debug/kmemleak # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff8800b9290308 (size 32): comm "bash", pid 1114, jiffies 4294848451 (age 141.139s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff81cef5aa>] kmemleak_alloc+0x4a/0xa0 [<ffffffff81357938>] kmem_cache_alloc_trace+0x158/0x290 [<ffffffff81261c09>] create_filter_start.constprop.28+0x99/0x940 [<ffffffff812639c9>] create_filter+0xa9/0x160 [<ffffffff81263bdc>] create_event_filter+0xc/0x10 [<ffffffff812655e5>] set_trigger_filter+0xe5/0x210 [<ffffffff812660c4>] event_enable_trigger_func+0x324/0x490 [<ffffffff812652e2>] event_trigger_write+0x1a2/0x260 [<ffffffff8138cf87>] __vfs_write+0xd7/0x380 [<ffffffff8138f421>] vfs_write+0x101/0x260 [<ffffffff8139187b>] SyS_write+0xab/0x130 [<ffffffff81cfd501>] entry_SYSCALL_64_fastpath+0x1f/0xbe [<ffffffffffffffff>] 0xffffffffffffffff The function create_filter() is passed a 'filterp' pointer that gets allocated, and if "set_str" is true, it is up to the caller to free it, even on error. The problem is that the pointer is not freed by create_filter() when set_str is false. This is a bug, and it is not up to the caller to free the filter on error if it doesn't care about the string. Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] Fixes: 38b78eb85 ("tracing: Factorize filter creation") Reported-by: Chunyu Hu <[email protected]> Tested-by: Chunyu Hu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2017-08-24tracing: Fix kmemleak in tracing_map_array_free()Chunyu Hu1-4/+7
kmemleak reported the below leak when I was doing clear of the hist trigger. With this patch, the kmeamleak is gone. unreferenced object 0xffff94322b63d760 (size 32): comm "bash", pid 1522, jiffies 4403687962 (age 2442.311s) hex dump (first 32 bytes): 00 01 00 00 04 00 00 00 08 00 00 00 ff 00 00 00 ................ 10 00 00 00 00 00 00 00 80 a8 7a f2 31 94 ff ff ..........z.1... backtrace: [<ffffffff9e96c27a>] kmemleak_alloc+0x4a/0xa0 [<ffffffff9e424cba>] kmem_cache_alloc_trace+0xca/0x1d0 [<ffffffff9e377736>] tracing_map_array_alloc+0x26/0x140 [<ffffffff9e261be0>] kretprobe_trampoline+0x0/0x50 [<ffffffff9e38b935>] create_hist_data+0x535/0x750 [<ffffffff9e38bd47>] event_hist_trigger_func+0x1f7/0x420 [<ffffffff9e38893d>] event_trigger_write+0xfd/0x1a0 [<ffffffff9e44dfc7>] __vfs_write+0x37/0x170 [<ffffffff9e44f552>] vfs_write+0xb2/0x1b0 [<ffffffff9e450b85>] SyS_write+0x55/0xc0 [<ffffffff9e203857>] do_syscall_64+0x67/0x150 [<ffffffff9e977ce7>] return_from_SYSCALL_64+0x0/0x6a [<ffffffffffffffff>] 0xffffffffffffffff unreferenced object 0xffff9431f27aa880 (size 128): comm "bash", pid 1522, jiffies 4403687962 (age 2442.311s) hex dump (first 32 bytes): 00 00 8c 2a 32 94 ff ff 00 f0 8b 2a 32 94 ff ff ...*2......*2... 00 e0 8b 2a 32 94 ff ff 00 d0 8b 2a 32 94 ff ff ...*2......*2... backtrace: [<ffffffff9e96c27a>] kmemleak_alloc+0x4a/0xa0 [<ffffffff9e425348>] __kmalloc+0xe8/0x220 [<ffffffff9e3777c1>] tracing_map_array_alloc+0xb1/0x140 [<ffffffff9e261be0>] kretprobe_trampoline+0x0/0x50 [<ffffffff9e38b935>] create_hist_data+0x535/0x750 [<ffffffff9e38bd47>] event_hist_trigger_func+0x1f7/0x420 [<ffffffff9e38893d>] event_trigger_write+0xfd/0x1a0 [<ffffffff9e44dfc7>] __vfs_write+0x37/0x170 [<ffffffff9e44f552>] vfs_write+0xb2/0x1b0 [<ffffffff9e450b85>] SyS_write+0x55/0xc0 [<ffffffff9e203857>] do_syscall_64+0x67/0x150 [<ffffffff9e977ce7>] return_from_SYSCALL_64+0x0/0x6a [<ffffffffffffffff>] 0xffffffffffffffff Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] Fixes: 08d43a5fa063 ("tracing: Add lock-free tracing_map") Signed-off-by: Chunyu Hu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2017-08-24ftrace: Check for null ret_stack on profile function graph entry functionSteven Rostedt (VMware)1-0/+4
There's a small race when function graph shutsdown and the calling of the registered function graph entry callback. The callback must not reference the task's ret_stack without first checking that it is not NULL. Note, when a ret_stack is allocated for a task, it stays allocated until the task exits. The problem here, is that function_graph is shutdown, and a new task was created, which doesn't have its ret_stack allocated. But since some of the functions are still being traced, the callbacks can still be called. The normal function_graph code handles this, but starting with commit 8861dd303c ("ftrace: Access ret_stack->subtime only in the function profiler") the profiler code references the ret_stack on function entry, but doesn't check if it is NULL first. Link: https://bugzilla.kernel.org/show_bug.cgi?id=196611 Cc: [email protected] Fixes: 8861dd303c ("ftrace: Access ret_stack->subtime only in the function profiler") Reported-by: [email protected] Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2017-08-24timers: Fix excessive granularity of new timers after a nohz idleNicholas Piggin1-9/+41
When a timer base is idle, it is forwarded when a new timer is added to ensure that granularity does not become excessive. When not idle, the timer tick is expected to increment the base. However there are several problems: - If an existing timer is modified, the base is forwarded only after the index is calculated. - The base is not forwarded by add_timer_on. - There is a window after a timer is restarted from a nohz idle, after it is marked not-idle and before the timer tick on this CPU, where a timer may be added but the ancient base does not get forwarded. These result in excessive granularity (a 1 jiffy timeout can blow out to 100s of jiffies), which cause the rcu lockup detector to trigger, among other things. Fix this by keeping track of whether the timer base has been idle since it was last run or forwarded, and if so then forward it before adding a new timer. There is still a case where mod_timer optimises the case of a pending timer mod with the same expiry time, where the timer can see excessive granularity relative to the new, shorter interval. A comment is added, but it's not changed because it is an important fastpath for networking. This has been tested and found to fix the RCU softlockup messages. Testing was also done with tracing to measure requested versus achieved wakeup latencies for all non-deferrable timers in an idle system (with no lockup watchdogs running). Wakeup latency relative to absolute latency is calculated (note this suffers from round-up skew at low absolute times) and analysed: max avg std upstream 506.0 1.20 4.68 patched 2.0 1.08 0.15 The bug was noticed due to the lockup detector Kconfig changes dropping it out of people's .configs and resulting in larger base clk skew When the lockup detectors are enabled, no CPU can go idle for longer than 4 seconds, which limits the granularity errors. Sub-optimal timer behaviour is observable on a smaller scale in that case: max avg std upstream 9.0 1.05 0.19 patched 2.0 1.04 0.11 Fixes: Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible") Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Jonathan Cameron <[email protected]> Tested-by: David Miller <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Stephen Boyd <[email protected]> Cc: [email protected] Cc: [email protected] Cc: John Stultz <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-24Merge branch 'linus' into perf/core, to pick up fixesIngo Molnar7-14/+102
Signed-off-by: Ingo Molnar <[email protected]>
2017-08-23bpf: netdev is never null in __dev_map_flushDaniel Borkmann1-5/+3
No need to test for it in fast-path, every dev in bpf_dtab_netdev is guaranteed to be non-NULL, otherwise dev_map_update_elem() will fail in the first place. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: John Fastabend <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-23bpf/verifier: document liveness analysisEdward Cree1-1/+27
The liveness tracking algorithm is quite subtle; add comments to explain it. Signed-off-by: Edward Cree <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-23bpf/verifier: remove varlen_map_value_access flagEdward Cree1-29/+12
The optimisation it does is broken when the 'new' register value has a variable offset and the 'old' was constant. I broke it with my pointer types unification (see Fixes tag below), before which the 'new' value would have type PTR_TO_MAP_VALUE_ADJ and would thus not compare equal; other changes in that patch mean that its original behaviour (ignore min/max values) cannot be restored. Tests on a sample set of cilium programs show no change in count of processed instructions. Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") Signed-off-by: Edward Cree <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-23bpf/verifier: when pruning a branch, ignore its write marksEdward Cree1-2/+7
The fact that writes occurred in reaching the continuation state does not screen off its reads from us, because we're not really its parent. So detect 'not really the parent' in do_propagate_liveness, and ignore write marks in that case. Fixes: dc503a8ad984 ("bpf/verifier: track liveness for pruning") Signed-off-by: Edward Cree <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-23block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig2-4/+3
This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-08-23workqueue: Use TASK_IDLEPeter Zijlstra1-2/+2
Workqueues don't use signals, it (ab)uses TASK_INTERRUPTIBLE to avoid increasing the loadavg numbers. We've 'recently' introduced TASK_IDLE for this case: 80ed87c8a9ca ("sched/wait: Introduce TASK_NOLOAD and TASK_IDLE") use it. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2017-08-23genirq: Let irq_set_vcpu_affinity() iterate over hierarchyMarc Zyngier1-2/+12
When assigning an interrupt to a vcpu, it is not unlikely that the level of the hierarchy implementing irq_set_vcpu_affinity is not the top level (think a generic MSI domain on top of a virtualization aware interrupt controller). In such a case, let's iterate over the hierarchy until we find an irqchip implementing it. Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2017-08-22bpf: minor cleanups for dev_mapDaniel Borkmann1-59/+41
Some minor code cleanups, while going over it I also noticed that we're accounting the bitmap only for one CPU currently, so fix that up as well. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-22ANDROID: binder: add hwbinder,vndbinder to BINDER_DEVICES.Martijn Coenen1-0/+1
These will be required going forward. Signed-off-by: Martijn Coenen <[email protected]> Cc: stable <[email protected]> # 4.11+ Signed-off-by: Greg Kroah-Hartman <[email protected]>
2017-08-22bpf: fix map value attribute for hash of mapsDaniel Borkmann1-13/+17
Currently, iproute2's BPF ELF loader works fine with array of maps when retrieving the fd from a pinned node and doing a selfcheck against the provided map attributes from the object file, but we fail to do the same for hash of maps and thus refuse to get the map from pinned node. Reason is that when allocating hash of maps, fd_htab_map_alloc() will set the value size to sizeof(void *), and any user space map creation requests are forced to set 4 bytes as value size. Thus, selfcheck will complain about exposed 8 bytes on 64 bit archs vs. 4 bytes from object file as value size. Contract is that fdinfo or BPF_MAP_GET_FD_BY_ID returns the value size used to create the map. Fix it by handling it the same way as we do for array of maps, which means that we leave value size at 4 bytes and in the allocation phase round up value size to 8 bytes. alloc_htab_elem() needs an adjustment in order to copy rounded up 8 bytes due to bpf_fd_htab_map_update_elem() calling into htab_map_update_elem() with the pointer of the map pointer as value. Unlike array of maps where we just xchg(), we're using the generic htab_map_update_elem() callback also used from helper calls, which published the key/value already on return, so we need to ensure to memcpy() the right size. Fixes: bcc6b1b7ebf8 ("bpf: Add hash of maps support") Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-22bpf: fix map value attribute for hash of mapsDaniel Borkmann1-13/+17
Currently, iproute2's BPF ELF loader works fine with array of maps when retrieving the fd from a pinned node and doing a selfcheck against the provided map attributes from the object file, but we fail to do the same for hash of maps and thus refuse to get the map from pinned node. Reason is that when allocating hash of maps, fd_htab_map_alloc() will set the value size to sizeof(void *), and any user space map creation requests are forced to set 4 bytes as value size. Thus, selfcheck will complain about exposed 8 bytes on 64 bit archs vs. 4 bytes from object file as value size. Contract is that fdinfo or BPF_MAP_GET_FD_BY_ID returns the value size used to create the map. Fix it by handling it the same way as we do for array of maps, which means that we leave value size at 4 bytes and in the allocation phase round up value size to 8 bytes. alloc_htab_elem() needs an adjustment in order to copy rounded up 8 bytes due to bpf_fd_htab_map_update_elem() calling into htab_map_update_elem() with the pointer of the map pointer as value. Unlike array of maps where we just xchg(), we're using the generic htab_map_update_elem() callback also used from helper calls, which published the key/value already on return, so we need to ensure to memcpy() the right size. Fixes: bcc6b1b7ebf8 ("bpf: Add hash of maps support") Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller9-28/+149
2017-08-21pids: make task_tgid_nr_ns() safeOleg Nesterov1-7/+4
This was reported many times, and this was even mentioned in commit 52ee2dfdd4f5 ("pids: refactor vnr/nr_ns helpers to make them safe") but somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is not safe because task->group_leader points to nowhere after the exiting task passes exit_notify(), rcu_read_lock() can not help. We really need to change __unhash_process() to nullify group_leader, parent, and real_parent, but this needs some cleanups. Until then we can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and fix the problem. Reported-by: Troy Kensinger <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-21Merge branch 'for-mingo' of ↵Ingo Molnar25-787/+544
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Removal of spin_unlock_wait() - SRCU updates - Torture-test updates - Documentation updates - Miscellaneous fixes - CPU-hotplug fixes - Miscellaneous non-RCU fixes Signed-off-by: Ingo Molnar <[email protected]>
2017-08-20bpf: fix double free from dev_map_notification()Daniel Borkmann1-7/+5
In the current code, dev_map_free() can still race with dev_map_notification(). In dev_map_free(), we remove dtab from the list of dtabs after we purged all entries from it. However, we don't do xchg() with NULL or the like, so the entry at that point is still pointing to the device. If a unregister notification comes in at the same time, we therefore risk a double-free, since the pointer is still present in the map, and then pushed again to __dev_map_entry_free(). All this is completely unnecessary. Just remove the dtab from the list right before the synchronize_rcu(), so all outstanding readers from the notifier list have finished by then, thus we don't need to deal with this corner case anymore and also wouldn't need to nullify dev entires. This is fine because we iterate over the map releasing all entries and therefore dev references anyway. Fixes: 4cc7b9544b9a ("bpf: devmap fix mutex in rcu critical section") Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-20Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds1-8/+39
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "Two fixes for the perf subsystem: - Fix an inconsistency of RDPMC mm struct tagging across exec() which causes RDPMC to fault. - Correct the timestamp mechanics across IOC_DISABLE/ENABLE which causes incorrect timestamps and total time calculations" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix time on IOC_ENABLE perf/x86: Fix RDPMC vs. mm_struct tracking
2017-08-20Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds2-4/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A pile of smallish changes all over the place: - Add a missing ISB in the GIC V1 driver - Remove an ACPI version check in the GIC V3 ITS driver - Add the missing irq_pm_shutdown function for BRCMSTB-L2 to avoid spurious wakeups - Remove the artifical limitation of ITS instances to the number of NUMA nodes which prevents utilizing the ITS hardware correctly - Prevent a infinite parsing loop in the GIC-V3 ITS/MSI code - Honour the force affinity argument in the GIC-V3 driver which is required to make perf work correctly - Correctly report allocation failures in GIC-V2/V3 to avoid using half allocated and initialized interrupts. - Fixup checks against nr_cpu_ids in the generic IPI code" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/ipi: Fixup checks against nr_cpu_ids genirq: Restore trigger settings in irq_modify_status() MAINTAINERS: Remove Jason Cooper's irqchip git tree irqchip/gic-v3-its-platform-msi: Fix msi-parent parsing loop irqchip/gic-v3-its: Allow GIC ITS number more than MAX_NUMNODES irqchip: brcmstb-l2: Define an irq_pm_shutdown function irqchip/gic: Ensure we have an ISB between ack and ->handle_irq irqchip/gic-v3-its: Remove ACPICA version check for ACPI NUMA irqchip/gic-v3: Honor forced affinity setting irqchip/gic-v3: Report failures in gic_irq_domain_alloc irqchip/gic-v2: Report failures in gic_irq_domain_alloc irqchip/atmel-aic: Remove root argument from ->fixup() prototype irqchip/atmel-aic: Fix unbalanced refcount in aic_common_rtc_irq_fixup() irqchip/atmel-aic: Fix unbalanced of_node_put() in aic_common_irq_fixup()
2017-08-20Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds2-0/+60
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull watchdog fix from Thomas Gleixner: "A fix for the hardlockup watchdog to prevent false positives with extreme Turbo-Modes which make the perf/NMI watchdog fire faster than the hrtimer which is used to verify. Slightly larger than the minimal fix, which just would increase the hrtimer frequency, but comes with extra overhead of more watchdog timer interrupts and thread wakeups for all users. With this change we restrict the overhead to the extreme Turbo-Mode systems" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kernel/watchdog: Prevent false positives with turbo modes
2017-08-20Merge branch 'fortglx/4.14/time' of ↵Thomas Gleixner2-3/+10
https://git.linaro.org/people/john.stultz/linux into timers/core Pull timekeepig updates from John Stultz - kselftest improvements - Use the proper timekeeper in the debug code - Prevent accessing an unavailable wakeup source in the alarmtimer sysfs interface.
2017-08-20genirq/ipi: Fixup checks against nr_cpu_idsAlexey Dobriyan1-2/+2
Valid CPU ids are [0, nr_cpu_ids-1] inclusive. Fixes: 3b8e29a82dd1 ("genirq: Implement ipi_send_mask/single()") Fixes: f9bce791ae2a ("genirq: Add a new function to get IPI reverse mapping") Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/20170819095751.GB27864@avx2
2017-08-19bpf: inline map in map lookup functions for array and htabDaniel Borkmann2-0/+43
Avoid two successive functions calls for the map in map lookup, first is the bpf_map_lookup_elem() helper call, and second the callback via map->ops->map_lookup_elem() to get to the map in map implementation. Implementation inlines array and htab flavor for map in map lookups. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-19bpf: make htab inlining more robust wrt assumptionsDaniel Borkmann1-1/+5
Commit 9015d2f59535 ("bpf: inline htab_map_lookup_elem()") was making the assumption that a direct call emission to the function __htab_map_lookup_elem() will always work out for JITs. This is currently true since all JITs we have are for 64 bit archs, but in case of 32 bit JITs like upcoming arm32, we get a NULL pointer dereference when executing the call to __htab_map_lookup_elem() since passed arguments are of a different size (due to pointer args) than what we do out of BPF. Guard and thus limit this for now for the current 64 bit JITs only. Reported-by: Shubham Bansal <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-19bpf: Allow selecting numa node during map creationMartin KaFai Lau7-21/+55
The current map creation API does not allow to provide the numa-node preference. The memory usually comes from where the map-creation-process is running. The performance is not ideal if the bpf_prog is known to always run in a numa node different from the map-creation-process. One of the use case is sharding on CPU to different LRU maps (i.e. an array of LRU maps). Here is the test result of map_perf_test on the INNER_LRU_HASH_PREALLOC test if we force the lru map used by CPU0 to be allocated from a remote numa node: [ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ] ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000 5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec 4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec 3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec 6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec 2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec 1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec 7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec 0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec #<<< After specifying numa node: ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000 5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec 3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec 1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec 6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec 2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec 4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec 7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec 0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<< This patch adds one field, numa_node, to the bpf_attr. Since numa node 0 is a valid node, a new flag BPF_F_NUMA_NODE is also added. The numa_node field is honored if and only if the BPF_F_NUMA_NODE flag is set. Numa node selection is not supported for percpu map. This patch does not change all the kmalloc. F.e. 'htab = kzalloc()' is not changed since the object is small enough to stay in the cache. Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-18bpf: Fix map-in-map checking in the verifierMartin KaFai Lau1-0/+1
In check_map_func_compatibility(), a 'break' has been accidentally removed for the BPF_MAP_TYPE_ARRAY_OF_MAPS and BPF_MAP_TYPE_HASH_OF_MAPS cases. This patch adds it back. Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Cc: John Fastabend <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: John Fastabend <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-18signal: don't remove SIGNAL_UNKILLABLE for traced tasks.Jamie Iles1-1/+5
When forcing a signal, SIGNAL_UNKILLABLE is removed to prevent recursive faults, but this is undesirable when tracing. For example, debugging an init process (whether global or namespace), hitting a breakpoint and SIGTRAP will force SIGTRAP and then remove SIGNAL_UNKILLABLE. Everything continues fine, but then once debugging has finished, the init process is left killable which is unlikely what the user expects, resulting in either an accidentally killed init or an init that stops reaping zombies. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jamie Iles <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-18kmod: fix wait on recursive loopLuis R. Rodriguez1-2/+23
Recursive loops with module loading were previously handled in kmod by restricting the number of modprobe calls to 50 and if that limit was breached request_module() would return an error and a user would see the following on their kernel dmesg: request_module: runaway loop modprobe binfmt-464c Starting init:/sbin/init exists but couldn't execute it (error -8) This issue could happen for instance when a 64-bit kernel boots a 32-bit userspace on some architectures and has no 32-bit binary format hanlders. This is visible, for instance, when a CONFIG_MODULES enabled 64-bit MIPS kernel boots a into o32 root filesystem and the binfmt handler for o32 binaries is not built-in. After commit 6d7964a722af ("kmod: throttle kmod thread limit") we now don't have any visible signs of an error and the kernel just waits for the loop to end somehow. Although this *particular* recursive loop could also be addressed by doing a sanity check on search_binary_handler() and disallowing a modular binfmt to be required for modprobe, a generic solution for any recursive kernel kmod issues is still needed. This should catch these loops. We can investigate each loop and address each one separately as they come in, this however puts a stop gap for them as before. Link: http://lkml.kernel.org/r/[email protected] Fixes: 6d7964a722af ("kmod: throttle kmod thread limit") Signed-off-by: Luis R. Rodriguez <[email protected]> Reported-by: Matt Redfearn <[email protected]> Tested-by: Matt Redfearn <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Colin Ian King <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Daniel Mentz <[email protected]> Cc: David Binderman <[email protected]> Cc: Dmitry Torokhov <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jessica Yu <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Kees Cook <[email protected]> Cc: Michal Marek <[email protected]> Cc: Miroslav Benes <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-18bpf: fix a return in sockmap_get_from_fd()Dan Carpenter1-2/+2
"map" is a valid pointer. We wanted to return "err" instead. Also let's return a zero literal at the end. Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Signed-off-by: Dan Carpenter <[email protected]> Acked-by: John Fastabend <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-18cpuset: Allow v2 behavior in v1 cgroupWaiman Long1-13/+20
Cpuset v2 has some useful behaviors that are not present in v1 because of backward compatibility concern. One of that is the restoration of the original cpu and memory node mask after a hot removal and addition event sequence. This patch makes the cpuset controller to check the CGRP_ROOT_CPUSET_V2_MODE flag and use the v2 behavior if it is set. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2017-08-18cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroupWaiman Long1-0/+6
A new mount option "cpuset_v2_mode" is added to the v1 cgroupfs filesystem to enable cpuset controller to use v2 behavior in a v1 cgroup. This mount option applies only to cpuset controller and have no effect on other controllers. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2017-08-18posix-cpu-timers: Use dedicated helper to access rlimit valuesKrzysztof Opasiak1-8/+6
Use rlimit() and rlimit_max() helper instead of manually writing whole chain from task to rlimit value Signed-off-by: Krzysztof Opasiak <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected]