aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2018-06-03Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds3-18/+35
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: - two patches addressing the problem that the scheduler allows under certain conditions user space tasks to be scheduled on CPUs which are not yet fully booted which causes a few subtle and hard to debug issue - add a missing runqueue clock update in the deadline scheduler which triggers a warning under certain circumstances - fix a silly typo in the scheduler header file * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/headers: Fix typo sched/deadline: Fix missing clock update sched/core: Require cpu_active() in select_task_rq(), for user tasks sched/core: Fix rules for running on online && !active CPUs
2018-06-03bpf/xdp: devmap can avoid calling ndo_xdp_flushJesper Dangaard Brouer1-13/+6
The XDP_REDIRECT map devmap can avoid using ndo_xdp_flush, by instead instructing ndo_xdp_xmit to flush via XDP_XMIT_FLUSH flag in appropriate places. Notice after this patch it is possible to remove ndo_xdp_flush completely, as this is the last user of ndo_xdp_flush. This is left for later patches, to keep driver changes separate. Signed-off-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-06-03xdp: add flags argument to ndo_xdp_xmit APIJesper Dangaard Brouer1-1/+1
This patch only change the API and reject any use of flags. This is an intermediate step that allows us to implement the flush flag operation later, for each individual driver in a separate patch. The plan is to implement flush operation via XDP_XMIT_FLUSH flag and then remove XDP_XMIT_FLAGS_NONE when done. Signed-off-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-06-03bpf: fix context access in tracing progs on 32 bit archsDaniel Borkmann2-3/+10
Wang reported that all the testcases for BPF_PROG_TYPE_PERF_EVENT program type in test_verifier report the following errors on x86_32: 172/p unpriv: spill/fill of different pointers ldx FAIL Unexpected error message! 0: (bf) r6 = r10 1: (07) r6 += -8 2: (15) if r1 == 0x0 goto pc+3 R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1 3: (bf) r2 = r10 4: (07) r2 += -76 5: (7b) *(u64 *)(r6 +0) = r2 6: (55) if r1 != 0x0 goto pc+1 R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 fp-8=fp 7: (7b) *(u64 *)(r6 +0) = r1 8: (79) r1 = *(u64 *)(r6 +0) 9: (79) r1 = *(u64 *)(r1 +68) invalid bpf_context access off=68 size=8 378/p check bpf_perf_event_data->sample_period byte load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (71) r0 = *(u8 *)(r1 +68) invalid bpf_context access off=68 size=1 379/p check bpf_perf_event_data->sample_period half load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (69) r0 = *(u16 *)(r1 +68) invalid bpf_context access off=68 size=2 380/p check bpf_perf_event_data->sample_period word load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (61) r0 = *(u32 *)(r1 +68) invalid bpf_context access off=68 size=4 381/p check bpf_perf_event_data->sample_period dword load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (79) r0 = *(u64 *)(r1 +68) invalid bpf_context access off=68 size=8 Reason is that struct pt_regs on x86_32 doesn't fully align to 8 byte boundary due to its size of 68 bytes. Therefore, bpf_ctx_narrow_access_ok() will then bail out saying that off & (size_default - 1) which is 68 & 7 doesn't cleanly align in the case of sample_period access from struct bpf_perf_event_data, hence verifier wrongly thinks we might be doing an unaligned access here though underlying arch can handle it just fine. Therefore adjust this down to machine size and check and rewrite the offset for narrow access on that basis. We also need to fix corresponding pe_prog_is_valid_access(), since we hit the check for off % size != 0 (e.g. 68 % 8 -> 4) in the first and last test. With that in place, progs for tracing work on x86_32. Reported-by: Wang YanQing <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Tested-by: Wang YanQing <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-06-03bpf: avoid retpoline for lookup/update/delete calls on mapsDaniel Borkmann2-22/+58
While some of the BPF map lookup helpers provide a ->map_gen_lookup() callback for inlining the map lookup altogether it is not available for every map, so the remaining ones have to call bpf_map_lookup_elem() helper which does a dispatch to map->ops->map_lookup_elem(). In times of retpolines, this will control and trap speculative execution rather than letting it do its work for the indirect call and will therefore cause a slowdown. Likewise, bpf_map_update_elem() and bpf_map_delete_elem() do not have an inlined version and need to call into their map->ops->map_update_elem() resp. map->ops->map_delete_elem() handlers. Before: # bpftool prog dump xlated id 1 0: (bf) r2 = r10 1: (07) r2 += -8 2: (7a) *(u64 *)(r2 +0) = 0 3: (18) r1 = map[id:1] 5: (85) call __htab_map_lookup_elem#232656 6: (15) if r0 == 0x0 goto pc+4 7: (71) r1 = *(u8 *)(r0 +35) 8: (55) if r1 != 0x0 goto pc+1 9: (72) *(u8 *)(r0 +35) = 1 10: (07) r0 += 56 11: (15) if r0 == 0x0 goto pc+4 12: (bf) r2 = r0 13: (18) r1 = map[id:1] 15: (85) call bpf_map_delete_elem#215008 <-- indirect call via 16: (95) exit helper After: # bpftool prog dump xlated id 1 0: (bf) r2 = r10 1: (07) r2 += -8 2: (7a) *(u64 *)(r2 +0) = 0 3: (18) r1 = map[id:1] 5: (85) call __htab_map_lookup_elem#233328 6: (15) if r0 == 0x0 goto pc+4 7: (71) r1 = *(u8 *)(r0 +35) 8: (55) if r1 != 0x0 goto pc+1 9: (72) *(u8 *)(r0 +35) = 1 10: (07) r0 += 56 11: (15) if r0 == 0x0 goto pc+4 12: (bf) r2 = r0 13: (18) r1 = map[id:1] 15: (85) call htab_lru_map_delete_elem#238240 <-- direct call 16: (95) exit In all three lookup/update/delete cases however we can use the actual address of the map callback directly if we find that there's only a single path with a map pointer leading to the helper call, meaning when the map pointer has not been poisoned from verifier side. Example code can be seen above for the delete case. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-06-03bpf: show prog and map id in fdinfoDaniel Borkmann1-4/+8
Its trivial and straight forward to expose it for scripts that can then use it along with bpftool in order to inspect an individual application's used maps and progs. Right now we dump some basic information in the fdinfo file but with the help of the map/prog id full introspection becomes possible now. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Song Liu <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-06-03bpf: fixup error message from gpl helpers on license mismatchDaniel Borkmann1-1/+1
Stating 'proprietary program' in the error is just silly since it can also be a different open source license than that which is just not compatible. Reference: https://twitter.com/majek04/status/998531268039102465 Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-06-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller5-15/+31
Filling in the padding slot in the bpf structure as a bug fix in 'ne' overlapped with actually using that padding area for something in 'net-next'. Signed-off-by: David S. Miller <[email protected]>
2018-06-02libnvdimm, e820: Register all pmem resourcesDan Williams1-0/+1
There is currently a mismatch between the resources that will trigger the e820_pmem driver to register/load and the resources that will actually be surfaced as pmem ranges. register_e820_pmem() uses walk_iomem_res_desc() which includes children and siblings. In contrast, e820_pmem_probe() only considers top level resources. For example the following resource tree results in the driver being loaded, but no resources being registered: 398000000000-39bfffffffff : PCI Bus 0000:ae 39be00000000-39bf07ffffff : PCI Bus 0000:af 39be00000000-39beffffffff : 0000:af:00.0 39be10000000-39beffffffff : Persistent Memory (legacy) Fix this up to allow definitions of "legacy" pmem ranges anywhere in system-physical address space. Not that it is a recommended or safe to define a pmem range in PCI space, but it is useful for debug / experimentation, and the restriction on being a top-level resource was arbitrary. Cc: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2018-06-02bpf: btf: Ensure t->type == 0 for BTF_KIND_FWDMartin KaFai Lau1-1/+20
The t->type in BTF_KIND_FWD is not used. It must be 0. This patch ensures that and also adds a test case in test_btf.c Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-06-02bpf: btf: Check array t->sizeMartin KaFai Lau1-0/+5
This patch ensures array's t->size is 0. The array size is decided by its individual elem's size and the number of elements. Hence, t->size is not used and it must be 0. A test case is added to test_btf.c Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-31Merge branch 'linus' into perf/core, to pick up fixesIngo Molnar4-27/+72
Signed-off-by: Ingo Molnar <[email protected]>
2018-05-31sched/headers: Fix typoDavidlohr Bueso1-1/+1
I cannot spell 'throttling'. Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-05-31sched/deadline: Fix missing clock updateJuri Lelli1-3/+3
A missing clock update is causing the following warning: rq->clock_update_flags < RQCF_ACT_SKIP WARNING: CPU: 10 PID: 0 at kernel/sched/sched.h:963 inactive_task_timer+0x5d6/0x720 Call Trace: <IRQ> __hrtimer_run_queues+0x10f/0x530 hrtimer_interrupt+0xe5/0x240 smp_apic_timer_interrupt+0x79/0x2b0 apic_timer_interrupt+0xf/0x20 </IRQ> do_idle+0x203/0x280 cpu_startup_entry+0x6f/0x80 start_secondary+0x1b0/0x200 secondary_startup_64+0xa5/0xb0 hardirqs last enabled at (793919): [<ffffffffa27c5f6e>] cpuidle_enter_state+0x9e/0x360 hardirqs last disabled at (793920): [<ffffffffa2a0096e>] interrupt_entry+0xce/0xe0 softirqs last enabled at (793922): [<ffffffffa20bef78>] irq_enter+0x68/0x70 softirqs last disabled at (793921): [<ffffffffa20bef5d>] irq_enter+0x4d/0x70 This happens because inactive_task_timer() calls sub_running_bw() (if TASK_DEAD and non_contending) that might trigger a schedutil update, which might access the clock. Clock is however currently updated only later in inactive_task_timer() function. Fix the problem by updating the clock right after task_rq_lock(). Reported-by: kernel test robot <[email protected]> Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Claudio Scordino <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luca Abeni <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-05-31sched/core: Require cpu_active() in select_task_rq(), for user tasksPaul Burton1-2/+1
select_task_rq() is used in a few paths to select the CPU upon which a thread should be run - for example it is used by try_to_wake_up() & by fork or exec balancing. As-is it allows use of any online CPU that is present in the task's cpus_allowed mask. This presents a problem because there is a period whilst CPUs are brought online where a CPU is marked online, but is not yet fully initialized - ie. the period where CPUHP_AP_ONLINE_IDLE <= state < CPUHP_ONLINE. Usually we don't run any user tasks during this window, but there are corner cases where this can happen. An example observed is: - Some user task A, running on CPU X, forks to create task B. - sched_fork() calls __set_task_cpu() with cpu=X, setting task B's task_struct::cpu field to X. - CPU X is offlined. - Task A, currently somewhere between the __set_task_cpu() in copy_process() and the call to wake_up_new_task(), is migrated to CPU Y by migrate_tasks() when CPU X is offlined. - CPU X is onlined, but still in the CPUHP_AP_ONLINE_IDLE state. The scheduler is now active on CPU X, but there are no user tasks on the runqueue. - Task A runs on CPU Y & reaches wake_up_new_task(). This calls select_task_rq() with cpu=X, taken from task B's task_struct, and select_task_rq() allows CPU X to be returned. - Task A enqueues task B on CPU X's runqueue, via activate_task() & enqueue_task(). - CPU X now has a user task on its runqueue before it has reached the CPUHP_ONLINE state. In most cases, the user tasks that schedule on the newly onlined CPU have no idea that anything went wrong, but one case observed to be problematic is if the task goes on to invoke the sched_setaffinity syscall. The newly onlined CPU reaches the CPUHP_AP_ONLINE_IDLE state before the CPU that brought it online calls stop_machine_unpark(). This means that for a portion of the window of time between CPUHP_AP_ONLINE_IDLE & CPUHP_ONLINE the newly onlined CPU's struct cpu_stopper has its enabled field set to false. If a user thread is executed on the CPU during this window and it invokes sched_setaffinity with a CPU mask that does not include the CPU it's running on, then when __set_cpus_allowed_ptr() calls stop_one_cpu() intending to invoke migration_cpu_stop() and perform the actual migration away from the CPU it will simply return -ENOENT rather than calling migration_cpu_stop(). We then return from the sched_setaffinity syscall back to the user task that is now running on a CPU which it just asked not to run on, and which is not present in its cpus_allowed mask. This patch resolves the problem by having select_task_rq() enforce that user tasks run on CPUs that are active - the same requirement that select_fallback_rq() already enforces. This should ensure that newly onlined CPUs reach the CPUHP_AP_ACTIVE state before being able to schedule user tasks, and also implies that bringup_wait_for_ap() will have called stop_machine_unpark() which resolves the sched_setaffinity issue above. I haven't yet investigated them, but it may be of interest to review whether any of the actions performed by hotplug states between CPUHP_AP_ONLINE_IDLE & CPUHP_AP_ACTIVE could have similar unintended effects on user tasks that might schedule before they are reached, which might widen the scope of the problem from just affecting the behaviour of sched_setaffinity. Signed-off-by: Paul Burton <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-05-31sched/core: Fix rules for running on online && !active CPUsPeter Zijlstra1-12/+30
As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules for running on an online && !active CPU are stricter than just being a kthread, you need to be a per-cpu kthread. If you're not strictly per-CPU, you have better CPUs to run on and don't need the partially booted one to get your work done. The exception is to allow smpboot threads to bootstrap the CPU itself and get kernel 'services' initialized before we allow userspace on it. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: 955dbdf4ce87 ("sched: Allow migrating kthreads into online but inactive CPUs") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-05-30bpf: devmap: remove redundant assignment of dev = devColin Ian King1-1/+1
The assignment dev = dev is redundant and should be removed. Detected by CoverityScan, CID#1469486 ("Evaluation order violation") Signed-off-by: Colin Ian King <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-30media: rc: introduce BPF_PROG_LIRC_MODE2Sean Young1-0/+7
Add support for BPF_PROG_LIRC_MODE2. This type of BPF program can call rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report that the last key should be repeated. The bpf program can be attached to using the bpf(BPF_PROG_ATTACH) syscall; the target_fd must be the /dev/lircN device. Acked-by: Yonghong Song <[email protected]> Signed-off-by: Sean Young <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-05-30bpf: bpf_prog_array_copy() should return -ENOENT if exclude_prog not foundSean Young2-2/+11
This makes is it possible for bpf prog detach to return -ENOENT. Acked-by: Yonghong Song <[email protected]> Signed-off-by: Sean Young <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-05-30Merge branch 'for-mingo' of ↵Ingo Molnar1-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU fix from Paul E. McKenney: "This additional v4.18 pull request contains a single commit that fell through the cracks: Provide early rcu_cpu_starting() callback for the benefit of the x86/mtrr code, which needs RCU to be available on incoming CPUs earlier than has been the case in the past." Signed-off-by: Ingo Molnar <[email protected]>
2018-05-29tracing: Allow histogram triggers to access ftrace internal eventsSteven Rostedt (VMware)1-1/+1
Now that trace_marker can have triggers, including a histogram triggers, the onmatch() and onmax() access the trace event. To do so, the search routine to find the event file needs to use the raw __find_event_file() that does not filter out ftrace events. Reviewed-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-29tracing: Have zero size length in filter logic be full stringSteven Rostedt (VMware)1-11/+12
As strings in trace events may not have a nul terminating character, the filter string compares use the defined string length for the field for the compares. The trace_marker records data slightly different than do normal events. It's size is zero, meaning that the string is the rest of the array, and that the string also ends with '\0'. If the size is zero, assume that the string is nul terminated and read the string in question as is. Reviewed-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-29tracing: Add trigger file for trace_markers tracefs/ftrace/printSteven Rostedt (VMware)4-2/+29
Allow writing to the trace_markers file initiate triggers defined in tracefs/ftrace/print/trigger file. This will allow of user space to trigger the same type of triggers (including histograms) that the trace events use. Had to create a ftrace_event_register() function that will become the trace_marker print event's reg() function. This is required because of how triggers are enabled: event_trigger_write() { event_trigger_regex_write() { trigger_process_regex() { for p in trigger_commands { p->func(); /* trigger_snapshot_cmd->func */ event_trigger_callback() { cmd_ops->reg() /* register_trigger() */ { trace_event_trigger_enable_disable() { trace_event_enable_disable() { call->class->reg(); Without the reg() function, the trigger code will call a NULL pointer and crash the system. Cc: Tom Zanussi <[email protected]> Cc: Clark Williams <[email protected]> Cc: Karim Yaghmour <[email protected]> Cc: Brendan Gregg <[email protected]> Suggested-by: Joel Fernandes <[email protected]> Reviewed-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-29Merge tag 'trace-v4.17-rc4-3' of ↵Linus Torvalds3-10/+28
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "While writing selftests for a new feature, I triggered two existing bugs that deal with triggers and instances. - a generic trigger bug where the triggers are not removed from a linked list properly when deleting an instance. - a bug specific to snapshots, where the snapshot is done in the top level buffer, when it is supposed to snapshot the buffer associated to the instance the snapshot trigger exists in" * tag 'trace-v4.17-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Make the snapshot trigger work with instances tracing: Fix crash when freeing instances with event triggers
2018-05-29tracing: Do not show filter file for ftrace internal eventsSteven Rostedt (VMware)1-4/+6
The filter file in the ftrace internal events, like in /sys/kernel/tracing/events/ftrace/function/filter is not attached to any functionality. Do not create them as they are meaningless. In the future, if an ftrace internal event gets filter functionality, then it will need to create it directly. Reviewed-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-29tracing: Add brackets in ftrace event dynamic arraysSteven Rostedt (VMware)1-1/+1
The dynamic arrays defined for ftrace internal events, such as the buf field for trace_marker (ftrace/print) did not have brackets which makes the filter code not accept it as a string. This is not currently an issues because the filter code doesn't do anything for these events, but they will in the future, and this needs to be fixed for when it does. Reviewed-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-29tracing: Have event_trace_init() called by trace_init_tracefs()Steven Rostedt (VMware)3-3/+4
Instead of having both trace_init_tracefs() and event_trace_init() be called by fs_initcall() routines, have event_trace_init() called directly by trace_init_tracefs(). This will guarantee order of how the events are created with respect to the rest of the ftrace infrastructure. This is needed to be able to assoctiate event files with ftrace internal events, such as the trace_marker. Reviewed-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-29tracing: Add __find_event_file() to find event files without restrictionsSteven Rostedt (VMware)2-5/+20
By adding the function __find_event_file() that can search for files without restrictions, such as if the event associated with the file has a reg function, or if it has the "ignore" flag set, the files that are associated to ftrace internal events (like trace_marker and function events) can be found and used. find_event_file() still returns a "filtered" file, as most callers need a valid trace event file. One created by the trace_events.h macros and not one created for parsing ftrace specific events. Reviewed-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-29tracing: Do not reference event data in post call triggersSteven Rostedt (VMware)2-6/+4
Trace event triggers can be called before or after the event has been committed. If it has been called after the commit, there's a possibility that the event no longer exists. Currently, the two post callers is the trigger to disable tracing (traceoff) and the one that will record a stack dump (stacktrace). Neither of them reference the trace event entry record, as that would lead to a race condition that could pass in corrupted data. To prevent any other users of the post data triggers from using the trace event record, pass in NULL to the post call trigger functions for the event record, as they should never need to use them in the first place. This does not fix any bug, but prevents bugs from happening by new post call trigger users. Reviewed-by: Masami Hiramatsu <[email protected]> Reviewed-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-28PM / QoS: Drop redundant declaration of pm_qos_get_value()Rafael J. Wysocki1-1/+0
The extra forward declaration of pm_qos_get_value() is redundant, so drop it. Signed-off-by: Rafael J. Wysocki <[email protected]>
2018-05-28tracepoints: Fix the descriptions of tracepoint_probe_register{_prio}Lee, Chun-Yi1-2/+1
The description of tracepoint_probe_register duplicates with tracepoint_probe_register_prio. This patch cleans up the descriptions. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Lee, Chun-Yi" <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-28tracing: Make the snapshot trigger work with instancesSteven Rostedt (VMware)3-8/+25
The snapshot trigger currently only affects the main ring buffer, even when it is used by the instances. This can be confusing as the snapshot trigger is listed in the instance. > # cd /sys/kernel/tracing > # mkdir instances/foo > # echo snapshot > instances/foo/events/syscalls/sys_enter_fchownat/trigger > # echo top buffer > trace_marker > # echo foo buffer > instances/foo/trace_marker > # touch /tmp/bar > # chown rostedt /tmp/bar > # cat instances/foo/snapshot # tracer: nop # # # * Snapshot is freed * # # Snapshot commands: # echo 0 > snapshot : Clears and frees snapshot buffer # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated. # Takes a snapshot of the main buffer. # echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free) # (Doesn't have to be '2' works with any number that # is not a '0' or '1') > # cat snapshot # tracer: nop # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | bash-1189 [000] .... 111.488323: tracing_mark_write: top buffer Not only did the snapshot occur in the top level buffer, but the instance snapshot buffer should have been allocated, and it is still free. Cc: [email protected] Fixes: 85f2b08268c01 ("tracing: Add basic event trigger framework") Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-28bpf: Hooks for sys_sendmsgAndrey Ignatov2-1/+18
In addition to already existing BPF hooks for sys_bind and sys_connect, the patch provides new hooks for sys_sendmsg. It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that provides access to socket itlself (properties like family, type, protocol) and user-passed `struct sockaddr *` so that BPF program can override destination IP and port for system calls such as sendto(2) or sendmsg(2) and/or assign source IP to the socket. The hooks are implemented as two new attach types: `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and UDPv6 correspondingly. UDPv4 and UDPv6 separate attach types for same reason as sys_bind and sys_connect hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound. The difference with already existing hooks is sys_sendmsg are implemented only for unconnected UDP. For TCP it doesn't make sense to change user-provided `struct sockaddr *` at sendto(2)/sendmsg(2) time since socket either was already connected and has source/destination set or wasn't connected and call to sendto(2)/sendmsg(2) would lead to ENOTCONN anyway. Connected UDP is already handled by sys_connect hooks that can override source/destination at connect time and use fast-path later, i.e. these hooks don't affect UDP fast-path. Rewriting source IP is implemented differently than that in sys_connect hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to just bind socket to desired local IP address since source IP can be set on per-packet basis by using ancillary data (cmsg(3)). So no matter if socket is bound or not, source IP has to be rewritten on every call to sys_sendmsg. To do so two new fields are added to UAPI `struct bpf_sock_addr`; * `msg_src_ip4` to set source IPv4 for UDPv4; * `msg_src_ip6` to set source IPv6 for UDPv6. Signed-off-by: Andrey Ignatov <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-05-28bpf: avoid -Wmaybe-uninitialized warningArnd Bergmann1-4/+3
The stack_map_get_build_id_offset() function is too long for gcc to track whether 'work' may or may not be initialized at the end of it, leading to a false-positive warning: kernel/bpf/stackmap.c: In function 'stack_map_get_build_id_offset': kernel/bpf/stackmap.c:334:13: error: 'work' may be used uninitialized in this function [-Werror=maybe-uninitialized] This removes the 'in_nmi_ctx' flag and uses the state of that variable itself to see if it got initialized. Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context") Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-05-28bpf: btf: avoid -Wreturn-type warningArnd Bergmann1-1/+1
gcc warns about a noreturn function possibly returning in some configurations: kernel/bpf/btf.c: In function 'env_type_is_resolve_sink': kernel/bpf/btf.c:729:1: error: control reaches end of non-void function [-Werror=return-type] Using BUG() instead of BUG_ON() avoids that warning and otherwise does the exact same thing. Fixes: eb3f595dab40 ("bpf: btf: Validate type reference") Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-05-27tracing: Fix crash when freeing instances with event triggersSteven Rostedt (VMware)1-2/+3
If a instance has an event trigger enabled when it is freed, it could cause an access of free memory. Here's the case that crashes: # cd /sys/kernel/tracing # mkdir instances/foo # echo snapshot > instances/foo/events/initcall/initcall_start/trigger # rmdir instances/foo Would produce: general protection fault: 0000 [#1] PREEMPT SMP PTI Modules linked in: tun bridge ... CPU: 5 PID: 6203 Comm: rmdir Tainted: G W 4.17.0-rc4-test+ #933 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 RIP: 0010:clear_event_triggers+0x3b/0x70 RSP: 0018:ffffc90003783de0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800c7130ba0 RBP: ffffc90003783e00 R08: ffff8801131993f8 R09: 0000000100230016 R10: ffffc90003783d80 R11: 0000000000000000 R12: ffff8800c7130ba0 R13: ffff8800c7130bd8 R14: ffff8800cc093768 R15: 00000000ffffff9c FS: 00007f6f4aa86700(0000) GS:ffff88011eb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6f4a5aed60 CR3: 00000000cd552001 CR4: 00000000001606e0 Call Trace: event_trace_del_tracer+0x2a/0xc5 instance_rmdir+0x15c/0x200 tracefs_syscall_rmdir+0x52/0x90 vfs_rmdir+0xdb/0x160 do_rmdir+0x16d/0x1c0 __x64_sys_rmdir+0x17/0x20 do_syscall_64+0x55/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe This was due to the call the clears out the triggers when an instance is being deleted not removing the trigger from the link list. Cc: [email protected] Fixes: 85f2b08268c01 ("tracing: Add basic event trigger framework") Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-05-27PM / hibernate: Fix oops at snapshot_write()Tetsuo Handa1-0/+5
syzbot is reporting NULL pointer dereference at snapshot_write() [1]. This is because data->handle is zero-cleared by ioctl(SNAPSHOT_FREE). Fix this by checking data_of(data->handle) != NULL before using it. [1] https://syzkaller.appspot.com/bug?id=828a3c71bd344a6de8b6a31233d51a72099f27fd Signed-off-by: Tetsuo Handa <[email protected]> Reported-by: syzbot <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2018-05-27PM / wakeup: Make s2idle_lock a RAW_SPINLOCKSebastian Andrzej Siewior1-7/+7
The `s2idle_lock' is acquired during suspend while interrupts are disabled even on RT. The lock is acquired for short sections only. Make it a RAW lock which avoids "sleeping while atomic" warnings on RT. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2018-05-27PM / s2idle: Make s2idle_wait_head swait basedSebastian Andrzej Siewior1-4/+5
s2idle_wait_head is used during s2idle with interrupts disabled even on RT. There is no "custom" wake up function so swait could be used instead which is also lower weight compared to the wait_queue. Make s2idle_wait_head a swait_queue_head. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2018-05-27PM / suspend: Prevent might sleep splatsThomas Gleixner3-0/+13
timekeeping suspend/resume calls read_persistent_clock() which takes rtc_lock. That results in might sleep warnings because at that point we run with interrupts disabled. We cannot convert rtc_lock to a raw spinlock as that would trigger other might sleep warnings. As a workaround we disable the might sleep warnings by setting system_state to SYSTEM_SUSPEND before calling sysdev_suspend() and restoring it to SYSTEM_RUNNING afer sysdev_resume(). There is no lock contention because hibernate / suspend to RAM is single-CPU at this point. In s2idle's case the system_state is set to SYSTEM_SUSPEND before timekeeping_suspend() which is invoked by the last CPU. In the resume case it set back to SYSTEM_RUNNING after timekeeping_resume() which is invoked by the first CPU in the resume case. The other CPUs will block on tick_freeze_lock. Signed-off-by: Thomas Gleixner <[email protected]> [bigeasy: cover s2idle in tick_freeze() / tick_unfreeze()] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2018-05-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-31/+163
Lots of easy overlapping changes in the confict resolutions here. Signed-off-by: David S. Miller <[email protected]>
2018-05-26Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds2-5/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "Three fixes for scheduler and kthread code: - allow calling kthread_park() on an already parked thread - restore the sched_pi_setprio() tracepoint behaviour - clarify the unclear string for the scheduling domain debug output" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched, tracing: Fix trace_sched_pi_setprio() for deboosting kthread: Allow kthread_park() on a parked kthread sched/topology: Clarify root domain(s) debug string
2018-05-25Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+5
Merge misc fixes from Andrew Morton: "16 fixes" * emailed patches from Andrew Morton <[email protected]>: kasan: fix memory hotplug during boot kasan: free allocated shadow memory on MEM_CANCEL_ONLINE checkpatch: fix macro argument precedence test init/main.c: include <linux/mem_encrypt.h> kernel/sys.c: fix potential Spectre v1 issue mm/memory_hotplug: fix leftover use of struct page during hotplug proc: fix smaps and meminfo alignment mm: do not warn on offline nodes unless the specific node is explicitly requested mm, memory_hotplug: make has_unmovable_pages more robust mm/kasan: don't vfree() nonexistent vm_area MAINTAINERS: change hugetlbfs maintainer and update files ipc/shm: fix shmat() nil address after round-down when remapping Revert "ipc/shm: Fix shmat mmap nil-page protection" idr: fix invalid ptr dereference on item delete ocfs2: revert "ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio" mm: fix nr_rotate_swap leak in swapon() error case
2018-05-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-22/+64
Pull networking fixes from David Miller: "Let's begin the holiday weekend with some networking fixes: 1) Whoops need to restrict cfg80211 wiphy names even more to 64 bytes. From Eric Biggers. 2) Fix flags being ignored when using kernel_connect() with SCTP, from Xin Long. 3) Use after free in DCCP, from Alexey Kodanev. 4) Need to check rhltable_init() return value in ipmr code, from Eric Dumazet. 5) XDP handling fixes in virtio_net from Jason Wang. 6) Missing RTA_TABLE in rtm_ipv4_policy[], from Roopa Prabhu. 7) Need to use IRQ disabling spinlocks in mlx4_qp_lookup(), from Jack Morgenstein. 8) Prevent out-of-bounds speculation using indexes in BPF, from Daniel Borkmann. 9) Fix regression added by AF_PACKET link layer cure, from Willem de Bruijn. 10) Correct ENIC dma mask, from Govindarajulu Varadarajan. 11) Missing config options for PMTU tests, from Stefano Brivio" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits) ibmvnic: Fix partial success login retries selftests/net: Add missing config options for PMTU tests mlx4_core: allocate ICM memory in page size chunks enic: set DMA mask to 47 bit ppp: remove the PPPIOCDETACH ioctl ipv4: remove warning in ip_recv_error net : sched: cls_api: deal with egdev path only if needed vhost: synchronize IOTLB message with dev cleanup packet: fix reserve calculation net/mlx5: IPSec, Fix a race between concurrent sandbox QP commands net/mlx5e: When RXFCS is set, add FCS data into checksum calculation bpf: properly enforce index mask to prevent out-of-bounds speculation net/mlx4: Fix irq-unsafe spinlock usage net: phy: broadcom: Fix bcm_write_exp() net: phy: broadcom: Fix auxiliary control register reads net: ipv4: add missing RTA_TABLE to rtm_ipv4_policy net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message ibmvnic: Only do H_EOI for mobility events tuntap: correctly set SOCKWQ_ASYNC_NOSPACE virtio-net: fix leaking page for gso packet during mergeable XDP ...
2018-05-25kernel/sys.c: fix potential Spectre v1 issueGustavo A. R. Silva1-0/+5
`resource' can be controlled by user-space, hence leading to a potential exploitation of the Spectre variant 1 vulnerability. This issue was detected with the help of Smatch: kernel/sys.c:1474 __do_compat_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap) kernel/sys.c:1455 __do_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap) Fix this by sanitizing *resource* before using it to index current->signal->rlim Notice that given that speculation windows are large, the policy is to kill the speculation on the first load and not worry if it can be completed with a dependent load/store [1]. [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Gustavo A. R. Silva <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Dan Williams <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-05-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-22/+64
Daniel Borkmann says: ==================== pull-request: bpf 2018-05-24 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix a bug in the original fix to prevent out of bounds speculation when multiple tail call maps from different branches or calls end up at the same tail call helper invocation, from Daniel. 2) Two selftest fixes, one in reuseport_bpf_numa where test is skipped in case of missing numa support and another one to update kernel config to properly support xdp_meta.sh test, from Anders. ... Would be great if you have a chance to merge net into net-next after that. The verifier fix would be needed later as a dependency in bpf-next for upcomig work there. When you do the merge there's a trivial conflict on BPF side with 849fa50662fb ("bpf/verifier: refine retval R0 state for bpf_get_stack helper"): Resolution is to keep both functions, the do_refine_retval_range() and record_func_map(). ==================== Signed-off-by: David S. Miller <[email protected]>
2018-05-25Merge back PM core material for v4.18.Rafael J. Wysocki1-0/+1
2018-05-25locking/rwsem: Simplify the is-owner-spinnable checksOleg Nesterov1-12/+13
Add the trivial owner_on_cpu() helper for rwsem_can_spin_on_owner() and rwsem_spin_on_owner(), it also allows to make rwsem_can_spin_on_owner() a bit more clear. Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Waiman Long <[email protected]> Cc: Amir Goldstein <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Jan Kara <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Theodore Y. Ts'o <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-05-25Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar12-71/+220
Signed-off-by: Ingo Molnar <[email protected]>
2018-05-25perf/core: Wire up compat PERF_EVENT_IOC_QUERY_BPF, ↵Eugene Syromiatnikov1-0/+2
PERF_EVENT_IOC_MODIFY_ATTRIBUTES Since pointer size is different in compat, and switching in _perf_ioctl is done using exact ioctl numbers, all new ioctl numbers that use pointer should be added to perf_compat_ioctl for _IOC_SIZE fixup before passing to perf_ioctl routine (this shouldn't be needed if semantics of the size argument of _IO* macros was honored). Signed-off-by: Eugene Syromiatnikov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>