aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2020-03-21sched/swait: Prepare usage in completionsThomas Gleixner2-1/+17
As a preparation to use simple wait queues for completions: - Provide swake_up_all_locked() to support complete_all() - Make __prepare_to_swait() public available This is done to enable the usage of complete() within truly atomic contexts on a PREEMPT_RT enabled kernel. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-21timekeeping: Split jiffies seqlockThomas Gleixner5-17/+28
seqlock consists of a sequence counter and a spinlock_t which is used to serialize the writers. spinlock_t is substituted by a "sleeping" spinlock on PREEMPT_RT enabled kernels which breaks the usage in the timekeeping code as the writers are executed in hard interrupt and therefore non-preemptible context even on PREEMPT_RT. The spinlock in seqlock cannot be unconditionally replaced by a raw_spinlock_t as many seqlock users have nesting spinlock sections or other code which is not suitable to run in truly atomic context on RT. Instead of providing a raw_seqlock API for a single use case, open code the seqlock for the jiffies use case and implement it with a raw_spinlock_t and a sequence counter. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-21rcuwait: Add @state argument to rcuwait_wait_event()Peter Zijlstra (Intel)1-1/+1
Extend rcuwait_wait_event() with a state variable so that it is not restricted to UNINTERRUPTIBLE waits. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-20bpf: Explicitly memset some bpf info structures declared on the stackGreg Kroah-Hartman2-3/+6
Trying to initialize a structure with "= {};" will not always clean out all padding locations in a structure. So be explicit and call memset to initialize everything for a number of bpf information structures that are then copied from userspace, sometimes from smaller memory locations than the size of the structure. Reported-by: Daniel Borkmann <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-03-20bpf: Explicitly memset the bpf_attr structureGreg Kroah-Hartman1-1/+2
For the bpf syscall, we are relying on the compiler to properly zero out the bpf_attr union that we copy userspace data into. Unfortunately that doesn't always work properly, padding and other oddities might not be correctly zeroed, and in some tests odd things have been found when the stack is pre-initialized to other values. Fix this by explicitly memsetting the structure to 0 before using it. Reported-by: Maciej Żenczykowski <[email protected]> Reported-by: John Stultz <[email protected]> Reported-by: Alexander Potapenko <[email protected]> Reported-by: Alistair Delva <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://android-review.googlesource.com/c/kernel/common/+/1235490 Link: https://lore.kernel.org/bpf/[email protected]
2020-03-20lockdep: Teach lockdep about "USED" <- "IN-NMI" inversionsPeter Zijlstra1-3/+59
nmi_enter() does lockdep_off() and hence lockdep ignores everything. And NMI context makes it impossible to do full IN-NMI tracking like we do IN-HARDIRQ, that could result in graph_lock recursion. However, since look_up_lock_class() is lockless, we can find the class of a lock that has prior use and detect IN-NMI after USED, just not USED after IN-NMI. NOTE: By shifting the lockdep_off() recursion count to bit-16, we can easily differentiate between actual recursion and off. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Reviewed-by: Joel Fernandes (Google) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-20locking/lockdep: Rework lockdep_lockPeter Zijlstra1-41/+48
A few sites want to assert we own the graph_lock/lockdep_lock, provide a more conventional lock interface for it with a number of trivial debug checks. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-20locking/lockdep: Fix bad recursion patternPeter Zijlstra1-34/+40
There were two patterns for lockdep_recursion: Pattern-A: if (current->lockdep_recursion) return current->lockdep_recursion = 1; /* do stuff */ current->lockdep_recursion = 0; Pattern-B: current->lockdep_recursion++; /* do stuff */ current->lockdep_recursion--; But a third pattern has emerged: Pattern-C: current->lockdep_recursion = 1; /* do stuff */ current->lockdep_recursion = 0; And while this isn't broken per-se, it is highly dangerous because it doesn't nest properly. Get rid of all Pattern-C instances and shore up Pattern-A with a warning. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-20locking/lockdep: Avoid recursion in lockdep_count_{for,back}ward_deps()Boqun Feng1-0/+4
Qian Cai reported a bug when PROVE_RCU_LIST=y, and read on /proc/lockdep triggered a warning: [ ] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled) ... [ ] Call Trace: [ ] lock_is_held_type+0x5d/0x150 [ ] ? rcu_lockdep_current_cpu_online+0x64/0x80 [ ] rcu_read_lock_any_held+0xac/0x100 [ ] ? rcu_read_lock_held+0xc0/0xc0 [ ] ? __slab_free+0x421/0x540 [ ] ? kasan_kmalloc+0x9/0x10 [ ] ? __kmalloc_node+0x1d7/0x320 [ ] ? kvmalloc_node+0x6f/0x80 [ ] __bfs+0x28a/0x3c0 [ ] ? class_equal+0x30/0x30 [ ] lockdep_count_forward_deps+0x11a/0x1a0 The warning got triggered because lockdep_count_forward_deps() call __bfs() without current->lockdep_recursion being set, as a result a lockdep internal function (__bfs()) is checked by lockdep, which is unexpected, and the inconsistency between the irq-off state and the state traced by lockdep caused the warning. Apart from this warning, lockdep internal functions like __bfs() should always be protected by current->lockdep_recursion to avoid potential deadlocks and data inconsistency, therefore add the current->lockdep_recursion on-and-off section to protect __bfs() in both lockdep_count_forward_deps() and lockdep_count_backward_deps() Reported-by: Qian Cai <[email protected]> Signed-off-by: Boqun Feng <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-20perf/core: Fix reversed NULL check in perf_event_groups_less()Dan Carpenter1-1/+1
This NULL check is reversed so it leads to a Smatch warning and presumably a NULL dereference. kernel/events/core.c:1598 perf_event_groups_less() error: we previously assumed 'right->cgrp->css.cgroup' could be null (see line 1590) Fixes: 95ed6c707f26 ("perf/cgroup: Order events in RB tree by cgroup id") Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/20200312105637.GA8960@mwanda
2020-03-20perf/core: Fix endless multiplex timerPeter Zijlstra1-6/+14
Kan and Andi reported that we fail to kill rotation when the flexible events go empty, but the context does not. XXX moar Fixes: fd7d55172d1e ("perf/cgroups: Don't rotate events for cgroups unnecessarily") Reported-by: Andi Kleen <[email protected]> Reported-by: Kan Liang <[email protected]> Tested-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-20sched/fair: Fix condition of avg_load calculationTao Zhou1-1/+2
In update_sg_wakeup_stats(), the comment says: Computing avg_load makes sense only when group is fully busy or overloaded. But, the code below this comment does not check like this. From reading the code about avg_load in other functions, I confirm that avg_load should be calculated in fully busy or overloaded case. The comment is correct and the checking condition is wrong. So, change that condition. Fixes: 57abff067a08 ("sched/fair: Rework find_idlest_group()") Signed-off-by: Tao Zhou <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Acked-by: Mel Gorman <[email protected]> Link: https://lkml.kernel.org/r/Message-ID:
2020-03-20sched/rt: cpupri_find: Trigger a full search as fallbackQais Yousef1-23/+6
If we failed to find a fitting CPU, in cpupri_find(), we only fallback to the level we found a hit at. But Steve suggested to fallback to a second full scan instead as this could be a better effort. https://lore.kernel.org/lkml/[email protected]/ We trigger the 2nd search unconditionally since the argument about triggering a full search is that the recorded fall back level might have become empty by then. Which means storing any data about what happened would be meaningless and stale. I had a humble try at timing it and it seemed okay for the small 6 CPUs system I was running on https://lore.kernel.org/lkml/[email protected]/ On large system this second full scan could be expensive. But there are no users outside capacity awareness for this fitness function at the moment. Heterogeneous systems tend to be small with 8cores in total. Suggested-by: Steven Rostedt <[email protected]> Signed-off-by: Qais Yousef <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-20kthread: Do not preempt current task if it is going to call schedule()Liang Chen1-2/+15
when we create a kthread with ktrhead_create_on_cpu(),the child thread entry is ktread.c:ktrhead() which will be preempted by the parent after call complete(done) while schedule() is not called yet,then the parent will call wait_task_inactive(child) but the child is still on the runqueue, so the parent will schedule_hrtimeout() for 1 jiffy,it will waste a lot of time,especially on startup. parent child ktrhead_create_on_cpu() wait_fo_completion(&done) -----> ktread.c:ktrhead() |----- complete(done);--wakeup and preempted by parent kthread_bind() <------------| |-> schedule();--dequeue here wait_task_inactive(child) | schedule_hrtimeout(1 jiffy) -| So we hope the child just wakeup parent but not preempted by parent, and the child is going to call schedule() soon,then the parent will not call schedule_hrtimeout(1 jiffy) as the child is already dequeue. The same issue for ktrhead_park()&&kthread_parkme(). This patch can save 120ms on rk312x startup with CONFIG_HZ=300. Signed-off-by: Liang Chen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-20sched/fair: Improve spreading of utilizationVincent Guittot1-0/+8
During load_balancing, a group with spare capacity will try to pull some utilizations from an overloaded group. In such case, the load balance looks for the runqueue with the highest utilization. Nevertheless, it should also ensure that there are some pending tasks to pull otherwise the load balance will fail to pull a task and the spread of the load will be delayed. This situation is quite transient but it's possible to highlight the effect with a short run of sysbench test so the time to spread task impacts the global result significantly. Below are the average results for 15 iterations on an arm64 octo core: sysbench --test=cpu --num-threads=8 --max-requests=1000 run tip/sched/core +patchset total time: 172ms 158ms per-request statistics: avg: 1.337ms 1.244ms max: 21.191ms 10.753ms The average max doesn't fully reflect the wide spread of the value which ranges from 1.350ms to more than 41ms for the tip/sched/core and from 1.350ms to 21ms with the patch. Other factors like waiting for an idle load balance or cache hotness can delay the spreading of the tasks which explains why we can still have up to 21ms with the patch. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-20sched: Avoid scale real weight down to zeroMichael Wang1-1/+7
During our testing, we found a case that shares no longer working correctly, the cgroup topology is like: /sys/fs/cgroup/cpu/A (shares=102400) /sys/fs/cgroup/cpu/A/B (shares=2) /sys/fs/cgroup/cpu/A/B/C (shares=1024) /sys/fs/cgroup/cpu/D (shares=1024) /sys/fs/cgroup/cpu/D/E (shares=1024) /sys/fs/cgroup/cpu/D/E/F (shares=1024) The same benchmark is running in group C & F, no other tasks are running, the benchmark is capable to consumed all the CPUs. We suppose the group C will win more CPU resources since it could enjoy all the shares of group A, but it's F who wins much more. The reason is because we have group B with shares as 2, since A->cfs_rq.load.weight == B->se.load.weight == B->shares/nr_cpus, so A->cfs_rq.load.weight become very small. And in calc_group_shares() we calculate shares as: load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg); shares = (tg_shares * load) / tg_weight; Since the 'cfs_rq->load.weight' is too small, the load become 0 after scale down, although 'tg_shares' is 102400, shares of the se which stand for group A on root cfs_rq become 2. While the se of D on root cfs_rq is far more bigger than 2, so it wins the battle. Thus when scale_load_down() scale real weight down to 0, it's no longer telling the real story, the caller will have the wrong information and the calculation will be buggy. This patch add check in scale_load_down(), so the real weight will be >= MIN_SHARES after scale, after applied the group C wins as expected. Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Michael Wang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-20psi: Move PF_MEMSTALL out of task->flagsYafang Shao2-11/+11
The task->flags is a 32-bits flag, in which 31 bits have already been consumed. So it is hardly to introduce other new per process flag. Currently there're still enough spaces in the bit-field section of task_struct, so we can define the memstall state as a single bit in task_struct instead. This patch also removes an out-of-date comment pointed by Matthew. Suggested-by: Johannes Weiner <[email protected]> Signed-off-by: Yafang Shao <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-20psi: Optimize switching tasks inside shared cgroupsJohannes Weiner2-28/+68
When switching tasks running on a CPU, the psi state of a cgroup containing both of these tasks does not change. Right now, we don't exploit that, and can perform many unnecessary state changes in nested hierarchies, especially when most activity comes from one leaf cgroup. This patch implements an optimization where we only update cgroups whose state actually changes during a task switch. These are all cgroups that contain one task but not the other, up to the first shared ancestor. When both tasks are in the same group, we don't need to update anything at all. We can identify the first shared ancestor by walking the groups of the incoming task until we see TSK_ONCPU set on the local CPU; that's the first group that also contains the outgoing task. The new psi_task_switch() is similar to psi_task_change(). To allow code reuse, move the task flag maintenance code into a new function and the poll/avg worker wakeups into the shared psi_group_change(). Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-20psi: Fix cpu.pressure for cpu.max and competing cgroupsJohannes Weiner3-5/+37
For simplicity, cpu pressure is defined as having more than one runnable task on a given CPU. This works on the system-level, but it has limitations in a cgrouped reality: When cpu.max is in use, it doesn't capture the time in which a task is not executing on the CPU due to throttling. Likewise, it doesn't capture the time in which a competing cgroup is occupying the CPU - meaning it only reflects cgroup-internal competitive pressure, not outside pressure. Enable tracking of currently executing tasks, and then change the definition of cpu pressure in a cgroup from NR_RUNNING > 1 to NR_RUNNING > ON_CPU which will capture the effects of cpu.max as well as competition from outside the cgroup. After this patch, a cgroup running `stress -c 1` with a cpu.max setting of 5000 10000 shows ~50% continuous CPU pressure. Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-20sched/core: Distribute tasks within affinity masksPaul Turner1-1/+6
Currently, when updating the affinity of tasks via either cpusets.cpus, or, sched_setaffinity(); tasks not currently running within the newly specified mask will be arbitrarily assigned to the first CPU within the mask. This (particularly in the case that we are restricting masks) can result in many tasks being assigned to the first CPUs of their new masks. This: 1) Can induce scheduling delays while the load-balancer has a chance to spread them between their new CPUs. 2) Can antogonize a poor load-balancer behavior where it has a difficult time recognizing that a cross-socket imbalance has been forced by an affinity mask. This change adds a new cpumask interface to allow iterated calls to distribute within the intersection of the provided masks. The cases that this mainly affects are: - modifying cpuset.cpus - when tasks join a cpuset - when modifying a task's affinity via sched_setaffinity(2) Signed-off-by: Paul Turner <[email protected]> Signed-off-by: Josh Don <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Qais Yousef <[email protected]> Tested-by: Qais Yousef <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-20sched/fair: Fix enqueue_task_fair warningVincent Guittot1-4/+22
When a cfs rq is throttled, the latter and its child are removed from the leaf list but their nr_running is not changed which includes staying higher than 1. When a task is enqueued in this throttled branch, the cfs rqs must be added back in order to ensure correct ordering in the list but this can only happens if nr_running == 1. When cfs bandwidth is used, we call unconditionnaly list_add_leaf_cfs_rq() when enqueuing an entity to make sure that the complete branch will be added. Similarly unthrottle_cfs_rq() can stop adding cfs in the list when a parent is throttled. Iterate the remaining entity to ensure that the complete branch will be added in the list. Reported-by: Christian Borntraeger <[email protected]> Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Tested-by: Christian Borntraeger <[email protected]> Tested-by: Dietmar Eggemann <[email protected]> Cc: [email protected] Cc: [email protected] #v5.1+ Link: https://lkml.kernel.org/r/[email protected]
2020-03-19ring-buffer: Optimize rb_iter_head_event()Steven Rostedt (VMware)1-0/+3
As it is fine to perform several "peeks" of event data in the ring buffer via the iterator before moving it forward, do not re-read the event, just return what was read before. Otherwise, it can cause inconsistent results, especially when testing multiple CPU buffers to interleave them. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-03-19ring-buffer: Do not die if rb_iter_peek() fails more than thriceSteven Rostedt (VMware)1-3/+10
As the iterator will be reading a live buffer, and if the event being read is on a page that a writer crosses, it will fail and try again, the condition in rb_iter_peek() that only allows a retry to happen three times is no longer valid. Allow rb_iter_peek() to retry more than three times without killing the ring buffer, but only if rb_iter_head_event() had failed at least once. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-03-19ring-buffer: Have rb_iter_head_event() handle concurrent writerSteven Rostedt (VMware)1-31/+75
Have the ring_buffer_iter structure have a place to store an event, such that it can not be overwritten by a writer, and load it in such a way via rb_iter_head_event() that it will return NULL and reset the iter to the start of the current page if a writer updated the page. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-03-19ring-buffer: Add page_stamp to iterator for synchronizationSteven Rostedt (VMware)1-3/+7
Have the ring_buffer_iter structure contain a page_stamp, such that it can be used to see if the writer entered the page the iterator is on. When going to a new page, the iterator will record the time stamp of that page. When reading events, it can copy the event to an internal buffer on the iterator (to be implemented later), then check the page's time stamp with its own to see if the writer entered the page. If so, it will need to try to read the event again. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-03-19ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance()Steven Rostedt (VMware)3-20/+9
When the ring buffer was first created, the iterator followed the normal producer/consumer operations where it had both a peek() operation, that just returned the event at the current location, and a read(), that would return the event at the current location and also increment the iterator such that the next peek() or read() will return the next event. The only use of the ring_buffer_read() is currently to move the iterator to the next location and nothing now actually reads the event it returns. Rename this function to its actual use case to ring_buffer_iter_advance(), which also adds the "iter" part to the name, which is more meaningful. As the timestamp returned by ring_buffer_read() was never used, there's no reason that this new version should bother having returning it. It will also become a void function. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-03-19ring-buffer: Have ring_buffer_empty() not depend on tracing stoppedSteven Rostedt (VMware)1-2/+23
It was complained about that when the trace file is read, that the tracing is disabled, as the iterator expects writing to the buffer it reads is not updated. Several steps are needed to make the iterator handle a writer, by testing if things have changed as it reads. This step is to make ring_buffer_empty() expect the buffer to be changing. Note if the current location of the iterator is overwritten, then it will return false as new data is being added. Note, that this means that data will be skipped. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-03-19tracing: Save off entry when peeking at next entrySteven Rostedt (VMware)2-10/+45
In order to have the iterator read the buffer even when it's still updating, it requires that the ring buffer iterator saves each event in a separate location outside the ring buffer such that its use is immutable. There's one use case that saves off the event returned from the ring buffer interator and calls it again to look at the next event, before going back to use the first event. As the ring buffer iterator will only have a single copy, this use case will no longer be supported. Instead, have the one use case create its own buffer to store the first event when looking at the next event. This way, when looking at the first event again, it wont be corrupted by the second read. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-03-19tracing: Use address-of operator on section symbolsNathan Chancellor1-1/+1
Clang warns: ../kernel/trace/trace.c:9335:33: warning: array comparison always evaluates to true [-Wtautological-compare] if (__stop___trace_bprintk_fmt != __start___trace_bprintk_fmt) ^ 1 warning generated. These are not true arrays, they are linker defined symbols, which are just addresses. Using the address of operator silences the warning and does not change the runtime result of the check (tested with some print statements compiled in with clang + ld.lld and gcc + ld.bfd in QEMU). Link: http://lkml.kernel.org/r/[email protected] Link: https://github.com/ClangBuiltLinux/linux/issues/893 Suggested-by: Nick Desaulniers <[email protected]> Signed-off-by: Nathan Chancellor <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-03-19Revert "tick/common: Make tick_periodic() check for missing ticks"Thomas Gleixner1-33/+3
This reverts commit d441dceb5dce71150f28add80d36d91bbfccba99 due to boot failures. Reported-by: Qian Cai <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Waiman Long <[email protected]>
2020-03-19Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar10-62/+116
Signed-off-by: Ingo Molnar <[email protected]>
2020-03-19bpf: Support llvm-objcopy for vmlinux BTFFangrui Song2-11/+9
Simplify gen_btf logic to make it work with llvm-objcopy. The existing 'file format' and 'architecture' parsing logic is brittle and does not work with llvm-objcopy/llvm-objdump. 'file format' output of llvm-objdump>=11 will match GNU objdump, but 'architecture' (bfdarch) may not. .BTF in .tmp_vmlinux.btf is non-SHF_ALLOC. Add the SHF_ALLOC flag because it is part of vmlinux image used for introspection. C code can reference the section via linker script defined __start_BTF and __stop_BTF. This fixes a small problem that previous .BTF had the SHF_WRITE flag (objcopy -I binary -O elf* synthesized .data). Additionally, `objcopy -I binary` synthesized symbols _binary__btf_vmlinux_bin_start and _binary__btf_vmlinux_bin_stop (not used elsewhere) are replaced with more commonplace __start_BTF and __stop_BTF. Add 2>/dev/null because GNU objcopy (but not llvm-objcopy) warns "empty loadable segment detected at vaddr=0xffffffff81000000, is this intentional?" We use a dd command to change the e_type field in the ELF header from ET_EXEC to ET_REL so that lld will accept .btf.vmlinux.bin.o. Accepting ET_EXEC as an input file is an extremely rare GNU ld feature that lld does not intend to support, because this is error-prone. The output section description .BTF in include/asm-generic/vmlinux.lds.h avoids potential subtle orphan section placement issues and suppresses --orphan-handling=warn warnings. Fixes: df786c9b9476 ("bpf: Force .BTF section start to zero when dumping from vmlinux") Fixes: cb0cc635c7a9 ("powerpc: Include .BTF section") Reported-by: Nathan Chancellor <[email protected]> Signed-off-by: Fangrui Song <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Stanislav Fomichev <[email protected]> Tested-by: Andrii Nakryiko <[email protected]> Reviewed-by: Stanislav Fomichev <[email protected]> Reviewed-by: Kees Cook <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Michael Ellerman <[email protected]> (powerpc) Link: https://github.com/ClangBuiltLinux/linux/issues/871 Link: https://lore.kernel.org/bpf/[email protected]
2020-03-19time/sched_clock: Expire timer in hardirq contextAhmed S. Darwish1-4/+5
To minimize latency, PREEMPT_RT kernels expires hrtimers in preemptible softirq context by default. This can be overriden by marking the timer's expiry with HRTIMER_MODE_HARD. sched_clock_timer is missing this annotation: if its callback is preempted and the duration of the preemption exceeds the wrap around time of the underlying clocksource, sched clock will get out of sync. Mark the sched_clock_timer for expiry in hard interrupt context. Signed-off-by: Ahmed S. Darwish <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-18misc: cleanup minor number definitions in c file into miscdevice.hZhenzhong Duan1-2/+0
HWRNG_MINOR and RNG_MISCDEV_MINOR are duplicate definitions, use unified HWRNG_MINOR instead and moved into miscdevice.h ANSLCD_MINOR and LCD_MINOR are duplicate definitions, use unified LCD_MINOR instead and moved into miscdevice.h MISCDEV_MINOR is renamed to PXA3XX_GCU_MINOR and moved into miscdevice.h Other definitions are just moved without any change. Link: https://lore.kernel.org/lkml/[email protected]/t/ Suggested-by: Arnd Bergmann <[email protected]> Build-tested-by: Willy TARREAU <[email protected]> Build-tested-by: Miguel Ojeda <[email protected]> Signed-off-by: Zhenzhong Duan <[email protected]> Acked-by: Miguel Ojeda <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Herbert Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2020-03-17bpf: Sanitize the bpf_struct_ops tcp-cc nameMartin KaFai Lau1-11/+14
The bpf_struct_ops tcp-cc name should be sanitized in order to avoid problematic chars (e.g. whitespaces). This patch reuses the bpf_obj_name_cpy() for accepting the same set of characters in order to keep a consistent bpf programming experience. A "size" param is added. Also, the strlen is returned on success so that the caller (like the bpf_tcp_ca here) can error out on empty name. The existing callers of the bpf_obj_name_cpy() only need to change the testing statement to "if (err < 0)". For all these existing callers, the err will be overwritten later, so no extra change is needed for the new strlen return value. v3: - reverse xmas tree style v2: - Save the orig_src to avoid "end - size" (Andrii) Fixes: 0baf26b0fcd7 ("bpf: tcp: Support tcp_congestion_ops in bpf") Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-03-17Merge branch 'for-5.7-console-exit' of ↵Greg Kroah-Hartman1-20/+29
git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk into tty-next We need the console patches in here as well for futher work from Andy. * 'for-5.7-console-exit' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: console: Introduce ->exit() callback console: Don't notify user space when unregister non-listed console console: Avoid positive return code from unregister_console() console: Drop misleading comment console: Use for_each_console() helper in unregister_console() console: Drop double check for console_drivers being non-NULL console: Don't perform test for CON_BRL flag Signed-off-by: Greg Kroah-Hartman <[email protected]>
2020-03-17dma-mapping: Fix dma_pgprot() for unencrypted coherent pagesThomas Hellstrom1-0/+2
When dma_mmap_coherent() sets up a mapping to unencrypted coherent memory under SEV encryption and sometimes under SME encryption, it will actually set up an encrypted mapping rather than an unencrypted, causing devices that DMAs from that memory to read encrypted contents. Fix this. When force_dma_unencrypted() returns true, the linear kernel map of the coherent pages have had the encryption bit explicitly cleared and the page content is unencrypted. Make sure that any additional PTEs we set up to these pages also have the encryption bit cleared by having dma_pgprot() return a protection with the encryption bit cleared in this case. Signed-off-by: Thomas Hellstrom <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Tom Lendacky <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-16cgroupfs: Support user xattrsDaniel Xu1-1/+2
This patch turns on xattr support for cgroupfs. This is useful for letting non-root owners of delegated subtrees attach metadata to cgroups. One use case is for subtree owners to tell a userspace out of memory killer to bias away from killing specific subtrees. Tests: [/sys/fs/cgroup]# for i in $(seq 0 130); \ do setfattr workload.slice -n user.name$i -v wow; done setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device [/sys/fs/cgroup]# for i in $(seq 0 130); \ do setfattr workload.slice --remove user.name$i; done setfattr: workload.slice: No such attribute setfattr: workload.slice: No such attribute setfattr: workload.slice: No such attribute [/sys/fs/cgroup]# for i in $(seq 0 130); \ do setfattr workload.slice -n user.name$i -v wow; done setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device `seq 0 130` is inclusive, and 131 - 128 = 3, which is the number of errors we expect to see. [/data]# cat testxattr.c #include <sys/types.h> #include <sys/xattr.h> #include <stdio.h> #include <stdlib.h> int main() { char name[256]; char *buf = malloc(64 << 10); if (!buf) { perror("malloc"); return 1; } for (int i = 0; i < 4; ++i) { snprintf(name, 256, "user.bigone%d", i); if (setxattr("/sys/fs/cgroup/system.slice", name, buf, 64 << 10, 0)) { printf("setxattr failed on iteration=%d\n", i); return 1; } } return 0; } [/data]# ./a.out setxattr failed on iteration=2 [/data]# ./a.out setxattr failed on iteration=0 [/sys/fs/cgroup]# setfattr -x user.bigone0 system.slice/ [/sys/fs/cgroup]# setfattr -x user.bigone1 system.slice/ [/data]# ./a.out setxattr failed on iteration=2 Signed-off-by: Daniel Xu <[email protected]> Acked-by: Chris Down <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2020-03-16dma-direct: provide a arch_dma_clear_uncached hookChristoph Hellwig1-0/+2
This allows the arch code to reset the page tables to cached access when freeing a dma coherent allocation that was set to uncached using arch_dma_set_uncached. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Robin Murphy <[email protected]>
2020-03-16dma-direct: make uncached_kernel_address more generalChristoph Hellwig1-4/+6
Rename the symbol to arch_dma_set_uncached, and pass a size to it as well as allow an error return. That will allow reusing this hook for in-place pagetable remapping. As the in-place remap doesn't always require an explicit cache flush, also detangle ARCH_HAS_DMA_PREP_COHERENT from ARCH_HAS_DMA_SET_UNCACHED. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Robin Murphy <[email protected]>
2020-03-16dma-direct: consolidate the error handling in dma_direct_alloc_pagesChristoph Hellwig1-7/+6
Use a goto label to merge two error return cases. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Robin Murphy <[email protected]>
2020-03-16dma-coherent: fix integer overflow in the reserved-memory dma allocationKevin Grandemange1-6/+7
pageno is an int and the PAGE_SHIFT shift is done on an int, overflowing if the memory is bigger than 2G This can be reproduced using for example a reserved-memory of 4G reserved-memory { #address-cells = <2>; #size-cells = <2>; ranges; reserved_dma: buffer@0 { compatible = "shared-dma-pool"; no-map; reg = <0x5 0x00000000 0x1 0x0>; }; }; Signed-off-by: Kevin Grandemange <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-03-15Merge tag 'locking-urgent-2020-03-15' of ↵Linus Torvalds1-38/+55
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull futex fix from Thomas Gleixner: "Fix for yet another subtle futex issue. The futex code used ihold() to prevent inodes from vanishing, but ihold() does not guarantee inode persistence. Replace the inode pointer with a per boot, machine wide, unique inode identifier. The second commit fixes the breakage of the hash mechanism which causes a 100% performance regression" * tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Unbreak futex hashing futex: Fix inode life-time issue
2020-03-15Merge tag 'timers-urgent-2020-03-15' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single fix adding the missing time namespace adjustment in sys/sysinfo which caused sys/sysinfo to be inconsistent with /proc/uptime when read from a task inside a time namespace" * tag 'timers-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sys/sysinfo: Respect boottime inside time namespace
2020-03-14PM / hibernate: Remove unnecessary compat ioctl overridesEric Biggers1-15/+1
Since the SNAPSHOT_GET_IMAGE_SIZE, SNAPSHOT_AVAIL_SWAP_SIZE, and SNAPSHOT_ALLOC_SWAP_PAGE ioctls produce an loff_t result, and loff_t is always 64-bit even in the compat case, there's no reason to have the special compat handling for these ioctls. Just remove this unneeded code so that these ioctls call into snapshot_ioctl() directly, doing just the compat_ptr() conversion on the argument. (This unnecessary code was also causing a KMSAN false positive.) Signed-off-by: Eric Biggers <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2020-03-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller12-219/+542
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-03-13 The following pull-request contains BPF updates for your *net-next* tree. We've added 86 non-merge commits during the last 12 day(s) which contain a total of 107 files changed, 5771 insertions(+), 1700 deletions(-). The main changes are: 1) Add modify_return attach type which allows to attach to a function via BPF trampoline and is run after the fentry and before the fexit programs and can pass a return code to the original caller, from KP Singh. 2) Generalize BPF's kallsyms handling and add BPF trampoline and dispatcher objects to be visible in /proc/kallsyms so they can be annotated in stack traces, from Jiri Olsa. 3) Extend BPF sockmap to allow for UDP next to existing TCP support in order in order to enable this for BPF based socket dispatch, from Lorenz Bauer. 4) Introduce a new bpftool 'prog profile' command which attaches to existing BPF programs via fentry and fexit hooks and reads out hardware counters during that period, from Song Liu. Example usage: bpftool prog profile id 337 duration 3 cycles instructions llc_misses 4228 run_cnt 3403698 cycles (84.08%) 3525294 instructions # 1.04 insn per cycle (84.05%) 13 llc_misses # 3.69 LLC misses per million isns (83.50%) 5) Batch of improvements to libbpf, bpftool and BPF selftests. Also addition of a new bpf_link abstraction to keep in particular BPF tracing programs attached even when the applicaion owning them exits, from Andrii Nakryiko. 6) New bpf_get_current_pid_tgid() helper for tracing to perform PID filtering and which returns the PID as seen by the init namespace, from Carlos Neira. 7) Refactor of RISC-V JIT code to move out common pieces and addition of a new RV32G BPF JIT compiler, from Luke Nelson. 8) Add gso_size context member to __sk_buff in order to be able to know whether a given skb is GSO or not, from Willem de Bruijn. 9) Add a new bpf_xdp_output() helper which reuses XDP's existing perf RB output implementation but can be called from tracepoint programs, from Eelco Chaudron. ==================== Signed-off-by: David S. Miller <[email protected]>
2020-03-13bpf: Add missing annotations for __bpf_prog_enter() and __bpf_prog_exit()Jules Irenge1-0/+2
Sparse reports a warning at __bpf_prog_enter() and __bpf_prog_exit() warning: context imbalance in __bpf_prog_enter() - wrong count at exit warning: context imbalance in __bpf_prog_exit() - unexpected unlock The root cause is the missing annotation at __bpf_prog_enter() and __bpf_prog_exit() Add the missing __acquires(RCU) annotation Add the missing __releases(RCU) annotation Signed-off-by: Jules Irenge <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-03-13bpf: Remove bpf_image treeJiri Olsa3-77/+12
Now that we have all the objects (bpf_prog, bpf_trampoline, bpf_dispatcher) linked in bpf_tree, there's no need to have separate bpf_image tree for images. Reverting the bpf_image tree together with struct bpf_image, because it's no longer needed. Also removing bpf_image_alloc function and adding the original bpf_jit_alloc_exec_page interface instead. The kernel_text_address function can now rely only on is_bpf_text_address, because it checks the bpf_tree that contains all the objects. Keeping bpf_image_ksym_add and bpf_image_ksym_del because they are useful wrappers with perf's ksymbol interface calls. Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2020-03-13bpf: Add dispatchers to kallsymsJiri Olsa1-0/+1
Adding dispatchers to kallsyms. It's displayed as bpf_dispatcher_<NAME> where NAME is the name of dispatcher. Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2020-03-13bpf: Add trampolines to kallsymsJiri Olsa1-0/+28
Adding trampolines to kallsyms. It's displayed as bpf_trampoline_<ID> [bpf] where ID is the BTF id of the trampoline function. Adding bpf_image_ksym_add/del functions that setup the start/end values and call KSYMBOL perf events handlers. Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>