aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2019-04-19cgroup: protect cgroup->nr_(dying_)descendants by css_set_lockRoman Gushchin1-0/+6
The number of descendant cgroups and the number of dying descendant cgroups are currently synchronized using the cgroup_mutex. The number of descendant cgroups will be required by the cgroup v2 freezer, which will use it to determine if a cgroup is frozen (depending on total number of descendants and number of frozen descendants). It's not always acceptable to grab the cgroup_mutex, especially from quite hot paths (e.g. exit()). To avoid this, let's additionally synchronize these counters using the css_set_lock. So, it's safe to read these counters with either cgroup_mutex or css_set_lock locked, and for changing both locks should be acquired. Signed-off-by: Roman Gushchin <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Cc: [email protected]
2019-04-19cgroup: implement __cgroup_task_count() helperRoman Gushchin3-16/+34
The helper is identical to the existing cgroup_task_count() except it doesn't take the css_set_lock by itself, assuming that the caller does. Also, move cgroup_task_count() implementation into kernel/cgroup/cgroup.c, as there is nothing specific to cgroup v1. Signed-off-by: Roman Gushchin <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Cc: [email protected]
2019-04-19cgroup: rename freezer.c into legacy_freezer.cRoman Gushchin2-1/+1
Freezer.c will contain an implementation of cgroup v2 freezer, so let's rename the v1 freezer to avoid naming conflicts. Signed-off-by: Roman Gushchin <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Cc: [email protected]
2019-04-19sched/topology: Update init_sched_domains() commentJuri Lelli1-3/+2
Holding hotplug lock is not a requirement anymore for callers of sched_ init_domains after commit: 6acce3ef8452 ("sched: Remove get_online_cpus() usage") Update the relative comment preceding init_sched_domains(). Signed-off-by: Juri Lelli <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-19cgroup/cpuset: Update stale generate_sched_domains() commentsJuri Lelli1-6/+5
Commit: fc560a26acce ("cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre()") removed the local list (q) that was used to perform a top-down scan of all cpusets; however, comments mentioning it were not updated. Update comments to reflect current implementation. Signed-off-by: Juri Lelli <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-19tick: Fix typos in commentsLaurent Gauthier1-2/+2
Signed-off-by: Laurent Gauthier <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2019-04-19kernel/watchdog_hld.c: hard lockup message should end with a newlineSergey Senozhatsky1-1/+2
Separate print_modules() and hard lockup error message. Before the patch: NMI watchdog: Watchdog detected hard LOCKUP on cpu 1Modules linked in: nls_cp437 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-04-19kprobes: Mark ftrace mcount handler functions nokprobeMasami Hiramatsu1-1/+5
Mark ftrace mcount handler functions nokprobe since probing on these functions with kretprobe pushes return address incorrectly on kretprobe shadow stack. Reported-by: Francis Deslauriers <[email protected]> Tested-by: Andrea Righi <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Acked-by: Steven Rostedt <[email protected]> Acked-by: Steven Rostedt (VMware) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/155094062044.6137.6419622920568680640.stgit@devbox Signed-off-by: Ingo Molnar <[email protected]>
2019-04-19sched/core: Check quota and period overflow at usec to nsec conversionKonstantin Khlebnikov1-1/+6
Large values could overflow u64 and pass following sanity checks. # echo 18446744073750000 > cpu.cfs_period_us # cat cpu.cfs_period_us 40448 # echo 18446744073750000 > cpu.cfs_quota_us # cat cpu.cfs_quota_us 40448 After this patch they will fail with -EINVAL. Signed-off-by: Konstantin Khlebnikov <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/155125502079.293431.3947497929372138600.stgit@buzz Signed-off-by: Ingo Molnar <[email protected]>
2019-04-19sched/core: Handle overflow in cpu_shares_write_u64Konstantin Khlebnikov1-0/+2
Bit shift in scale_load() could overflow shares. This patch saturates it to MAX_SHARES like following sched_group_set_shares(). Example: # echo 9223372036854776832 > cpu.shares # cat cpu.shares Before patch: 1024 After pattch: 262144 Signed-off-by: Konstantin Khlebnikov <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/155125501891.293431.3345233332801109696.stgit@buzz Signed-off-by: Ingo Molnar <[email protected]>
2019-04-19sched/rt: Check integer overflow at usec to nsec conversionKonstantin Khlebnikov1-0/+5
Example of unhandled overflows: # echo 18446744073709651 > cpu.rt_runtime_us # cat cpu.rt_runtime_us 99 # echo 18446744073709900 > cpu.rt_period_us # cat cpu.rt_period_us 348 After this patch they will fail with -EINVAL. Signed-off-by: Konstantin Khlebnikov <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/155125501739.293431.5252197504404771496.stgit@buzz Signed-off-by: Ingo Molnar <[email protected]>
2019-04-19mm/resource: Use resource_overlaps() to simplify region_intersects()Wei Yang2-8/+7
The three checks in region_intersects() are basically an open-coded version of resource_overlaps() - so use the real thing. Also fix typos in comments while at it. Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Like Xu <[email protected]> Reviewed-by: Yuan Yao <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] [ Rewrote the changelog. ] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-19rseq: Remove superfluous rseq_len from task_structMathieu Desnoyers1-4/+2
The rseq system call, when invoked with flags of "0" or "RSEQ_FLAG_UNREGISTER" values, expects the rseq_len parameter to be equal to sizeof(struct rseq), which is fixed-size and fixed-layout, specified in uapi linux/rseq.h. Expecting a fixed size for rseq_len is a design choice that ensures multiple libraries and application defining __rseq_abi in the same process agree on its exact size. Considering that this size is and will always be the same value, there is no point in saving this value within task_struct rseq_len. Remove this field from task_struct. No change in functionality intended. Signed-off-by: Mathieu Desnoyers <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Ben Maurer <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chris Lameter <[email protected]> Cc: Dave Watson <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Paul Turner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Russell King <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-19rseq: Clean up comments by reflecting removal of event counterMathieu Desnoyers1-2/+1
The "event counter" was removed from rseq before it was merged upstream. However, a few comments in the source code still refer to it. Adapt the comments to match reality. Signed-off-by: Mathieu Desnoyers <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Ben Maurer <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chris Lameter <[email protected]> Cc: Dave Watson <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Paul Turner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Russell King <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-19sched/core: Fix typo in commentJoel Savitz1-1/+1
Signed-off-by: Joel Savitz <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-18ipv6: Add rate limit mask for ICMPv6 messagesStephen Suryaputra1-0/+6
To make ICMPv6 closer to ICMPv4, add ratemask parameter. Since the ICMP message types use larger numeric values, a simple bitmask doesn't fit. I use large bitmap. The input and output are the in form of list of ranges. Set the default to rate limit all error messages but Packet Too Big. For Packet Too Big, use ratemask instead of hard-coded. There are functions where icmpv6_xrlim_allow() and icmpv6_global_allow() aren't called. This patch only adds them to icmpv6_echo_reply(). Rate limiting error messages is mandated by RFC 4443 but RFC 4890 says that it is also acceptable to rate limit informational messages. Thus, I removed the current hard-coded behavior of icmpv6_mask_allow() that doesn't rate limit informational messages. v2: Add dummy function proc_do_large_bitmap() if CONFIG_PROC_SYSCTL isn't defined, expand the description in ip-sysctl.txt and remove unnecessary conditional before kfree(). v3: Inline the bitmap instead of dynamically allocated. Still is a pointer to it is needed because of the way proc_do_large_bitmap work. Signed-off-by: Stephen Suryaputra <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-04-18sched/core: Make some functions staticYueHaibing2-6/+6
Fix these sparse warnings: kernel/sched/core.c:6577:11: warning: symbol 'min_cfs_quota_period' was not declared. Should it be static? kernel/sched/core.c:6657:5: warning: symbol 'tg_set_cfs_quota' was not declared. Should it be static? kernel/sched/core.c:6670:6: warning: symbol 'tg_get_cfs_quota' was not declared. Should it be static? kernel/sched/core.c:6683:5: warning: symbol 'tg_set_cfs_period' was not declared. Should it be static? kernel/sched/core.c:6693:6: warning: symbol 'tg_get_cfs_period' was not declared. Should it be static? kernel/sched/fair.c:2596:6: warning: symbol 'task_tick_numa' was not declared. Should it be static? Signed-off-by: YueHaibing <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-18signal: use fdget() since we don't allow O_PATHChristian Brauner1-1/+1
As stated in the original commit for pidfd_send_signal() we don't allow to signal processes through O_PATH file descriptors since it is semantically equivalent to a write on the pidfd. We already correctly error out right now and return EBADF if an O_PATH fd is passed. This is because we use file->f_op to detect whether a pidfd is passed and O_PATH fds have their file->f_op set to empty_fops in do_dentry_open() and thus fail the test. Thus, there is no regression. It's just semantically correct to use fdget() and return an error right from there instead of taking a reference and returning an error later. Signed-off-by: Christian Brauner <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Kees Cook <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Jann Horn <[email protected]> Cc: David Howells <[email protected]> Cc: "Michael Kerrisk (man-pages)" <[email protected]> Cc: Andy Lutomirsky <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Aleksa Sarai <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-04-18Merge branch 'for-mingo' of ↵Ingo Molnar14-826/+831
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU and LKMM commits from Paul E. McKenney: - An LKMM commit adding support for synchronize_srcu_expedited() - A couple of straggling RCU flavor consolidation updates - Documentation updates. - Miscellaneous fixes - SRCU updates - RCU CPU stall-warning updates - Torture-test updates Signed-off-by: Ingo Molnar <[email protected]>
2019-04-18timers/sched_clock: Prevent generic sched_clock wrap caused by tick_freeze()Chang-An Chen3-2/+11
tick_freeze() introduced by suspend-to-idle in commit 124cf9117c5f ("PM / sleep: Make it possible to quiesce timers during suspend-to-idle") uses timekeeping_suspend() instead of syscore_suspend() during suspend-to-idle. As a consequence generic sched_clock will keep going because sched_clock_suspend() and sched_clock_resume() are not invoked during suspend-to-idle which can result in a generic sched_clock wrap. On a ARM system with suspend-to-idle enabled, sched_clock is registered as "56 bits at 13MHz, resolution 76ns, wraps every 4398046511101ns", which means the real wrapping duration is 8796093022202ns. [ 134.551779] suspend-to-idle suspend (timekeeping_suspend()) [ 1204.912239] suspend-to-idle resume (timekeeping_resume()) ...... [ 1206.912239] suspend-to-idle suspend (timekeeping_suspend()) [ 5880.502807] suspend-to-idle resume (timekeeping_resume()) ...... [ 6000.403724] suspend-to-idle suspend (timekeeping_suspend()) [ 8035.753167] suspend-to-idle resume (timekeeping_resume()) ...... [ 8795.786684] (2)[321:charger_thread]...... [ 8795.788387] (2)[321:charger_thread]...... [ 0.057226] (0)[0:swapper/0]...... [ 0.061447] (2)[0:swapper/2]...... sched_clock was not stopped during suspend-to-idle, and sched_clock_poll hrtimer was not expired because timekeeping_suspend() was invoked during suspend-to-idle. It makes sched_clock wrap at kernel time 8796s. To prevent this, invoke sched_clock_suspend() and sched_clock_resume() in tick_freeze() together with timekeeping_suspend() and timekeeping_resume(). Fixes: 124cf9117c5f (PM / sleep: Make it possible to quiesce timers during suspend-to-idle) Signed-off-by: Chang-An Chen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Matthias Brugger <[email protected]> Cc: John Stultz <[email protected]> Cc: Kees Cook <[email protected]> Cc: Corey Minyard <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Stanley Chu <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2019-04-18irq_work: Do not raise an IPI when queueing work on the local CPUNicholas Piggin1-33/+42
The QEMU PowerPC/PSeries machine model was not expecting a self-IPI, and it may be a bit surprising thing to do, so have irq_work_queue_on do local queueing when target is the current CPU. Suggested-by: Steven Rostedt <[email protected]> Reported-by: Sebastian Andrzej Siewior <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: =?UTF-8?q?C=C3=A9dric=20Le=20Goater?= <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suraj Jitindar Singh <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] [ Simplified the preprocessor comments. Fixed unbalanced curly brackets pointed out by Thomas. ] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-18watchdog: Fix typo in commentArash Fotouhi1-1/+1
Signed-off-by: Arash Fotouhi <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-18locking/lockdep: Generate LOCKF_ bit compositesPeter Zijlstra1-6/+22
Instead of open-coding the bitmasks, generate them using the lockdep_states.h header. This prepares for additional states, which would make the manual masks tedious and error prone. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2019-04-18locking/lockdep: Use expanded masks on find_usage_*() functionsFrederic Weisbecker1-10/+10
In order to optimize check_irq_usage() and factorize all the IRQ usage validations we'll need to be able to check multiple lock usage bits at once. Prepare the low level usage mask check functions for that purpose. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-18locking/lockdep: Map remaining magic numbers to lock usage mask namesFrederic Weisbecker1-4/+7
Clarify the code with mapping some more constant numbers that haven't been named after their corresponding LOCK_USAGE_* symbol. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-18locking/lockdep: Move valid_state() inside CONFIG_TRACE_IRQFLAGS && ↵Frederic Weisbecker1-4/+6
CONFIG_PROVE_LOCKING valid_state() and print_usage_bug*() functions are not used beyond irq locking correctness checks under CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING. Sadly the "unused function" warning wouldn't fire because valid_state() is inline so the unused case has remained unseen until now. So move them inside the appropriate CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING section. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-17bpf: cpumap memory prefetchw optimizations for struct pageJesper Dangaard Brouer1-0/+12
A lot of the performance gain comes from this patch. While analysing performance overhead it was found that the largest CPU stalls were caused when touching the struct page area. It is first read with a READ_ONCE from build_skb_around via page_is_pfmemalloc(), and when freed written by page_frag_free() call. Measurements show that the prefetchw (W) variant operation is needed to achieve the performance gain. We believe this optimization it two fold, first the W-variant saves one step in the cache-coherency protocol, and second it helps us to avoid the non-temporal prefetch HW optimizations and bring this into all cache-levels. It might be worth investigating if prefetch into L2 will have the same benefit. Signed-off-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Ilias Apalodimas <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2019-04-17bpf: cpumap do bulk allocation of SKBsJesper Dangaard Brouer1-7/+15
As cpumap now batch consume xdp_frame's from the ptr_ring, it knows how many SKBs it need to allocate. Thus, lets bulk allocate these SKBs via kmem_cache_alloc_bulk() API, and use the previously introduced function build_skb_around(). Notice that the flag __GFP_ZERO asks the slab/slub allocator to clear the memory for us. This does clear a larger area than needed, but my micro benchmarks on Intel CPUs show that this is slightly faster due to being a cacheline aligned area is cleared for the SKBs. (For SLUB allocator, there is a future optimization potential, because SKBs will with high probability originate from same page. If we can find/identify continuous memory areas then the Intel CPU memset rep stos will have a real performance gain.) Signed-off-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2019-04-17bpf: cpumap use ptr_ring_consume_batchedJesper Dangaard Brouer1-10/+11
Move ptr_ring dequeue outside loop, that allocate SKBs and calls network stack, as these operations that can take some time. The ptr_ring is a communication channel between CPUs, where we want to reduce/limit any cacheline bouncing. Do a concentrated bulk dequeue via ptr_ring_consume_batched, to shorten the period and times the remote cacheline in ptr_ring is read Batch size 8 is both to (1) limit BH-disable period, and (2) consume one cacheline on 64-bit archs. After reducing the BH-disable section further then we can consider changing this, while still thinking about L1 cacheline size being active. Signed-off-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2019-04-17cpu/speculation: Add 'mitigations=' cmdline optionJosh Poimboeuf1-0/+15
Keeping track of the number of mitigations for all the CPU speculation bugs has become overwhelming for many users. It's getting more and more complicated to decide which mitigations are needed for a given architecture. Complicating matters is the fact that each arch tends to have its own custom way to mitigate the same vulnerability. Most users fall into a few basic categories: a) they want all mitigations off; b) they want all reasonable mitigations on, with SMT enabled even if it's vulnerable; or c) they want all reasonable mitigations on, with SMT disabled if vulnerable. Define a set of curated, arch-independent options, each of which is an aggregation of existing options: - mitigations=off: Disable all mitigations. - mitigations=auto: [default] Enable all the default mitigations, but leave SMT enabled, even if it's vulnerable. - mitigations=auto,nosmt: Enable all the default mitigations, disabling SMT if needed by a mitigation. Currently, these options are placeholders which don't actually do anything. They will be fleshed out in upcoming patches. Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Jiri Kosina <[email protected]> (on x86) Reviewed-by: Jiri Kosina <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H . Peter Anvin" <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Waiman Long <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Jon Masters <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: [email protected] Cc: Martin Schwidefsky <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: [email protected] Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Greg Kroah-Hartman <[email protected]> Cc: Tyler Hicks <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Steven Price <[email protected]> Cc: Phil Auld <[email protected]> Link: https://lkml.kernel.org/r/b07a8ef9b7c5055c3a4637c87d07c296d5016fe0.1555085500.git.jpoimboe@redhat.com
2019-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller9-34/+72
Conflict resolution of af_smc.c from Stephen Rothwell. Signed-off-by: David S. Miller <[email protected]>
2019-04-16bpf: use BPF_CAST_CALL for casting bpf callPrashant Bhole1-3/+2
verifier.c uses BPF_CAST_CALL for casting bpf call except at one place in jit_subprogs(). Let's use the macro for consistency. Signed-off-by: Prashant Bhole <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2019-04-16genirq/devres: Use struct_size() in devm_kzalloc()Gustavo A. R. Silva1-2/+1
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; size = sizeof(struct foo) + count * sizeof(struct boo); instance = devm_kzalloc(dev, size, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper. instance = devm_kzalloc(dev, struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/20190410170914.GA16161@embeddedor
2019-04-16sched/core: Unify p->on_rq updatesPeter Zijlstra2-7/+4
Almost all {,de}activate_task() invocations pair with p->on_rq updates, the exception being the usage in rt/deadline which hold both rq locks and therefore don't strictly need to set TASK_ON_RQ_MIGRATING, but it is harmless if we do anyway. Put the updates in {,de}activate_task() and cut down on repetition. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2019-04-16sched/core: Remove ttwu_activate()Peter Zijlstra1-7/+2
After the removal of try_to_wake_up_local(), there is only one user of ttwu_activate() left, and since it is a trivial function, remove it. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2019-04-16sched/core, workqueues: Distangle worker accounting from rq lockThomas Gleixner3-99/+48
The worker accounting for CPU bound workers is plugged into the core scheduler code and the wakeup code. This is not a hard requirement and can be avoided by keeping track of the state in the workqueue code itself. Keep track of the sleeping state in the worker itself and call the notifier before entering the core scheduler. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from scheduling and being woken immediately after switching away. When nr_running is updated when the task is retunrning from schedule() then it is later compared when it is done from ttwu(). [ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814.git.bristot@redhat.com Signed-off-by: Ingo Molnar <[email protected]>
2019-04-16sched/deadline: Correctly handle active 0-lag timersluca abeni1-2/+1
syzbot reported the following warning: [ ] WARNING: CPU: 4 PID: 17089 at kernel/sched/deadline.c:255 task_non_contending+0xae0/0x1950 line 255 of deadline.c is: WARN_ON(hrtimer_active(&dl_se->inactive_timer)); in task_non_contending(). Unfortunately, in some cases (for example, a deadline task continuosly blocking and waking immediately) it can happen that a task blocks (and task_non_contending() is called) while the 0-lag timer is still active. In this case, the safest thing to do is to immediately decrease the running bandwidth of the task, without trying to re-arm the 0-lag timer. Signed-off-by: luca abeni <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Juri Lelli <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: chengjian (D) <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-16sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockupPhil Auld1-0/+25
With extremely short cfs_period_us setting on a parent task group with a large number of children the for loop in sched_cfs_period_timer() can run until the watchdog fires. There is no guarantee that the call to hrtimer_forward_now() will ever return 0. The large number of children can make do_sched_cfs_period_timer() take longer than the period. NMI watchdog: Watchdog detected hard LOCKUP on cpu 24 RIP: 0010:tg_nop+0x0/0x10 <IRQ> walk_tg_tree_from+0x29/0xb0 unthrottle_cfs_rq+0xe0/0x1a0 distribute_cfs_runtime+0xd3/0xf0 sched_cfs_period_timer+0xcb/0x160 ? sched_cfs_slack_timer+0xd0/0xd0 __hrtimer_run_queues+0xfb/0x270 hrtimer_interrupt+0x122/0x270 smp_apic_timer_interrupt+0x6a/0x140 apic_timer_interrupt+0xf/0x20 </IRQ> To prevent this we add protection to the loop that detects when the loop has run too many times and scales the period and quota up, proportionally, so that the timer can complete before then next period expires. This preserves the relative runtime quota while preventing the hard lockup. A warning is issued reporting this state and the new values. Signed-off-by: Phil Auld <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: Anton Blanchard <[email protected]> Cc: Ben Segall <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-16sched/fair: Remove unneeded prototype of capacity_of()Valentin Schneider1-1/+0
The prototype of that function was already hoisted up in: commit 3b1baa6496e6 ("sched/fair: Add 'group_misfit_task' load-balance type") but that seems to have been missed. Get rid of the extra prototype. Signed-off-by: Valentin Schneider <[email protected]> Acked-by: Quentin Perret <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-16perf/core: Add perf_pmu_resched() as global functionStephane Eranian1-0/+10
This patch add perf_pmu_resched() a global function that can be called to force rescheduling of events for a given PMU. The function locks both cpuctx and task_ctx internally. This will be used by a subsequent patch. Signed-off-by: Stephane Eranian <[email protected]> [ Simplified the calling convention. ] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-16Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar3-38/+38
Signed-off-by: Ingo Molnar <[email protected]>
2019-04-16perf/ring_buffer: Fix AUX record suppressionAlexander Shishkin1-18/+15
The following commit: 1627314fb54a33e ("perf: Suppress AUX/OVERWRITE records") has an unintended side-effect of also suppressing all AUX records with no flags and non-zero size, so all the regular records in the full trace mode. This breaks some use cases for people. Fix this by restoring "regular" AUX records. Reported-by: Ben Gainey <[email protected]> Tested-by: Ben Gainey <[email protected]> Signed-off-by: Alexander Shishkin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Fixes: 1627314fb54a33e ("perf: Suppress AUX/OVERWRITE records") Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-16perf/core: Fix the address filtering fixAlexander Shishkin1-16/+21
The following recent commit: c60f83b813e5 ("perf, pt, coresight: Fix address filters for vmas with non-zero offset") changes the address filtering logic to communicate filter ranges to the PMU driver via a single address range object, instead of having the driver do the final bit of math. That change forgets to take into account kernel filters, which are not calculated the same way as DSO based filters. Fix that by passing the kernel filters the same way as file-based filters. This doesn't require any additional changes in the drivers. Reported-by: Adrian Hunter <[email protected]> Signed-off-by: Alexander Shishkin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Fixes: c60f83b813e5 ("perf, pt, coresight: Fix address filters for vmas with non-zero offset") Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-16Merge branch 'linus' into perf/core, to pick up fixesIngo Molnar16-67/+116
Signed-off-by: Ingo Molnar <[email protected]>
2019-04-16bpf: add map helper functions push, pop, peek in more BPF programsAlban Crequy2-0/+12
commit f1a2e44a3aec ("bpf: add queue and stack maps") introduced new BPF helper functions: - BPF_FUNC_map_push_elem - BPF_FUNC_map_pop_elem - BPF_FUNC_map_peek_elem but they were made available only for network BPF programs. This patch makes them available for tracepoint, cgroup and lirc programs. Signed-off-by: Alban Crequy <[email protected]> Cc: Mauricio Vasquez B <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2019-04-16kprobes: Fix error check when reusing optimized probesMasami Hiramatsu1-4/+2
The following commit introduced a bug in one of our error paths: 819319fc9346 ("kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()") it missed to handle the return value of kprobe_optready() as error-value. In reality, the kprobe_optready() returns a bool result, so "true" case must be passed instead of 0. This causes some errors on kprobe boot-time selftests on ARM: [ ] Beginning kprobe tests... [ ] Probe ARM code [ ] kprobe [ ] kretprobe [ ] ARM instruction simulation [ ] Check decoding tables [ ] Run test cases [ ] FAIL: test_case_handler not run [ ] FAIL: Test andge r10, r11, r14, asr r7 [ ] FAIL: Scenario 11 ... [ ] FAIL: Scenario 7 [ ] Total instruction simulation tests=1631, pass=1433 fail=198 [ ] kprobe tests failed This can happen if an optimized probe is unregistered and next kprobe is registered on same address until the previous probe is not reclaimed. If this happens, a hidden aggregated probe may be kept in memory, and no new kprobe can probe same address. Also, in that case register_kprobe() will return "1" instead of minus error value, which can mislead caller logic. Signed-off-by: Masami Hiramatsu <[email protected]> Cc: Anil S Keshavamurthy <[email protected]> Cc: David S . Miller <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Naveen N . Rao <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] # v5.0+ Fixes: 819319fc9346 ("kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()") Link: http://lkml.kernel.org/r/155530808559.32517.539898325433642204.stgit@devnote2 Signed-off-by: Ingo Molnar <[email protected]>
2019-04-16locking/lockdep: Make lockdep_unregister_key() honor 'debug_locks' againBart Van Assche1-4/+5
If lockdep_register_key() and lockdep_unregister_key() are called with debug_locks == false then the following warning is reported: WARNING: CPU: 2 PID: 15145 at kernel/locking/lockdep.c:4920 lockdep_unregister_key+0x1ad/0x240 That warning is reported because lockdep_unregister_key() ignores the value of 'debug_locks' and because the behavior of lockdep_register_key() depends on whether or not 'debug_locks' is set. Fix this inconsistency by making lockdep_unregister_key() take 'debug_locks' again into account. Signed-off-by: Bart Van Assche <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Waiman Long <[email protected]> Cc: Will Deacon <[email protected]> Cc: shenghui <[email protected]> Fixes: 90c1cba2b3b3 ("locking/lockdep: Zap lock classes even with lock debugging disabled") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-15ntp: Audit NTP parameters adjustmentOndrej Mosnacek4-5/+50
Emit an audit record every time selected NTP parameters are modified from userspace (via adjtimex(2) or clock_adjtime(2)). These parameters may be used to indirectly change system clock, and thus their modifications should be audited. Such events will now generate records of type AUDIT_TIME_ADJNTPVAL containing the following fields: - op -- which value was adjusted: - offset -- corresponding to the time_offset variable - freq -- corresponding to the time_freq variable - status -- corresponding to the time_status variable - adjust -- corresponding to the time_adjust variable - tick -- corresponding to the tick_usec variable - tai -- corresponding to the timekeeping's TAI offset - old -- the old value - new -- the new value Example records: type=TIME_ADJNTPVAL msg=audit(1530616044.507:7): op=status old=64 new=8256 type=TIME_ADJNTPVAL msg=audit(1530616044.511:11): op=freq old=0 new=49180377088000 The records of this type will be associated with the corresponding syscall records. An overview of parameter changes that can be done via do_adjtimex() (based on information from Miroslav Lichvar) and whether they are audited: __timekeeping_set_tai_offset() -- sets the offset from the International Atomic Time (AUDITED) NTP variables: time_offset -- can adjust the clock by up to 0.5 seconds per call and also speed it up or slow down by up to about 0.05% (43 seconds per day) (AUDITED) time_freq -- can speed up or slow down by up to about 0.05% (AUDITED) time_status -- can insert/delete leap seconds and it also enables/ disables synchronization of the hardware real-time clock (AUDITED) time_maxerror, time_esterror -- change error estimates used to inform userspace applications (NOT AUDITED) time_constant -- controls the speed of the clock adjustments that are made when time_offset is set (NOT AUDITED) time_adjust -- can temporarily speed up or slow down the clock by up to 0.05% (AUDITED) tick_usec -- a more extreme version of time_freq; can speed up or slow down the clock by up to 10% (AUDITED) Signed-off-by: Ondrej Mosnacek <[email protected]> Reviewed-by: Richard Guy Briggs <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Paul Moore <[email protected]>
2019-04-15timekeeping: Audit clock adjustmentsOndrej Mosnacek2-0/+13
Emit an audit record whenever the system clock is changed (i.e. shifted by a non-zero offset) by a syscall from userspace. The syscalls than can (at the time of writing) trigger such record are: - settimeofday(2), stime(2), clock_settime(2) -- via do_settimeofday64() - adjtimex(2), clock_adjtime(2) -- via do_adjtimex() The new records have type AUDIT_TIME_INJOFFSET and contain the following fields: - sec -- the 'seconds' part of the offset - nsec -- the 'nanoseconds' part of the offset Example record (time was shifted backwards by ~15.875 seconds): type=TIME_INJOFFSET msg=audit(1530616049.652:13): sec=-16 nsec=124887145 The records of this type will be associated with the corresponding syscall records. Signed-off-by: Ondrej Mosnacek <[email protected]> Reviewed-by: Richard Guy Briggs <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> [PM: fixed a line width problem in __audit_tk_injoffset()] Signed-off-by: Paul Moore <[email protected]>
2019-04-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller1-0/+10
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for net-next: 1) Remove the broute pseudo hook, implement this from the bridge prerouting hook instead. Now broute becomes real table in ebtables, from Florian Westphal. This also includes a size reduction patch for the bridge control buffer area via squashing boolean into bitfields and a selftest. 2) Add OS passive fingerprint version matching, from Fernando Fernandez. 3) Support for gue encapsulation for IPVS, from Jacky Hu. 4) Add support for NAT to the inet family, from Florian Westphal. This includes support for masquerade, redirect and nat extensions. 5) Skip interface lookup in flowtable, use device in the dst object. 6) Add jiffies64_to_msecs() and use it, from Li RongQing. 7) Remove unused parameter in nf_tables_set_desc_parse(), from Colin Ian King. 8) Statify several functions, patches from YueHaibing and Florian Westphal. 9) Add an optimized version of nf_inet_addr_cmp(), from Li RongQing. 10) Merge route extension to core, also from Florian. 11) Use IS_ENABLED(CONFIG_NF_NAT) instead of NF_NAT_NEEDED, from Florian. 12) Merge ip/ip6 masquerade extensions, from Florian. This includes netdevice notifier unification. ==================== Signed-off-by: David S. Miller <[email protected]>