aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2020-01-24tracing: Fix very unlikely race of registering two stat tracersSteven Rostedt (VMware)1-10/+9
Looking through old emails in my INBOX, I came across a patch from Luis Henriques that attempted to fix a race of two stat tracers registering the same stat trace (extremely unlikely, as this is done in the kernel, and probably doesn't even exist). The submitted patch wasn't quite right as it needed to deal with clean up a bit better (if two stat tracers were the same, it would have the same files). But to make the code cleaner, all we needed to do is to keep the all_stat_sessions_mutex held for most of the registering function. Link: http://lkml.kernel.org/r/[email protected] Fixes: 002bb86d8d42f ("tracing/ftrace: separate events tracing and stats tracing engine") Reported-by: Luis Henriques <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-01-24alarmtimer: Make alarmtimer_get_rtcdev() a stub when CONFIG_RTC_CLASS=nStephen Boyd1-5/+0
The stubbed version of alarmtimer_get_rtcdev() is not exported. so this won't work if this function is used in a module when CONFIG_RTC_CLASS=n. Move the stub function to the header file and make it inline so that callers don't have to worry about linking against this symbol. rtcdev isn't used outside of this ifdef so it's not required to be redefined to NULL. Drop that while touching this area. Signed-off-by: Stephen Boyd <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-01-24alarmtimer: Use wakeup source from alarmtimer platform deviceStephen Boyd1-10/+5
Use the wakeup source that can be associated with the 'alarmtimer' platform device instead of registering another one by hand. Signed-off-by: Stephen Boyd <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Douglas Anderson <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-01-24alarmtimer: Make alarmtimer platform device child of RTC deviceStephen Boyd1-11/+9
The alarmtimer_suspend() function will fail if an RTC device is on a bus such as SPI or i2c and that RTC device registers and probes after alarmtimer_init() registers and probes the 'alarmtimer' platform device. This is because system wide suspend suspends devices in the reverse order of their probe. When alarmtimer_suspend() attempts to program the RTC for a wakeup it will try to program an RTC device on a bus that has already been suspended. Move the alarmtimer device registration to happen when the RTC which is used for wakeup is registered. Register the 'alarmtimer' platform device as a child of the RTC device too, so that it can be guaranteed that the RTC device won't be suspended when alarmtimer_suspend() is called. Reported-by: Douglas Anderson <[email protected]> Signed-off-by: Stephen Boyd <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Douglas Anderson <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-01-24alarmtimer: Update alarmtimer_get_rtcdev() docs to reflect realityStephen Boyd1-2/+0
This function doesn't do anything like this comment says when an RTC device hasn't been chosen. It looks like we used to do something like that before commit 8bc0dafb5cf3 ("alarmtimers: Rework RTC device selection using class interface") but that's long gone now. Remove this sentence to avoid confusing the reader. Signed-off-by: Stephen Boyd <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-01-24smp: Remove allocation mask from on_each_cpu_cond.*()Sebastian Andrzej Siewior2-14/+6
The allocation mask is no longer used by on_each_cpu_cond() and on_each_cpu_cond_mask() and can be removed. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-01-24smp: Add a smp_cond_func_t argument to smp_call_function_many()Sebastian Andrzej Siewior1-43/+38
on_each_cpu_cond_mask() allocates a new CPU mask. The newly allocated mask is a subset of the provided mask based on the conditional function. This memory allocation can be avoided by extending smp_call_function_many() with the conditional function and performing the remote function call based on the mask and the conditional function. Rename smp_call_function_many() to smp_call_function_many_cond() and add the smp_cond_func_t argument. If smp_cond_func_t is provided then it is used before invoking the function. Provide smp_call_function_many() with cond_func set to NULL. Let on_each_cpu_cond_mask() use smp_call_function_many_cond(). Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-01-24smp: Use smp_cond_func_t as type for the conditional functionSebastian Andrzej Siewior2-12/+10
Use a typdef for the conditional function instead defining it each time in the function prototype. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-01-24Merge tag 'irqchip-5.6' of ↵Thomas Gleixner1-0/+18
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core Pull irqchip updates from Marc Zyngier: - Conversion of the SiFive PLIC to hierarchical domains - New SiFive GPIO irqchip driver - New Aspeed SCI irqchip driver - New NXP INTMUX irqchip driver - Additional support for the Meson A1 GPIO irqchip - First part of the GICv4.1 support - Assorted fixes
2020-01-24Merge branches 'doc.2019.12.10a', 'exp.2019.12.09a', 'fixes.2020.01.24a', ↵Paul E. McKenney17-385/+778
'kfree_rcu.2020.01.24a', 'list.2020.01.10a', 'preempt.2020.01.24a' and 'torture.2019.12.09a' into HEAD doc.2019.12.10a: Documentations updates exp.2019.12.09a: Expedited grace-period updates fixes.2020.01.24a: Miscellaneous fixes kfree_rcu.2020.01.24a: Batch kfree_rcu() work list.2020.01.10a: RCU-protected-list updates preempt.2020.01.24a: Preemptible RCU updates torture.2019.12.09a: Torture-test updates
2020-01-24rcu: Remove unused stop-machine #includePaul E. McKenney2-2/+0
Long ago, RCU used the stop-machine mechanism to implement expedited grace periods, but no longer does so. This commit therefore removes the no-longer-needed #includes of linux/stop_machine.h. Link: https://lwn.net/Articles/805317/ Reported-by: Viresh Kumar <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24srcu: Apply *_ONCE() to ->srcu_last_gp_endPaul E. McKenney1-3/+4
The ->srcu_last_gp_end field is accessed from any CPU at any time by synchronize_srcu(), so non-initialization references need to use READ_ONCE() and WRITE_ONCE(). This commit therefore makes that change. Reported-by: [email protected] Acked-by: Marco Elver <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Switch force_qs_rnp() to for_each_leaf_node_cpu_mask()Paul E. McKenney1-8/+5
Currently, force_qs_rnp() uses a for_each_leaf_node_possible_cpu() loop containing a check of the current CPU's bit in ->qsmask. This works, but this commit saves three lines by instead using for_each_leaf_node_cpu_mask(), which combines the functionality of for_each_leaf_node_possible_cpu() and leaf_node_cpu_bit(). This commit also replaces the use of the local variable "bit" with rdp->grpmask. Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Move rcu_{expedited,normal} definitions into rcupdate.hBen Dooks1-2/+0
This commit moves the rcu_{expedited,normal} definitions from kernel/rcu/update.c to include/linux/rcupdate.h to make sure they are in sync, and also to avoid the following warning from sparse: kernel/ksysfs.c:150:5: warning: symbol 'rcu_expedited' was not declared. Should it be static? kernel/ksysfs.c:167:5: warning: symbol 'rcu_normal' was not declared. Should it be static? Signed-off-by: Ben Dooks <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Move gp_state_names[] and gp_state_getname() to tree_stall.hLai Jiangshan3-22/+22
Only tree_stall.h needs to get name from GP state, so this commit moves the gp_state_names[] array and the gp_state_getname() from kernel/rcu/tree.h and kernel/rcu/tree.c, respectively, to kernel/rcu/tree_stall.h. While moving gp_state_names[], this commit uses the GCC syntax to ensure that the right string is associated with the right CPP macro. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Remove the declaration of call_rcu() in tree.hLai Jiangshan1-1/+0
The call_rcu() function is an external RCU API that is declared in include/linux/rcupdate.h. There is thus no point in redeclaring it in kernel/rcu/tree.h, so this commit removes that redundant declaration. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Fix tracepoint tracking RCU CPU kthread utilizationLai Jiangshan1-1/+1
In the call to trace_rcu_utilization() at the start of the loop in rcu_cpu_kthread(), "rcu_wait" is incorrect, plus this trace event needs to be hoisted above the loop to balance with either the "rcu_wait" or "rcu_yield", depending on how the loop exits. This commit therefore makes these changes. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Fix harmless omission of "CONFIG_" from #if conditionLai Jiangshan1-2/+2
The C preprocessor macros SRCU and TINY_RCU should instead be CONFIG_SRCU and CONFIG_TINY_RCU, respectively in the #f in kernel/rcu/rcu.h. But there is no harm when "TINY_RCU" is wrongly used, which are always non-defined, which makes "!defined(TINY_RCU)" always true, which means the code block is always included, and the included code block doesn't cause any compilation error so far in CONFIG_TINY_RCU builds. It is also the reason this change should not be taken in -stable. This commit adds the needed "CONFIG_" prefix to both macros. Not for -stable. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Avoid tick_dep_set_cpu() misorderingPaul E. McKenney1-3/+9
In the current code, rcu_nmi_enter_common() might decide to turn on the tick using tick_dep_set_cpu(), but be delayed just before doing so. Then the grace-period kthread might notice that the CPU in question had in fact gone through a quiescent state, thus turning off the tick using tick_dep_clear_cpu(). The later invocation of tick_dep_set_cpu() would then incorrectly leave the tick on. This commit therefore enlists the aid of the leaf rcu_node structure's ->lock to ensure that decisions to enable or disable the tick are carried out before they can be reversed. Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Provide wrappers for uses of ->rcu_read_lock_nestingLai Jiangshan2-21/+36
This commit provides wrapper functions for uses of ->rcu_read_lock_nesting to improve readability and to ease future changes to support inlining of __rcu_read_lock() and __rcu_read_unlock(). Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Use READ_ONCE() for ->expmask in rcu_read_unlock_special()Paul E. McKenney1-1/+1
The rcu_node structure's ->expmask field is updated only when holding the ->lock, but is also accessed locklessly. This means that all ->expmask updates must use WRITE_ONCE() and all reads carried out without holding ->lock must use READ_ONCE(). This commit therefore changes the lockless ->expmask read in rcu_read_unlock_special() to use READ_ONCE(). Reported-by: [email protected] Signed-off-by: Paul E. McKenney <[email protected]> Acked-by: Marco Elver <[email protected]>
2020-01-24rcu: Clear ->rcu_read_unlock_special only onceLai Jiangshan1-16/+3
In rcu_preempt_deferred_qs_irqrestore(), ->rcu_read_unlock_special is cleared one piece at a time. Given that the "if" statements in this function use the copy in "special", this commit removes the clearing of the individual pieces in favor of clearing ->rcu_read_unlock_special in one go just after it has been determined to be non-zero. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Clear .exp_hint only when deferred quiescent state has been reportedLai Jiangshan1-2/+1
Currently, the .exp_hint flag is cleared in rcu_read_unlock_special(), which works, but which can also prevent subsequent rcu_read_unlock() calls from helping expedite the quiescent state needed by an ongoing expedited RCU grace period. This commit therefore defers clearing of .exp_hint from rcu_read_unlock_special() to rcu_preempt_deferred_qs_irqrestore(), thus ensuring that intervening calls to rcu_read_unlock() have a chance to help end the expedited grace period. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Rename some instance of CONFIG_PREEMPTION to CONFIG_PREEMPT_RCULai Jiangshan2-5/+5
CONFIG_PREEMPTION and CONFIG_PREEMPT_RCU are always identical, but some code depends on CONFIG_PREEMPTION to access to rcu_preempt functionality. This patch changes CONFIG_PREEMPTION to CONFIG_PREEMPT_RCU in these cases. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Remove kfree_call_rcu_nobatch()Joel Fernandes (Google)2-23/+5
Now that the kfree_rcu() special-casing has been removed from tree RCU, this commit removes kfree_call_rcu_nobatch() since it is no longer needed. Signed-off-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Remove kfree_rcu() special casing and lazy-callback handlingJoel Fernandes (Google)9-129/+75
This commit removes kfree_rcu() special-casing and the lazy-callback handling from Tree RCU. It moves some of this special casing to Tiny RCU, the removal of which will be the subject of later commits. This results in a nice negative delta. Suggested-by: Paul E. McKenney <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> [ paulmck: Add slab.h #include, thanks to kbuild test robot <[email protected]>. ] Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Add support for debug_objects debugging for kfree_rcu()Joel Fernandes (Google)1-0/+8
This commit applies RCU's debug_objects debugging to the new batched kfree_rcu() implementations. The object is queued at the kfree_rcu() call and dequeued during reclaim. Tested that enabling CONFIG_DEBUG_OBJECTS_RCU_HEAD successfully detects double kfree_rcu() calls. Signed-off-by: Joel Fernandes (Google) <[email protected]> [ paulmck: Fix IRQ per kbuild test robot <[email protected]> feedback. ] Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Add multiple in-flight batches of kfree_rcu() workJoel Fernandes (Google)1-12/+39
During testing, it was observed that amount of memory consumed due kfree_rcu() batching is 300-400MB. Previously we had only a single head_free pointer pointing to the list of rcu_head(s) that are to be freed after a grace period. Until this list is drained, we cannot queue any more objects on it since such objects may not be ready to be reclaimed when the worker thread eventually gets to drainin g the head_free list. We can do better by maintaining multiple lists as done by this patch. Testing shows that memory consumption came down by around 100-150MB with just adding another list. Adding more than 1 additional list did not show any improvement. Suggested-by: Paul E. McKenney <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> [ paulmck: Code style and initialization handling. ] [ paulmck: Fix field name, reported by kbuild test robot <[email protected]>. ] Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Make kfree_rcu() use a non-atomic ->monitor_todoJoel Fernandes1-6/+10
Because the ->monitor_todo field is always protected by krcp->lock, this commit downgrades from xchg() to non-atomic unmarked assignment statements. Signed-off-by: Joel Fernandes <[email protected]> [ paulmck: Update to include early-boot kick code. ] Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcuperf: Add kfree_rcu() performance TestsJoel Fernandes (Google)1-8/+173
This test runs kfree_rcu() in a loop to measure performance of the new kfree_rcu() batching functionality. The following table shows results when booting with arguments: rcuperf.kfree_loops=20000 rcuperf.kfree_alloc_num=8000 rcuperf.kfree_rcu_test=1 rcuperf.kfree_no_batch=X rcuperf.kfree_no_batch=X # Grace Periods Test Duration (s) X=1 (old behavior) 9133 11.5 X=0 (new behavior) 1732 12.5 On a 16 CPU system with the above boot parameters, we see that the total number of grace periods that elapse during the test drops from 9133 when not batching to 1732 when batching (a 5X improvement). The kfree_rcu() flood itself slows down a bit when batching, though, as shown. Note that the active memory consumption during the kfree_rcu() flood does increase to around 200-250MB due to the batching (from around 50MB without batching). However, this memory consumption is relatively constant. In other words, the system is able to keep up with the kfree_rcu() load. The memory consumption comes down considerably if KFREE_DRAIN_JIFFIES is increased from HZ/50 to HZ/80. A later patch will reduce memory consumption further by using multiple lists. Also, when running the test, please disable CONFIG_DEBUG_PREEMPT and CONFIG_PROVE_RCU for realistic comparisons with/without batching. Signed-off-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-24rcu: Add basic support for kfree_rcu() batchingByungchul Park2-6/+198
Recently a discussion about stability and performance of a system involving a high rate of kfree_rcu() calls surfaced on the list [1] which led to another discussion how to prepare for this situation. This patch adds basic batching support for kfree_rcu(). It is "basic" because we do none of the slab management, dynamic allocation, code moving or any of the other things, some of which previous attempts did [2]. These fancier improvements can be follow-up patches and there are different ideas being discussed in those regards. This is an effort to start simple, and build up from there. In the future, an extension to use kfree_bulk and possibly per-slab batching could be done to further improve performance due to cache-locality and slab-specific bulk free optimizations. By using an array of pointers, the worker thread processing the work would need to read lesser data since it does not need to deal with large rcu_head(s) any longer. Torture tests follow in the next patch and show improvements of around 5x reduction in number of grace periods on a 16 CPU system. More details and test data are in that patch. There is an implication with rcu_barrier() with this patch. Since the kfree_rcu() calls can be batched, and may not be handed yet to the RCU machinery in fact, the monitor may not have even run yet to do the queue_rcu_work(), there seems no easy way of implementing rcu_barrier() to wait for those kfree_rcu()s that are already made. So this means a kfree_rcu() followed by an rcu_barrier() does not imply that memory will be freed once rcu_barrier() returns. Another implication is higher active memory usage (although not run-away..) until the kfree_rcu() flooding ends, in comparison to without batching. More details about this are in the second patch which adds an rcuperf test. Finally, in the near future we will get rid of kfree_rcu() special casing within RCU such as in rcu_do_batch and switch everything to just batching. Currently we don't do that since timer subsystem is not yet up and we cannot schedule the kfree_rcu() monitor as the timer subsystem's lock are not initialized. That would also mean getting rid of kfree_call_rcu_nobatch() entirely. [1] http://lore.kernel.org/lkml/[email protected] [2] https://lkml.org/lkml/2017/12/19/824 Cc: [email protected] Cc: [email protected] Co-developed-by: Byungchul Park <[email protected]> Signed-off-by: Byungchul Park <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> [ paulmck: Applied 0day and Paul Walmsley feedback on ->monitor_todo. ] [ paulmck: Make it work during early boot. ] [ paulmck: Add a crude early boot self-test. ] [ paulmck: Style adjustments and experimental docbook structure header. ] Link: https://lore.kernel.org/lkml/[email protected]/T/#me9956f66cb611b95d26ae92700e1d901f46e8c59 Signed-off-by: Paul E. McKenney <[email protected]>
2020-01-23bpf, devmap: Pass lockdep expression to RCU listsAmol Grover1-1/+2
head is traversed using hlist_for_each_entry_rcu outside an RCU read-side critical section but under the protection of dtab->index_lock. Hence, add corresponding lockdep expression to silence false-positive lockdep warnings, and harden RCU lists. Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index") Signed-off-by: Amol Grover <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Toke Høiland-Jørgensen <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-01-23Merge tag 'trace-v5.5-rc6-2' of ↵Linus Torvalds7-70/+158
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Various tracing fixes: - Fix a function comparison warning for a xen trace event macro - Fix a double perf_event linking to a trace_uprobe_filter for multiple events - Fix suspicious RCU warnings in trace event code for using list_for_each_entry_rcu() when the "_rcu" portion wasn't needed. - Fix a bug in the histogram code when using the same variable - Fix a NULL pointer dereference when tracefs lockdown enabled and calling trace_set_default_clock() - A fix to a bug found with the double perf_event linking patch" * tag 'trace-v5.5-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/uprobe: Fix to make trace_uprobe_filter alignment safe tracing: Do not set trace clock if tracefs lockdown is in effect tracing: Fix histogram code when expression has same var as value tracing: trigger: Replace unneeded RCU-list traversals tracing/uprobe: Fix double perf_event linking on multiprobe uprobe tracing: xen: Ordered comparison of function pointers
2020-01-23Merge tag 'pm-5.5-rc8' of ↵Linus Torvalds1-10/+10
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "Prevent the kernel from crashing during resume from hibernation if free pages contain leftover data from the restore kernel and init_on_free is set (Alexander Potapenko)" * tag 'pm-5.5-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: hibernate: fix crashes with init_on_free=1
2020-01-23Merge back new material related to system-wide PM for v5.6.Rafael J. Wysocki6-28/+69
2020-01-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller15-411/+2365
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-01-22 The following pull-request contains BPF updates for your *net-next* tree. We've added 92 non-merge commits during the last 16 day(s) which contain a total of 320 files changed, 7532 insertions(+), 1448 deletions(-). The main changes are: 1) function by function verification and program extensions from Alexei. 2) massive cleanup of selftests/bpf from Toke and Andrii. 3) batched bpf map operations from Brian and Yonghong. 4) tcp congestion control in bpf from Martin. 5) bulking for non-map xdp_redirect form Toke. 6) bpf_send_signal_thread helper from Yonghong. ==================== Signed-off-by: David S. Miller <[email protected]>
2020-01-22bpf: Add BPF_FUNC_jiffies64Martin KaFai Lau3-0/+37
This patch adds a helper to read the 64bit jiffies. It will be used in a later patch to implement the bpf_cubic.c. The helper is inlined for jit_requested and 64 BITS_PER_LONG as the map_gen_lookup(). Other cases could be considered together with map_gen_lookup() if needed. Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-01-22bpf: Introduce dynamic program extensionsAlexei Starovoitov4-27/+266
Introduce dynamic program extensions. The users can load additional BPF functions and replace global functions in previously loaded BPF programs while these programs are executing. Global functions are verified individually by the verifier based on their types only. Hence the global function in the new program which types match older function can safely replace that corresponding function. This new function/program is called 'an extension' of old program. At load time the verifier uses (attach_prog_fd, attach_btf_id) pair to identify the function to be replaced. The BPF program type is derived from the target program into extension program. Technically bpf_verifier_ops is copied from target program. The BPF_PROG_TYPE_EXT program type is a placeholder. It has empty verifier_ops. The extension program can call the same bpf helper functions as target program. Single BPF_PROG_TYPE_EXT type is used to extend XDP, SKB and all other program types. The verifier allows only one level of replacement. Meaning that the extension program cannot recursively extend an extension. That also means that the maximum stack size is increasing from 512 to 1024 bytes and maximum function nesting level from 8 to 16. The programs don't always consume that much. The stack usage is determined by the number of on-stack variables used by the program. The verifier could have enforced 512 limit for combined original plus extension program, but it makes for difficult user experience. The main use case for extensions is to provide generic mechanism to plug external programs into policy program or function call chaining. BPF trampoline is used to track both fentry/fexit and program extensions because both are using the same nop slot at the beginning of every BPF function. Attaching fentry/fexit to a function that was replaced is not allowed. The opposite is true as well. Replacing a function that currently being analyzed with fentry/fexit is not allowed. The executable page allocated by BPF trampoline is not used by program extensions. This inefficiency will be optimized in future patches. Function by function verification of global function supports scalars and pointer to context only. Hence program extensions are supported for such class of global functions only. In the future the verifier will be extended with support to pointers to structures, arrays with sizes, etc. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: John Fastabend <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Toke Høiland-Jørgensen <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-01-22genirq, sched/isolation: Isolate from handling managed interruptsMing Lei3-3/+65
The affinity of managed interrupts is completely handled in the kernel and cannot be changed via the /proc/irq/* interfaces from user space. As the kernel tries to spread out interrupts evenly accross CPUs on x86 to prevent vector exhaustion, it can happen that a managed interrupt whose affinity mask contains both isolated and housekeeping CPUs is routed to an isolated CPU. As a consequence IO submitted on a housekeeping CPU causes interrupts on the isolated CPU. Add a new sub-parameter 'managed_irq' for 'isolcpus' and the corresponding logic in the interrupt affinity selection code. The subparameter indicates to the interrupt affinity selection logic that it should try to avoid the above scenario. This isolation is best effort and only effective if the automatically assigned interrupt mask of a device queue contains isolated and housekeeping CPUs. If housekeeping CPUs are online then such interrupts are directed to the housekeeping CPU so that IO submitted on the housekeeping CPU cannot disturb the isolated CPU. If a queue's affinity mask contains only isolated CPUs then this parameter has no effect on the interrupt routing decision, though interrupts are only happening when tasks running on those isolated CPUs submit IO. IO submitted on housekeeping CPUs has no influence on those queues. If the affinity mask contains both housekeeping and isolated CPUs, but none of the contained housekeeping CPUs is online, then the interrupt is also routed to an isolated CPU. Interrupts are only delivered when one of the isolated CPUs in the affinity mask submits IO. If one of the contained housekeeping CPUs comes online, the CPU hotplug logic migrates the interrupt automatically back to the upcoming housekeeping CPU. Depending on the type of interrupt controller, this can require that at least one interrupt is delivered to the isolated CPU in order to complete the migration. [ tglx: Removed unused parameter, added and edited comments/documentation and rephrased the changelog so it contains more details. ] Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-01-22hrtimer: Add missing sparse annotation for __run_timer()Jules Irenge1-1/+1
Sparse reports a warning at __run_hrtimer() |warning: context imbalance in __run_hrtimer() - unexpected unlock Add the missing must_hold() annotation. Signed-off-by: Jules Irenge <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-01-22tracing/uprobe: Fix to make trace_uprobe_filter alignment safeMasami Hiramatsu4-28/+22
Commit 99c9a923e97a ("tracing/uprobe: Fix double perf_event linking on multiprobe uprobe") moved trace_uprobe_filter on trace_probe_event. However, since it introduced a flexible data structure with char array and type casting, the alignment of trace_uprobe_filter can be broken. This changes the type of the array to trace_uprobe_filter data strucure to fix it. Link: http://lore.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/157966340499.5107.10978352478952144902.stgit@devnote2 Fixes: 99c9a923e97a ("tracing/uprobe: Fix double perf_event linking on multiprobe uprobe") Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-01-22trace/kprobe: Remove unused MAX_KPROBE_CMDLINE_SIZEAlex Shi1-1/+0
This limitation are never lunched from introduce commit 970988e19eb0 ("tracing/kprobe: Add kprobe_event= boot parameter") Could we remove it if no intention to implement it? Link: http://lkml.kernel.org/r/[email protected] Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-01-22tracing: Fix uninitialized buffer var on early exit to trace_vbprintk()Steven Rostedt (VMware)1-1/+2
If we exit due to a bad input to trace_printk() (highly unlikely), then the buffer variable will not be initialized when we unnest the ring buffer. Reported-by: kbuild test robot <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-01-22bpf: Fix error path under memory pressureAlexei Starovoitov1-2/+7
Restore the 'if (env->cur_state)' check that was incorrectly removed during code move. Under memory pressure env->cur_state can be freed and zeroed inside do_check(). Hence the check is necessary. Fixes: 51c39bb1d5d1 ("bpf: Introduce function-by-function verification") Reported-by: [email protected] Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Song Liu <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-01-22bpf: Fix trampoline usage in preemptAlexei Starovoitov1-0/+10
Though the second half of trampoline page is unused a task could be preempted in the middle of the first half of trampoline and two updates to trampoline would change the code from underneath the preempted task. Hence wait for tasks to voluntarily schedule or go to userspace. Add similar wait before freeing the trampoline. Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline") Reported-by: Jann Horn <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-01-21tracing/boot: Fix an IS_ERR() vs NULL bugDan Carpenter1-1/+1
The trace_array_get_by_name() function doesn't return error pointers, it returns NULL on error. Link: http://lkml.kernel.org/r/[email protected] Fixes: 4f712a4d04a4 ("tracing/boot: Add instance node support") Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-01-21tracing: Remove unused TRACE_SEQ_BUF_USEDAlex Shi1-3/+0
This macro isn't used from commit 3a161d99c43c ("tracing: Create seq_buf layer in trace_seq"). so no needs to keep it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-01-21ring-buffer: Remove abandoned macro RB_MISSED_FLAGSAlex Shi1-2/+0
This macro isn't used since commit d325c402964e ("ring-buffer: Remove unused function ring_buffer_page_len()"), so better to remove it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-01-21bpf: don't bother with getname/kern_path - use user_path_atAl Viro1-30/+13
kernel/bpf/inode.c misuses kern_path...() - it's much simpler (and more efficient, on top of that) to use user_path...() counterparts rather than bothering with doing getname() manually. Signed-off-by: Al Viro <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-01-21ftrace: Remove NR_TO_INIT macroAlex Shi1-3/+0
This macro isn't used from commit cb7be3b2fc2c ("ftrace: remove daemon"). So no needs to keep it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>