aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2021-06-18tracing: Do no increment trace_clock_global() by oneSteven Rostedt (VMware)1-3/+3
The trace_clock_global() tries to make sure the events between CPUs is somewhat in order. A global value is used and updated by the latest read of a clock. If one CPU is ahead by a little, and is read by another CPU, a lock is taken, and if the timestamp of the other CPU is behind, it will simply use the other CPUs timestamp. The lock is also only taken with a "trylock" due to tracing, and strange recursions can happen. The lock is not taken at all in NMI context. In the case where the lock is not able to be taken, the non synced timestamp is returned. But it will not be less than the saved global timestamp. The problem arises because when the time goes "backwards" the time returned is the saved timestamp plus 1. If the lock is not taken, and the plus one to the timestamp is returned, there's a small race that can cause the time to go backwards! CPU0 CPU1 ---- ---- trace_clock_global() { ts = clock() [ 1000 ] trylock(clock_lock) [ success ] global_ts = ts; [ 1000 ] <interrupted by NMI> trace_clock_global() { ts = clock() [ 999 ] if (ts < global_ts) ts = global_ts + 1 [ 1001 ] trylock(clock_lock) [ fail ] return ts [ 1001] } unlock(clock_lock); return ts; [ 1000 ] } trace_clock_global() { ts = clock() [ 1000 ] if (ts < global_ts) [ false 1000 == 1000 ] trylock(clock_lock) [ success ] global_ts = ts; [ 1000 ] unlock(clock_lock) return ts; [ 1000 ] } The above case shows to reads of trace_clock_global() on the same CPU, but the second read returns one less than the first read. That is, time when backwards, and this is not what is allowed by trace_clock_global(). This was triggered by heavy tracing and the ring buffer checker that tests for the clock going backwards: Ring buffer clock went backwards: 20613921464 -> 20613921463 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 0 at kernel/trace/ring_buffer.c:3412 check_buffer+0x1b9/0x1c0 Modules linked in: [..] [CPU: 2]TIME DOES NOT MATCH expected:20620711698 actual:20620711697 delta:6790234 before:20613921463 after:20613921463 [20613915818] PAGE TIME STAMP [20613915818] delta:0 [20613915819] delta:1 [20613916035] delta:216 [20613916465] delta:430 [20613916575] delta:110 [20613916749] delta:174 [20613917248] delta:499 [20613917333] delta:85 [20613917775] delta:442 [20613917921] delta:146 [20613918321] delta:400 [20613918568] delta:247 [20613918768] delta:200 [20613919306] delta:538 [20613919353] delta:47 [20613919980] delta:627 [20613920296] delta:316 [20613920571] delta:275 [20613920862] delta:291 [20613921152] delta:290 [20613921464] delta:312 [20613921464] delta:0 TIME EXTEND [20613921464] delta:0 This happened more than once, and always for an off by one result. It also started happening after commit aafe104aa9096 was added. Cc: [email protected] Fixes: aafe104aa9096 ("tracing: Restructure trace_clock_global() to never block") Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-06-18tracing: Do not stop recording comms if the trace file is being readSteven Rostedt (VMware)1-9/+0
A while ago, when the "trace" file was opened, tracing was stopped, and code was added to stop recording the comms to saved_cmdlines, for mapping of the pids to the task name. Code has been added that only records the comm if a trace event occurred, and there's no reason to not trace it if the trace file is opened. Cc: [email protected] Fixes: 7ffbd48d5cab2 ("tracing: Cache comms only after an event occurred") Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-06-18tracing: Do not stop recording cmdlines when tracing is offSteven Rostedt (VMware)1-2/+0
The saved_cmdlines is used to map pids to the task name, such that the output of the tracing does not just show pids, but also gives a human readable name for the task. If the name is not mapped, the output looks like this: <...>-1316 [005] ...2 132.044039: ... Instead of this: gnome-shell-1316 [005] ...2 132.044039: ... The names are updated when tracing is running, but are skipped if tracing is stopped. Unfortunately, this stops the recording of the names if the top level tracer is stopped, and not if there's other tracers active. The recording of a name only happens when a new event is written into a ring buffer, so there is no need to test if tracing is on or not. If tracing is off, then no event is written and no need to test if tracing is off or not. Remove the check, as it hides the names of tasks for events in the instance buffers. Cc: [email protected] Fixes: 7ffbd48d5cab2 ("tracing: Cache comms only after an event occurred") Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-06-18sched: Change task_struct::statePeter Zijlstra14-68/+76
Change the type and name of task_struct::state. Drop the volatile and shrink it to an 'unsigned int'. Rename it in order to find all uses such that we can use READ_ONCE/WRITE_ONCE as appropriate. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Daniel Bristot de Oliveira <[email protected]> Acked-by: Will Deacon <[email protected]> Acked-by: Daniel Thompson <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-18sched,timer: Use __set_current_state()Peter Zijlstra1-1/+1
There's an existing helper for setting TASK_RUNNING; must've gotten lost last time we did this cleanup. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Acked-by: Will Deacon <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-18sched: Add get_current_state()Peter Zijlstra2-4/+4
Remove yet another few p->state accesses. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Will Deacon <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-18sched,perf,kvm: Fix preemption conditionPeter Zijlstra1-4/+3
When ran from the sched-out path (preempt_notifier or perf_event), p->state is irrelevant to determine preemption. You can get preempted with !task_is_running() just fine. The right indicator for preemption is if the task is still on the runqueue in the sched-out path. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Mark Rutland <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-18sched: Introduce task_is_running()Peter Zijlstra7-10/+9
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper: task_is_running(p). Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Acked-by: Will Deacon <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-18sched: Unbreak wakeupsPeter Zijlstra1-1/+1
Remove broken task->state references and let wake_up_process() DTRT. The anti-pattern in these patches breaks the ordering of ->state vs COND as described in the comment near set_current_state() and can lead to missed wakeups: (OoO load, observes RUNNING)<-. for (;;) { | t->state = UNINTERRUPTIBLE; | smp_mb(); ,-----> | (observes !COND) | / if (COND) ---------' | COND = 1; break; `- if (t->state != RUNNING) wake_up_process(t); // not done schedule(); // forever waiting } t->state = TASK_RUNNING; Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Acked-by: Will Deacon <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-18Merge branch 'sched/urgent' into sched/core, to resolve conflictsIngo Molnar10-68/+71
This commit in sched/urgent moved the cfs_rq_is_decayed() function: a7b359fc6a37: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and this fresh commit in sched/core modified it in the old location: 9e077b52d86a: ("sched/pelt: Check that *_avg are null when *_sum are") Merge the two variants. Conflicts: kernel/sched/fair.c Signed-off-by: Ingo Molnar <[email protected]>
2021-06-17tracing: Have ftrace_dump_on_oops kernel parameter take numbersSteven Rostedt (VMware)1-2/+2
The kernel parameter for ftrace_dump_on_oops can take a single assignment. That is, it can be: ftrace_dump_on_oops or ftrace_dump_on_oops=orig_cpu But the content in the sysctl file is a number. 0 for disabled 1 for ftrace_dump_on_oops (all CPUs) 2 for ftrace_dump_on_oops (orig CPU) Allow the kernel command line to take a number as well to match the sysctl numbers. That is: ftrace_dump_on_oops=1 is the same as ftrace_dump_on_oops and ftrace_dump_on_oops=2 is the same as ftrace_dump_on_oops=orig_cpu Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller13-110/+476
Daniel Borkmann says: ==================== pull-request: bpf-next 2021-06-17 The following pull-request contains BPF updates for your *net-next* tree. We've added 50 non-merge commits during the last 25 day(s) which contain a total of 148 files changed, 4779 insertions(+), 1248 deletions(-). The main changes are: 1) BPF infrastructure to migrate TCP child sockets from a listener to another in the same reuseport group/map, from Kuniyuki Iwashima. 2) Add a provably sound, faster and more precise algorithm for tnum_mul() as noted in https://arxiv.org/abs/2105.05398, from Harishankar Vishwanathan. 3) Streamline error reporting changes in libbpf as planned out in the 'libbpf: the road to v1.0' effort, from Andrii Nakryiko. 4) Add broadcast support to xdp_redirect_map(), from Hangbin Liu. 5) Extends bpf_map_lookup_and_delete_elem() functionality to 4 more map types, that is, {LRU_,PERCPU_,LRU_PERCPU_,}HASH, from Denis Salopek. 6) Support new LLVM relocations in libbpf to make them more linker friendly, also add a doc to describe the BPF backend relocations, from Yonghong Song. 7) Silence long standing KUBSAN complaints on register-based shifts in interpreter, from Daniel Borkmann and Eric Biggers. 8) Add dummy PT_REGS macros in libbpf to fail BPF program compilation when target arch cannot be determined, from Lorenz Bauer. 9) Extend AF_XDP to support large umems with 1M+ pages, from Magnus Karlsson. 10) Fix two minor libbpf tc BPF API issues, from Kumar Kartikeya Dwivedi. 11) Move libbpf BPF_SEQ_PRINTF/BPF_SNPRINTF macros that can be used by BPF programs to bpf_helpers.h header, from Florent Revest. ==================== Signed-off-by: David S. Miller <[email protected]>
2021-06-17tracing: Add tp_printk_stop_on_boot optionSteven Rostedt (VMware)1-11/+29
Add a kernel command line option that disables printing of events to console at late_initcall_sync(). This is useful when needing to see specific events written to console on boot up, but not wanting it when user space starts, as user space may make the console so noisy that the system becomes inoperable. Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-06-17sched/fair: Age the average idle timePeter Zijlstra3-4/+29
This is a partial forward-port of Peter Ziljstra's work first posted at: https://lore.kernel.org/lkml/[email protected]/ Currently select_idle_cpu()'s proportional scheme uses the average idle time *for when we are idle*, that is temporally challenged. When a CPU is not at all idle, we'll happily continue using whatever value we did see when the CPU goes idle. To fix this, introduce a separate average idle and age it (the existing value still makes sense for things like new-idle balancing, which happens when we do go idle). The overall goal is to not spend more time scanning for idle CPUs than we're idle for. Otherwise we're inhibiting work. This means that we need to consider the cost over all the wake-ups between consecutive idle periods. To track this, the scan cost is subtracted from the estimated average idle time. The impact of this patch is related to workloads that have domains that are fully busy or overloaded. Without the patch, the scan depth may be too high because a CPU is not reaching idle. Due to the nature of the patch, this is a regression magnet. It potentially wins when domains are almost fully busy or overloaded -- at that point searches are likely to fail but idle is not being aged as CPUs are active so search depth is too large and useless. It will potentially show regressions when there are idle CPUs and a deep search is beneficial. This tbench result on a 2-socket broadwell machine partially illustates the problem 5.13.0-rc2 5.13.0-rc2 vanilla sched-avgidle-v1r5 Hmean 1 445.02 ( 0.00%) 451.36 * 1.42%* Hmean 2 830.69 ( 0.00%) 846.03 * 1.85%* Hmean 4 1350.80 ( 0.00%) 1505.56 * 11.46%* Hmean 8 2888.88 ( 0.00%) 2586.40 * -10.47%* Hmean 16 5248.18 ( 0.00%) 5305.26 * 1.09%* Hmean 32 8914.03 ( 0.00%) 9191.35 * 3.11%* Hmean 64 10663.10 ( 0.00%) 10192.65 * -4.41%* Hmean 128 18043.89 ( 0.00%) 18478.92 * 2.41%* Hmean 256 16530.89 ( 0.00%) 17637.16 * 6.69%* Hmean 320 16451.13 ( 0.00%) 17270.97 * 4.98%* Note that 8 was a regression point where a deeper search would have helped but it gains for high thread counts when searches are useless. Hackbench is a more extreme example although not perfect as the tasks idle rapidly hackbench-process-pipes 5.13.0-rc2 5.13.0-rc2 vanilla sched-avgidle-v1r5 Amean 1 0.3950 ( 0.00%) 0.3887 ( 1.60%) Amean 4 0.9450 ( 0.00%) 0.9677 ( -2.40%) Amean 7 1.4737 ( 0.00%) 1.4890 ( -1.04%) Amean 12 2.3507 ( 0.00%) 2.3360 * 0.62%* Amean 21 4.0807 ( 0.00%) 4.0993 * -0.46%* Amean 30 5.6820 ( 0.00%) 5.7510 * -1.21%* Amean 48 8.7913 ( 0.00%) 8.7383 ( 0.60%) Amean 79 14.3880 ( 0.00%) 13.9343 * 3.15%* Amean 110 21.2233 ( 0.00%) 19.4263 * 8.47%* Amean 141 28.2930 ( 0.00%) 25.1003 * 11.28%* Amean 172 34.7570 ( 0.00%) 30.7527 * 11.52%* Amean 203 41.0083 ( 0.00%) 36.4267 * 11.17%* Amean 234 47.7133 ( 0.00%) 42.0623 * 11.84%* Amean 265 53.0353 ( 0.00%) 47.7720 * 9.92%* Amean 296 60.0170 ( 0.00%) 53.4273 * 10.98%* Stddev 1 0.0052 ( 0.00%) 0.0025 ( 51.57%) Stddev 4 0.0357 ( 0.00%) 0.0370 ( -3.75%) Stddev 7 0.0190 ( 0.00%) 0.0298 ( -56.64%) Stddev 12 0.0064 ( 0.00%) 0.0095 ( -48.38%) Stddev 21 0.0065 ( 0.00%) 0.0097 ( -49.28%) Stddev 30 0.0185 ( 0.00%) 0.0295 ( -59.54%) Stddev 48 0.0559 ( 0.00%) 0.0168 ( 69.92%) Stddev 79 0.1559 ( 0.00%) 0.0278 ( 82.17%) Stddev 110 1.1728 ( 0.00%) 0.0532 ( 95.47%) Stddev 141 0.7867 ( 0.00%) 0.0968 ( 87.69%) Stddev 172 1.0255 ( 0.00%) 0.0420 ( 95.91%) Stddev 203 0.8106 ( 0.00%) 0.1384 ( 82.92%) Stddev 234 1.1949 ( 0.00%) 0.1328 ( 88.89%) Stddev 265 0.9231 ( 0.00%) 0.0820 ( 91.11%) Stddev 296 1.0456 ( 0.00%) 0.1327 ( 87.31%) Again, higher thread counts benefit and the standard deviation shows that results are also a lot more stable when the idle time is aged. The patch potentially matters when a socket was multiple LLCs as the maximum search depth is lower. However, some of the test results were suspiciously good (e.g. specjbb2005 gaining 50% on a Zen1 machine) and other results were not dramatically different to other mcahines. Given the nature of the patch, Peter's full series is not being forward ported as each part should stand on its own. Preferably they would be merged at different times to reduce the risk of false bisections. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-06-17sched/cpufreq: Consider reduced CPU capacity in energy calculationLukasz Luba2-1/+2
Energy Aware Scheduling (EAS) needs to predict the decisions made by SchedUtil. The map_util_freq() exists to do that. There are corner cases where the max allowed frequency might be reduced (due to thermal). SchedUtil as a CPUFreq governor, is aware of that but EAS is not. This patch aims to address it. SchedUtil stores the maximum allowed frequency in 'sugov_policy::next_freq' field. EAS has to predict that value, which is the real used frequency. That value is made after a call to cpufreq_driver_resolve_freq() which clamps to the CPUFreq policy limits. In the existing code EAS is not able to predict that real frequency. This leads to energy estimation errors. To avoid wrong energy estimation in EAS (due to frequency miss prediction) make sure that the step which calculates Performance Domain frequency, is also aware of the allowed CPU capacity. Furthermore, modify map_util_freq() to not extend the frequency value. Instead, use map_util_perf() to extend the util value in both places: SchedUtil and EAS, but for EAS clamp it to max allowed CPU capacity. In the end, we achieve the same desirable behavior for both subsystems and alignment in regards to the real CPU frequency. Signed-off-by: Lukasz Luba <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> (For the schedutil part) Link: https://lore.kernel.org/r/[email protected]
2021-06-17sched/fair: Take thermal pressure into account while estimating energyLukasz Luba1-3/+8
Energy Aware Scheduling (EAS) needs to be able to predict the frequency requests made by the SchedUtil governor to properly estimate energy used in the future. It has to take into account CPUs utilization and forecast Performance Domain (PD) frequency. There is a corner case when the max allowed frequency might be reduced due to thermal. SchedUtil is aware of that reduced frequency, so it should be taken into account also in EAS estimations. SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping to 'policy::max'. SchedUtil is responsible to respect that upper limit while setting the frequency through CPUFreq drivers. This effective frequency is stored internally in 'sugov_policy::next_freq' and EAS has to predict that value. In the existing code the raw value of arch_scale_cpu_capacity() is used for clamping the returned CPU utilization from effective_cpu_util(). This patch fixes issue with too big single CPU utilization, by introducing clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU capacity reduced by thermal pressure raw value. Thanks to knowledge about allowed CPU capacity, we don't get too big value for a single CPU utilization, which is then added to the util sum. The util sum is used as a source of information for estimating whole PD energy. To avoid wrong energy estimation in EAS (due to capped frequency), make sure that the calculation of util sum is aware of allowed CPU capacity. This thermal pressure might be visible in scenarios where the CPUs are not heavily loaded, but some other component (like GPU) drastically reduced available power budget and increased the SoC temperature. Thus, we still use EAS for task placement and CPUs are not over-utilized. Signed-off-by: Lukasz Luba <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-17sched/fair: Return early from update_tg_cfs_load() if delta == 0Dietmar Eggemann1-1/+4
In case the _avg delta is 0 there is no need to update se's _avg (level n) nor cfs_rq's _avg (level n-1). These values stay the same. Since cfs_rq's _avg isn't changed, i.e. no load is propagated down, cfs_rq's _sum should stay the same as well. So bail out after se's _sum has been updated. Signed-off-by: Dietmar Eggemann <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-17sched/pelt: Check that *_avg are null when *_sum areVincent Guittot1-0/+9
Check that we never break the rule that pelt's avg values are null if pelt's sum are. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Acked-by: Odin Ugedal <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-17bpf: Fix up register-based shifts in interpreter to silence KUBSANDaniel Borkmann1-18/+43
syzbot reported a shift-out-of-bounds that KUBSAN observed in the interpreter: [...] UBSAN: shift-out-of-bounds in kernel/bpf/core.c:1420:2 shift exponent 255 is too large for 64-bit type 'long long unsigned int' CPU: 1 PID: 11097 Comm: syz-executor.4 Not tainted 5.12.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:327 ___bpf_prog_run.cold+0x19/0x56c kernel/bpf/core.c:1420 __bpf_prog_run32+0x8f/0xd0 kernel/bpf/core.c:1735 bpf_dispatcher_nop_func include/linux/bpf.h:644 [inline] bpf_prog_run_pin_on_cpu include/linux/filter.h:624 [inline] bpf_prog_run_clear_cb include/linux/filter.h:755 [inline] run_filter+0x1a1/0x470 net/packet/af_packet.c:2031 packet_rcv+0x313/0x13e0 net/packet/af_packet.c:2104 dev_queue_xmit_nit+0x7c2/0xa90 net/core/dev.c:2387 xmit_one net/core/dev.c:3588 [inline] dev_hard_start_xmit+0xad/0x920 net/core/dev.c:3609 __dev_queue_xmit+0x2121/0x2e00 net/core/dev.c:4182 __bpf_tx_skb net/core/filter.c:2116 [inline] __bpf_redirect_no_mac net/core/filter.c:2141 [inline] __bpf_redirect+0x548/0xc80 net/core/filter.c:2164 ____bpf_clone_redirect net/core/filter.c:2448 [inline] bpf_clone_redirect+0x2ae/0x420 net/core/filter.c:2420 ___bpf_prog_run+0x34e1/0x77d0 kernel/bpf/core.c:1523 __bpf_prog_run512+0x99/0xe0 kernel/bpf/core.c:1737 bpf_dispatcher_nop_func include/linux/bpf.h:644 [inline] bpf_test_run+0x3ed/0xc50 net/bpf/test_run.c:50 bpf_prog_test_run_skb+0xabc/0x1c50 net/bpf/test_run.c:582 bpf_prog_test_run kernel/bpf/syscall.c:3127 [inline] __do_sys_bpf+0x1ea9/0x4f00 kernel/bpf/syscall.c:4406 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae [...] Generally speaking, KUBSAN reports from the kernel should be fixed. However, in case of BPF, this particular report caused concerns since the large shift is not wrong from BPF point of view, just undefined. In the verifier, K-based shifts that are >= {64,32} (depending on the bitwidth of the instruction) are already rejected. The register-based cases were not given their content might not be known at verification time. Ideas such as verifier instruction rewrite with an additional AND instruction for the source register were brought up, but regularly rejected due to the additional runtime overhead they incur. As Edward Cree rightly put it: Shifts by more than insn bitness are legal in the BPF ISA; they are implementation-defined behaviour [of the underlying architecture], rather than UB, and have been made legal for performance reasons. Each of the JIT backends compiles the BPF shift operations to machine instructions which produce implementation-defined results in such a case; the resulting contents of the register may be arbitrary but program behaviour as a whole remains defined. Guard checks in the fast path (i.e. affecting JITted code) will thus not be accepted. The case of division by zero is not truly analogous here, as division instructions on many of the JIT-targeted architectures will raise a machine exception / fault on division by zero, whereas (to the best of my knowledge) none will do so on an out-of-bounds shift. Given the KUBSAN report only affects the BPF interpreter, but not JITs, one solution is to add the ANDs with 63 or 31 into ___bpf_prog_run(). That would make the shifts defined, and thus shuts up KUBSAN, and the compiler would optimize out the AND on any CPU that interprets the shift amounts modulo the width anyway (e.g., confirmed from disassembly that on x86-64 and arm64 the generated interpreter code is the same before and after this fix). The BPF interpreter is slow path, and most likely compiled out anyway as distros select BPF_JIT_ALWAYS_ON to avoid speculative execution of BPF instructions by the interpreter. Given the main argument was to avoid sacrificing performance, the fact that the AND is optimized away from compiler for mainstream archs helps as well as a solution moving forward. Also add a comment on LSH/RSH/ARSH translation for JIT authors to provide guidance when they see the ___bpf_prog_run() interpreter code and use it as a model for a new JIT backend. Reported-by: [email protected] Reported-by: Kurt Manucredo <[email protected]> Signed-off-by: Eric Biggers <[email protected]> Co-developed-by: Eric Biggers <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Tested-by: [email protected] Cc: Edward Cree <[email protected]> Link: https://lore.kernel.org/bpf/[email protected] Link: https://lore.kernel.org/bpf/[email protected]
2021-06-16bpf: Fix typo in kernel/bpf/bpf_lsm.cShuyi Cheng1-1/+1
Fix s/sleeable/sleepable/ typo in a comment. Signed-off-by: Shuyi Cheng <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-06-16crash_core, vmcoreinfo: append 'SECTION_SIZE_BITS' to vmcoreinfoPingfan Liu1-0/+1
As mentioned in kernel commit 1d50e5d0c505 ("crash_core, vmcoreinfo: Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the formula: #define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS) Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate PAGES_PER_SECTION in makedumpfile just like kernel. Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g. recently in kernel commit f0b13ee23241 ("arm64/sparsemem: reduce SECTION_SIZE_BITS"). But user space wants a stable interface to get this info. Such info is impossible to be deduced from a crashdump vmcore. Hence append SECTION_SIZE_BITS to vmcoreinfo. Link: https://lkml.kernel.org/r/[email protected] Link: http://lists.infradead.org/pipermail/kexec/2021-June/022676.html Signed-off-by: Pingfan Liu <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Bhupesh Sharma <[email protected]> Cc: Kazuhito Hagio <[email protected]> Cc: Dave Young <[email protected]> Cc: Boris Petkov <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: James Morse <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Will Deacon <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dave Anderson <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-16printk: Move EXPORT_SYMBOL() closer to vprintk definitionPunit Agrawal1-1/+1
Commit 28e1745b9fa2 ("printk: rename vprintk_func to vprintk") while improving readability by removing vprintk indirection, inadvertently placed the EXPORT_SYMBOL() for the newly renamed function at the end of the file. For reader sanity, and as is convention move the EXPORT_SYMBOL() declaration just after the end of the function. Fixes: 28e1745b9fa2 ("printk: rename vprintk_func to vprintk") Signed-off-by: Punit Agrawal <[email protected]> Acked-by: Rasmus Villemoes <[email protected]> Acked-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-15bpf: Support socket migration by eBPF.Kuniyuki Iwashima1-0/+13
This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to check if the attached eBPF program is capable of migrating sockets. When the eBPF program is attached, we run it for socket migration if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or net.ipv4.tcp_migrate_req is enabled. Currently, the expected_attach_type is not enforced for the BPF_PROG_TYPE_SK_REUSEPORT type of program. Thus, this commit follows the earlier idea in the commit aac3fc320d94 ("bpf: Post-hooks for sys_bind") to fix up the zero expected_attach_type in bpf_prog_load_fixup_attach_type(). Moreover, this patch adds a new field (migrating_sk) to sk_reuseport_md to select a new listener based on the child socket. migrating_sk varies depending on if it is migrating a request in the accept queue or during 3WHS. - accept_queue : sock (ESTABLISHED/SYN_RECV) - 3WHS : request_sock (NEW_SYN_RECV) In the eBPF program, we can select a new listener by BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning SK_DROP. This feature is useful when listeners have different settings at the socket API level or when we want to free resources as soon as possible. - SK_PASS with selected_sk, select it as a new listener - SK_PASS with selected_sk NULL, fallbacks to the random selection - SK_DROP, cancel the migration. There is a noteworthy point. We select a listening socket in three places, but we do not have struct skb at closing a listener or retransmitting a SYN+ACK. On the other hand, some helper functions do not expect skb is NULL (e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer() in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb temporarily before running the eBPF program. Suggested-by: Martin KaFai Lau <[email protected]> Signed-off-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/netdev/[email protected]/ Link: https://lore.kernel.org/netdev/[email protected]/ Link: https://lore.kernel.org/netdev/[email protected]/ Link: https://lore.kernel.org/bpf/[email protected]
2021-06-14bpf: Fix leakage under speculation on mispredicted branchesDaniel Borkmann1-4/+40
The verifier only enumerates valid control-flow paths and skips paths that are unreachable in the non-speculative domain. And so it can miss issues under speculative execution on mispredicted branches. For example, a type confusion has been demonstrated with the following crafted program: // r0 = pointer to a map array entry // r6 = pointer to readable stack slot // r9 = scalar controlled by attacker 1: r0 = *(u64 *)(r0) // cache miss 2: if r0 != 0x0 goto line 4 3: r6 = r9 4: if r0 != 0x1 goto line 6 5: r9 = *(u8 *)(r6) 6: // leak r9 Since line 3 runs iff r0 == 0 and line 5 runs iff r0 == 1, the verifier concludes that the pointer dereference on line 5 is safe. But: if the attacker trains both the branches to fall-through, such that the following is speculatively executed ... r6 = r9 r9 = *(u8 *)(r6) // leak r9 ... then the program will dereference an attacker-controlled value and could leak its content under speculative execution via side-channel. This requires to mistrain the branch predictor, which can be rather tricky, because the branches are mutually exclusive. However such training can be done at congruent addresses in user space using different branches that are not mutually exclusive. That is, by training branches in user space ... A: if r0 != 0x0 goto line C B: ... C: if r0 != 0x0 goto line D D: ... ... such that addresses A and C collide to the same CPU branch prediction entries in the PHT (pattern history table) as those of the BPF program's lines 2 and 4, respectively. A non-privileged attacker could simply brute force such collisions in the PHT until observing the attack succeeding. Alternative methods to mistrain the branch predictor are also possible that avoid brute forcing the collisions in the PHT. A reliable attack has been demonstrated, for example, using the following crafted program: // r0 = pointer to a [control] map array entry // r7 = *(u64 *)(r0 + 0), training/attack phase // r8 = *(u64 *)(r0 + 8), oob address // [...] // r0 = pointer to a [data] map array entry 1: if r7 == 0x3 goto line 3 2: r8 = r0 // crafted sequence of conditional jumps to separate the conditional // branch in line 193 from the current execution flow 3: if r0 != 0x0 goto line 5 4: if r0 == 0x0 goto exit 5: if r0 != 0x0 goto line 7 6: if r0 == 0x0 goto exit [...] 187: if r0 != 0x0 goto line 189 188: if r0 == 0x0 goto exit // load any slowly-loaded value (due to cache miss in phase 3) ... 189: r3 = *(u64 *)(r0 + 0x1200) // ... and turn it into known zero for verifier, while preserving slowly- // loaded dependency when executing: 190: r3 &= 1 191: r3 &= 2 // speculatively bypassed phase dependency 192: r7 += r3 193: if r7 == 0x3 goto exit 194: r4 = *(u8 *)(r8 + 0) // leak r4 As can be seen, in training phase (phase != 0x3), the condition in line 1 turns into false and therefore r8 with the oob address is overridden with the valid map value address, which in line 194 we can read out without issues. However, in attack phase, line 2 is skipped, and due to the cache miss in line 189 where the map value is (zeroed and later) added to the phase register, the condition in line 193 takes the fall-through path due to prior branch predictor training, where under speculation, it'll load the byte at oob address r8 (unknown scalar type at that point) which could then be leaked via side-channel. One way to mitigate these is to 'branch off' an unreachable path, meaning, the current verification path keeps following the is_branch_taken() path and we push the other branch to the verification stack. Given this is unreachable from the non-speculative domain, this branch's vstate is explicitly marked as speculative. This is needed for two reasons: i) if this path is solely seen from speculative execution, then we later on still want the dead code elimination to kick in in order to sanitize these instructions with jmp-1s, and ii) to ensure that paths walked in the non-speculative domain are not pruned from earlier walks of paths walked in the speculative domain. Additionally, for robustness, we mark the registers which have been part of the conditional as unknown in the speculative path given there should be no assumptions made on their content. The fix in here mitigates type confusion attacks described earlier due to i) all code paths in the BPF program being explored and ii) existing verifier logic already ensuring that given memory access instruction references one specific data structure. An alternative to this fix that has also been looked at in this scope was to mark aux->alu_state at the jump instruction with a BPF_JMP_TAKEN state as well as direction encoding (always-goto, always-fallthrough, unknown), such that mixing of different always-* directions themselves as well as mixing of always-* with unknown directions would cause a program rejection by the verifier, e.g. programs with constructs like 'if ([...]) { x = 0; } else { x = 1; }' with subsequent 'if (x == 1) { [...] }'. For unprivileged, this would result in only single direction always-* taken paths, and unknown taken paths being allowed, such that the former could be patched from a conditional jump to an unconditional jump (ja). Compared to this approach here, it would have two downsides: i) valid programs that otherwise are not performing any pointer arithmetic, etc, would potentially be rejected/broken, and ii) we are required to turn off path pruning for unprivileged, where both can be avoided in this work through pushing the invalid branch to the verification stack. The issue was originally discovered by Adam and Ofek, and later independently discovered and reported as a result of Benedict and Piotr's research work. Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation") Reported-by: Adam Morrison <[email protected]> Reported-by: Ofek Kirzner <[email protected]> Reported-by: Benedict Schlueter <[email protected]> Reported-by: Piotr Krysiuk <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Reviewed-by: Benedict Schlueter <[email protected]> Reviewed-by: Piotr Krysiuk <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-06-14bpf: Do not mark insn as seen under speculative path verificationDaniel Borkmann1-2/+18
... in such circumstances, we do not want to mark the instruction as seen given the goal is still to jmp-1 rewrite/sanitize dead code, if it is not reachable from the non-speculative path verification. We do however want to verify it for safety regardless. With the patch as-is all the insns that have been marked as seen before the patch will also be marked as seen after the patch (just with a potentially different non-zero count). An upcoming patch will also verify paths that are unreachable in the non-speculative domain, hence this extension is needed. Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Reviewed-by: Benedict Schlueter <[email protected]> Reviewed-by: Piotr Krysiuk <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-06-14bpf: Inherit expanded/patched seen count from old aux dataDaniel Borkmann1-1/+3
Instead of relying on current env->pass_cnt, use the seen count from the old aux data in adjust_insn_aux_data(), and expand it to the new range of patched instructions. This change is valid given we always expand 1:n with n>=1, so what applies to the old/original instruction needs to apply for the replacement as well. Not relying on env->pass_cnt is a prerequisite for a later change where we want to avoid marking an instruction seen when verified under speculative execution path. Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Reviewed-by: Benedict Schlueter <[email protected]> Reviewed-by: Piotr Krysiuk <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-06-14sched/fair: Correctly insert cfs_rq's to list on unthrottleOdin Ugedal1-19/+25
Fix an issue where fairness is decreased since cfs_rq's can end up not being decayed properly. For two sibling control groups with the same priority, this can often lead to a load ratio of 99/1 (!!). This happens because when a cfs_rq is throttled, all the descendant cfs_rq's will be removed from the leaf list. When they initial cfs_rq is unthrottled, it will currently only re add descendant cfs_rq's if they have one or more entities enqueued. This is not a perfect heuristic. Instead, we insert all cfs_rq's that contain one or more enqueued entities, or it its load is not completely decayed. Can often lead to situations like this for equally weighted control groups: $ ps u -C stress USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 10009 88.8 0.0 3676 100 pts/1 R+ 11:04 0:13 stress --cpu 1 root 10023 3.0 0.0 3676 104 pts/1 R+ 11:04 0:00 stress --cpu 1 Fixes: 31bc6aeaab1d ("sched/fair: Optimize update_blocked_averages()") [vingo: !SMP build fix] Signed-off-by: Odin Ugedal <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-14Revert "cpufreq: CPPC: Add support for frequency invariance"Viresh Kumar1-1/+0
This reverts commit 4c38f2df71c8e33c0b64865992d693f5022eeaad. There are few races in the frequency invariance support for CPPC driver, namely the driver doesn't stop the kthread_work and irq_work on policy exit during suspend/resume or CPU hotplug. A proper fix won't be possible for the 5.13-rc, as it requires a lot of changes. Lets revert the patch instead for now. Fixes: 4c38f2df71c8 ("cpufreq: CPPC: Add support for frequency invariance") Reported-by: Qian Cai <[email protected]> Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2021-06-12Merge tag 'sched-urgent-2021-06-12' of ↵Linus Torvalds5-24/+24
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes: - Fix performance regression caused by lack of intended batching of RCU callbacks by over-eager NOHZ-full code. - Fix cgroups related corruption of load_avg and load_sum metrics. - Three fixes to fix blocked load, util_sum/runnable_sum and util_est tracking bugs" * tag 'sched-urgent-2021-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix util_est UTIL_AVG_UNCHANGED handling sched/pelt: Ensure that *_sum is always synced with *_avg tick/nohz: Only check for RCU deferred wakeup on user/guest entry when needed sched/fair: Make sure to update tg contrib for blocked load sched/fair: Keep load_avg and load_sum synced
2021-06-12Merge tag 'perf-urgent-2021-06-12' of ↵Linus Torvalds2-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc fixes: - Fix the NMI watchdog on ancient Intel CPUs - Remove a misguided, NMI-unsafe KASAN callback from the NMI-safe irq_work path used by perf. - Fix uncore events on Ice Lake servers. - Someone booted maxcpus=1 on an SNB-EP, and the uncore driver emitted warnings and was probably buggy. Fix it. - KCSAN found a genuine data race in the core perf code. Somewhat ironically the bug was introduced through a recent race fix. :-/ In our defense, the new race window was much more narrow. Fix it" * tag 'perf-urgent-2021-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/nmi_watchdog: Fix old-style NMI watchdog regression on old Intel CPUs irq_work: Make irq_work_queue() NMI-safe again perf/x86/intel/uncore: Fix M2M event umask for Ice Lake server perf/x86/intel/uncore: Fix a kernel WARNING triggered by maxcpus=1 perf: Fix data race between pin_count increment/decrement
2021-06-11Merge tag 'trace-v5.13-rc5-2' of ↵Linus Torvalds2-2/+8
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Fix the length check in the temp buffer filter - Fix build failure in bootconfig tools for "fallthrough" macro - Fix error return of bootconfig apply_xbc() routine * tag 'trace-v5.13-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Correct the length check which causes memory corruption ftrace: Do not blindly read the ip address in ftrace_bug() tools/bootconfig: Fix a build error accroding to undefined fallthrough tools/bootconfig: Fix error return code in apply_xbc()
2021-06-11PM: hibernate: remove leading spaces before tabsZhen Lei1-1/+1
1) Run the following command to find and remove the leading spaces before tabs: $ find kernel/power/ -type f | xargs sed -r -i 's/^[ ]+\t/\t/' 2) Manually check and correct if necessary Signed-off-by: Zhen Lei <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2021-06-11PM: sleep: remove trailing spaces and tabsZhen Lei2-7/+7
Run the following command to find and remove the trailing spaces and tabs: $ find kernel/power/ -type f | xargs sed -r -i 's/[ \t]+$//' Signed-off-by: Zhen Lei <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2021-06-10audit: remove trailing spaces and tabsZhen Lei2-5/+5
Run the following command to find and remove the trailing spaces and tabs: sed -r -i 's/[ \t]+$//' <audit_files> The files to be checked are as follows: kernel/audit* include/linux/audit.h include/uapi/linux/audit.h Signed-off-by: Zhen Lei <[email protected]> Acked-by: Richard Guy Briggs <[email protected]> Signed-off-by: Paul Moore <[email protected]>
2021-06-10Merge branch 'for-5.13-fixes' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "This is a high priority but low risk fix for a cgroup1 bug where rename(2) can change a cgroup's name to something which can break parsing of /proc/PID/cgroup" * 'for-5.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup1: don't allow '\n' in renaming
2021-06-10tracing: Add better comments for the filtering temp buffer use caseSteven Rostedt (VMware)1-1/+35
When filtering is enabled, the event is copied into a temp buffer instead of being written into the ring buffer directly, because the discarding of events from the ring buffer is very expensive, and doing the extra copy is much faster than having to discard most of the time. As that logic is subtle, add comments to explain in more detail to what is going on and how it works. Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-06-10tracing: Simplify the max length test when using the filtering temp bufferSteven Rostedt (VMware)1-1/+3
When filtering trace events, a temp buffer is used because the extra copy from the temp buffer into the ring buffer is still faster than the direct write into the ring buffer followed by a discard if the filter does not match. But the data that can be stored in the temp buffer is a PAGE_SIZE minus the ring buffer event header. The calculation of that header size is complex, but using the helper macro "struct_size()" can simplify it. Link: https://lore.kernel.org/stable/CAHk-=whKbJkuVmzb0hD3N6q7veprUrSpiBHRxVY=AffWZPtxmg@mail.gmail.com/ Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-06-10tracing/boot: Add per-group/all events enablementMasami Hiramatsu1-2/+25
Add ftrace.event.<GROUP>.enable and ftrace.event.enable boot-time tracing, which enables all events under given GROUP and all events respectivly. Link: https://lkml.kernel.org/r/162264438005.302580.12019174481201855444.stgit@devnote2 Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-06-10tracing: Add WARN_ON_ONCE when returned value is negativeHyeonggon Yoo1-0/+1
ret is assigned return value of event_hist_trigger_func, but the value is unused. It is better to warn when returned value is negative, rather than just ignoring it. Link: https://lkml.kernel.org/r/20210529061423.GA103954@hyeyoo Signed-off-by: Hyeonggon Yoo <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-06-10tracing: Fix set_named_trigger_data() kernel-doc commentQiujun Huang1-1/+2
Fix the description of the parameters. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qiujun Huang <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-06-10tracing: Remove redundant initialization of variable retColin Ian King1-1/+1
The variable ret is being initialized with a value that is never read, it is being updated later on. The assignment is redundant and can be removed. Link: https://lkml.kernel.org/r/[email protected] Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-06-10ring-buffer: Use fallthrough pseudo-keywordWei Ming Chen1-1/+1
Replace /* fall through */ comment with pseudo-keyword macro fallthrough[1] [1] https://www.kernel.org/doc/html/latest/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Ming Chen <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-06-10tracing: Remove redundant assignment to event_varJiapeng Chong1-1/+1
Variable event_var is set to 'ERR_PTR(-EINVAL)', but this value is never read as it is overwritten or not used later on, hence it is a redundant assignment and can be removed. Clean up the following clang-analyzer warning: kernel/trace/trace_events_hist.c:2437:21: warning: Value stored to 'event_var' during its initialization is never read [clang-analyzer-deadcode.DeadStores]. Link: https://lkml.kernel.org/r/1620470236-26562-1-git-send-email-jiapeng.chong@linux.alibaba.com Reported-by: Abaci Robot <[email protected]> Signed-off-by: Jiapeng Chong <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-06-10scsi: cgroup: Add cgroup_get_from_id()Muneendra Kumar1-0/+26
Add a new function, cgroup_get_from_id(), to retrieve the cgroup associated with a cgroup id. Also export the function cgroup_get_e_css() as this is needed in blk-cgroup.h. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Himanshu Madhani <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Acked-by: Tejun Heo <[email protected]> Signed-off-by: Muneendra Kumar <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]>
2021-06-10cgroup1: don't allow '\n' in renamingAlexander Kuznetsov1-0/+4
cgroup_mkdir() have restriction on newline usage in names: $ mkdir $'/sys/fs/cgroup/cpu/test\ntest2' mkdir: cannot create directory '/sys/fs/cgroup/cpu/test\ntest2': Invalid argument But in cgroup1_rename() such check is missed. This allows us to make /proc/<pid>/cgroup unparsable: $ mkdir /sys/fs/cgroup/cpu/test $ mv /sys/fs/cgroup/cpu/test $'/sys/fs/cgroup/cpu/test\ntest2' $ echo $$ > $'/sys/fs/cgroup/cpu/test\ntest2' $ cat /proc/self/cgroup 11:pids:/ 10:freezer:/ 9:hugetlb:/ 8:cpuset:/ 7:blkio:/user.slice 6:memory:/user.slice 5:net_cls,net_prio:/ 4:perf_event:/ 3:devices:/user.slice 2:cpu,cpuacct:/test test2 1:name=systemd:/ 0::/ Signed-off-by: Alexander Kuznetsov <[email protected]> Reported-by: Andrey Krasichkov <[email protected]> Acked-by: Dmitry Yakunin <[email protected]> Cc: [email protected] Signed-off-by: Tejun Heo <[email protected]>
2021-06-10genirq: Move non-irqdomain handle_domain_irq() handling into ARM's handle_IRQ()Marc Zyngier1-22/+8
Despite the name, handle_domain_irq() deals with non-irqdomain handling for the sake of a handful of legacy ARM platforms. Move such handling into ARM's handle_IRQ(), allowing for better code generation for everyone else. This allows us get rid of some complexity, and to rearrange the guards on the various helpers in a more logical way. Signed-off-by: Marc Zyngier <[email protected]>
2021-06-10genirq: Add generic_handle_domain_irq() helperMarc Zyngier1-1/+18
Provide generic_handle_domain_irq() as a pendent to handle_domain_irq() for non-root interrupt controllers Signed-off-by: Marc Zyngier <[email protected]>
2021-06-10genirq: Use irq_resolve_mapping() to implement __handle_domain_irq() and coMarc Zyngier1-25/+35
In order to start reaping the benefits of irq_resolve_mapping(), start using it in __handle_domain_irq() and handle_domain_nmi(). This involves splitting generic_handle_irq() to be able to directly provide the irq_desc. Signed-off-by: Marc Zyngier <[email protected]>
2021-06-10irqdomain: Introduce irq_resolve_mapping()Marc Zyngier1-8/+20
Rework irq_find_mapping() to return an both an irq_desc pointer, optionally the virtual irq number, and rename the result to __irq_resolve_mapping(). a new helper called irq_resolve_mapping() is provided for code that doesn't need the virtual irq number. irq_find_mapping() is also rewritten in terms of __irq_resolve_mapping(). Signed-off-by: Marc Zyngier <[email protected]>
2021-06-10irqdomain: Protect the linear revmap with RCUMarc Zyngier1-26/+23
It is pretty odd that the radix tree uses RCU while the linear portion doesn't, leading to potential surprises for the users, depending on how the irqdomain has been created. Fix this by moving the update of the linear revmap under the mutex, and the lookup under the RCU read-side lock. The mutex name is updated to reflect that it doesn't only cover the radix-tree anymore. Signed-off-by: Marc Zyngier <[email protected]>