aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2022-05-12stop_machine: Add stop_core_cpuslocked() for per-core operationsPeter Zijlstra1-0/+21
Hardware core level testing features require near simultaneous execution of WRMSR instructions on all threads of a core to initiate a test. Provide a customized cut down version of stop_machine_cpuslocked() that just operates on the threads of a single core. Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Tony Luck <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Hans de Goede <[email protected]>
2022-05-11bpf: Fix potential array overflow in bpf_trampoline_get_progs()Yuntao Wang1-6/+12
The cnt value in the 'cnt >= BPF_MAX_TRAMP_PROGS' check does not include BPF_TRAMP_MODIFY_RETURN bpf programs, so the number of the attached BPF_TRAMP_MODIFY_RETURN bpf programs in a trampoline can exceed BPF_MAX_TRAMP_PROGS. When this happens, the assignment '*progs++ = aux->prog' in bpf_trampoline_get_progs() will cause progs array overflow as the progs field in the bpf_tramp_progs struct can only hold at most BPF_MAX_TRAMP_PROGS bpf programs. Fixes: 88fd9e5352fe ("bpf: Refactor trampoline update code") Signed-off-by: Yuntao Wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-05-11bpf: add bpf_map_lookup_percpu_elem for percpu mapFeng Zhou6-2/+83
Add new ebpf helpers bpf_map_lookup_percpu_elem. The implementation method is relatively simple, refer to the implementation method of map_lookup_elem of percpu map, increase the parameters of cpu, and obtain it according to the specified cpu. Signed-off-by: Feng Zhou <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-05-12sched/tracing: Append prev_state to tp args insteadDelyan Kratunov7-15/+15
Commit fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting sched_switch event, 2022-01-20) added a new prev_state argument to the sched_switch tracepoint, before the prev task_struct pointer. This reordering of arguments broke BPF programs that use the raw tracepoint (e.g. tp_btf programs). The type of the second argument has changed and existing programs that assume a task_struct* argument (e.g. for bpf_task_storage access) will now fail to verify. If we instead append the new argument to the end, all existing programs would continue to work and can conditionally extract the prev_state argument on supported kernel versions. Fixes: fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting sched_switch event, 2022-01-20) Signed-off-by: Delyan Kratunov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-05-11sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED statePeter Zijlstra2-5/+21
Currently ptrace_stop() / do_signal_stop() rely on the special states TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this state exists only in task->__state and nowhere else. There's two spots of bother with this: - PREEMPT_RT has task->saved_state which complicates matters, meaning task_is_{traced,stopped}() needs to check an additional variable. - An alternative freezer implementation that itself relies on a special TASK state would loose TASK_TRACED/TASK_STOPPED and will result in misbehaviour. As such, add additional state to task->jobctl to track this state outside of task->__state. NOTE: this doesn't actually fix anything yet, just adds extra state. --EWB * didn't add a unnecessary newline in signal.h * Update t->jobctl in signal_wake_up and ptrace_signal_wake_up instead of in signal_wake_up_state. This prevents the clearing of TASK_STOPPED and TASK_TRACED from getting lost. * Added warnings if JOBCTL_STOPPED or JOBCTL_TRACED are not cleared Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Tested-by: Kees Cook <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric W. Biederman <[email protected]>
2022-05-11ptrace: Always take siglock in ptrace_resumeEric W. Biederman1-11/+2
Make code analysis simpler and future changes easier by always taking siglock in ptrace_resume. Tested-by: Kees Cook <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-05-11ptrace: Don't change __stateEric W. Biederman3-25/+15
Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace command is executing. Instead remove TASK_WAKEKILL from the definition of TASK_TRACED, and implement a new jobctl flag TASK_PTRACE_FROZEN. This new flag is set in jobctl_freeze_task and cleared when ptrace_stop is awoken or in jobctl_unfreeze_task (when ptrace_stop remains asleep). In signal_wake_up add __TASK_TRACED to state along with TASK_WAKEKILL when the wake up is for a fatal signal. Skip adding __TASK_TRACED when TASK_PTRACE_FROZEN is not set. This has the same effect as changing TASK_TRACED to __TASK_TRACED as all of the wake_ups that use TASK_KILLABLE go through signal_wake_up. Handle a ptrace_stop being called with a pending fatal signal. Previously it would have been handled by schedule simply failing to sleep. As TASK_WAKEKILL is no longer part of TASK_TRACED schedule will sleep with a fatal_signal_pending. The code in signal_wake_up guarantees that the code will be awaked by any fatal signal that codes after TASK_TRACED is set. Previously the __state value of __TASK_TRACED was changed to TASK_RUNNING when woken up or back to TASK_TRACED when the code was left in ptrace_stop. Now when woken up ptrace_stop now clears JOBCTL_PTRACE_FROZEN and when left sleeping ptrace_unfreezed_traced clears JOBCTL_PTRACE_FROZEN. Tested-by: Kees Cook <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-05-11ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPsEric W. Biederman1-54/+38
Long ago and far away there was a BUG_ON at the start of ptrace_stop that did "BUG_ON(!(current->ptrace & PT_PTRACED));" [1]. The BUG_ON had never triggered but examination of the code showed that the BUG_ON could actually trigger. To complement removing the BUG_ON an attempt to better handle the race was added. The code detected the tracer had gone away and did not call do_notify_parent_cldstop. The code also attempted to prevent ptrace_report_syscall from sending spurious SIGTRAPs when the tracer went away. The code to detect when the tracer had gone away before sending a signal to tracer was a legitimate fix and continues to work to this date. The code to prevent sending spurious SIGTRAPs is a failure. At the time and until today the code only catches it when the tracer goes away after siglock is dropped and before read_lock is acquired. If the tracer goes away after read_lock is dropped a spurious SIGTRAP can still be sent to the tracee. The tracer going away after read_lock is dropped is the far likelier case as it is the bigger window. Given that the attempt to prevent the generation of a SIGTRAP was a failure and continues to be a failure remove the code that attempts to do that. This simplifies the code in ptrace_stop and makes ptrace_stop much easier to reason about. To successfully deal with the tracer going away, all of the tracer's instrumentation of the child would need to be removed, and reliably detecting when the tracer has set a signal to continue with would need to be implemented. [1] 66519f549ae5 ("[PATCH] fix ptracer death race yielding bogus BUG_ON") History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Tested-by: Kees Cook <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-05-11ptrace: Document that wait_task_inactive can't failEric W. Biederman1-11/+3
After ptrace_freeze_traced succeeds it is known that the tracee has a __state value of __TASK_TRACED and that no __ptrace_unlink will happen because the tracer is waiting for the tracee, and the tracee is in ptrace_stop. The function ptrace_freeze_traced can succeed at any point after ptrace_stop has set TASK_TRACED and dropped siglock. The read_lock on tasklist_lock only excludes ptrace_attach. This means that the !current->ptrace which executes under a read_lock of tasklist_lock will never see a ptrace_freeze_trace as the tracer must have gone away before the tasklist_lock was taken and ptrace_attach can not occur until the read_lock is dropped. As ptrace_freeze_traced depends upon ptrace_attach running before it can run that excludes ptrace_freeze_traced until __state is set to TASK_RUNNING. This means that task_is_traced will fail in ptrace_freeze_attach and ptrace_freeze_attached will fail. On the current->ptrace branch of ptrace_stop which will be reached any time after ptrace_freeze_traced has succeed it is known that __state is __TASK_TRACED and schedule() will be called with that state. Use a WARN_ON_ONCE to document that wait_task_inactive(TASK_TRACED) should never fail. Remove the stale comment about may_ptrace_stop. Strictly speaking this is not true because if PREEMPT_RT is enabled wait_task_inactive can fail because __state can be changed. I don't see this as a problem as the ptrace code is currently broken on PREMPT_RT, and this is one of the issues. Failing and warning when the assumptions of the code are broken is good. Tested-by: Kees Cook <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-05-11ptrace: Reimplement PTRACE_KILL by always sending SIGKILLEric W. Biederman1-3/+2
The current implementation of PTRACE_KILL is buggy and has been for many years as it assumes it's target has stopped in ptrace_stop. At a quick skim it looks like this assumption has existed since ptrace support was added in linux v1.0. While PTRACE_KILL has been deprecated we can not remove it as a quick search with google code search reveals many existing programs calling it. When the ptracee is not stopped at ptrace_stop some fields would be set that are ignored except in ptrace_stop. Making the userspace visible behavior of PTRACE_KILL a noop in those case. As the usual rules are not obeyed it is not clear what the consequences are of calling PTRACE_KILL on a running process. Presumably userspace does not do this as it achieves nothing. Replace the implementation of PTRACE_KILL with a simple send_sig_info(SIGKILL) followed by a return 0. This changes the observable user space behavior only in that PTRACE_KILL on a process not stopped in ptrace_stop will also kill it. As that has always been the intent of the code this seems like a reasonable change. Cc: [email protected] Reported-by: Al Viro <[email protected]> Suggested-by: Al Viro <[email protected]> Tested-by: Kees Cook <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-05-11signal: Use lockdep_assert_held instead of assert_spin_lockedEric W. Biederman1-2/+2
The distinction is that assert_spin_locked() checks if the lock is held *by*anyone* whereas lockdep_assert_held() asserts the current context holds the lock. Also, the check goes away if you build without lockdep. Suggested-by: Peter Zijlstra <[email protected]> Link: https://lkml.kernel.org/r/Ympr/+PX4XgT/[email protected] Tested-by: Kees Cook <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-05-11ptrace: Remove arch_ptrace_attachEric W. Biederman1-18/+0
The last remaining implementation of arch_ptrace_attach is ia64's ptrace_attach_sync_user_rbs which was added at the end of 2007 in commit aa91a2e90044 ("[IA64] Synchronize RBS on PTRACE_ATTACH"). Reading the comments and examining the code ptrace_attach_sync_user_rbs has the sole purpose of saving registers to the stack when ptrace_attach changes TASK_STOPPED to TASK_TRACED. In all other cases arch_ptrace_stop takes care of the register saving. In commit d79fdd6d96f4 ("ptrace: Clean transitions between TASK_STOPPED and TRACED") modified ptrace_attach to wake up the thread and enter ptrace_stop normally even when the thread starts out stopped. This makes ptrace_attach_sync_user_rbs completely unnecessary. So just remove it. I read through the code to verify that ptrace_attach_sync_user_rbs is unnecessary. What I found is that the code is quite dead. Reading ptrace_attach_sync_user_rbs it is easy to see that the it does nothing unless __state == TASK_STOPPED. Calling arch_ptrace_attach (aka ptrace_attach_sync_user_rbs) after ptrace_traceme it is easy to see that because we are talking about the current process the value of __state is TASK_RUNNING. Which means ptrace_attach_sync_user_rbs does nothing. The only other call of arch_ptrace_attach (aka ptrace_attach_sync_user_rbs) is after ptrace_attach. If the task is running (and PTRACE_SEIZE is not specified), a SIGSTOP is sent which results in do_signal_stop setting JOBCTL_TRAP_STOP on the target task (as it is ptraced) and the target task stopping in ptrace_stop with __state == TASK_TRACED. If the task was already stopped then ptrace_attach sets JOBCTL_TRAPPING and JOBCTL_TRAP_STOP, wakes it out of __TASK_STOPPED, and waits until the JOBCTL_TRAPPING_BIT is clear. At which point the task stops in ptrace_stop. In both cases there are a couple of funning excpetions such as if the traced task receiveds a SIGCONT, or is set a fatal signal. However in all of those cases the tracee never stops in __state TASK_STOPPED. Which is a long way of saying that ptrace_attach_sync_user_rbs is guaranteed never to do anything. Cc: [email protected] Tested-by: Kees Cook <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-05-11signal: Replace __group_send_sig_info with send_signal_lockedEric W. Biederman2-10/+4
The function __group_send_sig_info is just a light wrapper around send_signal_locked with one parameter fixed to a constant value. As the wrapper adds no real value update the code to directly call the wrapped function. Tested-by: Kees Cook <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-05-11signal: Rename send_signal send_signal_lockedEric W. Biederman1-12/+12
Rename send_signal and __send_signal to send_signal_locked and __send_signal_locked to make send_signal usable outside of signal.c. Tested-by: Kees Cook <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-05-11Merge branch 'exp.2022.05.11a' into HEADPaul E. McKenney8-36/+236
exp.2022.05.11a: Expedited-grace-period latency-reduction updates.
2022-05-11rcu: Move expedited grace period (GP) work to RT kthread_workerKalesh Singh5-34/+188
Enabling CONFIG_RCU_BOOST did not reduce RCU expedited grace-period latency because its workqueues run at SCHED_OTHER, and thus can be delayed by normal processes. This commit avoids these delays by moving the expedited GP work items to a real-time-priority kthread_worker. This option is controlled by CONFIG_RCU_EXP_KTHREAD and disabled by default on PREEMPT_RT=y kernels which disable expedited grace periods after boot by unconditionally setting rcupdate.rcu_normal_after_boot=1. The results were evaluated on arm64 Android devices (6GB ram) running 5.10 kernel, and capturing trace data in critical user-level code. The table below shows the resulting order-of-magnitude improvements in synchronize_rcu_expedited() latency: ------------------------------------------------------------------------ | | workqueues | kthread_worker | Diff | ------------------------------------------------------------------------ | Count | 725 | 688 | | ------------------------------------------------------------------------ | Min Duration (ns) | 326 | 447 | 37.12% | ------------------------------------------------------------------------ | Q1 (ns) | 39,428 | 38,971 | -1.16% | ------------------------------------------------------------------------ | Q2 - Median (ns) | 98,225 | 69,743 | -29.00% | ------------------------------------------------------------------------ | Q3 (ns) | 342,122 | 126,638 | -62.98% | ------------------------------------------------------------------------ | Max Duration (ns) | 372,766,967 | 2,329,671 | -99.38% | ------------------------------------------------------------------------ | Avg Duration (ns) | 2,746,353 | 151,242 | -94.49% | ------------------------------------------------------------------------ | Standard Deviation (ns) | 19,327,765 | 294,408 | | ------------------------------------------------------------------------ The below table show the range of maximums/minimums for synchronize_rcu_expedited() latency from all experiments: ------------------------------------------------------------------------ | | workqueues | kthread_worker | Diff | ------------------------------------------------------------------------ | Total No. of Experiments | 25 | 23 | | ------------------------------------------------------------------------ | Largest Maximum (ns) | 372,766,967 | 2,329,671 | -99.38% | ------------------------------------------------------------------------ | Smallest Maximum (ns) | 38,819 | 86,954 | 124.00% | ------------------------------------------------------------------------ | Range of Maximums (ns) | 372,728,148 | 2,242,717 | | ------------------------------------------------------------------------ | Largest Minimum (ns) | 88,623 | 27,588 | -68.87% | ------------------------------------------------------------------------ | Smallest Minimum (ns) | 326 | 447 | 37.12% | ------------------------------------------------------------------------ | Range of Minimums (ns) | 88,297 | 27,141 | | ------------------------------------------------------------------------ Cc: "Paul E. McKenney" <[email protected]> Cc: Tejun Heo <[email protected]> Reported-by: Tim Murray <[email protected]> Reported-by: Wei Wang <[email protected]> Tested-by: Kyle Lin <[email protected]> Tested-by: Chunwei Lu <[email protected]> Tested-by: Lulu Wang <[email protected]> Signed-off-by: Kalesh Singh <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2022-05-11rcu: Introduce CONFIG_RCU_EXP_CPU_STALL_TIMEOUTUladzislau Rezki5-2/+48
Currently both expedited and regular grace period stall warnings use a single timeout value that with units of seconds. However, recent Android use cases problem require a sub-100-millisecond expedited RCU CPU stall warning. Given that expedited RCU grace periods normally complete in far less than a single millisecond, especially for small systems, this is not unreasonable. Therefore introduce the CONFIG_RCU_EXP_CPU_STALL_TIMEOUT kernel configuration that defaults to 20 msec on Android and remains the same as that of the non-expedited stall warnings otherwise. It also can be changed in run-time via: /sys/.../parameters/rcu_exp_cpu_stall_timeout. [ paulmck: Default of zero to use CONFIG_RCU_STALL_TIMEOUT. ] Signed-off-by: Uladzislau Rezki <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2022-05-11dma-debug: change allocation mode from GFP_NOWAIT to GFP_ATIOMICMikulas Patocka1-1/+1
We observed the error "cacheline tracking ENOMEM, dma-debug disabled" during a light system load (copying some files). The reason for this error is that the dma_active_cacheline radix tree uses GFP_NOWAIT allocation - so it can't access the emergency memory reserves and it fails as soon as anybody reaches the watermark. This patch changes GFP_NOWAIT to GFP_ATOMIC, so that it can access the emergency memory reserves. Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-05-11dma-direct: don't fail on highmem CMA pages in dma_direct_alloc_pagesChristoph Hellwig1-17/+10
When dma_direct_alloc_pages encounters a highmem page it just gives up currently. But what we really should do is to try memory using the page allocator instead - without this platforms with a global highmem CMA pool will fail all dma_alloc_pages allocations. Fixes: efa70f2fdc84 ("dma-mapping: add a new dma_alloc_pages API") Reported-by: Mark O'Neill <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-05-11sched: Update task_tick_numa to ignore tasks without an mmEric W. Biederman1-1/+1
Qian Cai <[email protected]> wrote: > Reverting the last 3 commits of the series fixed a boot crash. > > 1b2552cbdbe0 fork: Stop allowing kthreads to call execve > 753550eb0ce1 fork: Explicitly set PF_KTHREAD > 68d85f0a33b0 init: Deal with the init process being a user mode process > > BUG: KASAN: null-ptr-deref in task_nr_scan_windows.isra.0 > arch_atomic_long_read at ./include/linux/atomic/atomic-long.h:29 > (inlined by) atomic_long_read at ./include/linux/atomic/atomic-instrumented.h:1266 > (inlined by) get_mm_counter at ./include/linux/mm.h:1996 > (inlined by) get_mm_rss at ./include/linux/mm.h:2049 > (inlined by) task_nr_scan_windows at kernel/sched/fair.c:1123 > Read of size 8 at addr 00000000000003d0 by task swapper/0/1 With the change to init and the user mode helper processes to not have PF_KTHREAD set before they call kernel_execve the PF_KTHREAD test in task_tick_numa became insufficient to detect all tasks that have "->mm == NULL". Correct that by testing for "->mm == NULL" directly. Reported-by: Qian Cai <[email protected]> Tested-by: Qian Cai <[email protected]> Fixes: 1b2552cbdbe0 ("fork: Stop allowing kthreads to call execve") Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-05-11PM: EM: Decrement policy counterPierre Gondois1-0/+2
In commit e458716a92b57 ("PM: EM: Mark inefficiencies in CPUFreq"), cpufreq_cpu_get() is called without a cpufreq_cpu_put(), permanently increasing the reference counts of the policy struct. Decrement the reference count once the policy struct is not used anymore. Fixes: e458716a92b57 ("PM: EM: Mark inefficiencies in CPUFreq") Tested-by: Cristian Marussi <[email protected]> Signed-off-by: Pierre Gondois <[email protected]> Reviewed-by: Vincent Donnefort <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2022-05-11sched/deadline: Remove superfluous rq clock update in push_dl_task()Hao Jia1-7/+1
The change to call update_rq_clock() before activate_task() commit 840d719604b0 ("sched/deadline: Update rq_clock of later_rq when pushing a task") is no longer needed since commit f4904815f97a ("sched/deadline: Fix double accounting of rq/running bw in push & pull") removed the add_running_bw() before the activate_task(). So we remove some comments that are no longer needed and update rq clock in activate_task(). Signed-off-by: Hao Jia <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Reviewed-by: Daniel Bristot de Oliveira <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-11sched/core: Avoid obvious double update_rq_clock warningHao Jia4-11/+33
When we use raw_spin_rq_lock() to acquire the rq lock and have to update the rq clock while holding the lock, the kernel may issue a WARN_DOUBLE_CLOCK warning. Since we directly use raw_spin_rq_lock() to acquire rq lock instead of rq_lock(), there is no corresponding change to rq->clock_update_flags. In particular, we have obtained the rq lock of other CPUs, the rq->clock_update_flags of this CPU may be RQCF_UPDATED at this time, and then calling update_rq_clock() will trigger the WARN_DOUBLE_CLOCK warning. So we need to clear RQCF_UPDATED of rq->clock_update_flags to avoid the WARN_DOUBLE_CLOCK warning. For the sched_rt_period_timer() and migrate_task_rq_dl() cases we simply replace raw_spin_rq_lock()/raw_spin_rq_unlock() with rq_lock()/rq_unlock(). For the {pull,push}_{rt,dl}_task() cases, we add the double_rq_clock_clear_update() function to clear RQCF_UPDATED of rq->clock_update_flags, and call double_rq_clock_clear_update() before double_lock_balance()/double_rq_lock() returns to avoid the WARN_DOUBLE_CLOCK warning. Some call trace reports: Call Trace 1: <IRQ> sched_rt_period_timer+0x10f/0x3a0 ? enqueue_top_rt_rq+0x110/0x110 __hrtimer_run_queues+0x1a9/0x490 hrtimer_interrupt+0x10b/0x240 __sysvec_apic_timer_interrupt+0x8a/0x250 sysvec_apic_timer_interrupt+0x9a/0xd0 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x12/0x20 Call Trace 2: <TASK> activate_task+0x8b/0x110 push_rt_task.part.108+0x241/0x2c0 push_rt_tasks+0x15/0x30 finish_task_switch+0xaa/0x2e0 ? __switch_to+0x134/0x420 __schedule+0x343/0x8e0 ? hrtimer_start_range_ns+0x101/0x340 schedule+0x4e/0xb0 do_nanosleep+0x8e/0x160 hrtimer_nanosleep+0x89/0x120 ? hrtimer_init_sleeper+0x90/0x90 __x64_sys_nanosleep+0x96/0xd0 do_syscall_64+0x34/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Call Trace 3: <TASK> deactivate_task+0x93/0xe0 pull_rt_task+0x33e/0x400 balance_rt+0x7e/0x90 __schedule+0x62f/0x8e0 do_task_dead+0x3f/0x50 do_exit+0x7b8/0xbb0 do_group_exit+0x2d/0x90 get_signal+0x9df/0x9e0 ? preempt_count_add+0x56/0xa0 ? __remove_hrtimer+0x35/0x70 arch_do_signal_or_restart+0x36/0x720 ? nanosleep_copyout+0x39/0x50 ? do_nanosleep+0x131/0x160 ? audit_filter_inodes+0xf5/0x120 exit_to_user_mode_prepare+0x10f/0x1e0 syscall_exit_to_user_mode+0x17/0x30 do_syscall_64+0x40/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Call Trace 4: update_rq_clock+0x128/0x1a0 migrate_task_rq_dl+0xec/0x310 set_task_cpu+0x84/0x1e4 try_to_wake_up+0x1d8/0x5c0 wake_up_process+0x1c/0x30 hrtimer_wakeup+0x24/0x3c __hrtimer_run_queues+0x114/0x270 hrtimer_interrupt+0xe8/0x244 arch_timer_handler_phys+0x30/0x50 handle_percpu_devid_irq+0x88/0x140 generic_handle_domain_irq+0x40/0x60 gic_handle_irq+0x48/0xe0 call_on_irq_stack+0x2c/0x60 do_interrupt_handler+0x80/0x84 Steps to reproduce: 1. Enable CONFIG_SCHED_DEBUG when compiling the kernel 2. echo 1 > /sys/kernel/debug/clear_warn_once echo "WARN_DOUBLE_CLOCK" > /sys/kernel/debug/sched/features echo "NO_RT_PUSH_IPI" > /sys/kernel/debug/sched/features 3. Run some rt/dl tasks that periodically work and sleep, e.g. Create 2*n rt or dl (90% running) tasks via rt-app (on a system with n CPUs), and Dietmar Eggemann reports Call Trace 4 when running on PREEMPT_RT kernel. Signed-off-by: Hao Jia <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-11Merge branch 'v5.18-rc5'Peter Zijlstra22-759/+667
Obtain the new INTEL_FAM6 stuff required. Signed-off-by: Peter Zijlstra <[email protected]>
2022-05-11locking/qrwlock: Change "queue rwlock" to "queued rwlock"Waiman Long1-4/+4
Queued rwlock was originally named "queue rwlock" which wasn't quite grammatically correct. However there are still some "queue rwlock" references in the code. Change those to "queued rwlock" for consistency. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-05-10bpf, x86: Attach a cookie to fentry/fexit/fmod_ret/lsm.Kui-Feng Lee4-6/+47
Pass a cookie along with BPF_LINK_CREATE requests. Add a bpf_cookie field to struct bpf_tracing_link to attach a cookie. The cookie of a bpf_tracing_link is available by calling bpf_get_attach_cookie when running the BPF program of the attached link. The value of a cookie will be set at bpf_tramp_run_ctx by the trampoline of the link. Signed-off-by: Kui-Feng Lee <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-05-10bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stackKui-Feng Lee2-6/+21
BPF trampolines will create a bpf_tramp_run_ctx, a bpf_run_ctx, on stacks and set/reset the current bpf_run_ctx before/after calling a bpf_prog. Signed-off-by: Kui-Feng Lee <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-05-10bpf, x86: Generate trampolines from bpf_tramp_linksKui-Feng Lee3-69/+98
Replace struct bpf_tramp_progs with struct bpf_tramp_links to collect struct bpf_tramp_link(s) for a trampoline. struct bpf_tramp_link extends bpf_link to act as a linked list node. arch_prepare_bpf_trampoline() accepts a struct bpf_tramp_links to collects all bpf_tramp_link(s) that a trampoline should call. Change BPF trampoline and bpf_struct_ops to pass bpf_tramp_links instead of bpf_tramp_progs. Signed-off-by: Kui-Feng Lee <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-05-11genirq: Remove WARN_ON_ONCE() in generic_handle_domain_irq()Lukas Wunner1-1/+0
Since commit 0953fb263714 ("irq: remove handle_domain_{irq,nmi}()"), generic_handle_domain_irq() warns if called outside hardirq context, even though the function calls down to handle_irq_desc(), which warns about the same, but conditionally on handle_enforce_irqctx(). The newly added warning is a false positive if the interrupt originates from any other irqchip than x86 APIC or ARM GIC/GICv3. Those are the only ones for which handle_enforce_irqctx() returns true. Per commit c16816acd086 ("genirq: Add protection against unsafe usage of generic_handle_irq()"): "In general calling generic_handle_irq() with interrupts disabled from non interrupt context is harmless. For some interrupt controllers like the x86 trainwrecks this is outright dangerous as it might corrupt state if an interrupt affinity change is pending." Examples for interrupt chips where the warning is a false positive are USB-attached GPIO controllers such as drivers/gpio/gpio-dln2.c: USB gadgets are incapable of directly signaling an interrupt because they cannot initiate a bus transaction by themselves. All communication on the bus is initiated by the host controller, which polls a gadget's Interrupt Endpoint in regular intervals. If an interrupt is pending, that information is passed up the stack in softirq context, from which a hardirq is synthesized via generic_handle_domain_irq(). Remove the warning to eliminate such false positives. Fixes: 0953fb263714 ("irq: remove handle_domain_{irq,nmi}()") Signed-off-by: Lukas Wunner <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Jakub Kicinski <[email protected]> CC: Linus Walleij <[email protected]> Cc: Bartosz Golaszewski <[email protected]> Cc: Octavian Purdila <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/c3caf60bfa78e5fdbdf483096b7174da65d1813a.1652168866.git.lukas@wunner.de
2022-05-10bpf: Resolve symbols with ftrace_lookup_symbols for kprobe multi linkJiri Olsa1-46/+66
Using kallsyms_lookup_names function to speed up symbols lookup in kprobe multi link attachment and replacing with it the current kprobe_multi_resolve_syms function. This speeds up bpftrace kprobe attachment: # perf stat -r 5 -e cycles ./src/bpftrace -e 'kprobe:x* { } i:ms:1 { exit(); }' ... 6.5681 +- 0.0225 seconds time elapsed ( +- 0.34% ) After: # perf stat -r 5 -e cycles ./src/bpftrace -e 'kprobe:x* { } i:ms:1 { exit(); }' ... 0.5661 +- 0.0275 seconds time elapsed ( +- 4.85% ) Acked-by: Andrii Nakryiko <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-05-10fprobe: Resolve symbols with ftrace_lookup_symbolsJiri Olsa1-20/+12
Using ftrace_lookup_symbols to speed up symbols lookup in register_fprobe_syms API. Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-05-10ftrace: Add ftrace_lookup_symbols functionJiri Olsa2-0/+63
Adding ftrace_lookup_symbols function that resolves array of symbols with single pass over kallsyms. The user provides array of string pointers with count and pointer to allocated array for resolved values. int ftrace_lookup_symbols(const char **sorted_syms, size_t cnt, unsigned long *addrs) It iterates all kallsyms symbols and tries to loop up each in provided symbols array with bsearch. The symbols array needs to be sorted by name for this reason. We also check each symbol to pass ftrace_location, because this API will be used for fprobe symbols resolving. Suggested-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-05-10kallsyms: Make kallsyms_on_each_symbol generally availableJiri Olsa1-2/+0
Making kallsyms_on_each_symbol generally available, so it can be used outside CONFIG_LIVEPATCH option in following changes. Rather than adding another ifdef option let's make the function generally available (when CONFIG_KALLSYMS option is defined). Cc: Christoph Hellwig <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-05-10bpf: Add bpf_link iteratorDmitrii Dolgov3-1/+127
Implement bpf_link iterator to traverse links via bpf_seq_file operations. The changeset is mostly shamelessly copied from commit a228a64fc1e4 ("bpf: Add bpf_prog iterator") Signed-off-by: Dmitrii Dolgov <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-05-10bpf: Extend batch operations for map-in-map bpf-mapsTakshak Chahande2-2/+13
This patch extends batch operations support for map-in-map map-types: BPF_MAP_TYPE_HASH_OF_MAPS and BPF_MAP_TYPE_ARRAY_OF_MAPS A usecase where outer HASH map holds hundred of VIP entries and its associated reuse-ports per VIP stored in REUSEPORT_SOCKARRAY type inner map, needs to do batch operation for performance gain. This patch leverages the exiting generic functions for most of the batch operations. As map-in-map's value contains the actual reference of the inner map, for BPF_MAP_TYPE_HASH_OF_MAPS type, it needed an extra step to fetch the map_id from the reference value. selftests are added in next patch 2/2. Signed-off-by: Takshak Chahande <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-05-09mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()David Hildenbrand1-1/+1
New anonymous pages are always mapped natively: only THP/khugepaged code maps a new compound anonymous page and passes "true". Otherwise, we're just dealing with simple, non-compound pages. Let's give the interface clearer semantics and document these. Remove the PageTransCompound() sanity check from page_add_new_anon_rmap(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Rientjes <[email protected]> Cc: Don Dutile <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Liang Zhang <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Oded Gabbay <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pedro Demarchi Gomes <[email protected]> Cc: Peter Xu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-09bpf: Remove unused parameter from find_kfunc_desc_btf()Yuntao Wang1-5/+4
The func_id parameter in find_kfunc_desc_btf() is not used, get rid of it. Fixes: 2357672c54c3 ("bpf: Introduce BPF support for kernel module function calls") Signed-off-by: Yuntao Wang <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-05-09sched: Fix build warning without CONFIG_SYSCTLYueHaibing1-29/+36
IF CONFIG_SYSCTL is n, build warn: kernel/sched/core.c:1782:12: warning: ‘sysctl_sched_uclamp_handler’ defined but not used [-Wunused-function] static int sysctl_sched_uclamp_handler(struct ctl_table *table, int write, ^~~~~~~~~~~~~~~~~~~~~~~~~~~ sysctl_sched_uclamp_handler() is used while CONFIG_SYSCTL enabled, wrap all related code with CONFIG_SYSCTL to fix this. Fixes: 3267e0156c33 ("sched: Move uclamp_util sysctls to core.c") Signed-off-by: YueHaibing <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
2022-05-09reboot: Fix build warning without CONFIG_SYSCTLYueHaibing1-27/+27
If CONFIG_SYSCTL is n, build warn: kernel/reboot.c:443:20: error: ‘kernel_reboot_sysctls_init’ defined but not used [-Werror=unused-function] static void __init kernel_reboot_sysctls_init(void) ^~~~~~~~~~~~~~~~~~~~~~~~~~ Move kernel_reboot_sysctls_init() to #ifdef block to fix this. Fixes: 06d177662fb8 ("kernel/reboot: move reboot sysctls to its own file") Signed-off-by: YueHaibing <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
2022-05-09mm,fs: Remove aops->readpageMatthew Wilcox (Oracle)1-3/+2
With all implementations of aops->readpage converted to aops->read_folio, we can stop checking whether it's set and remove the member from aops. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
2022-05-09fs: Introduce aops->read_folioMatthew Wilcox (Oracle)1-2/+4
Change all the callers of ->readpage to call ->read_folio in preference, if it exists. This is a transitional duplication, and will be removed by the end of the series. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
2022-05-09entry: Rename arch_check_user_regs() to arch_enter_from_user_mode()Sven Schnelle1-1/+1
arch_check_user_regs() is used at the moment to verify that struct pt_regs contains valid values when entering the kernel from userspace. s390 needs a place in the generic entry code to modify a cpu data structure when switching from userspace to kernel mode. As arch_check_user_regs() is exactly this, rename it to arch_enter_from_user_mode(). When entering the kernel from userspace, arch_check_user_regs() is used to verify that struct pt_regs contains valid values. Note that the NMI codepath doesn't call this function. s390 needs a place in the generic entry code to modify a cpu data structure when switching from userspace to kernel mode. As arch_check_user_regs() is exactly this, rename it to arch_enter_from_user_mode(). Signed-off-by: Sven Schnelle <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Heiko Carstens <[email protected]>
2022-05-08Merge tag 'timers-urgent-2022-05-08' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A fix and an email address update: - Mark the NMI safe time accessors notrace to prevent tracer recursion when they are selected as trace clocks. - John Stultz has a new email address" * tag 'timers-urgent-2022-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Mark NMI safe time accessors as notrace MAINTAINERS: Update email address for John Stultz
2022-05-08Merge tag 'irq-urgent-2022-05-08' of ↵Linus Torvalds3-10/+33
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Thomas Gleixner: "A fix for the threaded interrupt core. A quick sequence of request/free_irq() can result in a hang because the interrupt thread did not reach the thread function and got stopped in the kthread core already. That leaves a state active counter arround which makes a invocation of synchronized_irq() on that interrupt hang forever. Ensure that the thread reached the thread function in request_irq() to prevent that" * tag 'irq-urgent-2022-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Synchronize interrupt thread startup
2022-05-08stackleak: add on/off stack variantsMark Rutland1-3/+32
The stackleak_erase() code dynamically handles being on a task stack or another stack. In most cases, this is a fixed property of the caller, which the caller is aware of, as an architecture might always return using the task stack, or might always return using a trampoline stack. This patch adds stackleak_erase_on_task_stack() and stackleak_erase_off_task_stack() functions which callers can use to avoid on_thread_stack() check and associated redundant work when the calling stack is known. The existing stackleak_erase() is retained as a safe default. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> Cc: Alexander Popov <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Kees Cook <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-08stackleak: rework poison scanningMark Rutland1-14/+4
Currently we over-estimate the region of stack which must be erased. To determine the region to be erased, we scan downwards for a contiguous block of poison values (or the low bound of the stack). There are a few minor problems with this today: * When we find a block of poison values, we include this block within the region to erase. As this is included within the region to erase, this causes us to redundantly overwrite 'STACKLEAK_SEARCH_DEPTH' (128) bytes with poison. * As the loop condition checks 'poison_count <= depth', it will run an additional iteration after finding the contiguous block of poison, decrementing 'erase_low' once more than necessary. As this is included within the region to erase, this causes us to redundantly overwrite an additional unsigned long with poison. * As we always decrement 'erase_low' after checking an element on the stack, we always include the element below this within the region to erase. As this is included within the region to erase, this causes us to redundantly overwrite an additional unsigned long with poison. Note that this is not a functional problem. As the loop condition checks 'erase_low > task_stack_low', we'll never clobber the STACK_END_MAGIC. As we always decrement 'erase_low' after this, we'll never fail to erase the element immediately above the STACK_END_MAGIC. In total, this can cause us to erase `128 + 2 * sizeof(unsigned long)` bytes more than necessary, which is unfortunate. This patch reworks the logic to find the address immediately above the poisoned region, by finding the lowest non-poisoned address. This is factored into a stackleak_find_top_of_poison() helper both for clarity and so that this can be shared with the LKDTM test in subsequent patches. Signed-off-by: Mark Rutland <[email protected]> Cc: Alexander Popov <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Kees Cook <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-08stackleak: rework stack high bound handlingMark Rutland1-5/+14
Prior to returning to userspace, we reset current->lowest_stack to a reasonable high bound. Currently we do this by subtracting the arbitrary value `THREAD_SIZE/64` from the top of the stack, for reasons lost to history. Looking at configurations today: * On i386 where THREAD_SIZE is 8K, the bound will be 128 bytes. The pt_regs at the top of the stack is 68 bytes (with 0 to 16 bytes of padding above), and so this covers an additional portion of 44 to 60 bytes. * On x86_64 where THREAD_SIZE is at least 16K (up to 32K with KASAN) the bound will be at least 256 bytes (up to 512 with KASAN). The pt_regs at the top of the stack is 168 bytes, and so this cover an additional 88 bytes of stack (up to 344 with KASAN). * On arm64 where THREAD_SIZE is at least 16K (up to 64K with 64K pages and VMAP_STACK), the bound will be at least 256 bytes (up to 1024 with KASAN). The pt_regs at the top of the stack is 336 bytes, so this can fall within the pt_regs, or can cover an additional 688 bytes of stack. Clearly the `THREAD_SIZE/64` value doesn't make much sense -- in the worst case, this will cause more than 600 bytes of stack to be erased for every syscall, even if actual stack usage were substantially smaller. This patches makes this slightly less nonsensical by consistently resetting current->lowest_stack to the base of the task pt_regs. For clarity and for consistency with the handling of the low bound, the generation of the high bound is split into a helper with commentary explaining why. Since the pt_regs at the top of the stack will be clobbered upon the next exception entry, we don't need to poison these at exception exit. By using task_pt_regs() as the high stack boundary instead of current_top_of_stack() we avoid some redundant poisoning, and the compiler can share the address generation between the poisoning and resetting of `current->lowest_stack`, making the generated code more optimal. It's not clear to me whether the existing `THREAD_SIZE/64` offset was a dodgy heuristic to skip the pt_regs, or whether it was attempting to minimize the number of times stackleak_check_stack() would have to update `current->lowest_stack` when stack usage was shallow at the cost of unconditionally poisoning a small portion of the stack for every exit to userspace. For now I've simply removed the offset, and if we need/want to minimize updates for shallow stack usage it should be easy to add a better heuristic atop, with appropriate commentary so we know what's going on. Signed-off-by: Mark Rutland <[email protected]> Cc: Alexander Popov <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Kees Cook <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-08stackleak: clarify variable namesMark Rutland1-16/+14
The logic within __stackleak_erase() can be a little hard to follow, as `boundary` switches from being the low bound to the high bound mid way through the function, and `kstack_ptr` is used to represent the start of the region to erase while `boundary` represents the end of the region to erase. Make this a little clearer by consistently using clearer variable names. The `boundary` variable is removed, the bounds of the region to erase are described by `erase_low` and `erase_high`, and bounds of the task stack are described by `task_stack_low` and `task_stack_high`. As the same time, remove the comment above the variables, since it is unclear whether it's intended as rationale, a complaint, or a TODO, and is more confusing than helpful. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> Cc: Alexander Popov <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Kees Cook <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-08stackleak: rework stack low bound handlingMark Rutland1-10/+4
In stackleak_task_init(), stackleak_track_stack(), and __stackleak_erase(), we open-code skipping the STACK_END_MAGIC at the bottom of the stack. Each case is implemented slightly differently, and only the __stackleak_erase() case is commented. In stackleak_task_init() and stackleak_track_stack() we unconditionally add sizeof(unsigned long) to the lowest stack address. In stackleak_task_init() we use end_of_stack() for this, and in stackleak_track_stack() we use task_stack_page(). In __stackleak_erase() we handle this by detecting if `kstack_ptr` has hit the stack end boundary, and if so, conditionally moving it above the magic. This patch adds a new stackleak_task_low_bound() helper which is used in all three cases, which unconditionally adds sizeof(unsigned long) to the lowest address on the task stack, with commentary as to why. This uses end_of_stack() as stackleak_task_init() did prior to this patch, as this is consistent with the code in kernel/fork.c which initializes the STACK_END_MAGIC value. In __stackleak_erase() we no longer need to check whether we've spilled into the STACK_END_MAGIC value, as stackleak_track_stack() ensures that `current->lowest_stack` stops immediately above this, and similarly the poison scan will stop immediately above this. For stackleak_task_init() and stackleak_track_stack() this results in no change to code generation. For __stackleak_erase() the generated assembly is slightly simpler and shorter. Signed-off-by: Mark Rutland <[email protected]> Cc: Alexander Popov <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Kees Cook <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-08stackleak: remove redundant checkMark Rutland1-4/+0
In __stackleak_erase() we check that the `erase_low` value derived from `current->lowest_stack` is above the lowest legitimate stack pointer value, but this is already enforced by stackleak_track_stack() when recording the lowest stack value. Remove the redundant check. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> Cc: Alexander Popov <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Kees Cook <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]