aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2021-05-12ptrace: make ptrace() fail if the tracee changed its pid unexpectedlyOleg Nesterov1-1/+17
Suppose we have 2 threads, the group-leader L and a sub-theread T, both parked in ptrace_stop(). Debugger tries to resume both threads and does ptrace(PTRACE_CONT, T); ptrace(PTRACE_CONT, L); If the sub-thread T execs in between, the 2nd PTRACE_CONT doesn not resume the old leader L, it resumes the post-exec thread T which was actually now stopped in PTHREAD_EVENT_EXEC. In this case the PTHREAD_EVENT_EXEC event is lost, and the tracer can't know that the tracee changed its pid. This patch makes ptrace() fail in this case until debugger does wait() and consumes PTHREAD_EVENT_EXEC which reports old_pid. This affects all ptrace requests except the "asynchronous" PTRACE_INTERRUPT/KILL. The patch doesn't add the new PTRACE_ option to not complicate the API, and I _hope_ this won't cause any noticeable regression: - If debugger uses PTRACE_O_TRACEEXEC and the thread did an exec and the tracer does a ptrace request without having consumed the exec event, it's 100% sure that the thread the ptracer thinks it is targeting does not exist anymore, or isn't the same as the one it thinks it is targeting. - To some degree this patch adds nothing new. In the scenario above ptrace(L) can fail with -ESRCH if it is called after the execing sub-thread wakes the leader up and before it "steals" the leader's pid. Test-case: #include <stdio.h> #include <unistd.h> #include <signal.h> #include <sys/ptrace.h> #include <sys/wait.h> #include <errno.h> #include <pthread.h> #include <assert.h> void *tf(void *arg) { execve("/usr/bin/true", NULL, NULL); assert(0); return NULL; } int main(void) { int leader = fork(); if (!leader) { kill(getpid(), SIGSTOP); pthread_t th; pthread_create(&th, NULL, tf, NULL); for (;;) pause(); return 0; } waitpid(leader, NULL, WSTOPPED); ptrace(PTRACE_SEIZE, leader, 0, PTRACE_O_TRACECLONE | PTRACE_O_TRACEEXEC); waitpid(leader, NULL, 0); ptrace(PTRACE_CONT, leader, 0,0); waitpid(leader, NULL, 0); int status, thread = waitpid(-1, &status, 0); assert(thread > 0 && thread != leader); assert(status == 0x80137f); ptrace(PTRACE_CONT, thread, 0,0); /* * waitid() because waitpid(leader, &status, WNOWAIT) does not * report status. Why ???? * * Why WEXITED? because we have another kernel problem connected * to mt-exec. */ siginfo_t info; assert(waitid(P_PID, leader, &info, WSTOPPED|WEXITED|WNOWAIT) == 0); assert(info.si_pid == leader && info.si_status == 0x0405); /* OK, it sleeps in ptrace(PTRACE_EVENT_EXEC == 0x04) */ assert(ptrace(PTRACE_CONT, leader, 0,0) == -1); assert(errno == ESRCH); assert(leader == waitpid(leader, &status, WNOHANG)); assert(status == 0x04057f); assert(ptrace(PTRACE_CONT, leader, 0,0) == 0); return 0; } Signed-off-by: Oleg Nesterov <[email protected]> Reported-by: Simon Marchi <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Acked-by: Pedro Alves <[email protected]> Acked-by: Simon Marchi <[email protected]> Acked-by: Jan Kratochvil <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-12jump_label: Free jump_entry::key bit1 for build usePeter Zijlstra1-4/+6
Have jump_label_init() set jump_entry::key bit1 to either 0 ot 1 unconditionally. This makes it available for build-time games. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-12jump_label, x86: Introduce jump_entry_size()Peter Zijlstra1-1/+1
This allows architectures to have variable sized jumps. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-12sched/core: Initialize the idle task with preemption disabledValentin Schneider3-3/+2
As pointed out by commit de9b8f5dcbd9 ("sched: Fix crash trying to dequeue/enqueue the idle thread") init_idle() can and will be invoked more than once on the same idle task. At boot time, it is invoked for the boot CPU thread by sched_init(). Then smp_init() creates the threads for all the secondary CPUs and invokes init_idle() on them. As the hotplug machinery brings the secondaries to life, it will issue calls to idle_thread_get(), which itself invokes init_idle() yet again. In this case it's invoked twice more per secondary: at _cpu_up(), and at bringup_cpu(). Given smp_init() already initializes the idle tasks for all *possible* CPUs, no further initialization should be required. Now, removing init_idle() from idle_thread_get() exposes some interesting expectations with regards to the idle task's preempt_count: the secondary startup always issues a preempt_disable(), requiring some reset of the preempt count to 0 between hot-unplug and hotplug, which is currently served by idle_thread_get() -> idle_init(). Given the idle task is supposed to have preemption disabled once and never see it re-enabled, it seems that what we actually want is to initialize its preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove init_idle() from idle_thread_get(). Secondary startups were patched via coccinelle: @begone@ @@ -preempt_disable(); ... cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-12sched: prctl() core-scheduling interfaceChris Hyser2-0/+119
This patch provides support for setting and copying core scheduling 'task cookies' between threads (PID), processes (TGID), and process groups (PGID). The value of core scheduling isn't that tasks don't share a core, 'nosmt' can do that. The value lies in exploiting all the sharing opportunities that exist to recover possible lost performance and that requires a degree of flexibility in the API. From a security perspective (and there are others), the thread, process and process group distinction is an existent hierarchal categorization of tasks that reflects many of the security concerns about 'data sharing'. For example, protecting against cache-snooping by a thread that can just read the memory directly isn't all that useful. With this in mind, subcommands to CREATE/SHARE (TO/FROM) provide a mechanism to create and share cookies. CREATE/SHARE_TO specify a target pid with enum pidtype used to specify the scope of the targeted tasks. For example, PIDTYPE_TGID will share the cookie with the process and all of it's threads as typically desired in a security scenario. API: prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie) prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL) prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL) prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL) where 'tgtpid/srcpid == 0' implies the current process and pidtype is kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}. For return values, EINVAL, ENOMEM are what they say. ESRCH means the tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission access to tgtpid/srcpid. ENODEV indicates your machines lacks SMT. [peterz: complete rewrite] Signed-off-by: Chris Hyser <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched: Inherit task cookie on fork()Peter Zijlstra2-0/+9
Note that sched_core_fork() is called from under tasklist_lock, and not from sched_fork() earlier. This avoids a few races later. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched: Trivial core scheduling cookie managementPeter Zijlstra5-3/+131
In order to not have to use pid_struct, create a new, smaller, structure to manage task cookies for core scheduling. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched: Migration changes for core schedulingAubrey Li2-6/+96
- Don't migrate if there is a cookie mismatch Load balance tries to move task from busiest CPU to the destination CPU. When core scheduling is enabled, if the task's cookie does not match with the destination CPU's core cookie, this task may be skipped by this CPU. This mitigates the forced idle time on the destination CPU. - Select cookie matched idle CPU In the fast path of task wakeup, select the first cookie matched idle CPU instead of the first idle CPU. - Find cookie matched idlest CPU In the slow path of task wakeup, find the idlest CPU whose core cookie matches with task's cookie Signed-off-by: Aubrey Li <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched: Trivial forced-newidle balancerPeter Zijlstra3-1/+136
When a sibling is forced-idle to match the core-cookie; search for matching tasks to fill the core. rcu_read_unlock() can incur an infrequent deadlock in sched_core_balance(). Fix this by using the RCU-sched flavor instead. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched/fair: Snapshot the min_vruntime of CPUs on force idleJoel Fernandes (Google)3-25/+117
During force-idle, we end up doing cross-cpu comparison of vruntimes during pick_next_task. If we simply compare (vruntime-min_vruntime) across CPUs, and if the CPUs only have 1 task each, we will always end up comparing 0 with 0 and pick just one of the tasks all the time. This starves the task that was not picked. To fix this, take a snapshot of the min_vruntime when entering force idle and use it for comparison. This min_vruntime snapshot will only be used for cross-CPU vruntime comparison, and nothing else. A note about the min_vruntime snapshot and force idling: During selection: When we're not fi, we need to update snapshot. when we're fi and we were not fi, we must update snapshot. When we're fi and we were already fi, we must not update snapshot. Which gives: fib fi update 0 0 1 0 1 1 1 0 1 1 1 0 Where: fi: force-idled now fib: force-idled before So the min_vruntime snapshot needs to be updated when: !(fib && fi). Also, the cfs_prio_less() function needs to be aware of whether the core is in force idle or not, since it will be use this information to know whether to advance a cfs_rq's min_vruntime_fi in the hierarchy. So pass this information along via pick_task() -> prio_less(). Suggested-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched: Fix priority inversion of cookied task with siblingJoel Fernandes (Google)1-39/+26
The rationale is as follows. In the core-wide pick logic, even if need_sync == false, we need to go look at other CPUs (non-local CPUs) to see if they could be running RT. Say the RQs in a particular core look like this: Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task. rq0 rq1 CFS1 (tagged) RT1 (no tag) CFS2 (tagged) Say schedule() runs on rq0. Now, it will enter the above loop and pick_task(RT) will return NULL for 'p'. It will enter the above if() block and see that need_sync == false and will skip RT entirely. The end result of the selection will be (say prio(CFS1) > prio(CFS2)): rq0 rq1 CFS1 IDLE When it should have selected: rq0 rq1 IDLE RT Suggested-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched/fair: Fix forced idle sibling starvation corner caseVineeth Pillai3-8/+49
If there is only one long running local task and the sibling is forced idle, it might not get a chance to run until a schedule event happens on any cpu in the core. So we check for this condition during a tick to see if a sibling is starved and then give it a chance to schedule. Signed-off-by: Vineeth Pillai <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched: Add core wide task selection and schedulingPeter Zijlstra2-2/+305
Instead of only selecting a local task, select a task for all SMT siblings for every reschedule on the core (irrespective which logical CPU does the reschedule). Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched: Basic tracking of matching tasksPeter Zijlstra3-50/+203
Introduce task_struct::core_cookie as an opaque identifier for core scheduling. When enabled; core scheduling will only allow matching task to be on the core; where idle matches everything. When task_struct::core_cookie is set (and core scheduling is enabled) these tasks are indexed in a second RB-tree, first on cookie value then on scheduling function, such that matching task selection always finds the most elegible match. NOTE: *shudder* at the overhead... NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative is per class tracking of cookies and that just duplicates a lot of stuff for no raisin (the 2nd copy lives in the rt-mutex PI code). [Joel: folded fixes] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched: Introduce sched_class::pick_task()Peter Zijlstra6-9/+87
Because sched_class::pick_next_task() also implies sched_class::set_next_task() (and possibly put_prev_task() and newidle_balance) it is not state invariant. This makes it unsuitable for remote task selection. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> [Vineeth: folded fixes] Signed-off-by: Vineeth Remanan Pillai <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched: Allow sched_core_put() from atomic contextPeter Zijlstra1-6/+27
Stuff the meat of sched_core_put() into a work such that we can use sched_core_put() from atomic context. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched: Optimize rq_lockp() usagePeter Zijlstra4-19/+36
rq_lockp() includes a static_branch(), which is asm-goto, which is asm volatile which defeats regular CSE. This means that: if (!static_branch(&foo)) return simple; if (static_branch(&foo) && cond) return complex; Doesn't fold and we get horrible code. Introduce __rq_lockp() without the static_branch() on. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched: Core-wide rq->lockPeter Zijlstra3-4/+224
Introduce the basic infrastructure to have a core wide rq->lock. This relies on the rq->__lock order being in increasing CPU number (inside a core). It is also constrained to SMT8 per lockdep (and SMT256 per preempt_count). Luckily SMT8 is the max supported SMT count for Linux (Mips, Sparc and Power are known to have this). Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched: Prepare for Core-wide rq->lockPeter Zijlstra2-33/+63
When switching on core-sched, CPUs need to agree which lock to use for their RQ. The new rule will be that rq->core_enabled will be toggled while holding all rq->__locks that belong to a core. This means we need to double check the rq->core_enabled value after each lock acquire and retry if it changed. This also has implications for those sites that take multiple RQ locks, they need to be careful that the second lock doesn't end up being the first lock. Verify the lock pointer after acquiring the first lock, because if they're on the same core, holding any of the rq->__lock instances will pin the core state. While there, change the rq->__lock order to CPU number, instead of rq address, this greatly simplifies the next patch. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/YJUNY0dmrJMD/[email protected]
2021-05-12sched: Wrap rq::lock accessPeter Zijlstra10-138/+136
In preparation of playing games with rq->lock, abstract the thing using an accessor. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched: Provide raw_spin_rq_*lock*() helpersPeter Zijlstra2-0/+65
In prepration for playing games with rq->lock, add some rq_lock wrappers. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched/fair: Add a few assertionsPeter Zijlstra1-2/+9
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Don Hiatt <[email protected]> Tested-by: Hongyu Ning <[email protected]> Tested-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12delayacct: Add sysctl to enable at runtimePeter Zijlstra2-2/+46
Just like sched_schedstats, allow runtime enabling (and disabling) of delayacct. This is useful if one forgot to add the delayacct boot time option. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12delayacct: Default disabledPeter Zijlstra1-8/+11
Assuming this stuff isn't actually used much; disable it by default and avoid allocating and tracking the task_delay_info structure. taskstats is changed to still report the regular sched and sched_info and only skip the missing task_delay_info fields instead of not reporting anything. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12delayacct: Add static_branch in scheduler hooksPeter Zijlstra1-0/+3
Cheaper when delayacct is disabled. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Acked-by: Balbir Singh <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched: Simplify sched_info_on()Peter Zijlstra2-28/+10
The situation around sched_info is somewhat complicated, it is used by sched_stats and delayacct and, indirectly, kvm. If SCHEDSTATS=Y (but disabled by default) sched_info_on() is unconditionally true -- this is the case for all distro kernel configs I checked. If for some reason SCHEDSTATS=N, but TASK_DELAY_ACCT=Y, then sched_info_on() can return false when delayacct is disabled, presumably because there would be no other users left; except kvm is. Instead of complicating matters further by accurately accounting sched_stat and kvm state, simply unconditionally enable when SCHED_INFO=Y, matching the common distro case. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched: Rename sched_info_{queued,dequeued}Peter Zijlstra2-12/+12
For consistency, rename {queued,dequeued} to {enqueue,dequeue}. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Balbir Singh <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12delayacct: Use sched_clock()Peter Zijlstra1-12/+10
Like all scheduler statistics, use sched_clock() based time. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Balbir Singh <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched/fair: Fix negative energy delta in find_energy_efficient_cpu()Pierre Gondois1-12/+15
find_energy_efficient_cpu() (feec()) searches the best energy CPU to place a task on. To do so, compute_energy() estimates the energy impact of placing the task on a CPU, based on CPU and task utilization signals. Utilization signals can be concurrently updated while evaluating a performance domain (pd). In some cases, this leads to having a 'negative delta', i.e. placing the task in the pd is seen as an energy gain. Thus, any further energy comparison is biased. In case of a 'negative delta', return prev_cpu since: 1. a 'negative delta' happens in less than 0.5% of feec() calls, on a Juno with 6 CPUs (4 little, 2 big) 2. it is unlikely to have two consecutive 'negative delta' for a task, so if the first call fails, feec() will correctly place the task in the next feec() call 3. EAS current behavior tends to select prev_cpu if the task doesn't raise the OPP of its current pd. prev_cpu is EAS's generic decision 4. prev_cpu should be preferred to returning an error code. In the latter case, select_idle_sibling() would do the placement, selecting a big (and not energy efficient) CPU. As 3., the task would potentially reside on the big CPU for a long time Reported-by: Xuewen Yan <[email protected]> Suggested-by: Xuewen Yan <[email protected]> Signed-off-by: Pierre Gondois <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Lukasz Luba <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Reviewed-by: Vincent Donnefort <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched/fair: Only compute base_energy_pd if necessaryPierre Gondois1-17/+24
find_energy_efficient_cpu() searches the best energy CPU to place a task on. To do so, the energy of each performance domain (pd) is computed w/ and w/o the task placed on it. The energy of a pd w/o the task (base_energy_pd) is computed prior knowing whether a CPU is available in the pd. Move the base_energy_pd computation after looping through the CPUs of a pd and only compute it if at least one CPU is available. Suggested-by: Xuewen Yan <[email protected]> Signed-off-by: Pierre Gondois <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Lukasz Luba <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Reviewed-by: Vincent Donnefort <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched,fair: Skip newidle_balance if a wakeup is pendingRik van Riel1-1/+10
The try_to_wake_up function has an optimization where it can queue a task for wakeup on its previous CPU, if the task is still in the middle of going to sleep inside schedule(). Once schedule() re-enables IRQs, the task will be woken up with an IPI, and placed back on the runqueue. If we have such a wakeup pending, there is no need to search other CPUs for runnable tasks. Just skip (or bail out early from) newidle balancing, and run the just woken up task. For a memcache like workload test, this reduces total CPU use by about 2%, proportionally split between user and system time, and p99 and p95 application response time by 10% on average. The schedstats run_delay number shows a similar improvement. Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Acked-by: Mel Gorman <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-12sched/core: Remove the pointless BUG_ON(!task) from wake_up_q()Oleg Nesterov1-1/+0
container_of() can never return NULL - so don't check for it pointlessly. [ mingo: Twiddled the changelog. ] Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-12sched/fair: Fix clearing of has_idle_cores flag in select_idle_cpu()Gautham R. Shenoy1-1/+1
In commit: 9fe1f127b913 ("sched/fair: Merge select_idle_core/cpu()") in select_idle_cpu(), we check if an idle core is present in the LLC of the target CPU via the flag "has_idle_cores". We look for the idle core in select_idle_cores(). If select_idle_cores() isn't able to find an idle core/CPU, we need to unset the has_idle_cores flag in the LLC of the target to prevent other CPUs from going down this route. However, the current code is unsetting it in the LLC of the current CPU instead of the target CPU. This patch fixes this issue. Fixes: 9fe1f127b913 ("sched/fair: Merge select_idle_core/cpu()") Signed-off-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Reviewed-by: Srikar Dronamraju <[email protected]> Acked-by: Mel Gorman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller7-45/+174
Daniel Borkmann says: ==================== pull-request: bpf 2021-05-11 The following pull-request contains BPF updates for your *net* tree. We've added 13 non-merge commits during the last 8 day(s) which contain a total of 21 files changed, 817 insertions(+), 382 deletions(-). The main changes are: 1) Fix multiple ringbuf bugs in particular to prevent writable mmap of read-only pages, from Andrii Nakryiko & Thadeu Lima de Souza Cascardo. 2) Fix verifier alu32 known-const subregister bound tracking for bitwise operations and/or/xor, from Daniel Borkmann. 3) Reject trampoline attachment for functions with variable arguments, and also add a deny list of other forbidden functions, from Jiri Olsa. 4) Fix nested bpf_bprintf_prepare() calls used by various helpers by switching to per-CPU buffers, from Florent Revest. 5) Fix kernel compilation with BTF debug info on ppc64 due to pahole missing TCP-CC functions like cubictcp_init, from Martin KaFai Lau. 6) Add a kconfig entry to provide an option to disallow unprivileged BPF by default, from Daniel Borkmann. 7) Fix libbpf compilation for older libelf when GELF_ST_VISIBILITY() macro is not available, from Arnaldo Carvalho de Melo. 8) Migrate test_tc_redirect to test_progs framework as prep work for upcoming skb_change_head() fix & selftest, from Jussi Maki. 9) Fix a libbpf segfault in add_dummy_ksym_var() if BTF is not present, from Ian Rogers. 10) Fix tx_only micro-benchmark in xdpsock BPF sample with proper frame size, from Magnus Karlsson. ==================== Signed-off-by: David S. Miller <[email protected]>
2021-05-11bpf: Fix nested bpf_bprintf_prepare with more per-cpu buffersFlorent Revest1-13/+14
The bpf_seq_printf, bpf_trace_printk and bpf_snprintf helpers share one per-cpu buffer that they use to store temporary data (arguments to bprintf). They "get" that buffer with try_get_fmt_tmp_buf and "put" it by the end of their scope with bpf_bprintf_cleanup. If one of these helpers gets called within the scope of one of these helpers, for example: a first bpf program gets called, uses bpf_trace_printk which calls raw_spin_lock_irqsave which is traced by another bpf program that calls bpf_snprintf, then the second "get" fails. Essentially, these helpers are not re-entrant. They would return -EBUSY and print a warning message once. This patch triples the number of bprintf buffers to allow three levels of nesting. This is very similar to what was done for tracepoints in "9594dc3c7e7 bpf: fix nested bpf tracepoints with per-cpu data" Fixes: d9c9e4db186a ("bpf: Factorize bpf_trace_printk and bpf_seq_printf") Reported-by: [email protected] Signed-off-by: Florent Revest <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-05-11bpf: Add deny list of btf ids check for tracing programsJiri Olsa1-0/+14
The recursion check in __bpf_prog_enter and __bpf_prog_exit leaves some (not inlined) functions unprotected: In __bpf_prog_enter: - migrate_disable is called before prog->active is checked In __bpf_prog_exit: - migrate_enable,rcu_read_unlock_strict are called after prog->active is decreased When attaching trampoline to them we get panic like: traps: PANIC: double fault, error_code: 0x0 double fault: 0000 [#1] SMP PTI RIP: 0010:__bpf_prog_enter+0x4/0x50 ... Call Trace: <IRQ> bpf_trampoline_6442466513_0+0x18/0x1000 migrate_disable+0x5/0x50 __bpf_prog_enter+0x9/0x50 bpf_trampoline_6442466513_0+0x18/0x1000 migrate_disable+0x5/0x50 __bpf_prog_enter+0x9/0x50 bpf_trampoline_6442466513_0+0x18/0x1000 migrate_disable+0x5/0x50 __bpf_prog_enter+0x9/0x50 bpf_trampoline_6442466513_0+0x18/0x1000 migrate_disable+0x5/0x50 ... Fixing this by adding deny list of btf ids for tracing programs and checking btf id during program verification. Adding above functions to this list. Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-05-11bpf: Add kconfig knob for disabling unpriv bpf by defaultDaniel Borkmann3-6/+36
Add a kconfig knob which allows for unprivileged bpf to be disabled by default. If set, the knob sets /proc/sys/kernel/unprivileged_bpf_disabled to value of 2. This still allows a transition of 2 -> {0,1} through an admin. Similarly, this also still keeps 1 -> {1} behavior intact, so that once set to permanently disabled, it cannot be undone aside from a reboot. We've also added extra2 with max of 2 for the procfs handler, so that an admin still has a chance to toggle between 0 <-> 2. Either way, as an additional alternative, applications can make use of CAP_BPF that we added a while ago. Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/74ec548079189e4e4dffaeb42b8987bb3c852eee.1620765074.git.daniel@iogearbox.net
2021-05-11bpf, kconfig: Add consolidated menu entry for bpf with core optionsDaniel Borkmann1-0/+78
Right now, all core BPF related options are scattered in different Kconfig locations mainly due to historic reasons. Moving forward, lets add a proper subsystem entry under ... General setup ---> BPF subsystem ---> ... in order to have all knobs in a single location and thus ease BPF related configuration. Networking related bits such as sockmap are out of scope for the general setup and therefore better suited to remain in net/Kconfig. Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/f23f58765a4d59244ebd8037da7b6a6b2fb58446.1620765074.git.daniel@iogearbox.net
2021-05-11alarmtimer: Check RTC features instead of opsAlexandre Belloni1-1/+1
RTC drivers used to leave .set_alarm() NULL in order to signal the RTC device doesn't support alarms. The drivers are now clearing the RTC_FEATURE_ALARM bit for that purpose in order to keep the rtc_class_ops structure const. So now, .set_alarm() is set unconditionally and this possibly causes the alarmtimer code to select an RTC device that doesn't support alarms. Test RTC_FEATURE_ALARM instead of relying on ops->set_alarm to determine whether alarms are available. Fixes: 7ae41220ef58 ("rtc: introduce features bitfield") Signed-off-by: Alexandre Belloni <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2021-05-11bpf: Prevent writable memory-mapping of read-only ringbuf pagesAndrii Nakryiko1-13/+8
Only the very first page of BPF ringbuf that contains consumer position counter is supposed to be mapped as writeable by user-space. Producer position is read-only and can be modified only by the kernel code. BPF ringbuf data pages are read-only as well and are not meant to be modified by user-code to maintain integrity of per-record headers. This patch allows to map only consumer position page as writeable and everything else is restricted to be read-only. remap_vmalloc_range() internally adds VM_DONTEXPAND, so all the established memory mappings can't be extended, which prevents any future violations through mremap()'ing. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: Ryota Shiga (Flatt Security) Reported-by: Thadeu Lima de Souza Cascardo <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-05-11bpf, ringbuf: Deny reserve of buffers larger than ringbufThadeu Lima de Souza Cascardo1-0/+3
A BPF program might try to reserve a buffer larger than the ringbuf size. If the consumer pointer is way ahead of the producer, that would be successfully reserved, allowing the BPF program to read or write out of the ringbuf allocated area. Reported-by: Ryota Shiga (Flatt Security) Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Signed-off-by: Thadeu Lima de Souza Cascardo <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-05-11bpf: Fix alu32 const subreg bound tracking on bitwise operationsDaniel Borkmann1-13/+9
Fix a bug in the verifier's scalar32_min_max_*() functions which leads to incorrect tracking of 32 bit bounds for the simulation of and/or/xor bitops. When both the src & dst subreg is a known constant, then the assumption is that scalar_min_max_*() will take care to update bounds correctly. However, this is not the case, for example, consider a register R2 which has a tnum of 0xffffffff00000000, meaning, lower 32 bits are known constant and in this case of value 0x00000001. R2 is then and'ed with a register R3 which is a 64 bit known constant, here, 0x100000002. What can be seen in line '10:' is that 32 bit bounds reach an invalid state where {u,s}32_min_value > {u,s}32_max_value. The reason is scalar32_min_max_*() delegates 32 bit bounds updates to scalar_min_max_*(), however, that really only takes place when both the 64 bit src & dst register is a known constant. Given scalar32_min_max_*() is intended to be designed as closely as possible to scalar_min_max_*(), update the 32 bit bounds in this situation through __mark_reg32_known() which will set all {u,s}32_{min,max}_value to the correct constant, which is 0x00000000 after the fix (given 0x00000001 & 0x00000002 in 32 bit space). This is possible given var32_off already holds the final value as dst_reg->var_off is updated before calling scalar32_min_max_*(). Before fix, invalid tracking of R2: [...] 9: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0,smin_value=-9223372036854775807 (0x8000000000000001),smax_value=9223372032559808513 (0x7fffffff00000001),umin_value=1,umax_value=0xffffffff00000001,var_off=(0x1; 0xffffffff00000000),s32_min_value=1,s32_max_value=1,u32_min_value=1,u32_max_value=1) R3_w=inv4294967298 R10=fp0 9: (5f) r2 &= r3 10: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0,smin_value=0,smax_value=4294967296 (0x100000000),umin_value=0,umax_value=0x100000000,var_off=(0x0; 0x100000000),s32_min_value=1,s32_max_value=0,u32_min_value=1,u32_max_value=0) R3_w=inv4294967298 R10=fp0 [...] After fix, correct tracking of R2: [...] 9: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0,smin_value=-9223372036854775807 (0x8000000000000001),smax_value=9223372032559808513 (0x7fffffff00000001),umin_value=1,umax_value=0xffffffff00000001,var_off=(0x1; 0xffffffff00000000),s32_min_value=1,s32_max_value=1,u32_min_value=1,u32_max_value=1) R3_w=inv4294967298 R10=fp0 9: (5f) r2 &= r3 10: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0,smin_value=0,smax_value=4294967296 (0x100000000),umin_value=0,umax_value=0x100000000,var_off=(0x0; 0x100000000),s32_min_value=0,s32_max_value=0,u32_min_value=0,u32_max_value=0) R3_w=inv4294967298 R10=fp0 [...] Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Fixes: 2921c90d4718 ("bpf: Fix a verifier failure with xor") Reported-by: Manfred Paul (@_manfp) Reported-by: Thadeu Lima de Souza Cascardo <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-05-10cgroup: inline cgroup_task_freeze()Roman Gushchin1-1/+2
After the introduction of the cgroup.kill there is only one call site of cgroup_task_freeze() left: cgroup_exit(). cgroup_task_freeze() is currently taking rcu_read_lock() to read task's cgroup flags, but because it's always called with css_set_lock locked, the rcu protection is excessive. Simplify the code by inlining cgroup_task_freeze(). v2: fix build Signed-off-by: Roman Gushchin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2021-05-10rcu: Don't penalize priority boosting when there is nothing to boostPaul E. McKenney1-3/+14
RCU priority boosting cannot do anything unless there is at least one task blocking the current RCU grace period that was preempted within the RCU read-side critical section that it still resides in. However, the current rcu_torture_boost_failed() code will count this as an RCU priority-boosting failure if there were no CPUs blocking the current grace period. This situation can happen (for example) if the last CPU blocking the current grace period was subjected to vCPU preemption, which is always a risk for rcutorture guest OSes. This commit therefore causes rcu_torture_boost_failed() to refrain from reporting failure unless there is at least one task blocking the current RCU grace period that was preempted within the RCU read-side critical section that it still resides in. Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-10rcu: Point to documentation of ordering guaranteesPaul E. McKenney2-2/+21
Add comments to synchronize_rcu() and friends that point to Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst. Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-10rcu: Make rcu_gp_cleanup() be noinline for tracingPaul E. McKenney1-1/+1
Although there are trace events for RCU grace periods, these are only enabled in CONFIG_RCU_TRACE=y kernels. This commit therefore marks rcu_gp_cleanup() noinline in order to provide a function that can be traced that is invoked near the end of each grace period. Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-10rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUsPaul E. McKenney1-1/+1
Kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y can experience significant lock contention due to RCU's resulting focus on ending grace periods as soon as possible. This is OK, but only if there are not very many CPUs. This commit therefore puts this Kconfig option off-limits to systems with more than four CPUs. Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-10rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GPPaul E. McKenney1-2/+3
Currently, show_rcu_gp_kthreads() only dumps rcu_node structures that have outdated ideas of the current grace-period number. This commit also dumps those that are in any way blocking the current grace period. This helps diagnose RCU priority boosting failures. Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-10rcu: Make RCU priority boosting work on single-CPU rcu_node structuresPaul E. McKenney3-24/+9
When any CPU comes online, it checks to see if an RCU-boost kthread has already been created for that CPU's leaf rcu_node structure, and if not, it creates one. Unfortunately, it also verifies that this leaf rcu_node structure actually has at least one online CPU, and if not, it declines to create the kthread. Although this behavior makes sense during early boot, especially on systems that claim far more CPUs than they actually have, it makes no sense for the first CPU to come online for a given rcu_node structure. There is no point in checking because we know there is a CPU on its way in. The problem is that timing differences can cause this incoming CPU to not yet be reflected in the various bit masks even at rcutree_online_cpu() time, and there is no chance at rcutree_prepare_cpu() time. Plus it would be better to create the RCU-boost kthread at rcutree_prepare_cpu() to handle the case where the CPU is involved in an RCU priority inversion very shortly after it comes online. This commit therefore moves the checking to rcu_prepare_kthreads(), which is called only at early boot, when the check is appropriate. In addition, it makes rcutree_prepare_cpu() invoke rcu_spawn_one_boost_kthread(), which no longer does any checking for online CPUs. With this change, RCU priority boosting tests now pass for short rcutorture runs, even with single-CPU leaf rcu_node structures. Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Scott Wood <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-10rcu: Add quiescent states and boost states to show_rcu_gp_kthreads() outputPaul E. McKenney3-3/+11
This commit adds each rcu_node structure's ->qsmask and "bBEG" output indicating whether: (1) There is a boost kthread, (2) A reader needs to be (or is in the process of being) boosted, (3) A reader is blocking an expedited grace period, and (4) A reader is blocking a normal grace period. This helps diagnose RCU priority boosting failures. Signed-off-by: Paul E. McKenney <[email protected]>