Age | Commit message (Collapse) | Author | Files | Lines |
|
`__bpf_ops_sched_ext_ops` was missing the initialization of some struct
attributes. With
https://lore.kernel.org/all/[email protected]/
every single attributes need to be initialized programs (like scx_layered)
will fail to load.
05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_init not found in kernel, skipping it as it's set to zero
05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_exit not found in kernel, skipping it as it's set to zero
05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_prep_move not found in kernel, skipping it as it's set to zero
05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_move not found in kernel, skipping it as it's set to zero
05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_cancel_move not found in kernel, skipping it as it's set to zero
05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_set_weight not found in kernel, skipping it as it's set to zero
05:26:48 [WARN] libbpf: prog 'layered_dump': BPF program load failed: unknown error (-524)
05:26:48 [WARN] libbpf: prog 'layered_dump': -- BEGIN PROG LOAD LOG --
attach to unsupported member dump of struct sched_ext_ops
processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
-- END PROG LOAD LOG --
05:26:48 [WARN] libbpf: prog 'layered_dump': failed to load: -524
05:26:48 [WARN] libbpf: failed to load object 'bpf_bpf'
05:26:48 [WARN] libbpf: failed to load BPF skeleton 'bpf_bpf': -524
Error: Failed to load BPF program
Signed-off-by: Manu Bretelle <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
sched_ext currently doesn't generate messages when the BPF scheduler is
enabled and disabled unless there are errors. It is useful to have paper
trail. Improve logging around enable/disable:
- Generate info messages on enable and non-error disable.
- Update error exit message formatting so that it's consistent with
non-error message. Also, prefix ei->msg with the BPF scheduler's name to
make it clear where the message is coming from.
- Shorten scx_exit_reason() strings for SCX_EXIT_UNREG* for brevity and
consistency.
v2: Use pr_*() instead of KERN_* consistently. (David)
Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Phil Auld <[email protected]>
Reviewed-by: Phil Auld <[email protected]>
Acked-by: David Vernet <[email protected]>
|
|
SCX_RQ_ONLINE
scx_rq_online() currently only tests SCX_RQ_ONLINE. This isn't fully correct
- e.g. consume_dispatch_q() uses task_run_on_remote_rq() which tests
scx_rq_online() to see whether the current rq can run the task, and, if so,
calls consume_remote_task() to migrate the task to @rq. While the test
itself was done while locking @rq, @rq can be temporarily unlocked by
consume_remote_task() and nothing prevents SCX_RQ_ONLINE from going offline
before the migration takes place.
To address the issue, add cpu_active() test to scx_rq_online(). There is a
synchronize_rcu() between cpu_active() being cleared and the rq going
offline, so if an on-going scheduling operation sees cpu_active(), the
associated rq is guaranteed to not go offline until the scheduling operation
is complete.
Signed-off-by: Tejun Heo <[email protected]>
Fixes: 60c27fb59f6c ("sched_ext: Implement sched_ext_ops.cpu_online/offline()")
Acked-by: David Vernet <[email protected]>
|
|
process_ddsp_deferred_locals() executes deferred direct dispatches to the
local DSQs of remote CPUs. It iterates the tasks on
rq->scx.ddsp_deferred_locals list, removing and calling
dispatch_to_local_dsq() on each. However, the list is protected by the rq
lock that can be dropped by dispatch_to_local_dsq() temporarily, so the list
can be modified during the iteration, which can lead to oopses and other
failures.
Fix it by popping from the head of the list instead of iterating the list.
Signed-off-by: Tejun Heo <[email protected]>
Fixes: 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches")
Acked-by: David Vernet <[email protected]>
|
|
task_can_run_on_remote_rq() is similar to is_cpu_allowed() but there are
subtle differences. It currently open codes all the tests. This is
cumbersome to understand and error-prone in case the intersecting tests need
to be updated.
Factor out the common part - testing whether the task is allowed on the CPU
at all regardless of the CPU state - into task_allowed_on_cpu() and make
both is_cpu_allowed() and SCX's task_can_run_on_remote_rq() use it. As the
code is now linked between the two and each contains only the extra tests
that differ between them, it's less error-prone when the conditions need to
be updated. Also, improve the comment to explain why they are different.
v2: Replace accidental "extern inline" with "static inline" (Peter).
Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Acked-by: David Vernet <[email protected]>
|
|
scx_task_iter_next_locked()
scx_task_iter_next_locked() skips tasks whose sched_class is
idle_sched_class. While it has a short comment explaining why it's testing
the sched_class directly isntead of using is_idle_task(), the comment
doesn't sufficiently explain what's going on and why. Improve the comment.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Acked-by: David Vernet <[email protected]>
|
|
On SMP, SCX performs dispatch from sched_class->balance(). As balance() was
not available in UP, it instead called the internal balance function from
put_prev_task_scx() and pick_next_task_scx() to emulate the effect, which is
rather nasty.
Enabling sched_class->balance() on UP shouldn't cause any meaningful
overhead. Enable balance() on UP and drop the ugly workaround.
Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Acked-by: David Vernet <[email protected]>
|
|
update_curr_scx() is open coding runtime updates. Use update_curr_common()
instead and avoid unnecessary deviations.
Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Acked-by: David Vernet <[email protected]>
|
|
put_prev_task_balance()
SCX needs its balance() invoked even when waking up from a lower priority
sched class (idle) and put_prev_task_balance() thus has the logic to promote
@start_class if it's lower than ext_sched_class. This is only needed when
SCX is enabled. Add scx_enabled() test to avoid unnecessary overhead when
SCX is disabled.
Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Acked-by: David Vernet <[email protected]>
|
|
The way sched_can_stop_tick() used scx_can_stop_tick() was rather confusing
and the behavior wasn't ideal when SCX is enabled in partial mode. Simplify
it so that:
- scx_can_stop_tick() can say no if scx_enabled().
- CFS tests rq->cfs.nr_running > 1 instead of rq->nr_running.
This is easier to follow and leads to the correct answer whether SCX is
disabled, enabled in partial mode or all tasks are switched to SCX.
Peter, note that this is a bit different from your suggestion where
sched_can_stop_tick() unconditionally returns scx_can_stop_tick() iff
scx_switched_all(). The problem is that in partial mode, tick can be stopped
when there is only one SCX task even if the BPF scheduler didn't ask and
isn't ready for it.
Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Acked-by: David Vernet <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.12
Pull tip/sched/core to resolve the following four conflicts. While 2-4 are
simple context conflicts, 1 is a bit subtle and easy to resolve incorrectly.
1. 2c8d046d5d51 ("sched: Add normal_policy()")
vs.
faa42d29419d ("sched/fair: Make SCHED_IDLE entity be preempted in strict hierarchy")
The former converts direct test on p->policy to use the helper
normal_policy(). The latter moves the p->policy test to a different
location. Resolve by converting the test on p->plicy in the new location to
use normal_policy().
2. a7a9fc549293 ("sched_ext: Add boilerplate for extensible scheduler class")
vs.
a110a81c52a9 ("sched/deadline: Deferrable dl server")
Both add calls to put_prev_task_idle() and set_next_task_idle(). Simple
context conflict. Resolve by taking changes from both.
3. a7a9fc549293 ("sched_ext: Add boilerplate for extensible scheduler class")
vs.
c245910049d0 ("sched/core: Add clearing of ->dl_server in put_prev_task_balance()")
The former changes for_each_class() itertion to use for_each_active_class().
The latter moves away the adjacent dl_server handling code. Simple context
conflict. Resolve by taking changes from both.
4. 60c27fb59f6c ("sched_ext: Implement sched_ext_ops.cpu_online/offline()")
vs.
31b164e2e4af ("sched/smt: Introduce sched_smt_present_inc/dec() helper")
2f027354122f ("sched/core: Introduce sched_set_rq_on/offline() helper")
The former adds scx_rq_deactivate() call. The latter two change code around
it. Simple context conflict. Resolve by taking changes from both.
Signed-off-by: Tejun Heo <[email protected]>
|
|
From 1232da7eced620537a78f19c8cf3d4a3508e2419 Mon Sep 17 00:00:00 2001
From: Tejun Heo <[email protected]>
Date: Wed, 31 Jul 2024 09:14:52 -1000
p->scx.disallow provides a way for the BPF scheduler to reject certain tasks
from attaching. It's currently allowed for both the load and fork paths;
however, the latter doesn't actually work as p->sched_class is already set
by the time scx_ops_init_task() is called during fork.
This is a convenience feature which is mostly useful from the load path
anyway. Allow it only from the load path.
v2: Trigger scx_ops_error() iff @p->policy == SCHED_EXT to make it a bit
easier for the BPF scheduler (David).
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: "Zhangqiao (2012 lab)" <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 7bb6f0810ecf ("sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT")
Acked-by: David Vernet <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
scx_dump_task() uses stack_trace_save_tsk() which is only available when
CONFIG_STACKTRACE. Make CONFIG_SCHED_CLASS_EXT select CONFIG_STACKTRACE if
the support is available and skip capturing stack trace if
!CONFIG_STACKTRACE.
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Acked-by: David Vernet <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
We already have some testcases verifying that we can call
BPF_PROG_TYPE_SYSCALL progs and invoke scx_bpf_exit(). Let's extend that to
also call scx_bpf_create_dsq() so we get coverage for that as well.
Signed-off-by: David Vernet <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
We currently only allow calling sleepable scx kfuncs (i.e.
scx_bpf_create_dsq()) from BPF_PROG_TYPE_STRUCT_OPS progs. The idea here
was that we'd never have to call scx_bpf_create_dsq() outside of a
sched_ext struct_ops callback, but that might not actually be true. For
example, a scheduler could do something like the following:
1. Open and load (not yet attach) a scheduler skel
2. Synchronously call into a BPF_PROG_TYPE_SYSCALL prog from user space.
For example, to initialize an LLC domain, or some other global,
read-only state.
3. Attach the skel, which actually enables the scheduler
The advantage of doing this is that it can preclude having to do pretty
ugly boilerplate like initializing a read-only, statically sized array of
u64[]'s which the kernel consumes literally once at init time to then
create struct bpf_cpumask objects which are actually queried at runtime.
Doing the above is already possible given that we can invoke core BPF
kfuncs, such as bpf_cpumask_create(), from BPF_PROG_TYPE_SYSCALL progs. We
already allow many scx kfuncs to be called from BPF_PROG_TYPE_SYSCALL progs
(e.g. scx_bpf_kick_cpu()). Let's allow the sleepable kfuncs as well.
Signed-off-by: David Vernet <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
Linux 6.11-rc1
|
|
The throttle interaction made my brain hurt, make it consistently
about 0 transitions of h_nr_running.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
|
|
Now that fair_server exists, we no longer need RT bandwidth control
unless RT_GROUP_SCHED.
Enable fair_server with parameters equivalent to RT throttling.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: "Peter Zijlstra (Intel)" <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: "Vineeth Pillai (Google)" <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Link: https://lore.kernel.org/r/14d562db55df5c3c780d91940743acb166895ef7.1716811044.git.bristot@kernel.org
|
|
* Use simple CFS pick_task for DL pick_task
DL server's pick_task calls CFS's pick_next_task_fair(), this is wrong
because core scheduling's pick_task only calls CFS's pick_task() for
evaluation / checking of the CFS task (comparing across CPUs), not for
actually affirmatively picking the next task. This causes RB tree
corruption issues in CFS that were found by syzbot.
* Make pick_task_fair clear DL server
A DL task pick might set ->dl_server, but it is possible the task will
never run (say the other HT has a stop task). If the CFS task is picked
in the future directly (say without DL server), ->dl_server will be
set. So clear it in pick_task_fair().
This fixes the KASAN issue reported by syzbot in set_next_entity().
(DL refactoring suggestions by Vineeth Pillai).
Reported-by: Suleiman Souhlal <[email protected]>
Signed-off-by: "Joel Fernandes (Google)" <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Vineeth Pillai <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Link: https://lore.kernel.org/r/b10489ab1f03d23e08e6097acea47442e7d6466f.1716811044.git.bristot@kernel.org
|
|
In core scheduling, a DL server pick (which is CFS task) should be
given higher priority than tasks in other classes.
Not doing so causes CFS starvation. A kselftest is added later to
demonstrate this. A CFS task that is competing with RT tasks can
be completely starved without this and the DL server's boosting
completely ignored.
Fix these problems.
Reported-by: Suleiman Souhlal <[email protected]>
Signed-off-by: "Joel Fernandes (Google)" <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Vineeth Pillai <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Link: https://lore.kernel.org/r/48b78521d86f3b33c24994d843c1aad6b987dda9.1716811044.git.bristot@kernel.org
|
|
Add an interface for fair server setup on debugfs.
Each CPU has two files under /debug/sched/fair_server/cpu{ID}:
- runtime: set runtime in ns
- period: set period in ns
This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to set
bounds on admission control.
The interface also add the server to the dl bandwidth accounting.
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Link: https://lore.kernel.org/r/a9ef9fc69bcedb44bddc9bc34f2b313296052819.1716811044.git.bristot@kernel.org
|
|
Among the motivations for the DL servers is the real-time throttling
mechanism. This mechanism works by throttling the rt_rq after
running for a long period without leaving space for fair tasks.
The base dl server avoids this problem by boosting fair tasks instead
of throttling the rt_rq. The point is that it boosts without waiting
for potential starvation, causing some non-intuitive cases.
For example, an IRQ dispatches two tasks on an idle system, a fair
and an RT. The DL server will be activated, running the fair task
before the RT one. This problem can be avoided by deferring the
dl server activation.
By setting the defer option, the dl_server will dispatch an
SCHED_DEADLINE reservation with replenished runtime, but throttled.
The dl_timer will be set for the defer time at (period - runtime) ns
from start time. Thus boosting the fair rq at defer time.
If the fair scheduler has the opportunity to run while waiting
for defer time, the dl server runtime will be consumed. If
the runtime is completely consumed before the defer time, the
server will be replenished while still in a throttled state. Then,
the dl_timer will be reset to the new defer time
If the fair server reaches the defer time without consuming
its runtime, the server will start running, following CBS rules
(thus without breaking SCHED_DEADLINE). Then the server will
continue the running state (without deferring) until it fair
tasks are able to execute as regular fair scheduler (end of
the starvation).
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Link: https://lore.kernel.org/r/dd175943c72533cd9f0b87767c6499204879cc38.1716811044.git.bristot@kernel.org
|
|
Use deadline servers to service fair tasks.
This patch adds a fair_server deadline entity which acts as a container
for fair entities and can be used to fix starvation when higher priority
(wrt fair) tasks are monopolizing CPU(s).
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Link: https://lore.kernel.org/r/b6b0bcefaf25391bcf5b6ecdb9f1218de402d42e.1716811044.git.bristot@kernel.org
|
|
In case the previous pick was a DL server pick, ->dl_server might be
set. Clear it in the fast path as well.
Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Youssef Esmat <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/7f7381ccba09efcb4a1c1ff808ed58385eccc222.1716811044.git.bristot@kernel.org
|
|
Paths using put_prev_task_balance() need to do a pick shortly
after. Make sure they also clear the ->dl_server on prev as a
part of that.
Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers")
Signed-off-by: "Joel Fernandes (Google)" <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/d184d554434bedbad0581cb34656582d78655150.1716811044.git.bristot@kernel.org
|
|
Add an explanation for the newly added variable.
Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/147f7aa8cb8fd925f36aa8059af6a35aad08b45a.1716811044.git.bristot@kernel.org
|
|
Consider the following cgroup:
root
|
------------------------
| |
normal_cgroup idle_cgroup
| |
SCHED_IDLE task_A SCHED_NORMAL task_B
According to the cgroup hierarchy, A should preempt B. But current
check_preempt_wakeup_fair() treats cgroup se and task separately, so B
will preempt A unexpectedly.
Unify the wakeup logic by {c,p}se_is_idle only. This makes SCHED_IDLE of
a task a relative policy that is effective only within its own cgroup,
similar to the behavior of NICE.
Also fix se_is_idle() definition when !CONFIG_FAIR_GROUP_SCHED.
Fixes: 304000390f88 ("sched: Cgroup SCHED_IDLE support")
Signed-off-by: Tianchen Ding <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Josh Don <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
As a hedge against unexpected user issues commit 88c56cfeaec4
("sched/fair: Block nohz tick_stop when cfs bandwidth in use")
included a scheduler feature to disable the new functionality.
It's been a few releases (v6.6) and no screams, so remove it.
Signed-off-by: Phil Auld <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
nr_spread_over tracks the number of instances where the difference
between a scheduling entity's virtual runtime and the minimum virtual
runtime in the runqueue exceeds three times the scheduler latency,
indicating significant disparity in task scheduling.
Commit that removed its usage: 5e963f2bd: sched/fair: Commit to EEVDF
cfs_rq->exec_clock was used to account for time spent executing tasks.
Commit that removed its usage: 5d69eca542ee1 sched: Unify runtime
accounting across classes
cfs_rq::nr_spread_over and cfs_rq::exec_clock are not used anymore in
eevdf. Remove them from struct cfs_rq.
Signed-off-by: Chuyi Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Reviewed-by: K Prateek Nayak <[email protected]>
Acked-by: Vishal Chourasia <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Background
==========
When repeated migrate_disable() calls are made with missing the
corresponding migrate_enable() calls, there is a risk of
'migration_disabled' going upper overflow because
'migration_disabled' is a type of unsigned short whose max value is
65535.
In PREEMPT_RT kernel, if 'migration_disabled' goes upper overflow, it may
make the migrate_disable() ineffective within local_lock_irqsave(). This
is because, during the scheduling procedure, the value of
'migration_disabled' will be checked, which can trigger CPU migration.
Consequently, the count of 'rcu_read_lock_nesting' may leak due to
local_lock_irqsave() and local_unlock_irqrestore() occurring on different
CPUs.
Usecase
========
For example, When I developed a driver, I encountered a warning like
"WARNING: CPU: 4 PID: 260 at kernel/rcu/tree_plugin.h:315
rcu_note_context_switch+0xa8/0x4e8" warning. It took me half a month
to locate this issue. Ultimately, I discovered that the lack of upper
overflow detection mechanism in migrate_disable() was the root cause,
leading to a significant amount of time spent on problem localization.
If the upper overflow detection mechanism was added to migrate_disable(),
the root cause could be very quickly and easily identified.
Effect
======
Using WARN_ON_ONCE() to check if 'migration_disabled' is upper overflow
can help developers identify the issue quickly.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Peilin He<[email protected]>
Signed-off-by: xu xin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Yunkai Zhang <[email protected]>
Reviewed-by: Qiang Tu <[email protected]>
Reviewed-by: Kun Jiang <[email protected]>
Reviewed-by: Fan Yu <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
When creating a new task, we initialize vruntime of the newly task at
sched_cgroup_fork(). However, the timing of executing this action is too
early and may not be accurate.
Because it uses current CPU to init the vruntime, but the new task
actually runs on the cpu which be assigned at wake_up_new_task().
To optimize this case, we pass ENQUEUE_INITIAL flag to activate_task()
in wake_up_new_task(), in this way, when place_entity is called in
enqueue_entity(), the vruntime of the new task will be initialized.
In addition, place_entity() in task_fork_fair() was introduced for two
reasons:
1. Previously, the __enqueue_entity() was in task_new_fair(),
in order to provide vruntime for enqueueing the newly task, the
vruntime assignment equation "se->vruntime = cfs_rq->min_vruntime" was
introduced by commit e9acbff6484d ("sched: introduce se->vruntime").
This is the initial state of place_entity().
2. commit 4d78e7b656aa ("sched: new task placement for vruntime") added
child_runs_first task placement feature which based on vruntime, this
also requires the new task's vruntime value.
After removing the child_runs_first and enqueue_entity() from
task_fork_fair(), this place_entity() no longer makes sense, so remove
it also.
Signed-off-by: Zhang Qiao <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
If cpuset_cpu_inactive() fails, set_rq_online() need be called to rollback.
Fixes: 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control")
Cc: [email protected]
Signed-off-by: Yang Yingliang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Introduce sched_set_rq_on/offline() helper, so it can be called
in normal or error path simply. No functional changed.
Cc: [email protected]
Signed-off-by: Yang Yingliang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
I got the following warn report while doing stress test:
jump label: negative count!
WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0
Call Trace:
<TASK>
__static_key_slow_dec_cpuslocked+0x16/0x70
sched_cpu_deactivate+0x26e/0x2a0
cpuhp_invoke_callback+0x3ad/0x10d0
cpuhp_thread_fun+0x3f5/0x680
smpboot_thread_fn+0x56d/0x8d0
kthread+0x309/0x400
ret_from_fork+0x41/0x70
ret_from_fork_asm+0x1b/0x30
</TASK>
Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(),
the cpu offline failed, but sched_smt_present is decremented before
calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so
fix it by incrementing sched_smt_present in the error path.
Fixes: c5511d03ec09 ("sched/smt: Make sched_smt_present track topology")
Cc: [email protected]
Signed-off-by: Yang Yingliang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Chen Yu <[email protected]>
Reviewed-by: Tim Chen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Introduce sched_smt_present_inc/dec() helper, so it can be called
in normal or error path simply. No functional changed.
Cc: [email protected]
Signed-off-by: Yang Yingliang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In extreme test scenarios:
the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime,
utime = 18446744073709518790 ns, rtime = 135989749728000 ns
In cputime_adjust() process, stime is greater than rtime due to
mul_u64_u64_div_u64() precision problem.
before call mul_u64_u64_div_u64(),
stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
after call mul_u64_u64_div_u64(),
stime = 135989949653530
unsigned reversion occurs because rtime is less than stime.
utime = rtime - stime = 135989749728000 - 135989949653530
= -199925530
= (u64)18446744073709518790
Trigger condition:
1). User task run in kernel mode most of time
2). ARM64 architecture
3). TICK_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set
Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime
Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()")
Signed-off-by: Zheng Zucheng <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix RPM package build error caused by an incorrect locale setup
- Mark modules.weakdep as ghost in RPM package
- Fix the odd combination of -S and -c in stack protector scripts,
which is an error with the latest Clang
* tag 'kbuild-fixes-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: Fix '-S -c' in x86 stack protector scripts
kbuild: rpm-pkg: ghost modules.weakdep file
kbuild: rpm-pkg: Fix C locale setup
|
|
This simplifies the min_t() and max_t() macros by no longer making them
work in the context of a C constant expression.
That means that you can no longer use them for static initializers or
for array sizes in type definitions, but there were only a couple of
such uses, and all of them were converted (famous last words) to use
MIN_T/MAX_T instead.
Cc: David Laight <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit 3a7e02c040b1 ("minmax: avoid overly complicated constant
expressions in VM code") added the simpler MIN_T/MAX_T macros in order
to avoid some excessive expansion from the rather complicated regular
min/max macros.
The complexity of those macros stems from two issues:
(a) trying to use them in situations that require a C constant
expression (in static initializers and for array sizes)
(b) the type sanity checking
and MIN_T/MAX_T avoids both of these issues.
Now, in the whole (long) discussion about all this, it was pointed out
that the whole type sanity checking is entirely unnecessary for
min_t/max_t which get a fixed type that the comparison is done in.
But that still leaves min_t/max_t unnecessarily complicated due to
worries about the C constant expression case.
However, it turns out that there really aren't very many cases that use
min_t/max_t for this, and we can just force-convert those.
This does exactly that.
Which in turn will then allow for much simpler implementations of
min_t()/max_t(). All the usual "macros in all upper case will evaluate
the arguments multiple times" rules apply.
We should do all the same things for the regular min/max() vs MIN/MAX()
cases, but that has the added complexity of various drivers defining
their own local versions of MIN/MAX, so that needs another level of
fixes first.
Link: https://lore.kernel.org/all/[email protected]/
Cc: David Laight <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBI and UBIFS updates from Richard Weinberger:
- Many fixes for power-cut issues by Zhihao Cheng
- Another ubiblock error path fix
- ubiblock section mismatch fix
- Misc fixes all over the place
* tag 'ubifs-for-linus-6.11-rc1-take2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubi: Fix ubi_init() ubiblock_exit() section mismatch
ubifs: add check for crypto_shash_tfm_digest
ubifs: Fix inconsistent inode size when powercut happens during appendant writing
ubi: block: fix null-pointer-dereference in ubiblock_create()
ubifs: fix kernel-doc warnings
ubifs: correct UBIFS_DFS_DIR_LEN macro definition and improve code clarity
mtd: ubi: Restore missing cleanup on ubi_init() failure path
ubifs: dbg_orphan_check: Fix missed key type checking
ubifs: Fix unattached inode when powercut happens in creating
ubifs: Fix space leak when powercut happens in linking tmpfile
ubifs: Move ui->data initialization after initializing security
ubifs: Fix adding orphan entry twice for the same inode
ubifs: Remove insert_dead_orphan from replaying orphan process
Revert "ubifs: ubifs_symlink: Fix memleak of inode->i_link in error path"
ubifs: Don't add xattr inode into orphan area
ubifs: Fix unattached xattr inode if powercut happens after deleting
mtd: ubi: avoid expensive do_div() on 32-bit machines
mtd: ubi: make ubi_class constant
ubi: eba: properly rollback inside self_check_eba
|
|
After a recent change in clang to stop consuming all instances of '-S'
and '-c' [1], the stack protector scripts break due to the kernel's use
of -Werror=unused-command-line-argument to catch cases where flags are
not being properly consumed by the compiler driver:
$ echo | clang -o - -x c - -S -c -Werror=unused-command-line-argument
clang: error: argument unused during compilation: '-c' [-Werror,-Wunused-command-line-argument]
This results in CONFIG_STACKPROTECTOR getting disabled because
CONFIG_CC_HAS_SANE_STACKPROTECTOR is no longer set.
'-c' and '-S' both instruct the compiler to stop at different stages of
the pipeline ('-S' after compiling, '-c' after assembling), so having
them present together in the same command makes little sense. In this
case, the test wants to stop before assembling because it is looking at
the textual assembly output of the compiler for either '%fs' or '%gs',
so remove '-c' from the list of arguments to resolve the error.
All versions of GCC continue to work after this change, along with
versions of clang that do or do not contain the change mentioned above.
Cc: [email protected]
Fixes: 4f7fd4d7a791 ("[PATCH] Add the -fstack-protector option to the CFLAGS")
Fixes: 60a5317ff0f4 ("x86: implement x86_32 stack protector")
Link: https://github.com/llvm/llvm-project/commit/6461e537815f7fa68cef06842505353cf5600e9c [1]
Signed-off-by: Nathan Chancellor <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>
|
|
Since ubiblock_exit() is now called from an init function,
the __exit section no longer makes sense.
Cc: Ben Hutchings <[email protected]>
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Richard Weinberger <[email protected]>
Reviewed-by: Zhihao Cheng <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
Pull turbostat updates from Len Brown:
- Enable turbostat extensions to add both perf and PMT (Intel
Platform Monitoring Technology) counters via the cmdline
- Demonstrate PMT access with built-in support for Meteor Lake's
Die C6 counter
* tag 'v6.11-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
tools/power turbostat: version 2024.07.26
tools/power turbostat: Include umask=%x in perf counter's config
tools/power turbostat: Document PMT in turbostat.8
tools/power turbostat: Add MTL's PMT DC6 builtin counter
tools/power turbostat: Add early support for PMT counters
tools/power turbostat: Add selftests for added perf counters
tools/power turbostat: Add selftests for SMI, APERF and MPERF counters
tools/power turbostat: Move verbose counter messages to level 2
tools/power turbostat: Move debug prints from stdout to stderr
tools/power turbostat: Fix typo in turbostat.8
tools/power turbostat: Add perf added counter example to turbostat.8
tools/power turbostat: Fix formatting in turbostat.8
tools/power turbostat: Extend --add option with perf counters
tools/power turbostat: Group SMI counter with APERF and MPERF
tools/power turbostat: Add ZERO_ARRAY for zero initializing builtin array
tools/power turbostat: Replace enum rapl_source and cstate_source with counter_source
tools/power turbostat: Remove anonymous union from rapl_counter_info_t
tools/power/turbostat: Switch to new Intel CPU model defines
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull CXL updates from Dave Jiang:
"Core:
- A CXL maturity map has been added to the documentation to detail
the current state of CXL enabling.
It provides the status of the current state of various CXL features
to inform current and future contributors of where things are and
which areas need contribution.
- A notifier handler has been added in order for a newly created CXL
memory region to trigger the abstract distance metrics calculation.
This should bring parity for CXL memory to the same level vs
hotplugged DRAM for NUMA abstract distance calculation. The
abstract distance reflects relative performance used for memory
tiering handling.
- An addition for XOR math has been added to address the CXL DPA to
SPA translation.
CXL address translation did not support address interleave math
with XOR prior to this change.
Fixes:
- Fix to address race condition in the CXL memory hotplug notifier
- Add missing MODULE_DESCRIPTION() for CXL modules
- Fix incorrect vendor debug UUID define
Misc:
- A warning has been added to inform users of an unsupported
configuration when mixing CXL VH and RCH/RCD hierarchies
- The ENXIO error code has been replaced with EBUSY for inject poison
limit reached via debugfs and cxl-test support
- Moving the PCI config read in cxl_dvsec_rr_decode() to avoid
unnecessary PCI config reads
- A refactor to a common struct for DRAM and general media CXL
events"
* tag 'cxl-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/core/pci: Move reading of control register to immediately before usage
cxl: Remove defunct code calculating host bridge target positions
cxl/region: Verify target positions using the ordered target list
cxl: Restore XOR'd position bits during address translation
cxl/core: Fold cxl_trace_hpa() into cxl_dpa_to_hpa()
cxl/test: Replace ENXIO with EBUSY for inject poison limit reached
cxl/memdev: Replace ENXIO with EBUSY for inject poison limit reached
cxl/acpi: Warn on mixed CXL VH and RCH/RCD Hierarchy
cxl/core: Fix incorrect vendor debug UUID define
Documentation: CXL Maturity Map
cxl/region: Simplify cxl_region_nid()
cxl/region: Support to calculate memory tier abstract distance
cxl/region: Fix a race condition in memory hotplug notifier
cxl: add missing MODULE_DESCRIPTION() macros
cxl/events: Use a common struct for DRAM and General Media events
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode
Pull unicode update from Gabriel Krisman Bertazi:
"Two small fixes to silence the compiler and static analyzers tools
from Ben Dooks and Jeff Johnson"
* tag 'unicode-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode:
unicode: add MODULE_DESCRIPTION() macros
unicode: make utf8 test count static
|
|
In the same way as for other similar files, mark as ghost the new file
generated by depmod for configured weak dependencies for modules,
modules.weakdep, so that although it is not included in the package,
claim the ownership on it.
Signed-off-by: Jose Ignacio Tornos Martinez <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>
|
|
git://git.samba.org/sfrench/cifs-2.6
Pull more smb client updates from Steve French:
- fix for potential null pointer use in init cifs
- additional dynamic trace points to improve debugging of some common
scenarios
- two SMB1 fixes (one addressing reconnect with POSIX extensions, one a
mount parsing error)
* tag '6.11-rc-smb-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
smb3: add dynamic trace point for session setup key expired failures
smb3: add four dynamic tracepoints for copy_file_range and reflink
smb3: add dynamic tracepoint for reflink errors
cifs: mount with "unix" mount option for SMB1 incorrectly handled
cifs: fix reconnect with SMB1 UNIX Extensions
cifs: fix potential null pointer use in destroy_workqueue in init_cifs error path
|
|
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- Fix request without payloads cleanup (Leon)
- Use new protection information format (Francis)
- Improved debug message for lost pci link (Bart)
- Another apst quirk (Wang)
- Use appropriate sysfs api for printing chars (Markus)
- ublk async device deletion fix (Ming)
- drbd kerneldoc fixups (Simon)
- Fix deadlock between sd removal and release (Yang)
* tag 'block-6.11-20240726' of git://git.kernel.dk/linux:
nvme-pci: add missing condition check for existence of mapped data
ublk: fix UBLK_CMD_DEL_DEV_ASYNC handling
block: fix deadlock between sd_remove & sd_release
drbd: Add peer_device to Kernel doc
nvme-core: choose PIF from QPIF if QPIFS supports and PIF is QTYPE
nvme-pci: Fix the instructions for disabling power management
nvme: remove redundant bdev local variable
nvme-fabrics: Use seq_putc() in __nvmf_concat_opt_tokens()
nvme/pci: Add APST quirk for Lenovo N60z laptop
|
|
Pull io_uring fixes from Jens Axboe:
- Fix a syzbot issue for the msg ring cache added in this release. No
ill effects from this one, but it did make KMSAN unhappy (me)
- Sanitize the NAPI timeout handling, by unifying the value handling
into all ktime_t rather than converting back and forth (Pavel)
- Fail NAPI registration for IOPOLL rings, it's not supported (Pavel)
- Fix a theoretical issue with ring polling and cancelations (Pavel)
- Various little cleanups and fixes (Pavel)
* tag 'io_uring-6.11-20240726' of git://git.kernel.dk/linux:
io_uring/napi: pass ktime to io_napi_adjust_timeout
io_uring/napi: use ktime in busy polling
io_uring/msg_ring: fix uninitialized use of target_req->flags
io_uring: align iowq and task request error handling
io_uring: kill REQ_F_CANCEL_SEQ
io_uring: simplify io_uring_cmd return
io_uring: fix io_match_task must_hold
io_uring: don't allow netpolling with SETUP_IOPOLL
io_uring: tighten task exit cancellations
|