| Age | Commit message (Collapse) | Author | Files | Lines |
|
VMAs are skipped if there is no recent fault activity but this represents
a chicken-and-egg problem as there may be no fault activity if the PTEs
are never updated to trap NUMA hints. There is an indirect reliance on
scanning to be forced early in the lifetime of a task but this may fail
to detect changes in phase behaviour. Force inactive VMAs to be scanned
when all other eligible VMAs have been updated within the same scan
sequence.
Test results in general look good with some changes in performance, both
negative and positive, depending on whether the additional scanning and
faulting was beneficial or not to the workload. The autonuma benchmark
workload NUMA01_THREADLOCAL was picked for closer examination. The workload
creates two processes with numerous threads and thread-local storage that
is zero-filled in a loop. It exercises the corner case where unrelated
threads may skip VMAs that are thread-local to another thread and still
has some VMAs that inactive while the workload executes.
The VMA skipping activity frequency with and without the patch:
6.6.0-rc2-sched-numabtrace-v1
=============================
649 reason=scan_delay
9,094 reason=unsuitable
48,915 reason=shared_ro
143,919 reason=inaccessible
193,050 reason=pid_inactive
6.6.0-rc2-sched-numabselective-v1
=============================
146 reason=seq_completed
622 reason=ignore_pid_inactive
624 reason=scan_delay
6,570 reason=unsuitable
16,101 reason=shared_ro
27,608 reason=inaccessible
41,939 reason=pid_inactive
Note that with the patch applied, the PID activity is ignored
(ignore_pid_inactive) to ensure a VMA with some activity is completely
scanned. In addition, a small number of VMAs are scanned when no other
eligible VMA is available during a single scan window (seq_completed).
The number of times a VMA is skipped due to no PID activity from the
scanning task (pid_inactive) drops dramatically. It is expected that
this will increase the number of PTEs updated for NUMA hinting faults
as well as hinting faults but these represent PTEs that would otherwise
have been missed. The tradeoff is scan+fault overhead versus improving
locality due to migration.
On a 2-socket Cascade Lake test machine, the time to complete the
workload is as follows;
6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1
Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%)
Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%*
Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%)
CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%)
Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%)
The time to complete the workload is reduced by almost 30%:
6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1 /
Duration User 91201.80 63506.64
Duration System 2015.53 1819.78
Duration Elapsed 1234.77 868.37
In this specific case, system CPU time was not increased but it's not
universally true.
From vmstat, the NUMA scanning and fault activity is as follows;
6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1
Ops NUMA base-page range updates 64272.00 26374386.00
Ops NUMA PTE updates 36624.00 55538.00
Ops NUMA PMD updates 54.00 51404.00
Ops NUMA hint faults 15504.00 75786.00
Ops NUMA hint local faults % 14860.00 56763.00
Ops NUMA hint local percent 95.85 74.90
Ops NUMA pages migrated 1629.00 6469222.00
Both the number of PTE updates and hint faults is dramatically
increased. While this is superficially unfortunate, it represents
ranges that were simply skipped without the patch. As a result
of the scanning and hinting faults, many more pages were also
migrated but as the time to completion is reduced, the overhead
is offset by the gain.
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Raghavendra K T <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
NUMA Balancing skips VMAs when the current task has not trapped a NUMA
fault within the VMA. If the VMA is skipped then mm->numa_scan_offset
advances and a task that is trapping faults within the VMA may never
fully update PTEs within the VMA.
Force tasks to update PTEs for partially scanned PTEs. The VMA will
be tagged for NUMA hints by some task but this removes some of the
benefit of tracking PID activity within a VMA. A follow-on patch
will mitigate this problem.
The test cases and machines evaluated did not trigger the corner case so
the performance results are neutral with only small changes within the
noise from normal test-to-test variance. However, the next patch makes
the corner case easier to trigger.
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Raghavendra K T <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Fix the various warnings from kernel-doc in kernel/fork.c
Signed-off-by: "Matthew Wilcox (Oracle)" <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Tested-by: Randy Dunlap <[email protected]>
Signed-off-by: Jonathan Corbet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
SRCU callbacks acceleration might fail if the preceding callbacks
advance also fails. This can happen when the following steps are met:
1) The RCU_WAIT_TAIL segment has callbacks (say for gp_num 8) and the
RCU_NEXT_READY_TAIL also has callbacks (say for gp_num 12).
2) The grace period for RCU_WAIT_TAIL is observed as started but not yet
completed so rcu_seq_current() returns 4 + SRCU_STATE_SCAN1 = 5.
3) This value is passed to rcu_segcblist_advance() which can't move
any segment forward and fails.
4) srcu_gp_start_if_needed() still proceeds with callback acceleration.
But then the call to rcu_seq_snap() observes the grace period for the
RCU_WAIT_TAIL segment (gp_num 8) as completed and the subsequent one
for the RCU_NEXT_READY_TAIL segment as started
(ie: 8 + SRCU_STATE_SCAN1 = 9) so it returns a snapshot of the
next grace period, which is 16.
5) The value of 16 is passed to rcu_segcblist_accelerate() but the
freshly enqueued callback in RCU_NEXT_TAIL can't move to
RCU_NEXT_READY_TAIL which already has callbacks for a previous grace
period (gp_num = 12). So acceleration fails.
6) Note in all these steps, srcu_invoke_callbacks() hadn't had a chance
to run srcu_invoke_callbacks().
Then some very bad outcome may happen if the following happens:
7) Some other CPU races and starts the grace period number 16 before the
CPU handling previous steps had a chance. Therefore srcu_gp_start()
isn't called on the latter sdp to fix the acceleration leak from
previous steps with a new pair of call to advance/accelerate.
8) The grace period 16 completes and srcu_invoke_callbacks() is finally
called. All the callbacks from previous grace periods (8 and 12) are
correctly advanced and executed but callbacks in RCU_NEXT_READY_TAIL
still remain. Then rcu_segcblist_accelerate() is called with a
snaphot of 20.
9) Since nothing started the grace period number 20, callbacks stay
unhandled.
This has been reported in real load:
[3144162.608392] INFO: task kworker/136:12:252684 blocked for more
than 122 seconds.
[3144162.615986] Tainted: G O K 5.4.203-1-tlinux4-0011.1 #1
[3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[3144162.631162] kworker/136:12 D 0 252684 2 0x90004000
[3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm]
[3144162.631192] Call Trace:
[3144162.631202] __schedule+0x2ee/0x660
[3144162.631206] schedule+0x33/0xa0
[3144162.631209] schedule_timeout+0x1c4/0x340
[3144162.631214] ? update_load_avg+0x82/0x660
[3144162.631217] ? raw_spin_rq_lock_nested+0x1f/0x30
[3144162.631218] wait_for_completion+0x119/0x180
[3144162.631220] ? wake_up_q+0x80/0x80
[3144162.631224] __synchronize_srcu.part.19+0x81/0xb0
[3144162.631226] ? __bpf_trace_rcu_utilization+0x10/0x10
[3144162.631227] synchronize_srcu+0x5f/0xc0
[3144162.631236] irqfd_shutdown+0x3c/0xb0 [kvm]
[3144162.631239] ? __schedule+0x2f6/0x660
[3144162.631243] process_one_work+0x19a/0x3a0
[3144162.631244] worker_thread+0x37/0x3a0
[3144162.631247] kthread+0x117/0x140
[3144162.631247] ? process_one_work+0x3a0/0x3a0
[3144162.631248] ? __kthread_cancel_work+0x40/0x40
[3144162.631250] ret_from_fork+0x1f/0x30
Fix this with taking the snapshot for acceleration _before_ the read
of the current grace period number.
The only side effect of this solution is that callbacks advancing happen
then _after_ the full barrier in rcu_seq_snap(). This is not a problem
because that barrier only cares about:
1) Ordering accesses of the update side before call_srcu() so they don't
bleed.
2) See all the accesses prior to the grace period of the current gp_num
The only things callbacks advancing need to be ordered against are
carried by snp locking.
Reported-by: Yong He <[email protected]>
Co-developed-by:: Yong He <[email protected]>
Signed-off-by: Yong He <[email protected]>
Co-developed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Co-developed-by: Neeraj upadhyay <[email protected]>
Signed-off-by: Neeraj upadhyay <[email protected]>
Link: http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com
Fixes: da915ad5cf25 ("srcu: Parallelize callback handling")
Signed-off-by: Frederic Weisbecker <[email protected]>
|
|
The SMT control mechanism got added as speculation attack vector
mitigation. The implemented logic relies on the primary thread mask to
be set up properly.
This turns out to be an issue with XEN/PV guests because their CPU hotplug
mechanics do not enumerate APICs and therefore the mask is never correctly
populated.
This went unnoticed so far because by chance XEN/PV ends up with
smp_num_siblings == 2. So smt_hotplug_control stays at its default value
CPU_SMT_ENABLED and the primary thread mask is never evaluated in the
context of CPU hotplug.
This stopped "working" with the upcoming overhaul of the topology
evaluation which legitimately provides a fake topology for XEN/PV. That
sets smp_num_siblings to 1, which causes the core CPU hot-plug core to
refuse to bring up the APs.
This happens because smt_hotplug_control is set to CPU_SMT_NOT_SUPPORTED
which causes cpu_smt_allowed() to evaluate the unpopulated primary thread
mask with the conclusion that all non-boot CPUs are not valid to be
plugged.
Make cpu_smt_allowed() more robust and take CPU_SMT_NOT_SUPPORTED and
CPU_SMT_NOT_IMPLEMENTED into account. Rename it to cpu_bootable() while at
it as that makes it more clear what the function is about.
The primary mask issue on x86 XEN/PV needs to be addressed separately as
there are users outside of the CPU hotplug code too.
Fixes: 05736e4ac13c ("cpu/hotplug: Provide knobs to control SMT")
Reported-by: Juergen Gross <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Juergen Gross <[email protected]>
Tested-by: Sohil Mehta <[email protected]>
Tested-by: Michael Kelley <[email protected]>
Tested-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Zhang Rui <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Recent NUMA hinting faulting activity is reset approximately every
VMA_PID_RESET_PERIOD milliseconds. However, if the current task has not
accessed a VMA then the reset check is missed and the reset is potentially
deferred forever. Check if the PID activity information should be reset
before checking if the current task recently trapped a NUMA hinting fault.
[ [email protected]: Rewrite changelog ]
Suggested-by: Mel Gorman <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
NUMA balancing skips or scans VMAs for a variety of reasons. In preparation
for completing scans of VMAs regardless of PID access, trace the reasons
why a VMA was skipped. In a later patch, the tracing will be used to track
if a VMA was forcibly scanned.
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
::next_pid_reset => ::pids_active_reset
The access_pids[] field name is somewhat ambiguous as no PIDs are accessed.
Similarly, it's not clear that next_pid_reset is related to access_pids[].
Rename the fields to more accurately reflect their purpose.
[ mingo: Rename in the comments too. ]
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The verifier, as part of check_return_code(), verifies that async
callbacks such as from e.g. timers, will return 0. It does this by
correctly checking that R0->var_off is in tnum_const(0), which
effectively checks that it's in a range of 0. If this condition fails,
however, it prints an error message which says that the value should
have been in (0x0; 0x1). This results in possibly confusing output such
as the following in which an async callback returns 1:
At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x1)
The fix is easy -- we should just pass the tnum_const(0) as the correct
range to verbose_invalid_scalar(), which will then print the following:
At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x0)
Fixes: bfc6bb74e4f1 ("bpf: Implement verifier support for validation of async callbacks.")
Signed-off-by: David Vernet <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
cgroup v1 or v2 or both controller names can be passed as arguments to
the 'cgroup_no_v1' kernel parameter, though most of the controller's
names are the same for both cgroup versions. This can be confusing when
both versions are used interchangeably, i.e., passing cgroup_no_v1=io
$ sudo dmesg |grep cgroup
...
cgroup: Disabling io control group subsystem in v1 mounts
cgroup: Disabled controller 'blkio'
Make it consistent across the pr_info()'s, by using ss->legacy_name, as
the subsystem name, while printing the cgroup v1 controller disabling
information in cgroup_init().
Signed-off-by: Kamalesh Babulal <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
One PID may appear multiple times in a preloaded pidlist.
(Possibly due to PID recycling but we have reports of the same
task_struct appearing with different PIDs, thus possibly involving
transfer of PID via de_thread().)
Because v1 seq_file iterator uses PIDs as position, it leads to
a message:
> seq_file: buggy .next function kernfs_seq_next did not update position index
Conservative and quick fix consists of removing duplicates from `tasks`
file (as opposed to removing pidlists altogether). It doesn't affect
correctness (it's sufficient to show a PID once), performance impact
would be hidden by unconditional sorting of the pidlist already in place
(asymptotically).
Link: https://lore.kernel.org/r/[email protected]/
Suggested-by: Firo Yang <[email protected]>
Signed-off-by: Michal Koutný <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Cc: [email protected]
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Drop break after return.
Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Julia Lawall <[email protected]>
Acked-by: Masami Hiramatsu (Google) <[email protected]>
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
|
|
Move it out of the .c file into the shared scheduler-internal header file,
to gain type-checking.
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Shrikanth Hegde <[email protected]>
Cc: Valentin Schneider <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
on the platform
The 'sched_energy_aware' sysctl is available for the admin to disable/enable
energy aware scheduling(EAS). EAS is enabled only if few conditions are
met by the platform. They are, asymmetric CPU capacity, no SMT,
schedutil CPUfreq governor, frequency invariant load tracking etc.
A platform may boot without EAS capability, but could gain such
capability at runtime. For example, changing/registering the cpufreq
governor to schedutil.
At present, though platform doesn't support EAS, this sysctl returns 1
and it ends up calling build_perf_domains on write to 1 and
NOP when writing to 0. That is confusing and un-necessary.
Desired behavior would be to have this sysctl to enable/disable the EAS
on supported platform. On non-supported platform write to the sysctl
would return not supported error and read of the sysctl would return
empty. So sched_energy_aware returns empty - EAS is not possible at this moment
This will include EAS capable platforms which have at least one EAS
condition false during startup, e.g. not using the schedutil cpufreq governor
sched_energy_aware returns 0 - EAS is supported but disabled by admin.
sched_energy_aware returns 1 - EAS is supported and enabled.
User can find out the reason why EAS is not possible by checking
info messages. sched_is_eas_possible returns true if the platform
can do EAS at this moment.
Signed-off-by: Shrikanth Hegde <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Pierre Gondois <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
BPF supports creating high resolution timers using bpf_timer_* helper
functions. Currently, only the BPF_F_TIMER_ABS flag is supported, which
specifies that the timeout should be interpreted as absolute time. It
would also be useful to be able to pin that timer to a core. For
example, if you wanted to make a subset of cores run without timer
interrupts, and only have the timer be invoked on a single core.
This patch adds support for this with a new BPF_F_TIMER_CPU_PIN flag.
When specified, the HRTIMER_MODE_PINNED flag is passed to
hrtimer_start(). A subsequent patch will update selftests to validate.
Signed-off-by: David Vernet <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Song Liu <[email protected]>
Acked-by: Hou Tao <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Per-package perf events are typically registered with a single CPU only,
however they can be read across all the CPUs within the package.
Currently perf_event_read maps the event CPU according to the topology
information to avoid an unnecessary SMP call, however
perf_event_read_local deals with hard values and rejects a read with a
failure if the CPU is not the one exactly registered. Allow similar
mapping within the perf_event_read_local if the perf event in question
can support this.
This allows users like BPF code to read the package perf events properly
across different CPUs within a package.
Signed-off-by: Tero Kristo <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Some userspace applications use timerfd_create() to request wakeups after
a long period of time. For example, a backup application may request a
wakeup once per week. This is perfectly fine as long as the system does
not try to suspend. However, if the system tries to suspend and the
system's RTC does not support the required alarm timeout, the suspend
operation will fail with an error such as
rtc_cmos 00:01: Alarms can be up to one day in the future
PM: dpm_run_callback(): platform_pm_suspend+0x0/0x4a returns -22
alarmtimer alarmtimer.4.auto: platform_pm_suspend+0x0/0x4a returned -22 after 117 usecs
PM: Device alarmtimer.4.auto failed to suspend: error -22
This results in a refusal to suspend the system, causing substantial
battery drain on affected systems.
To fix the problem, use the maximum alarm time offset as reported by RTC
drivers to set the maximum alarm time. While this may result in early
wakeups from suspend, it is still much better than not suspending at all.
Standardize system behavior if the requested alarm timeout is larger than
the alarm timeout supported by the rtc chip. Currently, in this situation,
the RTC driver will do one of the following:
- It may return an error.
- It may limit the alarm timeout to the maximum supported by the rtc chip.
- It may mask the timeout by the maximum alarm timeout supported by the RTC
chip (i.e. a requested timeout of 1 day + 1 minute may result in a 1
minute timeout).
With this in place, if the RTC driver reports the maximum alarm timeout
supported by the RTC chip, the system will always limit the alarm timeout
to the maximum supported by the RTC chip.
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: John Stultz <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
|
|
Update_triggers() always returns now + group->rtpoll_min_period, and the
return value is only used by psi_rtpoll_work(), so change update_triggers()
to a void function, let group->rtpoll_next_update = now +
group->rtpoll_min_period directly.
This will avoid unnecessary function return value passing & simplifies
the function.
[ mingo: Updated changelog ]
Suggested-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Yang Yang <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The Energy Aware Scheduler (EAS) estimates the energy consumption
of placing a task on different CPUs. The goal is to minimize this
energy consumption. Estimating the energy of different task placements
is increasingly complex with the size of the platform.
To avoid having a slow wake-up path, EAS is only enabled if this
complexity is low enough.
The current complexity limit was set in:
b68a4c0dba3b1 ("sched/topology: Disable EAS on inappropriate platforms")
... based on the first implementation of EAS, which was re-computing
the power of the whole platform for each task placement scenario, see:
390031e4c309 ("sched/fair: Introduce an energy estimation helper function")
... but the complexity of EAS was reduced in:
eb92692b2544d ("sched/fair: Speed-up energy-aware wake-ups")
... and find_energy_efficient_cpu() (feec) algorithm was updated in:
3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")
find_energy_efficient_cpu() (feec) is now doing:
feec()
\_ for_each_pd(pd) [0]
// get max_spare_cap_cpu and compute_prev_delta
\_ for_each_cpu(pd) [1]
\_ eenv_pd_busy_time(pd) [2]
\_ for_each_cpu(pd)
// compute_energy(pd) without the task
\_ eenv_pd_max_util(pd, -1) [3.0]
\_ for_each_cpu(pd)
\_ em_cpu_energy(pd, -1)
\_ for_each_ps(pd)
// compute_energy(pd) with the task on prev_cpu
\_ eenv_pd_max_util(pd, prev_cpu) [3.1]
\_ for_each_cpu(pd)
\_ em_cpu_energy(pd, prev_cpu)
\_ for_each_ps(pd)
// compute_energy(pd) with the task on max_spare_cap_cpu
\_ eenv_pd_max_util(pd, max_spare_cap_cpu) [3.2]
\_ for_each_cpu(pd)
\_ em_cpu_energy(pd, max_spare_cap_cpu)
\_ for_each_ps(pd)
[3.1] happens only once since prev_cpu is unique. With the same
definitions for nr_pd, nr_cpus and nr_ps, the complexity is of:
nr_pd * (2 * [nr_cpus in pd] + 2 * ([nr_cpus in pd] + [nr_ps in pd]))
+ ([nr_cpus in pd] + [nr_ps in pd])
[0] * ( [1] + [2] + [3.0] + [3.2] )
+ [3.1]
= nr_pd * (4 * [nr_cpus in pd] + 2 * [nr_ps in pd])
+ [nr_cpus in prev pd] + nr_ps
The complexity limit was set to 2048 in:
b68a4c0dba3b1 ("sched/topology: Disable EAS on inappropriate platforms")
... to make "EAS usable up to 16 CPUs with per-CPU DVFS and less than 8
performance states each". For the same platform, the complexity would
actually be of:
16 * (4 + 2 * 7) + 1 + 7 = 296
Since the EAS complexity was greatly reduced since the limit was
introduced, bigger platforms can handle EAS.
For instance, a platform with 112 CPUs with 7 performance states
each would not reach it:
112 * (4 + 2 * 7) + 1 + 7 = 2024
To reflect this improvement in the underlying EAS code, remove
the EAS complexity check.
Note that a limit on the number of CPUs still holds against
EM_MAX_NUM_CPUS to avoid overflows during the energy estimation.
[ mingo: Updates to the changelog. ]
Signed-off-by: Pierre Gondois <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Lukasz Luba <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Remove the rq::cpu_capacity_orig field and use arch_scale_cpu_capacity()
instead.
The scheduler uses 3 methods to get access to a CPU's max compute capacity:
- arch_scale_cpu_capacity(cpu) which is the default way to get a CPU's capacity.
- cpu_capacity_orig field which is periodically updated with
arch_scale_cpu_capacity().
- capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig.
There is no real need to save the value returned by arch_scale_cpu_capacity()
in struct rq. arch_scale_cpu_capacity() returns:
- either a per_cpu variable.
- or a const value for systems which have only one capacity.
Remove rq::cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere.
No functional changes.
Some performance tests on Arm64:
- small SMP device (hikey): no noticeable changes
- HMP device (RB5): hackbench shows minor improvement (1-2%)
- large smp (thx2): hackbench and tbench shows minor improvement (1%)
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
'int'
Doing this matches the natural type of 'int' based calculus
in sched_rt_handler(), and also enables the adding in of a
correct upper bounds check on the sysctl interface.
[ mingo: Rewrote the changelog. ]
Signed-off-by: Yajun Deng <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
find_new_ilb()
find_new_ilb() returns nr_cpu_ids on failure - which is the usual
cpumask bitops return pattern, but is weird & unnecessary in this
context: not only is it a global variable, it it is a +1 out of
bounds CPU index and also has different signedness ...
Its only user, kick_ilb(), then checks the return against nr_cpu_ids
to decide to return. There's no other use.
So instead of this, use a standard -1 return on failure to find an
idle CPU, as the argument is signed already.
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Joel Fernandes (Google) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Use 'ilb_cpu' consistently in both functions.
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Joel Fernandes (Google) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
- Fix incorrect/misleading comments,
- clarify some others,
- fix typos & grammar,
- and use more consistent style throughout.
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Joel Fernandes (Google) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Commit 9e70a5e109a4 ("printk: Add per-console suspended state")
removed console lock usage during resume and replaced it with
the clearly defined console_list_lock and srcu mechanisms.
However, the console lock usage had an important side-effect
of flushing the consoles. After its removal, consoles were no
longer flushed before checking their progress.
Add the console_lock/console_unlock dance to the beginning
of __pr_flush() to actually flush the consoles before checking
their progress. Also add comments to clarify this additional
usage of the console lock.
Note that console_unlock() does not guarantee flushing all messages
since the commit dbdda842fe96f89 ("printk: Add console owner and waiter
logic to load balance console writes").
Reported-by: Todd Brandt <[email protected]>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217955
Fixes: 9e70a5e109a4 ("printk: Add per-console suspended state")
Co-developed-by: Petr Mladek <[email protected]>
Signed-off-by: Petr Mladek <[email protected]>
Signed-off-by: John Ogness <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The old pick_eevdf() could fail to find the actual earliest eligible
deadline when it descended to the right looking for min_deadline, but
it turned out that that min_deadline wasn't actually eligible. In that
case we need to go back and search through any left branches we
skipped looking for the actual best _eligible_ min_deadline.
This is more expensive, but still O(log n), and at worst should only
involve descending two branches of the rbtree.
I've run this through a userspace stress test (thank you
tools/lib/rbtree.c), so hopefully this implementation doesn't miss any
corner cases.
Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Ben Segall <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Marek and Biju reported instances of:
"EEVDF scheduling fail, picking leftmost"
which Mike correlated with cgroup scheduling and the min_deadline heap
getting corrupted; some trace output confirms:
> And yeah, min_deadline is hosed somehow:
>
> validate_cfs_rq: --- /
> __print_se: ffff88845cf48080 w: 1024 ve: -58857638 lag: 870381 vd: -55861854 vmd: -66302085 E (11372/tr)
> __print_se: ffff88810d165800 w: 25 ve: -80323686 lag: 22336429 vd: -41496434 vmd: -66302085 E (-1//autogroup-31)
> __print_se: ffff888108379000 w: 25 ve: 0 lag: -57987257 vd: 114632828 vmd: 114632828 N (-1//autogroup-33)
> validate_cfs_rq: min_deadline: -55861854 avg_vruntime: -62278313462 / 1074 = -57987256
Turns out that reweight_entity(), which tries really hard to be fast,
does not do the normal dequeue+update+enqueue pattern but *does* scale
the deadline.
However, it then fails to propagate the updated deadline value up the
heap.
Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Reported-by: Marek Szyprowski <[email protected]>
Reported-by: Biju Das <[email protected]>
Reported-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Marek Szyprowski <[email protected]>
Tested-by: Biju Das <[email protected]>
Tested-by: Mike Galbraith <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Currently, there is no overflow-check with memdup_user().
Use the new function memdup_array_user() instead of memdup_user() for
duplicating the user-space array safely.
Suggested-by: David Airlie <[email protected]>
Signed-off-by: Philipp Stanner <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Zack Rusin <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Currently, there is no overflow-check with memdup_user().
Use the new function memdup_array_user() instead of memdup_user() for
duplicating the user-space array safely.
Suggested-by: David Airlie <[email protected]>
Signed-off-by: Philipp Stanner <[email protected]>
Acked-by: Baoquan He <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Zack Rusin <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc scheduler fixes from Ingo Molnar:
- Two EEVDF fixes: one to fix sysctl_sched_base_slice propagation, and
to fix an avg_vruntime() corner-case.
- A cpufreq frequency scaling fix
* tag 'sched-urgent-2023-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpufreq: schedutil: Update next_freq when cpufreq_limits change
sched/eevdf: Fix avg_vruntime()
sched/eevdf: Also update slice on placement
|
|
Multiple blocked tasks are printed when the system hangs. They may have
the same parent pid, but belong to different task groups.
Printing tgid lets users better know whether these tasks are from the same
task group or not.
Signed-off-by: Yajun Deng <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The following commit:
9b3c4ab3045e ("sched,rcu: Rework try_invoke_on_locked_down_task()")
... renamed try_invoke_on_locked_down_task() to task_call_func(),
but forgot to update the comment in try_to_wake_up().
But it turns out that the smp_rmb() doesn't live in task_call_func()
either, it was moved to __task_needs_rq_lock() in:
91dabf33ae5d ("sched: Fix race in task_call_func()")
Fix that now.
Also fix the s/smb/smp typo while at it.
Reported-by: Zhang Qiao <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
the branch
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The recently added tcx attachment extended the BPF UAPI for attaching and
detaching by a couple of fields. Those fields are currently only supported
for tcx, other types like cgroups and flow dissector silently ignore the
new fields except for the new flags.
This is problematic once we extend bpf_mprog to older attachment types, since
it's hard to figure out whether the syscall really was successful if the
kernel silently ignores non-zero values.
Explicitly reject non-zero fields relevant to bpf_mprog for attachment types
which don't use the latter yet.
Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support")
Signed-off-by: Lorenz Bauer <[email protected]>
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
|
|
Improve consistency for bpf_mprog_query() API and let the latter also handle
a NULL entry as can be the case for tcx. Instead of returning -ENOENT, we
copy a count of 0 and revision of 1 to user space, so that this can be fed
into a subsequent bpf_mprog_attach() call as expected_revision. A BPF self-
test as part of this series has been added to assert this case.
Suggested-by: Lorenz Bauer <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
|
|
While working on the ebpf-go [0] library integration for bpf_mprog and tcx,
Lorenz noticed that two subsequent BPF_PROG_QUERY requests currently fail. A
typical workflow is to first gather the bpf_mprog count without passing program/
link arrays, followed by the second request which contains the actual array
pointers.
The initial call populates count and revision fields. The second call gets
rejected due to a BPF_PROG_QUERY_LAST_FIELD bug which should point to
query.revision instead of query.link_attach_flags since the former is really
the last member.
It was not noticed in libbpf as bpf_prog_query_opts() always calls bpf(2) with
an on-stack bpf_attr that is memset() each time (and therefore query.revision
was reset to zero).
[0] https://ebpf-go.dev
Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: Lorenz Bauer <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Fix a recently introduced hibernation crash (Pavankumar Kondeti)"
* tag 'pm-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: hibernate: Fix copying the zero bitmap to safe pages
|
|
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle [1], add __counted_by for struct bpf_stack_map.
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1]
Link: https://lore.kernel.org/bpf/[email protected]
|
|
This extends the current PR_SET_MDWE prctl arg with a bit to indicate that
the process doesn't want MDWE protection to propagate to children.
To implement this no-inherit mode, the tag in current->mm->flags must be
absent from MMF_INIT_MASK. This means that the encoding for "MDWE but
without inherit" is different in the prctl than in the mm flags. This
leads to a bit of bit-mangling in the prctl implementation.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Florent Revest <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Alexey Izbyshev <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Ayush Jain <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Joey Gouly <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Szabolcs Nagy <[email protected]>
Cc: Topi Miettinen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The Energy Aware Scheduler (EAS) relies on the schedutil governor.
When moving to/from the schedutil governor, sched domains must be
rebuilt to allow re-evaluating the enablement conditions of EAS.
This is done through sched_cpufreq_governor_change().
Having a cpufreq governor assumes a cpufreq driver is running.
Inserting/removing a cpufreq driver should trigger a re-evaluation
of EAS enablement conditions, avoiding to see EAS enabled when
removing a running cpufreq driver.
Rebuild the sched domains in schedutil's sugov_init()/sugov_exit(),
allowing to check EAS's enablement condition whenever schedutil
governor is initialized/exited from.
Move relevant code up in schedutil.c to avoid a split and conditional
function declaration.
Rename sched_cpufreq_governor_change() to sugov_eas_rebuild_sd().
Signed-off-by: Pierre Gondois <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
The initialization code of the per-cpu sg_cpu struct is currently split
into two for-loop blocks. This can be simplified by merging the two
blocks into a single loop. This will make the code more maintainable.
Signed-off-by: Liao Chang <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts (or adjacent changes of note).
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
When cpufreq's policy is 'single', there is a scenario that will
cause sg_policy's next_freq to be unable to update.
When the CPU's util is always max, the cpufreq will be max,
and then if we change the policy's scaling_max_freq to be a
lower freq, indeed, the sg_policy's next_freq need change to
be the lower freq, however, because the cpu_is_busy, the next_freq
would keep the max_freq.
For example:
The cpu7 is a single CPU:
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # while true;do done& [1] 4737
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # taskset -p 80 4737
pid 4737's current affinity mask: ff
pid 4737's new affinity mask: 80
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq
2301000
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_cur_freq
2301000
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # echo 2171000 > scaling_max_freq
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq
2171000
At this time, the sg_policy's next_freq would stay at 2301000, which
is wrong.
To fix this, add a check for the ->need_freq_update flag.
[ mingo: Clarified the changelog. ]
Co-developed-by: Guohua Yan <[email protected]>
Signed-off-by: Xuewen Yan <[email protected]>
Signed-off-by: Guohua Yan <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: "Rafael J. Wysocki" <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from Bluetooth, netfilter, BPF and WiFi.
I didn't collect precise data but feels like we've got a lot of 6.5
fixes here. WiFi fixes are most user-awaited.
Current release - regressions:
- Bluetooth: fix hci_link_tx_to RCU lock usage
Current release - new code bugs:
- bpf: mprog: fix maximum program check on mprog attachment
- eth: ti: icssg-prueth: fix signedness bug in prueth_init_tx_chns()
Previous releases - regressions:
- ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling
- vringh: don't use vringh_kiov_advance() in vringh_iov_xfer(), it
doesn't handle zero length like we expected
- wifi:
- cfg80211: fix cqm_config access race, fix crashes with brcmfmac
- iwlwifi: mvm: handle PS changes in vif_cfg_changed
- mac80211: fix mesh id corruption on 32 bit systems
- mt76: mt76x02: fix MT76x0 external LNA gain handling
- Bluetooth: fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER
- l2tp: fix handling of transhdrlen in __ip{,6}_append_data()
- dsa: mv88e6xxx: avoid EEPROM timeout when EEPROM is absent
- eth: stmmac: fix the incorrect parameter after refactoring
Previous releases - always broken:
- net: replace calls to sock->ops->connect() with kernel_connect(),
prevent address rewrite in kernel_bind(); otherwise BPF hooks may
modify arguments, unexpectedly to the caller
- tcp: fix delayed ACKs when reads and writes align with MSS
- bpf:
- verifier: unconditionally reset backtrack_state masks on global
func exit
- s390: let arch_prepare_bpf_trampoline return program size, fix
struct_ops offsets
- sockmap: fix accounting of available bytes in presence of PEEKs
- sockmap: reject sk_msg egress redirects to non-TCP sockets
- ipv4/fib: send netlink notify when delete source address routes
- ethtool: plca: fix width of reads when parsing netlink commands
- netfilter: nft_payload: rebuild vlan header on h_proto access
- Bluetooth: hci_codec: fix leaking memory of local_codecs
- eth: intel: ice: always add legacy 32byte RXDID in supported_rxdids
- eth: stmmac:
- dwmac-stm32: fix resume on STM32 MCU
- remove buggy and unneeded stmmac_poll_controller, depend on NAPI
- ibmveth: always recompute TCP pseudo-header checksum, fix use of
the driver with Open vSwitch
- wifi:
- rtw88: rtw8723d: fix MAC address offset in EEPROM
- mt76: fix lock dependency problem for wed_lock
- mwifiex: sanity check data reported by the device
- iwlwifi: ensure ack flag is properly cleared
- iwlwifi: mvm: fix a memory corruption due to bad pointer arithm
- iwlwifi: mvm: fix incorrect usage of scan API
Misc:
- wifi: mac80211: work around Cisco AP 9115 VHT MPDU length"
* tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
MAINTAINERS: update Matthieu's email address
mptcp: userspace pm allow creating id 0 subflow
mptcp: fix delegated action races
net: stmmac: remove unneeded stmmac_poll_controller
net: lan743x: also select PHYLIB
net: ethernet: mediatek: disable irq before schedule napi
net: mana: Fix oversized sge0 for GSO packets
net: mana: Fix the tso_bytes calculation
net: mana: Fix TX CQE error handling
netlink: annotate data-races around sk->sk_err
sctp: update hb timer immediately after users change hb_interval
sctp: update transport state when processing a dupcook packet
tcp: fix delayed ACKs for MSS boundary condition
tcp: fix quick-ack counting to count actual ACKs of new data
page_pool: fix documentation typos
tipc: fix a potential deadlock on &tx->lock
net: stmmac: dwmac-stm32: fix resume on STM32 MCU
ipv4: Set offload_failed flag in fibmatch results
netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
netfilter: nf_tables: Deduplicate nft_register_obj audit logs
...
|
|
The system_callback() function in trace_events.c is only used within that
file. The "static" annotation was missed.
Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode")
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
The update to removing the eventfs_file changed the way the events top
level directory was handled. Instead of returning a dentry, it now returns
the eventfs_inode. In this changed, the removing of the events top level
directory is not much different than removing any of the other
directories. Because of this, the removal just called eventfs_remove_dir()
instead of eventfs_remove_events_dir().
Although eventfs_remove_dir() does the clean up, it misses out on the
dget() of the ei->dentry done in eventfs_create_events_dir(). It makes
more sense to match eventfs_create_events_dir() with a specific function
eventfs_remove_events_dir() and this specific function can then perform
the dput() to the dentry that had the dget() when it was created.
Fixes: 5790b1fb3d67 ("eventfs: Remove eventfs_file and just use eventfs_inode")
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
We have the next_resource() is used once and no user for the
next_resource_skip_children() outside of the for_each_resource().
Unify them by adding skip_children parameter to the next_resource().
Signed-off-by: Andy Shevchenko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
We have a few places where for_each_resource() is open coded.
Replace that by the macro. This makes code easier to read and
understand.
With this, compile r_next() only for CONFIG_PROC_FS=y.
Signed-off-by: Andy Shevchenko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|