Age | Commit message (Collapse) | Author | Files | Lines |
|
Currently, BPF kfuncs which accept trusted pointer arguments
i.e. those flagged as KF_TRUSTED_ARGS, KF_RCU, or KF_RELEASE, all
require an original/unmodified trusted pointer argument to be supplied
to them. By original/unmodified, it means that the backing register
holding the trusted pointer argument that is to be supplied to the BPF
kfunc must have its fixed offset set to zero, or else the BPF verifier
will outright reject the BPF program load. However, this zero fixed
offset constraint that is currently enforced by the BPF verifier onto
BPF kfuncs specifically flagged to accept KF_TRUSTED_ARGS or KF_RCU
trusted pointer arguments is rather unnecessary, and can limit their
usability in practice. Specifically, it completely eliminates the
possibility of constructing a derived trusted pointer from an original
trusted pointer. To put it simply, a derived pointer is a pointer
which points to one of the nested member fields of the object being
pointed to by the original trusted pointer.
This patch relaxes the zero fixed offset constraint that is enforced
upon BPF kfuncs which specifically accept KF_TRUSTED_ARGS, or KF_RCU
arguments. Although, the zero fixed offset constraint technically also
applies to BPF kfuncs accepting KF_RELEASE arguments, relaxing this
constraint for such BPF kfuncs has subtle and unwanted
side-effects. This was discovered by experimenting a little further
with an initial version of this patch series [0]. The primary issue
with relaxing the zero fixed offset constraint on BPF kfuncs accepting
KF_RELEASE arguments is that it'd would open up the opportunity for
BPF programs to supply both trusted pointers and derived trusted
pointers to them. For KF_RELEASE BPF kfuncs specifically, this could
be problematic as resources associated with the backing pointer could
be released by the backing BPF kfunc and cause instabilities for the
rest of the kernel.
With this new fixed offset semantic in-place for BPF kfuncs accepting
KF_TRUSTED_ARGS and KF_RCU arguments, we now have more flexibility
when it comes to the BPF kfuncs that we're able to introduce moving
forward.
Early discussions covering the possibility of relaxing the zero fixed
offset constraint can be found using the link below. This will provide
more context on where all this has stemmed from [1].
Notably, pre-existing tests have been updated such that they provide
coverage for the updated zero fixed offset
functionality. Specifically, the nested offset test was converted from
a negative to positive test as it was already designed to assert zero
fixed offset semantics of a KF_TRUSTED_ARGS BPF kfunc.
[0] https://lore.kernel.org/bpf/[email protected]/
[1] https://lore.kernel.org/bpf/[email protected]/
Signed-off-by: Matt Bobrowski <[email protected]>
Acked-by: Kumar Kartikeya Dwivedi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
The kernel test robot reported that the find_module() is not available
if CONFIG_MODULES=n.
Fix this error by hiding find_modules() in #ifdef CONFIG_MODULES with
related rcu locks as try_module_get_by_name().
Link: https://lore.kernel.org/all/172056819167.201571.250053007194508038.stgit@devnote2/
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-07-08
The following pull-request contains BPF updates for your *net-next* tree.
We've added 102 non-merge commits during the last 28 day(s) which contain
a total of 127 files changed, 4606 insertions(+), 980 deletions(-).
The main changes are:
1) Support resilient split BTF which cuts down on duplication and makes BTF
as compact as possible wrt BTF from modules, from Alan Maguire & Eduard Zingerman.
2) Add support for dumping kfunc prototypes from BTF which enables both detecting
as well as dumping compilable prototypes for kfuncs, from Daniel Xu.
3) Batch of s390x BPF JIT improvements to add support for BPF arena and to implement
support for BPF exceptions, from Ilya Leoshkevich.
4) Batch of riscv64 BPF JIT improvements in particular to add 12-argument support
for BPF trampolines and to utilize bpf_prog_pack for the latter, from Pu Lehui.
5) Extend BPF test infrastructure to add a CHECKSUM_COMPLETE validation option
for skbs and add coverage along with it, from Vadim Fedorenko.
6) Inline bpf_get_current_task/_btf() helpers in the arm64 BPF JIT which gives
a small 1% performance improvement in micro-benchmarks, from Puranjay Mohan.
7) Extend the BPF verifier to track the delta between linked registers in order
to better deal with recent LLVM code optimizations, from Alexei Starovoitov.
8) Fix bpf_wq_set_callback_impl() kfunc signature where the third argument should
have been a pointer to the map value, from Benjamin Tissoires.
9) Extend BPF selftests to add regular expression support for test output matching
and adjust some of the selftest when compiled under gcc, from Cupertino Miranda.
10) Simplify task_file_seq_get_next() and remove an unnecessary loop which always
iterates exactly once anyway, from Dan Carpenter.
11) Add the capability to offload the netfilter flowtable in XDP layer through
kfuncs, from Florian Westphal & Lorenzo Bianconi.
12) Various cleanups in networking helpers in BPF selftests to shave off a few
lines of open-coded functions on client/server handling, from Geliang Tang.
13) Properly propagate prog->aux->tail_call_reachable out of BPF verifier, so
that x86 JIT does not need to implement detection, from Leon Hwang.
14) Fix BPF verifier to add a missing check_func_arg_reg_off() to prevent an
out-of-bounds memory access for dynpointers, from Matt Bobrowski.
15) Fix bpf_session_cookie() kfunc to return __u64 instead of long pointer as
it might lead to problems on 32-bit archs, from Jiri Olsa.
16) Enhance traffic validation and dynamic batch size support in xsk selftests,
from Tushar Vyavahare.
bpf-next-for-netdev
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (102 commits)
selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep
selftests/bpf: amend for wrong bpf_wq_set_callback_impl signature
bpf: helpers: fix bpf_wq_set_callback_impl signature
libbpf: Add NULL checks to bpf_object__{prev_map,next_map}
selftests/bpf: Remove exceptions tests from DENYLIST.s390x
s390/bpf: Implement exceptions
s390/bpf: Change seen_reg to a mask
bpf: Remove unnecessary loop in task_file_seq_get_next()
riscv, bpf: Optimize stack usage of trampoline
bpf, devmap: Add .map_alloc_check
selftests/bpf: Remove arena tests from DENYLIST.s390x
selftests/bpf: Add UAF tests for arena atomics
selftests/bpf: Introduce __arena_global
s390/bpf: Support arena atomics
s390/bpf: Enable arena
s390/bpf: Support address space cast instruction
s390/bpf: Support BPF_PROBE_MEM32
s390/bpf: Land on the next JITed instruction after exception
s390/bpf: Introduce pre- and post- probe functions
s390/bpf: Get rid of get_probe_mem_regno()
...
====================
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
|
|
perf_pending_irq() invokes perf_event_wakeup() and __perf_pending_irq().
The former is in charge of waking any tasks which waits to be woken up
while the latter disables perf-events.
The irq_work perf_pending_irq(), while this an irq_work, the callback
is invoked in thread context on PREEMPT_RT. This is needed because all
the waking functions (wake_up_all(), kill_fasync()) acquire sleep locks
which must not be used with disabled interrupts.
Disabling events, as done by __perf_pending_irq(), expects a hardirq
context and disabled interrupts. This requirement is not fulfilled on
PREEMPT_RT.
Split functionality based on perf_event::pending_disable into irq_work
named `pending_disable_irq' and invoke it in hardirq context on
PREEMPT_RT. Rename the split out callback to perf_pending_disable().
Reported-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Marco Elver <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
perf_pending_task() is invoked in task context and disables preemption
because perf_swevent_get_recursion_context() used to access per-CPU
variables. The other reason is to create a RCU read section while
accessing the perf_event.
The recursion counter is no longer a per-CPU accounter so disabling
preemption is no longer required. The RCU section is needed and must be
created explicit.
Replace the preemption-disable section with a explicit RCU-read section.
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Marco Elver <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The swevent_htable::recursion counter is used to avoid creating an
swevent while an event is processed to avoid recursion. The counter is
per-CPU and preemption must be disabled to have a stable counter.
perf_pending_task() disables preemption to access the counter and then
signal. This is problematic on PREEMPT_RT because sending a signal uses
a spinlock_t which must not be acquired in atomic on PREEMPT_RT because
it becomes a sleeping lock.
The atomic context can be avoided by moving the counter into the
task_struct. There is a 4 byte hole between futex_state (usually always
on) and the following perf pointer (perf_event_ctxp). After the
recursion lost some weight it fits perfectly.
Move swevent_htable::recursion into task_struct.
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Marco Elver <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
There are four recursion counter, one for each context. The type of the
counter is `int' but the counter is used as `bool' since it is only
incremented if zero.
The main goal here is to shrink the whole struct into 32bit int which
can later be added task_struct into an existing hole.
Reduce the type of the recursion counter to an unsigned char, keep the
increment/ decrement operation.
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Marco Elver <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
A signal is delivered by raising irq_work() which works from any context
including NMI. irq_work() can be delayed if the architecture does not
provide an interrupt vector. In order not to lose a signal, the signal
is injected via task_work during event_sched_out().
Instead going via irq_work, the signal could be added directly via
task_work. The signal is sent to current and can be enqueued on its
return path to userland.
Queue signal via task_work and consider possible NMI context. Remove
perf_event::pending_sigtrap and and use perf_event::pending_work
instead.
Reported-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Marco Elver <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Adding task_work from NMI context requires the following:
- The kasan_record_aux_stack() is not NMU safe and must be avoided.
- Using TWA_RESUME is NMI safe. If the NMI occurs while the CPU is in
userland then it will continue in userland and not invoke the `work'
callback.
Add TWA_NMI_CURRENT as an additional notify mode. In this mode skip
kasan and use irq_work in hardirq-mode to for needed interrupt. Set
TIF_NOTIFY_RESUME within the irq_work callback due to k[ac]san
instrumentation in test_and_set_bit() which does not look NMI safe in
case of a report.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Only if perf_event::pending_sigtrap is zero, the irq_work accounted by
increminging perf_event::nr_pending. The member perf_event::pending_addr
might be overwritten by a subsequent event if the signal was not yet
delivered and is expected. The irq_work will not be enqeueued again
because it has a check to be only enqueued once.
Move irq_work_queue() to where the counter is incremented and
perf_event::pending_sigtrap is set to make it more obvious that the
irq_work is scheduled once.
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Marco Elver <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The perf pending task work is never waited upon the matching event
release. In the case of a child event, released via free_event()
directly, this can potentially result in a leaked event, such as in the
following scenario that doesn't even require a weak IRQ work
implementation to trigger:
schedule()
prepare_task_switch()
=======> <NMI>
perf_event_overflow()
event->pending_sigtrap = ...
irq_work_queue(&event->pending_irq)
<======= </NMI>
perf_event_task_sched_out()
event_sched_out()
event->pending_sigtrap = 0;
atomic_long_inc_not_zero(&event->refcount)
task_work_add(&event->pending_task)
finish_lock_switch()
=======> <IRQ>
perf_pending_irq()
//do nothing, rely on pending task work
<======= </IRQ>
begin_new_exec()
perf_event_exit_task()
perf_event_exit_event()
// If is child event
free_event()
WARN(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1)
// event is leaked
Similar scenarios can also happen with perf_event_remove_on_exec() or
simply against concurrent perf_event_release().
Fix this with synchonizing against the possibly remaining pending task
work while freeing the event, just like is done with remaining pending
IRQ work. This means that the pending task callback neither need nor
should hold a reference to the event, preventing it from ever beeing
freed.
Fixes: 517e6a301f34 ("perf: Fix perf_pending_task() UaF")
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
|
|
When a task is scheduled out, pending sigtrap deliveries are deferred
to the target task upon resume to userspace via task_work.
However failures while adding an event's callback to the task_work
engine are ignored. And since the last call for events exit happen
after task work is eventually closed, there is a small window during
which pending sigtrap can be queued though ignored, leaking the event
refcount addition such as in the following scenario:
TASK A
-----
do_exit()
exit_task_work(tsk);
<IRQ>
perf_event_overflow()
event->pending_sigtrap = pending_id;
irq_work_queue(&event->pending_irq);
</IRQ>
=========> PREEMPTION: TASK A -> TASK B
event_sched_out()
event->pending_sigtrap = 0;
atomic_long_inc_not_zero(&event->refcount)
// FAILS: task work has exited
task_work_add(&event->pending_task)
[...]
<IRQ WORK>
perf_pending_irq()
// early return: event->oncpu = -1
</IRQ WORK>
[...]
=========> TASK B -> TASK A
perf_event_exit_task(tsk)
perf_event_exit_event()
free_event()
WARN(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1)
// leak event due to unexpected refcount == 2
As a result the event is never released while the task exits.
Fix this with appropriate task_work_add()'s error handling.
Fixes: 517e6a301f34 ("perf: Fix perf_pending_task() UaF")
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
|
|
Re-introduce task_work_cancel(), this time to cancel an actual callback
and not *any* callback pointing to a given function. This is going to be
needed for perf events event freeing.
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
|
|
A proper task_work_cancel() API that actually cancels a callback and not
*any* callback pointing to a given function is going to be needed for
perf events event freeing. Do the appropriate rename to prepare for
that.
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
|
|
inlined callers
Apparently despite it being marked inline, the compiler
may not inline __down_write_common() which makes it difficult
to identify the cause of lock contention, as the wchan of the
blocked function will always be listed as __down_write_common().
So add __always_inline annotation to the common function (as
well as the inlined helper callers) to force it to be inlined
so a more useful blocking function will be listed (via wchan).
This mirrors commit 92cc5d00a431 ("locking/rwsem: Add
__always_inline annotation to __down_read_common() and inlined
callers") which did the same for __down_read_common.
I sort of worry that I'm playing wack-a-mole here, and talking
with compiler people, they tell me inline means nothing, which
makes me want to cry a little. So I'm wondering if we need to
replace all the inlines with __always_inline, or remove them
because either we mean something by it, or not.
Fixes: c995e638ccbb ("locking/rwsem: Fold __down_{read,write}*()")
Reported-by: Tim Murray <[email protected]>
Signed-off-by: John Stultz <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Waiman Long <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The test thread will start N benchmark kthreads and then schedule out
until the test time finished and notify the benchmark kthreads to stop.
The benchmark kthreads will keep running until notified to stop.
There's a problem with current implementation when the benchmark
kthreads number is equal to the CPUs on a non-preemptible kernel:
since the scheduler will balance the kthreads across the CPUs and
when the test time's out the test thread won't get a chance to be
scheduled on any CPU then cannot notify the benchmark kthreads to stop.
This can be easily reproduced on a VM (simulated with 16 CPUs) with
PREEMPT_VOLUNTARY:
estuary:/mnt$ ./dma_map_benchmark -t 16 -s 1
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 10-...!: (5221 ticks this GP) idle=ed24/1/0x4000000000000000 softirq=142/142 fqs=0
rcu: (t=5254 jiffies g=-559 q=45 ncpus=16)
rcu: rcu_sched kthread starved for 5255 jiffies! g-559 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=12
rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_sched state:R running task stack:0 pid:16 tgid:16 ppid:2 flags:0x00000008
Call trace
__switch_to+0xec/0x138
__schedule+0x2f8/0x1080
schedule+0x30/0x130
schedule_timeout+0xa0/0x188
rcu_gp_fqs_loop+0x128/0x528
rcu_gp_kthread+0x1c8/0x208
kthread+0xec/0xf8
ret_from_fork+0x10/0x20
Sending NMI from CPU 10 to CPUs 0:
NMI backtrace for cpu 0
CPU: 0 PID: 332 Comm: dma-map-benchma Not tainted 6.10.0-rc1-vanilla-LSE #8
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : arm_smmu_cmdq_issue_cmdlist+0x218/0x730
lr : arm_smmu_cmdq_issue_cmdlist+0x488/0x730
sp : ffff80008748b630
x29: ffff80008748b630 x28: 0000000000000000 x27: ffff80008748b780
x26: 0000000000000000 x25: 000000000000bc70 x24: 000000000001bc70
x23: ffff0000c12af080 x22: 0000000000010000 x21: 000000000000ffff
x20: ffff80008748b700 x19: ffff0000c12af0c0 x18: 0000000000010000
x17: 0000000000000001 x16: 0000000000000040 x15: ffffffffffffffff
x14: 0001ffffffffffff x13: 000000000000ffff x12: 00000000000002f1
x11: 000000000001ffff x10: 0000000000000031 x9 : ffff800080b6b0b8
x8 : ffff0000c2a48000 x7 : 000000000001bc71 x6 : 0001800000000000
x5 : 00000000000002f1 x4 : 01ffffffffffffff x3 : 000000000009aaf1
x2 : 0000000000000018 x1 : 000000000000000f x0 : ffff0000c12af18c
Call trace:
arm_smmu_cmdq_issue_cmdlist+0x218/0x730
__arm_smmu_tlb_inv_range+0xe0/0x1a8
arm_smmu_iotlb_sync+0xc0/0x128
__iommu_dma_unmap+0x248/0x320
iommu_dma_unmap_page+0x5c/0xe8
dma_unmap_page_attrs+0x38/0x1d0
map_benchmark_thread+0x118/0x2c0
kthread+0xec/0xf8
ret_from_fork+0x10/0x20
Solve this by adding scheduling point in the kthread loop,
so if there're other threads in the system they may have
a chance to run, especially the thread to notify the test
end. However this may degrade the test concurrency so it's
recommended to run this on an idle system.
Signed-off-by: Yicong Yang <[email protected]>
Acked-by: Barry Song <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
I realized this while having a map containing both a struct bpf_timer and
a struct bpf_wq: the third argument provided to the bpf_wq callback is
not the struct bpf_wq pointer itself, but the pointer to the value in
the map.
Which means that the users need to double cast the provided "value" as
this is not a struct bpf_wq *.
This is a change of API, but there doesn't seem to be much users of bpf_wq
right now, so we should be able to go with this right now.
Fixes: 81f1d7a583fa ("bpf: wq: add bpf_wq_set_callback_impl")
Signed-off-by: Benjamin Tissoires <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
After commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU") this
loop always iterates exactly one time. Delete the for statement and pull
the code in a tab.
Signed-off-by: Dan Carpenter <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Currently, kprobe event checks whether the target symbol name is unique
or not, so that it does not put a probe on an unexpected place. But this
skips the check if the target is on a module because the module may not
be loaded.
To fix this issue, this patch checks the number of probe target symbols
in a target module when the module is loaded. If the probe is not on the
unique name symbols in the module, it will be rejected at that point.
Note that the symbol which has a unique name in the target module,
it will be accepted even if there are same-name symbols in the
kernel or other modules,
Link: https://lore.kernel.org/all/172016348553.99543.2834679315611882137.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected]>
|
|
Make it consistent with apply_wqattrs_commit().
Link: https://lore.kernel.org/lkml/[email protected]/
Cc: Juri Lelli <[email protected]>
Cc: Waiman Long <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
The PWQ allocation and WQ enlistment are not within the same lock-held
critical section; therefore, their states can become out of sync when
the user modifies the unbound mask or if CPU hotplug events occur in
the interim since those operations only update the WQs that are already
in the list.
Make the PWQ allocation and WQ enlistment atomic.
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
kthread_flush_worker() can't be called with wq_pool_mutex held.
Prepare for moving wq_pool_mutex and cpu hotplug lock out of
alloc_and_link_pwqs().
Cc: Zqiang <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
new wq
For early wq allocation, rescuer initialization is the last step of the
creation of a new wq. Make the behavior the same for all allocations.
Prepare for initializing rescuer's affinities with the default pwq's
affinities.
Prepare for moving the whole workqueue initializing procedure into
wq_pool_mutex and cpu hotplug locks.
Cc: Juri Lelli <[email protected]>
Cc: Waiman Long <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
workqueue creation includes adding it to the workqueue list.
Prepare for moving the whole workqueue initializing procedure into
wq_pool_mutex and cpu hotplug locks.
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
In the function cgroup_base_stat_cputime_show, there are five
instances of #ifdef, which makes the code not concise.
To address this, add the function cgroup_force_idle_show
to make the code more succinct.
Signed-off-by: Chen Ridong <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
Nowadays PF_KTHREAD is sticky and it was never protected by ->alloc_lock.
Move the PF_KTHREAD check outside of task_lock() section to make this code
more understandable.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
mm_update_next_owner() checks the children / real_parent->children to
avoid the "everything else" loop in the likely case, but this won't work
if a child/sibling has a zombie leader with ->mm == NULL.
Move the for_each_thread() logic into try_to_set_owner(), if nothing else
this makes the children/siblings/everything searches more consistent.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jinliang Zheng <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Tycho Andersen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add the new helper, try_to_set_owner(), which tries to update mm->owner
once we see c->mm == mm. This way mm_update_next_owner() doesn't need to
restart the list_for_each_entry/for_each_process loops from the very
beginning if it races with exit/exec, it can just continue.
Unlike the current code, try_to_set_owner() re-checks tsk->mm == mm before
it drops tasklist_lock, so it doesn't need get/put_task_struct().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jinliang Zheng <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Tycho Andersen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/phy/aquantia/aquantia.h
219343755eae ("net: phy: aquantia: add missing include guards")
61578f679378 ("net: phy: aquantia: add support for PHY LEDs")
drivers/net/ethernet/wangxun/libwx/wx_hw.c
bd07a9817846 ("net: txgbe: remove separate irq request for MSI and INTx")
b501d261a5b3 ("net: txgbe: add FDIR ATR support")
https://lore.kernel.org/all/[email protected]/
include/linux/mlx5/mlx5_ifc.h
048a403648fc ("net/mlx5: IFC updates for changing max EQs")
99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO")
https://lore.kernel.org/all/[email protected]/
Adjacent changes:
drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c
4130c67cd123 ("wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference")
3f3126515fbe ("wifi: iwlwifi: mvm: add mvm-specific guard")
include/net/mac80211.h
816c6bec09ed ("wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP")
5a009b42e041 ("wifi: mac80211: track changes in AP's TPE")
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
'nocb.2024.06.03a', 'rcu-tasks.2024.06.06a', 'rcutorture.2024.06.06a' and 'srcu.2024.06.18a' into HEAD
doc.2024.06.06a: Documentation updates.
fixes.2024.07.04a: Miscellaneous fixes.
mb.2024.06.28a: Grace-period memory-barrier redundancy removal.
nocb.2024.06.03a: No-CB CPU updates.
rcu-tasks.2024.06.06a: RCU-Tasks updates.
rcutorture.2024.06.06a: Torture-test updates.
srcu.2024.06.18a: SRCU polled-grace-period updates.
|
|
When rcu_barrier() calls rcu_rdp_cpu_online() and observes a CPU off
rnp->qsmaskinitnext, it means that all accesses from the offline CPU
preceding the CPUHP_TEARDOWN_CPU are visible to RCU barrier, including
callbacks expiration and counter updates.
However interrupts can still fire after stop_machine() re-enables
interrupts and before rcutree_report_cpu_dead(). The related accesses
happening between CPUHP_TEARDOWN_CPU and rnp->qsmaskinitnext clearing
are _NOT_ guaranteed to be seen by rcu_barrier() without proper
ordering, especially when callbacks are invoked there to the end, making
rcutree_migrate_callback() bypass barrier_lock.
The following theoretical race example can make rcu_barrier() hang:
CPU 0 CPU 1
----- -----
//cpu_down()
smpboot_park_threads()
//ksoftirqd is parked now
<IRQ>
rcu_sched_clock_irq()
invoke_rcu_core()
do_softirq()
rcu_core()
rcu_do_batch()
// callback storm
// rcu_do_batch() returns
// before completing all
// of them
// do_softirq also returns early because of
// timeout. It defers to ksoftirqd but
// it's parked
</IRQ>
stop_machine()
take_cpu_down()
rcu_barrier()
spin_lock(barrier_lock)
// observes rcu_segcblist_n_cbs(&rdp->cblist) != 0
<IRQ>
do_softirq()
rcu_core()
rcu_do_batch()
//completes all pending callbacks
//smp_mb() implied _after_ callback number dec
</IRQ>
rcutree_report_cpu_dead()
rnp->qsmaskinitnext &= ~rdp->grpmask;
rcutree_migrate_callback()
// no callback, early return without locking
// barrier_lock
//observes !rcu_rdp_cpu_online(rdp)
rcu_barrier_entrain()
rcu_segcblist_entrain()
// Observe rcu_segcblist_n_cbs(rsclp) == 0
// because no barrier between reading
// rnp->qsmaskinitnext and rsclp->len
rcu_segcblist_add_len()
smp_mb__before_atomic()
// will now observe the 0 count and empty
// list, but too late, we enqueue regardless
WRITE_ONCE(rsclp->len, rsclp->len + v);
// ignored barrier callback
// rcu barrier stall...
This could be solved with a read memory barrier, enforcing the message
passing between rnp->qsmaskinitnext and rsclp->len, matching the full
memory barrier after rsclp->len addition in rcu_segcblist_add_len()
performed at the end of rcu_do_batch().
However the rcu_barrier() is complicated enough and probably doesn't
need too many more subtleties. CPU down is a slowpath and the
barrier_lock seldom contended. Solve the issue with unconditionally
locking the barrier_lock on rcutree_migrate_callbacks(). This makes sure
that either rcu_barrier() sees the empty queue or its entrained
callback will be migrated.
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
The rcu_sync structure's ->gp_count field is always accessed under the
protection of that same structure's ->rss_lock field, with the exception
of a pair of WARN_ON_ONCE() calls just prior to acquiring that lock in
functions rcu_sync_exit() and rcu_sync_dtor(). These lockless accesses
are unnecessary and impair KCSAN's ability to catch bugs that might be
inserted via other lockless accesses.
This commit therefore moves those WARN_ON_ONCE() calls under the lock.
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
If a CPU is running either a userspace application or a guest OS in
nohz_full mode, it is possible for a system call to occur just as an
RCU grace period is starting. If that CPU also has the scheduling-clock
tick enabled for any reason (such as a second runnable task), and if the
system was booted with rcutree.use_softirq=0, then RCU can add insult to
injury by awakening that CPU's rcuc kthread, resulting in yet another
task and yet more OS jitter due to switching to that task, running it,
and switching back.
In addition, in the common case where that system call is not of
excessively long duration, awakening the rcuc task is pointless.
This pointlessness is due to the fact that the CPU will enter an extended
quiescent state upon returning to the userspace application or guest OS.
In this case, the rcuc kthread cannot do anything that the main RCU
grace-period kthread cannot do on its behalf, at least if it is given
a few additional milliseconds (for example, given the time duration
specified by rcutree.jiffies_till_first_fqs, give or take scheduling
delays).
This commit therefore adds a rcutree.nohz_full_patience_delay kernel
boot parameter that specifies the grace period age (in milliseconds,
rounded to jiffies) before which RCU will refrain from awakening the
rcuc kthread. Preliminary experimentation suggests a value of 1000,
that is, one second. Increasing rcutree.nohz_full_patience_delay will
increase grace-period latency and in turn increase memory footprint,
so systems with constrained memory might choose a smaller value.
Systems with less-aggressive OS-jitter requirements might choose the
default value of zero, which keeps the traditional immediate-wakeup
behavior, thus avoiding increases in grace-period latency.
[ paulmck: Apply Leonardo Bras feedback. ]
Link: https://lore.kernel.org/all/[email protected]/
Reported-by: Leonardo Bras <[email protected]>
Suggested-by: Leonardo Bras <[email protected]>
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Reviewed-by: Leonardo Bras <[email protected]>
|
|
The Lunarlake patches rely on the new VFM stuff.
Signed-off-by: Peter Zijlstra <[email protected]>
|
|
rb_alloc_aux() should not be called with nr_pages <= 0. Make it more robust
and readable by returning an error immediately in that case.
Signed-off-by: Adrian Hunter <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The default aux_watermark is half the AUX area buffer size. In general,
on a 64-bit architecture, the AUX area buffer size could be a bigger than
fits in a 32-bit type, but the calculation does not allow for that
possibility.
However the aux_watermark value is recorded in a u32, so should not be
more than U32_MAX either.
Fix by doing the calculation in a correctly sized type, and limiting the
result to U32_MAX.
Fixes: d68e6799a5c8 ("perf: Cap allocation order at aux_watermark")
Signed-off-by: Adrian Hunter <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
nr_pages is unsigned long but gets passed to rb_alloc_aux() as an int,
and is stored as an int.
Only power-of-2 values are accepted, so if nr_pages is a 64_bit value, it
will be passed to rb_alloc_aux() as zero.
That is not ideal because:
1. the value is incorrect
2. rb_alloc_aux() is at risk of misbehaving, although it manages to
return -ENOMEM in that case, it is a result of passing zero to get_order()
even though the get_order() result is documented to be undefined in that
case.
Fix by simply validating the maximum supported value in the first place.
Use -ENOMEM error code for consistency with the current error code that
is returned in that case.
Fixes: 45bfb2e50471 ("perf: Add AUX area to ring buffer for raw data streams")
Signed-off-by: Adrian Hunter <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
perf_buffer->aux_nr_pages uses a 32-bit type, so a cast is needed to
calculate a 64-bit size.
Fixes: 45bfb2e50471 ("perf: Add AUX area to ring buffer for raw data streams")
Signed-off-by: Adrian Hunter <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
tasks
When a task's weight is being changed, set_load_weight() is called with
@update_load set. As weight changes aren't trivial for the fair class,
set_load_weight() calls fair.c::reweight_task() for fair class tasks.
However, set_load_weight() first tests task_has_idle_policy() on entry and
skips calling reweight_task() for SCHED_IDLE tasks. This is buggy as
SCHED_IDLE tasks are just fair tasks with a very low weight and they would
incorrectly skip load, vlag and position updates.
Fix it by updating reweight_task() to take struct load_weight as idle weight
can't be expressed with prio and making set_load_weight() call
reweight_task() for SCHED_IDLE tasks too when @update_load is set.
Fixes: 9059393e4ec1 ("sched/fair: Use reweight_entity() for set_user_nice()")
Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected] # v4.15+
Link: http://lkml.kernel.org/r/[email protected]
|
|
The current code loops over the psi_states only to call a helper which
then resolves back to the action needed for each state using a switch
statement. That is effectively creating a double indirection of a kind
which, given how all the states need to be explicitly listed and handled
anyway, we can simply remove. Both the for loop and the switch statement
that is.
The benefit is both in the code size and CPU time spent in this function.
YMMV but on my Steam Deck, while in a game, the patch makes the CPU usage
go from ~2.4% down to ~1.2%. Text size at the same time went from 0x323 to
0x2c1.
Signed-off-by: Tvrtko Ursulin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Let's add match_devname_and_update_preferred_console() for driver
subsystems to call during init when the console is ready, and it's
character device name is known. For now, we use it only for the serial
layer to allow console=DEVNAME:0.0 style hardware based addressing for
consoles.
The earlier attempt on doing this caused a regression with the kernel
command line console order as it added calling __add_preferred_console()
again later on during init. A better approach was suggested by Petr where
we add the deferred console to the console_cmdline[] and update it later
on when the console is ready.
Suggested-by: Petr Mladek <[email protected]>
Co-developed-by: Petr Mladek <[email protected]>
Signed-off-by: Tony Lindgren <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Currently users of the interrupt simulator don't have any way of being
notified about interrupts from the simulated domain being requested or
released. This causes a problem for one of the users - the GPIO
simulator - which is unable to lock the pins as interrupts.
Define a structure containing callbacks to be executed on various
irq_sim-related events (for now: irq request and release) and provide an
extended function for creating simulated interrupt domains that takes it
and a pointer to custom user data (to be passed to said callbacks) as
arguments.
Reviewed-by: Linus Walleij <[email protected]>
Acked-by: Linus Walleij <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Bartosz Golaszewski <[email protected]>
|
|
Patch series "kmsan: Enable on s390", v7.
Architectures use assembly code to initialize ftrace_regs and call
ftrace_ops_list_func(). Therefore, from the KMSAN's point of view,
ftrace_regs is poisoned on ftrace_ops_list_func entry(). This causes
KMSAN warnings when running the ftrace testsuite.
Fix by trusting the architecture-specific assembly code and always
unpoisoning ftrace_regs in ftrace_ops_list_func.
The issue was not encountered on x86_64 so far only by accident:
assembly-allocated ftrace_regs was overlapping a stale partially
unpoisoned stack frame. Poisoning stack frames before returns [1] makes
the issue appear on x86_64 as well.
[1] https://github.com/iii-i/llvm-project/commits/msan-poison-allocas-before-returning-2024-06-12/
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ilya Leoshkevich <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Acked-by: Steven Rostedt (Google) <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Masami Hiramatsu (Google) <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm: clarify folio_add_new_anon_rmap() and
__folio_add_anon_rmap()", v2.
This patchset is preparatory work for mTHP swapin.
folio_add_new_anon_rmap() assumes that new anon rmaps are always
exclusive. However, this assumption doesn’t hold true for cases like
do_swap_page(), where a new anon might be added to the swapcache and is
not necessarily exclusive.
The patchset extends the rmap flags to allow folio_add_new_anon_rmap() to
handle both exclusive and non-exclusive new anon folios. The
do_swap_page() function is updated to use this extended API with rmap
flags. Consequently, all new anon folios now consistently use
folio_add_new_anon_rmap(). The special case for !folio_test_anon() in
__folio_add_anon_rmap() can be safely removed.
In conclusion, new anon folios always use folio_add_new_anon_rmap(),
regardless of exclusivity. Old anon folios continue to use
__folio_add_anon_rmap() via folio_add_anon_rmap_pmd() and
folio_add_anon_rmap_ptes().
This patch (of 3):
In the case of a swap-in, a new anonymous folio is not necessarily
exclusive. This patch updates the rmap flags to allow a new anonymous
folio to be treated as either exclusive or non-exclusive. To maintain the
existing behavior, we always use EXCLUSIVE as the default setting.
[[email protected]: cleanup and constifications per David and akpm]
[[email protected]: fix missing doc for flags of folio_add_new_anon_rmap()]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: enhance doc for extend rmap flags arguments for folio_add_new_anon_rmap]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Barry Song <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Tested-by: Shuai Yuan <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or
/proc or ptrace or page migration (get_task_mm()), it is impossible to
find an appropriate task_struct in the loop whose mm_struct is the same as
the target mm_struct.
If the above race condition is combined with the stress-ng-zombie and
stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
write_lock_irq() for tasklist_lock.
Recognize this situation in advance and exit early.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jinliang Zheng <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
To streamline maintenance efforts, we propose removing the implementation
of swap_free(). Instead, we can simply invoke swap_free_nr() with nr set
to 1. swap_free_nr() is designed with a bitmap consisting of only one
long, resulting in overhead that can be ignored for cases where nr equals
1.
A prime candidate for leveraging swap_free_nr() lies within
kernel/power/swap.c. Implementing this change facilitates the adoption of
batch processing for hibernation.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Barry Song <[email protected]>
Suggested-by: "Huang, Ying" <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Acked-by: Chris Li <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Chuanhua Han <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Gao Xiang <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
When the WARN_ON_ONCE() triggers, the printk() of the additional
information related to the warning will not happen in print level
"warn". When reading dmesg with a restriction to level "warn", the
information published by the printk_once() will not show up there.
Transform WARN_ON_ONCE() and printk_once() into a WARN_ONCE().
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or
/proc or ptrace or page migration (get_task_mm()), it is impossible to
find an appropriate task_struct in the loop whose mm_struct is the same as
the target mm_struct.
If the above race condition is combined with the stress-ng-zombie and
stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
write_lock_irq() for tasklist_lock.
Recognize this situation in advance and exit early.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jinliang Zheng <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The writing of css->cgroup associated with the cgroup root in
rebind_subsystems() is currently protected only by cgroup_mutex.
However, the reading of css->cgroup in both proc_cpuset_show() and
proc_cgroup_show() is protected just by css_set_lock. That makes the
readers susceptible to racing problems like data tearing or caching.
It is also a problem that can be reported by KCSAN.
This can be fixed by using READ_ONCE() and WRITE_ONCE() to access
css->cgroup. Alternatively, the writing of css->cgroup can be moved
under css_set_lock as well which is done by this patch.
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
Introduce misc.peak to record the historical maximum usage of the
resource, as in some scenarios the value of misc.max could be
adjusted based on the peak usage of the resource.
Signed-off-by: Xiu Jianfeng <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|