aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-07-09bpf: relax zero fixed offset constraint on KF_TRUSTED_ARGS/KF_RCUMatt Bobrowski1-6/+3
Currently, BPF kfuncs which accept trusted pointer arguments i.e. those flagged as KF_TRUSTED_ARGS, KF_RCU, or KF_RELEASE, all require an original/unmodified trusted pointer argument to be supplied to them. By original/unmodified, it means that the backing register holding the trusted pointer argument that is to be supplied to the BPF kfunc must have its fixed offset set to zero, or else the BPF verifier will outright reject the BPF program load. However, this zero fixed offset constraint that is currently enforced by the BPF verifier onto BPF kfuncs specifically flagged to accept KF_TRUSTED_ARGS or KF_RCU trusted pointer arguments is rather unnecessary, and can limit their usability in practice. Specifically, it completely eliminates the possibility of constructing a derived trusted pointer from an original trusted pointer. To put it simply, a derived pointer is a pointer which points to one of the nested member fields of the object being pointed to by the original trusted pointer. This patch relaxes the zero fixed offset constraint that is enforced upon BPF kfuncs which specifically accept KF_TRUSTED_ARGS, or KF_RCU arguments. Although, the zero fixed offset constraint technically also applies to BPF kfuncs accepting KF_RELEASE arguments, relaxing this constraint for such BPF kfuncs has subtle and unwanted side-effects. This was discovered by experimenting a little further with an initial version of this patch series [0]. The primary issue with relaxing the zero fixed offset constraint on BPF kfuncs accepting KF_RELEASE arguments is that it'd would open up the opportunity for BPF programs to supply both trusted pointers and derived trusted pointers to them. For KF_RELEASE BPF kfuncs specifically, this could be problematic as resources associated with the backing pointer could be released by the backing BPF kfunc and cause instabilities for the rest of the kernel. With this new fixed offset semantic in-place for BPF kfuncs accepting KF_TRUSTED_ARGS and KF_RCU arguments, we now have more flexibility when it comes to the BPF kfuncs that we're able to introduce moving forward. Early discussions covering the possibility of relaxing the zero fixed offset constraint can be found using the link below. This will provide more context on where all this has stemmed from [1]. Notably, pre-existing tests have been updated such that they provide coverage for the updated zero fixed offset functionality. Specifically, the nested offset test was converted from a negative to positive test as it was already designed to assert zero fixed offset semantics of a KF_TRUSTED_ARGS BPF kfunc. [0] https://lore.kernel.org/bpf/[email protected]/ [1] https://lore.kernel.org/bpf/[email protected]/ Signed-off-by: Matt Bobrowski <[email protected]> Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-07-10tracing/kprobes: Fix build error when find_module() is not availableMasami Hiramatsu (Google)1-6/+19
The kernel test robot reported that the find_module() is not available if CONFIG_MODULES=n. Fix this error by hiding find_modules() in #ifdef CONFIG_MODULES with related rcu locks as try_module_get_by_name(). Link: https://lore.kernel.org/all/172056819167.201571.250053007194508038.stgit@devnote2/ Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2024-07-09Merge tag 'for-netdev' of ↵Paolo Abeni13-145/+352
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-07-08 The following pull-request contains BPF updates for your *net-next* tree. We've added 102 non-merge commits during the last 28 day(s) which contain a total of 127 files changed, 4606 insertions(+), 980 deletions(-). The main changes are: 1) Support resilient split BTF which cuts down on duplication and makes BTF as compact as possible wrt BTF from modules, from Alan Maguire & Eduard Zingerman. 2) Add support for dumping kfunc prototypes from BTF which enables both detecting as well as dumping compilable prototypes for kfuncs, from Daniel Xu. 3) Batch of s390x BPF JIT improvements to add support for BPF arena and to implement support for BPF exceptions, from Ilya Leoshkevich. 4) Batch of riscv64 BPF JIT improvements in particular to add 12-argument support for BPF trampolines and to utilize bpf_prog_pack for the latter, from Pu Lehui. 5) Extend BPF test infrastructure to add a CHECKSUM_COMPLETE validation option for skbs and add coverage along with it, from Vadim Fedorenko. 6) Inline bpf_get_current_task/_btf() helpers in the arm64 BPF JIT which gives a small 1% performance improvement in micro-benchmarks, from Puranjay Mohan. 7) Extend the BPF verifier to track the delta between linked registers in order to better deal with recent LLVM code optimizations, from Alexei Starovoitov. 8) Fix bpf_wq_set_callback_impl() kfunc signature where the third argument should have been a pointer to the map value, from Benjamin Tissoires. 9) Extend BPF selftests to add regular expression support for test output matching and adjust some of the selftest when compiled under gcc, from Cupertino Miranda. 10) Simplify task_file_seq_get_next() and remove an unnecessary loop which always iterates exactly once anyway, from Dan Carpenter. 11) Add the capability to offload the netfilter flowtable in XDP layer through kfuncs, from Florian Westphal & Lorenzo Bianconi. 12) Various cleanups in networking helpers in BPF selftests to shave off a few lines of open-coded functions on client/server handling, from Geliang Tang. 13) Properly propagate prog->aux->tail_call_reachable out of BPF verifier, so that x86 JIT does not need to implement detection, from Leon Hwang. 14) Fix BPF verifier to add a missing check_func_arg_reg_off() to prevent an out-of-bounds memory access for dynpointers, from Matt Bobrowski. 15) Fix bpf_session_cookie() kfunc to return __u64 instead of long pointer as it might lead to problems on 32-bit archs, from Jiri Olsa. 16) Enhance traffic validation and dynamic batch size support in xsk selftests, from Tushar Vyavahare. bpf-next-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (102 commits) selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep selftests/bpf: amend for wrong bpf_wq_set_callback_impl signature bpf: helpers: fix bpf_wq_set_callback_impl signature libbpf: Add NULL checks to bpf_object__{prev_map,next_map} selftests/bpf: Remove exceptions tests from DENYLIST.s390x s390/bpf: Implement exceptions s390/bpf: Change seen_reg to a mask bpf: Remove unnecessary loop in task_file_seq_get_next() riscv, bpf: Optimize stack usage of trampoline bpf, devmap: Add .map_alloc_check selftests/bpf: Remove arena tests from DENYLIST.s390x selftests/bpf: Add UAF tests for arena atomics selftests/bpf: Introduce __arena_global s390/bpf: Support arena atomics s390/bpf: Enable arena s390/bpf: Support address space cast instruction s390/bpf: Support BPF_PROBE_MEM32 s390/bpf: Land on the next JITed instruction after exception s390/bpf: Introduce pre- and post- probe functions s390/bpf: Get rid of get_probe_mem_regno() ... ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
2024-07-09perf: Split __perf_pending_irq() out of perf_pending_irq()Sebastian Andrzej Siewior1-7/+22
perf_pending_irq() invokes perf_event_wakeup() and __perf_pending_irq(). The former is in charge of waking any tasks which waits to be woken up while the latter disables perf-events. The irq_work perf_pending_irq(), while this an irq_work, the callback is invoked in thread context on PREEMPT_RT. This is needed because all the waking functions (wake_up_all(), kill_fasync()) acquire sleep locks which must not be used with disabled interrupts. Disabling events, as done by __perf_pending_irq(), expects a hardirq context and disabled interrupts. This requirement is not fulfilled on PREEMPT_RT. Split functionality based on perf_event::pending_disable into irq_work named `pending_disable_irq' and invoke it in hardirq context on PREEMPT_RT. Rename the split out callback to perf_pending_disable(). Reported-by: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Marco Elver <[email protected]> Tested-by: Arnaldo Carvalho de Melo <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-09perf: Don't disable preemption in perf_pending_task().Sebastian Andrzej Siewior1-6/+5
perf_pending_task() is invoked in task context and disables preemption because perf_swevent_get_recursion_context() used to access per-CPU variables. The other reason is to create a RCU read section while accessing the perf_event. The recursion counter is no longer a per-CPU accounter so disabling preemption is no longer required. The RCU section is needed and must be created explicit. Replace the preemption-disable section with a explicit RCU-read section. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Marco Elver <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-09perf: Move swevent_htable::recursion into task_struct.Sebastian Andrzej Siewior2-11/+4
The swevent_htable::recursion counter is used to avoid creating an swevent while an event is processed to avoid recursion. The counter is per-CPU and preemption must be disabled to have a stable counter. perf_pending_task() disables preemption to access the counter and then signal. This is problematic on PREEMPT_RT because sending a signal uses a spinlock_t which must not be acquired in atomic on PREEMPT_RT because it becomes a sleeping lock. The atomic context can be avoided by moving the counter into the task_struct. There is a 4 byte hole between futex_state (usually always on) and the following perf pointer (perf_event_ctxp). After the recursion lost some weight it fits perfectly. Move swevent_htable::recursion into task_struct. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Marco Elver <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-09perf: Shrink the size of the recursion counter.Sebastian Andrzej Siewior3-4/+4
There are four recursion counter, one for each context. The type of the counter is `int' but the counter is used as `bool' since it is only incremented if zero. The main goal here is to shrink the whole struct into 32bit int which can later be added task_struct into an existing hole. Reduce the type of the recursion counter to an unsigned char, keep the increment/ decrement operation. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Marco Elver <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-09perf: Enqueue SIGTRAP always via task_work.Sebastian Andrzej Siewior1-21/+10
A signal is delivered by raising irq_work() which works from any context including NMI. irq_work() can be delayed if the architecture does not provide an interrupt vector. In order not to lose a signal, the signal is injected via task_work during event_sched_out(). Instead going via irq_work, the signal could be added directly via task_work. The signal is sent to current and can be enqueued on its return path to userland. Queue signal via task_work and consider possible NMI context. Remove perf_event::pending_sigtrap and and use perf_event::pending_work instead. Reported-by: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Marco Elver <[email protected]> Tested-by: Arnaldo Carvalho de Melo <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-09task_work: Add TWA_NMI_CURRENT as an additional notify mode.Sebastian Andrzej Siewior1-3/+21
Adding task_work from NMI context requires the following: - The kasan_record_aux_stack() is not NMU safe and must be avoided. - Using TWA_RESUME is NMI safe. If the NMI occurs while the CPU is in userland then it will continue in userland and not invoke the `work' callback. Add TWA_NMI_CURRENT as an additional notify mode. In this mode skip kasan and use irq_work in hardirq-mode to for needed interrupt. Set TIF_NOTIFY_RESUME within the irq_work callback due to k[ac]san instrumentation in test_and_set_bit() which does not look NMI safe in case of a report. Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-09perf: Move irq_work_queue() where the event is prepared.Sebastian Andrzej Siewior1-5/+5
Only if perf_event::pending_sigtrap is zero, the irq_work accounted by increminging perf_event::nr_pending. The member perf_event::pending_addr might be overwritten by a subsequent event if the signal was not yet delivered and is expected. The irq_work will not be enqeueued again because it has a check to be only enqueued once. Move irq_work_queue() to where the counter is incremented and perf_event::pending_sigtrap is set to make it more obvious that the irq_work is scheduled once. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Marco Elver <[email protected]> Tested-by: Arnaldo Carvalho de Melo <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-09perf: Fix event leak upon exec and file releaseFrederic Weisbecker1-4/+34
The perf pending task work is never waited upon the matching event release. In the case of a child event, released via free_event() directly, this can potentially result in a leaked event, such as in the following scenario that doesn't even require a weak IRQ work implementation to trigger: schedule() prepare_task_switch() =======> <NMI> perf_event_overflow() event->pending_sigtrap = ... irq_work_queue(&event->pending_irq) <======= </NMI> perf_event_task_sched_out() event_sched_out() event->pending_sigtrap = 0; atomic_long_inc_not_zero(&event->refcount) task_work_add(&event->pending_task) finish_lock_switch() =======> <IRQ> perf_pending_irq() //do nothing, rely on pending task work <======= </IRQ> begin_new_exec() perf_event_exit_task() perf_event_exit_event() // If is child event free_event() WARN(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1) // event is leaked Similar scenarios can also happen with perf_event_remove_on_exec() or simply against concurrent perf_event_release(). Fix this with synchonizing against the possibly remaining pending task work while freeing the event, just like is done with remaining pending IRQ work. This means that the pending task callback neither need nor should hold a reference to the event, preventing it from ever beeing freed. Fixes: 517e6a301f34 ("perf: Fix perf_pending_task() UaF") Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2024-07-09perf: Fix event leak upon exitFrederic Weisbecker1-8/+5
When a task is scheduled out, pending sigtrap deliveries are deferred to the target task upon resume to userspace via task_work. However failures while adding an event's callback to the task_work engine are ignored. And since the last call for events exit happen after task work is eventually closed, there is a small window during which pending sigtrap can be queued though ignored, leaking the event refcount addition such as in the following scenario: TASK A ----- do_exit() exit_task_work(tsk); <IRQ> perf_event_overflow() event->pending_sigtrap = pending_id; irq_work_queue(&event->pending_irq); </IRQ> =========> PREEMPTION: TASK A -> TASK B event_sched_out() event->pending_sigtrap = 0; atomic_long_inc_not_zero(&event->refcount) // FAILS: task work has exited task_work_add(&event->pending_task) [...] <IRQ WORK> perf_pending_irq() // early return: event->oncpu = -1 </IRQ WORK> [...] =========> TASK B -> TASK A perf_event_exit_task(tsk) perf_event_exit_event() free_event() WARN(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1) // leak event due to unexpected refcount == 2 As a result the event is never released while the task exits. Fix this with appropriate task_work_add()'s error handling. Fixes: 517e6a301f34 ("perf: Fix perf_pending_task() UaF") Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2024-07-09task_work: Introduce task_work_cancel() againFrederic Weisbecker1-0/+24
Re-introduce task_work_cancel(), this time to cancel an actual callback and not *any* callback pointing to a given function. This is going to be needed for perf events event freeing. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2024-07-09task_work: s/task_work_cancel()/task_work_cancel_func()/Frederic Weisbecker2-6/+6
A proper task_work_cancel() API that actually cancels a callback and not *any* callback pointing to a given function is going to be needed for perf events event freeing. Do the appropriate rename to prepare for that. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2024-07-09locking/rwsem: Add __always_inline annotation to __down_write_common() and ↵John Stultz1-3/+3
inlined callers Apparently despite it being marked inline, the compiler may not inline __down_write_common() which makes it difficult to identify the cause of lock contention, as the wchan of the blocked function will always be listed as __down_write_common(). So add __always_inline annotation to the common function (as well as the inlined helper callers) to force it to be inlined so a more useful blocking function will be listed (via wchan). This mirrors commit 92cc5d00a431 ("locking/rwsem: Add __always_inline annotation to __down_read_common() and inlined callers") which did the same for __down_read_common. I sort of worry that I'm playing wack-a-mole here, and talking with compiler people, they tell me inline means nothing, which makes me want to cry a little. So I'm wondering if we need to replace all the inlines with __always_inline, or remove them because either we mean something by it, or not. Fixes: c995e638ccbb ("locking/rwsem: Fold __down_{read,write}*()") Reported-by: Tim Murray <[email protected]> Signed-off-by: John Stultz <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Waiman Long <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-07-09dma-mapping: benchmark: Don't starve others when doing the testYicong Yang1-0/+16
The test thread will start N benchmark kthreads and then schedule out until the test time finished and notify the benchmark kthreads to stop. The benchmark kthreads will keep running until notified to stop. There's a problem with current implementation when the benchmark kthreads number is equal to the CPUs on a non-preemptible kernel: since the scheduler will balance the kthreads across the CPUs and when the test time's out the test thread won't get a chance to be scheduled on any CPU then cannot notify the benchmark kthreads to stop. This can be easily reproduced on a VM (simulated with 16 CPUs) with PREEMPT_VOLUNTARY: estuary:/mnt$ ./dma_map_benchmark -t 16 -s 1 rcu: INFO: rcu_sched self-detected stall on CPU rcu: 10-...!: (5221 ticks this GP) idle=ed24/1/0x4000000000000000 softirq=142/142 fqs=0 rcu: (t=5254 jiffies g=-559 q=45 ncpus=16) rcu: rcu_sched kthread starved for 5255 jiffies! g-559 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=12 rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_sched state:R running task stack:0 pid:16 tgid:16 ppid:2 flags:0x00000008 Call trace __switch_to+0xec/0x138 __schedule+0x2f8/0x1080 schedule+0x30/0x130 schedule_timeout+0xa0/0x188 rcu_gp_fqs_loop+0x128/0x528 rcu_gp_kthread+0x1c8/0x208 kthread+0xec/0xf8 ret_from_fork+0x10/0x20 Sending NMI from CPU 10 to CPUs 0: NMI backtrace for cpu 0 CPU: 0 PID: 332 Comm: dma-map-benchma Not tainted 6.10.0-rc1-vanilla-LSE #8 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : arm_smmu_cmdq_issue_cmdlist+0x218/0x730 lr : arm_smmu_cmdq_issue_cmdlist+0x488/0x730 sp : ffff80008748b630 x29: ffff80008748b630 x28: 0000000000000000 x27: ffff80008748b780 x26: 0000000000000000 x25: 000000000000bc70 x24: 000000000001bc70 x23: ffff0000c12af080 x22: 0000000000010000 x21: 000000000000ffff x20: ffff80008748b700 x19: ffff0000c12af0c0 x18: 0000000000010000 x17: 0000000000000001 x16: 0000000000000040 x15: ffffffffffffffff x14: 0001ffffffffffff x13: 000000000000ffff x12: 00000000000002f1 x11: 000000000001ffff x10: 0000000000000031 x9 : ffff800080b6b0b8 x8 : ffff0000c2a48000 x7 : 000000000001bc71 x6 : 0001800000000000 x5 : 00000000000002f1 x4 : 01ffffffffffffff x3 : 000000000009aaf1 x2 : 0000000000000018 x1 : 000000000000000f x0 : ffff0000c12af18c Call trace: arm_smmu_cmdq_issue_cmdlist+0x218/0x730 __arm_smmu_tlb_inv_range+0xe0/0x1a8 arm_smmu_iotlb_sync+0xc0/0x128 __iommu_dma_unmap+0x248/0x320 iommu_dma_unmap_page+0x5c/0xe8 dma_unmap_page_attrs+0x38/0x1d0 map_benchmark_thread+0x118/0x2c0 kthread+0xec/0xf8 ret_from_fork+0x10/0x20 Solve this by adding scheduling point in the kthread loop, so if there're other threads in the system they may have a chance to run, especially the thread to notify the test end. However this may degrade the test concurrency so it's recommended to run this on an idle system. Signed-off-by: Yicong Yang <[email protected]> Acked-by: Barry Song <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-07-08bpf: helpers: fix bpf_wq_set_callback_impl signatureBenjamin Tissoires1-1/+1
I realized this while having a map containing both a struct bpf_timer and a struct bpf_wq: the third argument provided to the bpf_wq callback is not the struct bpf_wq pointer itself, but the pointer to the value in the map. Which means that the users need to double cast the provided "value" as this is not a struct bpf_wq *. This is a change of API, but there doesn't seem to be much users of bpf_wq right now, so we should be able to go with this right now. Fixes: 81f1d7a583fa ("bpf: wq: add bpf_wq_set_callback_impl") Signed-off-by: Benjamin Tissoires <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-07-08bpf: Remove unnecessary loop in task_file_seq_get_next()Dan Carpenter1-6/+3
After commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU") this loop always iterates exactly one time. Delete the for statement and pull the code in a tab. Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Acked-by: Jiri Olsa <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-07-06tracing/kprobes: Add symbol counting check when module loadsMasami Hiramatsu (Google)1-44/+81
Currently, kprobe event checks whether the target symbol name is unique or not, so that it does not put a probe on an unexpected place. But this skips the check if the target is on a module because the module may not be loaded. To fix this issue, this patch checks the number of probe target symbols in a target module when the module is loaded. If the probe is not on the unique name symbols in the module, it will be rejected at that point. Note that the symbol which has a unique name in the target module, it will be accepted even if there are same-name symbols in the kernel or other modules, Link: https://lore.kernel.org/all/172016348553.99543.2834679315611882137.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]>
2024-07-05workqueue: Init rescuer's affinities as the wq's effective cpumaskLai Jiangshan1-4/+8
Make it consistent with apply_wqattrs_commit(). Link: https://lore.kernel.org/lkml/[email protected]/ Cc: Juri Lelli <[email protected]> Cc: Waiman Long <[email protected]> Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-07-05workqueue: Put PWQ allocation and WQ enlistment in the same lock C.S.Lai Jiangshan1-26/+28
The PWQ allocation and WQ enlistment are not within the same lock-held critical section; therefore, their states can become out of sync when the user modifies the unbound mask or if CPU hotplug events occur in the interim since those operations only update the WQs that are already in the list. Make the PWQ allocation and WQ enlistment atomic. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-07-05workqueue: Move kthread_flush_worker() out of alloc_and_link_pwqs()Lai Jiangshan1-7/+8
kthread_flush_worker() can't be called with wq_pool_mutex held. Prepare for moving wq_pool_mutex and cpu hotplug lock out of alloc_and_link_pwqs(). Cc: Zqiang <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-07-05workqueue: Make rescuer initialization as the last step of the creation of a ↵Lai Jiangshan1-3/+3
new wq For early wq allocation, rescuer initialization is the last step of the creation of a new wq. Make the behavior the same for all allocations. Prepare for initializing rescuer's affinities with the default pwq's affinities. Prepare for moving the whole workqueue initializing procedure into wq_pool_mutex and cpu hotplug locks. Cc: Juri Lelli <[email protected]> Cc: Waiman Long <[email protected]> Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-07-05workqueue: Register sysfs after the whole creation of the new wqLai Jiangshan1-3/+3
workqueue creation includes adding it to the workqueue list. Prepare for moving the whole workqueue initializing procedure into wq_pool_mutex and cpu hotplug locks. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-07-05cgroup/rstat: add force idle show helperChen Ridong1-20/+17
In the function cgroup_base_stat_cputime_show, there are five instances of #ifdef, which makes the code not concise. To address this, add the function cgroup_force_idle_show to make the code more succinct. Signed-off-by: Chen Ridong <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-07-04get_task_mm: check PF_KTHREAD locklessOleg Nesterov1-6/+5
Nowadays PF_KTHREAD is sticky and it was never protected by ->alloc_lock. Move the PF_KTHREAD check outside of task_lock() section to make this code more understandable. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Eric W. Biederman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-04memcg: mm_update_next_owner: move for_each_thread() into try_to_set_owner()Oleg Nesterov1-16/+24
mm_update_next_owner() checks the children / real_parent->children to avoid the "everything else" loop in the likely case, but this won't work if a child/sibling has a zombie leader with ->mm == NULL. Move the for_each_thread() logic into try_to_set_owner(), if nothing else this makes the children/siblings/everything searches more consistent. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jinliang Zheng <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Tycho Andersen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-04memcg: mm_update_next_owner: kill the "retry" logicOleg Nesterov1-30/+27
Add the new helper, try_to_set_owner(), which tries to update mm->owner once we see c->mm == mm. This way mm_update_next_owner() doesn't need to restart the list_for_each_entry/for_each_process loops from the very beginning if it races with exit/exec, it can just continue. Unlike the current code, try_to_set_owner() re-checks tsk->mm == mm before it drops tasklist_lock, so it doesn't need get/put_task_struct(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jinliang Zheng <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Tycho Andersen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski12-213/+49
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/phy/aquantia/aquantia.h 219343755eae ("net: phy: aquantia: add missing include guards") 61578f679378 ("net: phy: aquantia: add support for PHY LEDs") drivers/net/ethernet/wangxun/libwx/wx_hw.c bd07a9817846 ("net: txgbe: remove separate irq request for MSI and INTx") b501d261a5b3 ("net: txgbe: add FDIR ATR support") https://lore.kernel.org/all/[email protected]/ include/linux/mlx5/mlx5_ifc.h 048a403648fc ("net/mlx5: IFC updates for changing max EQs") 99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO") https://lore.kernel.org/all/[email protected]/ Adjacent changes: drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c 4130c67cd123 ("wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference") 3f3126515fbe ("wifi: iwlwifi: mvm: add mvm-specific guard") include/net/mac80211.h 816c6bec09ed ("wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP") 5a009b42e041 ("wifi: mac80211: track changes in AP's TPE") Signed-off-by: Jakub Kicinski <[email protected]>
2024-07-04Merge branches 'doc.2024.06.06a', 'fixes.2024.07.04a', 'mb.2024.06.28a', ↵Paul E. McKenney15-212/+207
'nocb.2024.06.03a', 'rcu-tasks.2024.06.06a', 'rcutorture.2024.06.06a' and 'srcu.2024.06.18a' into HEAD doc.2024.06.06a: Documentation updates. fixes.2024.07.04a: Miscellaneous fixes. mb.2024.06.28a: Grace-period memory-barrier redundancy removal. nocb.2024.06.03a: No-CB CPU updates. rcu-tasks.2024.06.06a: RCU-Tasks updates. rcutorture.2024.06.06a: Torture-test updates. srcu.2024.06.18a: SRCU polled-grace-period updates.
2024-07-04rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocationFrederic Weisbecker1-3/+7
When rcu_barrier() calls rcu_rdp_cpu_online() and observes a CPU off rnp->qsmaskinitnext, it means that all accesses from the offline CPU preceding the CPUHP_TEARDOWN_CPU are visible to RCU barrier, including callbacks expiration and counter updates. However interrupts can still fire after stop_machine() re-enables interrupts and before rcutree_report_cpu_dead(). The related accesses happening between CPUHP_TEARDOWN_CPU and rnp->qsmaskinitnext clearing are _NOT_ guaranteed to be seen by rcu_barrier() without proper ordering, especially when callbacks are invoked there to the end, making rcutree_migrate_callback() bypass barrier_lock. The following theoretical race example can make rcu_barrier() hang: CPU 0 CPU 1 ----- ----- //cpu_down() smpboot_park_threads() //ksoftirqd is parked now <IRQ> rcu_sched_clock_irq() invoke_rcu_core() do_softirq() rcu_core() rcu_do_batch() // callback storm // rcu_do_batch() returns // before completing all // of them // do_softirq also returns early because of // timeout. It defers to ksoftirqd but // it's parked </IRQ> stop_machine() take_cpu_down() rcu_barrier() spin_lock(barrier_lock) // observes rcu_segcblist_n_cbs(&rdp->cblist) != 0 <IRQ> do_softirq() rcu_core() rcu_do_batch() //completes all pending callbacks //smp_mb() implied _after_ callback number dec </IRQ> rcutree_report_cpu_dead() rnp->qsmaskinitnext &= ~rdp->grpmask; rcutree_migrate_callback() // no callback, early return without locking // barrier_lock //observes !rcu_rdp_cpu_online(rdp) rcu_barrier_entrain() rcu_segcblist_entrain() // Observe rcu_segcblist_n_cbs(rsclp) == 0 // because no barrier between reading // rnp->qsmaskinitnext and rsclp->len rcu_segcblist_add_len() smp_mb__before_atomic() // will now observe the 0 count and empty // list, but too late, we enqueue regardless WRITE_ONCE(rsclp->len, rsclp->len + v); // ignored barrier callback // rcu barrier stall... This could be solved with a read memory barrier, enforcing the message passing between rnp->qsmaskinitnext and rsclp->len, matching the full memory barrier after rsclp->len addition in rcu_segcblist_add_len() performed at the end of rcu_do_batch(). However the rcu_barrier() is complicated enough and probably doesn't need too many more subtleties. CPU down is a slowpath and the barrier_lock seldom contended. Solve the issue with unconditionally locking the barrier_lock on rcutree_migrate_callbacks(). This makes sure that either rcu_barrier() sees the empty queue or its entrained callback will be migrated. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2024-07-04rcu: Eliminate lockless accesses to rcu_sync->gp_countOleg Nesterov1-8/+4
The rcu_sync structure's ->gp_count field is always accessed under the protection of that same structure's ->rss_lock field, with the exception of a pair of WARN_ON_ONCE() calls just prior to acquiring that lock in functions rcu_sync_exit() and rcu_sync_dtor(). These lockless accesses are unnecessary and impair KCSAN's ability to catch bugs that might be inserted via other lockless accesses. This commit therefore moves those WARN_ON_ONCE() calls under the lock. Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2024-07-04rcu: Add rcutree.nohz_full_patience_delay to reduce nohz_full OS jitterPaul E. McKenney2-2/+19
If a CPU is running either a userspace application or a guest OS in nohz_full mode, it is possible for a system call to occur just as an RCU grace period is starting. If that CPU also has the scheduling-clock tick enabled for any reason (such as a second runnable task), and if the system was booted with rcutree.use_softirq=0, then RCU can add insult to injury by awakening that CPU's rcuc kthread, resulting in yet another task and yet more OS jitter due to switching to that task, running it, and switching back. In addition, in the common case where that system call is not of excessively long duration, awakening the rcuc task is pointless. This pointlessness is due to the fact that the CPU will enter an extended quiescent state upon returning to the userspace application or guest OS. In this case, the rcuc kthread cannot do anything that the main RCU grace-period kthread cannot do on its behalf, at least if it is given a few additional milliseconds (for example, given the time duration specified by rcutree.jiffies_till_first_fqs, give or take scheduling delays). This commit therefore adds a rcutree.nohz_full_patience_delay kernel boot parameter that specifies the grace period age (in milliseconds, rounded to jiffies) before which RCU will refrain from awakening the rcuc kthread. Preliminary experimentation suggests a value of 1000, that is, one second. Increasing rcutree.nohz_full_patience_delay will increase grace-period latency and in turn increase memory footprint, so systems with constrained memory might choose a smaller value. Systems with less-aggressive OS-jitter requirements might choose the default value of zero, which keeps the traditional immediate-wakeup behavior, thus avoiding increases in grace-period latency. [ paulmck: Apply Leonardo Bras feedback. ] Link: https://lore.kernel.org/all/[email protected]/ Reported-by: Leonardo Bras <[email protected]> Suggested-by: Leonardo Bras <[email protected]> Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Leonardo Bras <[email protected]>
2024-07-04Merge branch 'tip/x86/cpu'Peter Zijlstra168-1997/+5285
The Lunarlake patches rely on the new VFM stuff. Signed-off-by: Peter Zijlstra <[email protected]>
2024-07-04perf: Make rb_alloc_aux() return an error immediately if nr_pages <= 0Adrian Hunter1-0/+3
rb_alloc_aux() should not be called with nr_pages <= 0. Make it more robust and readable by returning an error immediately in that case. Signed-off-by: Adrian Hunter <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-04perf: Fix default aux_watermark calculationAdrian Hunter1-1/+3
The default aux_watermark is half the AUX area buffer size. In general, on a 64-bit architecture, the AUX area buffer size could be a bigger than fits in a 32-bit type, but the calculation does not allow for that possibility. However the aux_watermark value is recorded in a u32, so should not be more than U32_MAX either. Fix by doing the calculation in a correctly sized type, and limiting the result to U32_MAX. Fixes: d68e6799a5c8 ("perf: Cap allocation order at aux_watermark") Signed-off-by: Adrian Hunter <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-04perf: Prevent passing zero nr_pages to rb_alloc_aux()Adrian Hunter1-0/+2
nr_pages is unsigned long but gets passed to rb_alloc_aux() as an int, and is stored as an int. Only power-of-2 values are accepted, so if nr_pages is a 64_bit value, it will be passed to rb_alloc_aux() as zero. That is not ideal because: 1. the value is incorrect 2. rb_alloc_aux() is at risk of misbehaving, although it manages to return -ENOMEM in that case, it is a result of passing zero to get_order() even though the get_order() result is documented to be undefined in that case. Fix by simply validating the maximum supported value in the first place. Use -ENOMEM error code for consistency with the current error code that is returned in that case. Fixes: 45bfb2e50471 ("perf: Add AUX area to ring buffer for raw data streams") Signed-off-by: Adrian Hunter <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-04perf: Fix perf_aux_size() for greater-than 32-bit sizeAdrian Hunter1-1/+1
perf_buffer->aux_nr_pages uses a 32-bit type, so a cast is needed to calculate a 64-bit size. Fixes: 45bfb2e50471 ("perf: Add AUX area to ring buffer for raw data streams") Signed-off-by: Adrian Hunter <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-04sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE ↵Tejun Heo3-18/+14
tasks When a task's weight is being changed, set_load_weight() is called with @update_load set. As weight changes aren't trivial for the fair class, set_load_weight() calls fair.c::reweight_task() for fair class tasks. However, set_load_weight() first tests task_has_idle_policy() on entry and skips calling reweight_task() for SCHED_IDLE tasks. This is buggy as SCHED_IDLE tasks are just fair tasks with a very low weight and they would incorrectly skip load, vlag and position updates. Fix it by updating reweight_task() to take struct load_weight as idle weight can't be expressed with prio and making set_load_weight() call reweight_task() for SCHED_IDLE tasks too when @update_load is set. Fixes: 9059393e4ec1 ("sched/fair: Use reweight_entity() for set_user_nice()") Suggested-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] # v4.15+ Link: http://lkml.kernel.org/r/[email protected]
2024-07-04sched/psi: Optimise psi_group_change a bitTvrtko Ursulin1-27/+27
The current code loops over the psi_states only to call a helper which then resolves back to the action needed for each state using a switch statement. That is effectively creating a double indirection of a kind which, given how all the states need to be explicitly listed and handled anyway, we can simply remove. Both the for loop and the switch statement that is. The benefit is both in the code size and CPU time spent in this function. YMMV but on my Steam Deck, while in a game, the patch makes the CPU usage go from ~2.4% down to ~1.2%. Text size at the same time went from 0x323 to 0x2c1. Signed-off-by: Tvrtko Ursulin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-07-04printk: Add match_devname_and_update_preferred_console()Tony Lindgren2-15/+89
Let's add match_devname_and_update_preferred_console() for driver subsystems to call during init when the console is ready, and it's character device name is known. For now, we use it only for the serial layer to allow console=DEVNAME:0.0 style hardware based addressing for consoles. The earlier attempt on doing this caused a regression with the kernel command line console order as it added calling __add_preferred_console() again later on during init. A better approach was suggested by Petr where we add the deferred console to the console_cmdline[] and update it later on when the console is ready. Suggested-by: Petr Mladek <[email protected]> Co-developed-by: Petr Mladek <[email protected]> Signed-off-by: Tony Lindgren <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-07-04genirq/irq_sim: add an extended irq_sim initializerBartosz Golaszewski1-3/+57
Currently users of the interrupt simulator don't have any way of being notified about interrupts from the simulated domain being requested or released. This causes a problem for one of the users - the GPIO simulator - which is unable to lock the pins as interrupts. Define a structure containing callbacks to be executed on various irq_sim-related events (for now: irq request and release) and provide an extended function for creating simulated interrupt domains that takes it and a pointer to custom user data (to be passed to said callbacks) as arguments. Reviewed-by: Linus Walleij <[email protected]> Acked-by: Linus Walleij <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Bartosz Golaszewski <[email protected]>
2024-07-03ftrace: unpoison ftrace_regs in ftrace_ops_list_func()Ilya Leoshkevich1-0/+1
Patch series "kmsan: Enable on s390", v7. Architectures use assembly code to initialize ftrace_regs and call ftrace_ops_list_func(). Therefore, from the KMSAN's point of view, ftrace_regs is poisoned on ftrace_ops_list_func entry(). This causes KMSAN warnings when running the ftrace testsuite. Fix by trusting the architecture-specific assembly code and always unpoisoning ftrace_regs in ftrace_ops_list_func. The issue was not encountered on x86_64 so far only by accident: assembly-allocated ftrace_regs was overlapping a stale partially unpoisoned stack frame. Poisoning stack frames before returns [1] makes the issue appear on x86_64 as well. [1] https://github.com/iii-i/llvm-project/commits/msan-poison-allocas-before-returning-2024-06-12/ Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ilya Leoshkevich <[email protected]> Reviewed-by: Alexander Potapenko <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Hyeonggon Yoo <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: <[email protected]> Cc: Marco Elver <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masami Hiramatsu (Google) <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03mm: extend rmap flags arguments for folio_add_new_anon_rmapBarry Song1-1/+1
Patch series "mm: clarify folio_add_new_anon_rmap() and __folio_add_anon_rmap()", v2. This patchset is preparatory work for mTHP swapin. folio_add_new_anon_rmap() assumes that new anon rmaps are always exclusive. However, this assumption doesn’t hold true for cases like do_swap_page(), where a new anon might be added to the swapcache and is not necessarily exclusive. The patchset extends the rmap flags to allow folio_add_new_anon_rmap() to handle both exclusive and non-exclusive new anon folios. The do_swap_page() function is updated to use this extended API with rmap flags. Consequently, all new anon folios now consistently use folio_add_new_anon_rmap(). The special case for !folio_test_anon() in __folio_add_anon_rmap() can be safely removed. In conclusion, new anon folios always use folio_add_new_anon_rmap(), regardless of exclusivity. Old anon folios continue to use __folio_add_anon_rmap() via folio_add_anon_rmap_pmd() and folio_add_anon_rmap_ptes(). This patch (of 3): In the case of a swap-in, a new anonymous folio is not necessarily exclusive. This patch updates the rmap flags to allow a new anonymous folio to be treated as either exclusive or non-exclusive. To maintain the existing behavior, we always use EXCLUSIVE as the default setting. [[email protected]: cleanup and constifications per David and akpm] [[email protected]: fix missing doc for flags of folio_add_new_anon_rmap()] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: enhance doc for extend rmap flags arguments for folio_add_new_anon_rmap] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Tested-by: Shuai Yuan <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Chris Li <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03mm: optimize the redundant loop of mm_update_owner_next()Jinliang Zheng1-0/+2
When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or /proc or ptrace or page migration (get_task_mm()), it is impossible to find an appropriate task_struct in the loop whose mm_struct is the same as the target mm_struct. If the above race condition is combined with the stress-ng-zombie and stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in write_lock_irq() for tasklist_lock. Recognize this situation in advance and exit early. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jinliang Zheng <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tycho Andersen <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03mm: remove the implementation of swap_free() and always use swap_free_nr()Barry Song1-3/+2
To streamline maintenance efforts, we propose removing the implementation of swap_free(). Instead, we can simply invoke swap_free_nr() with nr set to 1. swap_free_nr() is designed with a bitmap consisting of only one long, resulting in overhead that can be ignored for cases where nr equals 1. A prime candidate for leveraging swap_free_nr() lies within kernel/power/swap.c. Implementing this change facilitates the adoption of batch processing for hibernation. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Suggested-by: "Huang, Ying" <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Acked-by: Chris Li <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Len Brown <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Chuanhua Han <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kairui Song <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03tick/sched: Combine WARN_ON_ONCE and print_onceAnna-Maria Behnsen1-4/+4
When the WARN_ON_ONCE() triggers, the printk() of the additional information related to the warning will not happen in print level "warn". When reading dmesg with a restriction to level "warn", the information published by the printk_once() will not show up there. Transform WARN_ON_ONCE() and printk_once() into a WARN_ONCE(). Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-03mm: optimize the redundant loop of mm_update_owner_next()Jinliang Zheng1-0/+2
When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or /proc or ptrace or page migration (get_task_mm()), it is impossible to find an appropriate task_struct in the loop whose mm_struct is the same as the target mm_struct. If the above race condition is combined with the stress-ng-zombie and stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in write_lock_irq() for tasklist_lock. Recognize this situation in advance and exit early. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jinliang Zheng <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tycho Andersen <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03cgroup: Protect css->cgroup write under css_set_lockWaiman Long1-1/+1
The writing of css->cgroup associated with the cgroup root in rebind_subsystems() is currently protected only by cgroup_mutex. However, the reading of css->cgroup in both proc_cpuset_show() and proc_cgroup_show() is protected just by css_set_lock. That makes the readers susceptible to racing problems like data tearing or caching. It is also a problem that can be reported by KCSAN. This can be fixed by using READ_ONCE() and WRITE_ONCE() to access css->cgroup. Alternatively, the writing of css->cgroup can be moved under css_set_lock as well which is done by this patch. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-07-03cgroup/misc: Introduce misc.peakXiu Jianfeng1-0/+41
Introduce misc.peak to record the historical maximum usage of the resource, as in some scenarios the value of misc.max could be adjusted based on the peak usage of the resource. Signed-off-by: Xiu Jianfeng <[email protected]> Signed-off-by: Tejun Heo <[email protected]>