aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-08-30cgroup/cpuset: Account for boot time isolated CPUsWaiman Long1-5/+18
With the "isolcpus" boot command line parameter, we are able to create isolated CPUs at boot time. These isolated CPUs aren't fully accounted for in the cpuset code. For instance, the root cgroup's "cpuset.cpus.isolated" control file does not include the boot time isolated CPUs. Fix that by looking for pre-isolated CPUs at init time. The prstate_housekeeping_conflict() function does check the HK_TYPE_DOMAIN housekeeping cpumask to make sure that CPUs outside of it can only be used in isolated partition. Given the fact that we are going to make housekeeping cpumasks dynamic, the current check may not be right anymore. Save the boot time HK_TYPE_DOMAIN cpumask and check against it instead of the upcoming dynamic HK_TYPE_DOMAIN housekeeping cpumask. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-08-30bpf: Fix a crash when btf_parse_base() returns an error pointerMartin KaFai Lau1-1/+1
The pointer returned by btf_parse_base could be an error pointer. IS_ERR() check is needed before calling btf_free(base_btf). Fixes: 8646db238997 ("libbpf,bpf: Share BTF relocate-related code with kernel") Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Reviewed-by: Alan Maguire <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-08-30bpf: Use sockfd_put() helperJinjie Ruan1-1/+1
Replace fput() with sockfd_put() in bpf_fd_reuseport_array_update_elem(). Signed-off-by: Jinjie Ruan <[email protected]> Acked-by: Stanislav Fomichev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-30bpf: Remove custom build ruleAlexey Gladkov4-6/+6
According to the documentation, when building a kernel with the C=2 parameter, all source files should be checked. But this does not happen for the kernel/bpf/ directory. $ touch kernel/bpf/core.o $ make C=2 CHECK=true kernel/bpf/core.o Outputs: CHECK scripts/mod/empty.c CALL scripts/checksyscalls.sh DESCEND objtool INSTALL libsubcmd_headers CC kernel/bpf/core.o As can be seen the compilation is done, but CHECK is not executed. This happens because kernel/bpf/Makefile has defined its own rule for compilation and forgotten the macro that does the check. There is no need to duplicate the build code, and this rule can be removed to use generic rules. Acked-by: Masahiro Yamada <[email protected]> Tested-by: Oleg Nesterov <[email protected]> Tested-by: Alan Maguire <[email protected]> Signed-off-by: Alexey Gladkov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-30drivers/perf: arm_spe: Use perf_allow_kernel() for permissionsJames Clark1-0/+9
Use perf_allow_kernel() for 'pa_enable' (physical addresses), 'pct_enable' (physical timestamps) and context IDs. This means that perf_event_paranoid is now taken into account and LSM hooks can be used, which is more consistent with other perf_event_open calls. For example PERF_SAMPLE_PHYS_ADDR uses perf_allow_kernel() rather than just perfmon_capable(). This also indirectly fixes the following error message which is misleading because perf_event_paranoid is not taken into account by perfmon_capable(): $ perf record -e arm_spe/pa_enable/ Error: Access to performance monitoring and observability operations is limited. Consider adjusting /proc/sys/kernel/perf_event_paranoid setting ... Suggested-by: Al Grant <[email protected]> Signed-off-by: James Clark <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Will Deacon <[email protected]>
2024-08-30padata: Honor the caller's alignment in case of chunk_size 0Kamlesh Gurudasani1-0/+3
In the case where we are forcing the ps.chunk_size to be at least 1, we are ignoring the caller's alignment. Move the forcing of ps.chunk_size to be at least 1 before rounding it up to caller's alignment, so that caller's alignment is honored. While at it, use max() to force the ps.chunk_size to be at least 1 to improve readability. Fixes: 6d45e1c948a8 ("padata: Fix possible divide-by-0 panic in padata_mt_helper()") Signed-off-by: Kamlesh Gurudasani <[email protected]> Acked-by:  Waiman Long <[email protected]> Acked-by: Daniel Jordan <[email protected]> Signed-off-by: Herbert Xu <[email protected]>
2024-08-29bpf: Make the pointer returned by iter next method validJuntong Deng1-4/+22
Currently we cannot pass the pointer returned by iter next method as argument to KF_TRUSTED_ARGS or KF_RCU kfuncs, because the pointer returned by iter next method is not "valid". This patch sets the pointer returned by iter next method to be valid. This is based on the fact that if the iterator is implemented correctly, then the pointer returned from the iter next method should be valid. This does not make NULL pointer valid. If the iter next method has KF_RET_NULL flag, then the verifier will ask the ebpf program to check NULL pointer. KF_RCU_PROTECTED iterator is a special case, the pointer returned by iter next method should only be valid within RCU critical section, so it should be with MEM_RCU, not PTR_TRUSTED. Another special case is bpf_iter_num_next, which returns a pointer with base type PTR_TO_MEM. PTR_TO_MEM should not be combined with type flag PTR_TRUSTED (PTR_TO_MEM already means the pointer is valid). The pointer returned by iter next method of other types of iterators is with PTR_TRUSTED. In addition, this patch adds get_iter_from_state to help us get the current iterator from the current state. Signed-off-by: Juntong Deng <[email protected]> Link: https://lore.kernel.org/r/AM6PR03MB584869F8B448EA1C87B7CDA399962@AM6PR03MB5848.eurprd03.prod.outlook.com Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-29bpf: Export bpf_base_func_protoMartin KaFai Lau1-0/+1
The bpf_testmod needs to use the bpf_tail_call helper in a later selftest patch. This patch is to EXPORT_GPL_SYMBOL the bpf_base_func_proto. Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-29bpf: Add gen_epilogue to bpf_verifier_opsMartin KaFai Lau1-1/+45
This patch adds a .gen_epilogue to the bpf_verifier_ops. It is similar to the existing .gen_prologue. Instead of allowing a subsystem to run code at the beginning of a bpf prog, it allows the subsystem to run code just before the bpf prog exit. One of the use case is to allow the upcoming bpf qdisc to ensure that the skb->dev is the same as the qdisc->dev_queue->dev. The bpf qdisc struct_ops implementation could either fix it up or drop the skb. Another use case could be in bpf_tcp_ca.c to enforce snd_cwnd has sane value (e.g. non zero). The epilogue can do the useful thing (like checking skb->dev) if it can access the bpf prog's ctx. Unlike prologue, r1 may not hold the ctx pointer. This patch saves the r1 in the stack if the .gen_epilogue has returned some instructions in the "epilogue_buf". The existing .gen_prologue is done in convert_ctx_accesses(). The new .gen_epilogue is done in the convert_ctx_accesses() also. When it sees the (BPF_JMP | BPF_EXIT) instruction, it will be patched with the earlier generated "epilogue_buf". The epilogue patching is only done for the main prog. Only one epilogue will be patched to the main program. When the bpf prog has multiple BPF_EXIT instructions, a BPF_JA is used to goto the earlier patched epilogue. Majority of the archs support (BPF_JMP32 | BPF_JA): x86, arm, s390, risv64, loongarch, powerpc and arc. This patch keeps it simple and always use (BPF_JMP32 | BPF_JA). A new macro BPF_JMP32_A is added to generate the (BPF_JMP32 | BPF_JA) insn. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-29bpf: Adjust BPF_JMP that jumps to the 1st insn of the prologueMartin KaFai Lau1-0/+6
The next patch will add a ctx ptr saving instruction "(r1 = *(u64 *)(r10 -8)" at the beginning for the main prog when there is an epilogue patch (by the .gen_epilogue() verifier ops added in the next patch). There is one corner case if the bpf prog has a BPF_JMP that jumps to the 1st instruction. It needs an adjustment such that those BPF_JMP instructions won't jump to the newly added ctx saving instruction. The commit 5337ac4c9b80 ("bpf: Fix the corner case with may_goto and jump to the 1st insn.") has the details on this case. Note that the jump back to 1st instruction is not limited to the ctx ptr saving instruction. The same also applies to the prologue. A later test, pro_epilogue_goto_start.c, has a test for the prologue only case. Thus, this patch does one adjustment after gen_prologue and the future ctx ptr saving. It is done by adjust_jmp_off(env->prog, 0, delta) where delta has the total number of instructions in the prologue and the future ctx ptr saving instruction. The adjust_jmp_off(env->prog, 0, delta) assumes that the prologue does not have a goto 1st instruction itself. To accommodate the prologue might have a goto 1st insn itself, this patch changes the adjust_jmp_off() to skip considering the instructions between [tgt_idx, tgt_idx + delta). Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-29bpf: Move insn_buf[16] to bpf_verifier_envMartin KaFai Lau1-7/+8
This patch moves the 'struct bpf_insn insn_buf[16]' stack usage to the bpf_verifier_env. A '#define INSN_BUF_SIZE 16' is also added to replace the ARRAY_SIZE(insn_buf) usages. Both convert_ctx_accesses() and do_misc_fixup() are changed to use the env->insn_buf. It is a refactoring work for adding the epilogue_buf[16] in a later patch. With this patch, the stack size usage decreased. Before: ./kernel/bpf/verifier.c:22133:5: warning: stack frame size (2584) After: ./kernel/bpf/verifier.c:22184:5: warning: stack frame size (2264) Reviewed-by: Eduard Zingerman <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-29bpf: Use kvmemdup to simplify the codeHongbo Li1-2/+1
Use kvmemdup instead of kvmalloc() + memcpy() to simplify the code. No functional change intended. Acked-by: Yonghong Song <[email protected]> Signed-off-by: Hongbo Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-29irqdomain: Use IS_ERR_OR_NULL() in irq_domain_trim_hierarchy()Hongbo Li1-1/+1
Use IS_ERR_OR_NULL() instead of open-coding a NULL and a error pointer check. Signed-off-by: Hongbo Li <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-29genirq/msi: Use kmemdup_array() instead of kmemdup()Jinjie Ruan1-1/+1
Let kmemdup_array() take care about sizing instead of doing it open coded. Signed-off-by: Jinjie Ruan <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-29genirq/proc: Change the return value for set affinity permission errorJeff Xie1-1/+1
Currently, when the affinity of an irq cannot be set due to lack of permission, the write_irq_affinity() returns the error code -EIO. Change the return value to -EPERM as that reflects the cause of error correctly. Signed-off-by: Jeff Xie <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-29genirq/proc: Use irq_move_pending() in show_irq_affinity()Jinjie Ruan1-4/+2
irq_move_pending() encapsulates irqd_is_setaffinity_pending() depending on CONFIG_GENERIC_PENDING_IRQ. Replace the open coded #ifdeffery with it. Signed-off-by: Jinjie Ruan <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-29genirq/proc: Correctly set file permissions for affinity control filesJeff Xie1-2/+7
The kernel already knows at the time of interrupt allocation whether affinity of an interrupt can be controlled by userspace or not. It still creates all related procfs control files with read/write permissions. That's inconsistent and non-intuitive for system administrators and tools. Therefore set the file permissions to read-only for such interrupts. [ tglx: Massage change log, fixed UP build ] Signed-off-by: Jeff Xie <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-29timers: Remove historical extra jiffie for timeout in msleep()Anna-Maria Behnsen1-2/+2
msleep() and msleep_interruptible() add a jiffie to the requested timeout. This extra jiffie was introduced to ensure that the timeout will not happen earlier than specified. Since the rework of the timer wheel, the enqueue path already takes care of this. So the extra jiffie added by msleep*() is pointless now. Remove this extra jiffie in msleep() and msleep_interruptible(). Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-28bpf: Relax KF_ACQUIRE kfuncs strict type matching constraintJuntong Deng1-2/+1
Currently we cannot pass zero offset (implicit cast) or non-zero offset pointers to KF_ACQUIRE kfuncs. This is because KF_ACQUIRE kfuncs requires strict type matching, but zero offset or non-zero offset does not change the type of pointer, which causes the ebpf program to be rejected by the verifier. This can cause some problems, one example is that bpf_skb_peek_tail kfunc [0] cannot be implemented by just passing in non-zero offset pointers. We cannot pass pointers like &sk->sk_write_queue (non-zero offset) or &sk->__sk_common (zero offset) to KF_ACQUIRE kfuncs. This patch makes KF_ACQUIRE kfuncs not require strict type matching. [0]: https://lore.kernel.org/bpf/AM6PR03MB5848CA39CB4B7A4397D380B099B12@AM6PR03MB5848.eurprd03.prod.outlook.com/ Signed-off-by: Juntong Deng <[email protected]> Link: https://lore.kernel.org/r/AM6PR03MB5848FD2BD89BF0B6B5AA3B4C99952@AM6PR03MB5848.eurprd03.prod.outlook.com Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-28audit: use task_tgid_nr() instead of task_pid_nr()Ricardo Robaina3-3/+3
In a few audit records, PIDs were being recorded with task_pid_nr() instead of task_tgid_nr(). $ grep "task_pid_nr" kernel/audit*.c audit.c: task_pid_nr(current), auditfilter.c: pid = task_pid_nr(current); auditsc.c: audit_log_format(ab, " pid=%u", task_pid_nr(current)); For single-thread applications, the process id (pid) and the thread group id (tgid) are the same. However, on multi-thread applications, task_pid_nr() returns the current thread id (user-space's TID), while task_tgid_nr() returns the main thread id (user-space's PID). Since the users are more interested in the process id (pid), rather than the thread id (tid), this patch converts these callers to the correct method. Link: https://github.com/linux-audit/audit-kernel/issues/126 Reviewed-by: Richard Guy Briggs <[email protected]> Signed-off-by: Ricardo Robaina <[email protected]> Signed-off-by: Paul Moore <[email protected]>
2024-08-27sched_ext: Add missing cfi stub for ops.tickTejun Heo1-0/+2
The cfi stub for ops.tick was missing which will fail scheduler loading after pending BPF changes. Add it. Signed-off-by: Tejun Heo <[email protected]>
2024-08-27rcu/kvfree: Add kvfree_rcu_barrier() APIUladzislau Rezki (Sony)1-8/+101
Add a kvfree_rcu_barrier() function. It waits until all in-flight pointers are freed over RCU machinery. It does not wait any GP completion and it is within its right to return immediately if there are no outstanding pointers. This function is useful when there is a need to guarantee that a memory is fully freed before destroying memory caches. For example, during unloading a kernel module. Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2024-08-27Merge v6.11-rc5 into drm-nextDaniel Vetter29-181/+230
amdgpu pr conconflicts due to patches cherry-picked to -fixes, I might as well catch up with a backmerge and handle them all. Plus both misc and intel maintainers asked for a backmerge anyway. Signed-off-by: Daniel Vetter <[email protected]>
2024-08-27genirq: Get rid of global lock in irq_do_set_affinity()Marc Zyngier1-12/+9
Kunkun Jiang reports that for a workload involving the simultaneous startup of a large number of VMs (for a total of about 200 vcpus), a lot of CPU time gets spent on spinning on the tmp_mask_lock that exists as a static raw spinlock in irq_do_set_affinity(). This lock protects a global cpumask (tmp_mask) that is used as a temporary variable to compute the resulting affinity. While this is triggered by KVM issuing a irq_set_affinity() call each time a vcpu is about to execute, it is obvious that having a single global resource is not very scalable. Since a cpumask can be a fairly large structure on systems with a high core count, a stack allocation is not really appropriate. Instead, turn the global cpumask into a per-CPU variable, removing the need for locking altogether as the code is executed with preemption and interrupts disabled. [ tglx: Moved the per CPU variable declaration outside of the function ] Reported-by: Kunkun Jiang <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Kunkun Jiang <[email protected]> Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/[email protected]
2024-08-27Merge tag 'vfs-6.11-rc6.fixes' of ↵Linus Torvalds1-22/+3
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "VFS: - Ensure that backing files uses file->f_ops->splice_write() for splice netfs: - Revert the removal of PG_private_2 from netfs_release_folio() as cephfs still relies on this - When AS_RELEASE_ALWAYS is set on a mapping the folio needs to always be invalidated during truncation - Fix losing untruncated data in a folio by making letting netfs_release_folio() return false if the folio is dirty - Fix trimming of streaming-write folios in netfs_inval_folio() - Reset iterator before retrying a short read - Fix interaction of streaming writes with zero-point tracker afs: - During truncation afs currently calls truncate_setsize() which sets i_size, expands the pagecache and truncates it. The first two operations aren't needed because they will have already been done. So call truncate_pagecache() instead and skip the redundant parts overlayfs: - Fix checking of the number of allowed lower layers so 500 layers can actually be used instead of just 499 - Add missing '\n' to pr_err() output - Pass string to ovl_parse_layer() and thus allow it to be used for Opt_lowerdir as well pidfd: - Revert blocking the creation of pidfds for kthread as apparently userspace relies on this. Specifically, it breaks systemd during shutdown romfs: - Fix romfs_read_folio() to use the correct offset with folio_zero_tail()" * tag 'vfs-6.11-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: netfs: Fix interaction of streaming writes with zero-point tracker netfs: Fix missing iterator reset on retry of short read netfs: Fix trimming of streaming-write folios in netfs_inval_folio() netfs: Fix netfs_release_folio() to say no if folio dirty afs: Fix post-setattr file edit to do truncation correctly mm: Fix missing folio invalidation calls during truncation ovl: ovl_parse_param_lowerdir: Add missed '\n' for pr_err ovl: fix wrong lowerdir number check for parameter Opt_lowerdir ovl: pass string to ovl_parse_layer() backing-file: convert to using fops->splice_write Revert "pidfd: prevent creation of pidfds for kthreads" romfs: fix romfs_read_folio() netfs, ceph: Partially revert "netfs: Replace PG_fscache by setting folio->private and marking dirty"
2024-08-26tracing: Add option to set an instance to be the trace_printk destinationSteven Rostedt2-5/+36
Add a option "trace_printk_dest" that will make the tracing instance the location that trace_printk() will go to. This is useful if the trace_printk or one of the top level tracers is too noisy and there's a need to separate the two. Then an instance can be created, the trace_printk can be set to go there instead, where it will not be lost in the noise of the top level tracer. Note, only one instance can be the destination of trace_printk at a time. If an instance sets this flag, the instance that had it set will have it cleared. There is always one instance that has this set. By default, that is the top instance. This flag cannot be cleared from the top instance. Doing so will result in an -EINVAL. The only way this flag can be cleared from the top instance is by another instance setting it. Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Vincent Donnefort <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vineeth Pillai <[email protected]> Cc: Beau Belgrave <[email protected]> Cc: Alexander Graf <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: David Howells <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Tony Luck <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Kees Cook <[email protected]> Cc: Alexander Aring <[email protected]> Cc: "Luis Claudio R. Goncalves" <[email protected]> Cc: Tomas Glozar <[email protected]> Cc: John Kacur <[email protected]> Cc: Clark Williams <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: "Jonathan Corbet" <[email protected]> Link: https://lore.kernel.org/[email protected] Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-08-26tracing: Have trace_printk not use binary prints if boot bufferSteven Rostedt3-17/+35
If the persistent boot mapped ring buffer is used for trace_printk(), force it to not use the binary versions. trace_printk() by default uses bin_printf() that only saves the pointer to the format and not the format itself inside the ring buffer. But for a persistent buffer that is read after reboot, the pointers to the format strings may not be the same, or worse, not even exist! Instead, just force the more robust, but slower, version that does the formatting before saving into the ring buffer. The boot mapped buffer can now be used for trace_printk and friends! Using the trace_printk() and the persistent buffer was used to debug the issue with the osnoise tracer: Link: https://lore.kernel.org/all/[email protected]/ Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Vincent Donnefort <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vineeth Pillai <[email protected]> Cc: Beau Belgrave <[email protected]> Cc: Alexander Graf <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: David Howells <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Tony Luck <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Kees Cook <[email protected]> Cc: Alexander Aring <[email protected]> Cc: "Luis Claudio R. Goncalves" <[email protected]> Cc: Tomas Glozar <[email protected]> Cc: John Kacur <[email protected]> Cc: Clark Williams <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: "Jonathan Corbet" <[email protected]> Link: https://lore.kernel.org/[email protected] Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-08-26tracing: Allow trace_printk() to go to other instance buffersSteven Rostedt1-11/+35
Currently, trace_printk() just goes to the top level ring buffer. But there may be times that it should go to one of the instances created by the kernel command line. Add a new trace_instance flag: traceprintk (also can use "printk" or "trace_printk" as people tend to forget the actual flag name). trace_instance=foo^traceprintk Will assign the trace_printk to this buffer at boot up. Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Vincent Donnefort <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vineeth Pillai <[email protected]> Cc: Beau Belgrave <[email protected]> Cc: Alexander Graf <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: David Howells <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Tony Luck <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Kees Cook <[email protected]> Cc: Alexander Aring <[email protected]> Cc: "Luis Claudio R. Goncalves" <[email protected]> Cc: Tomas Glozar <[email protected]> Cc: John Kacur <[email protected]> Cc: Clark Williams <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: "Jonathan Corbet" <[email protected]> Link: https://lore.kernel.org/[email protected] Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-08-26tracing: Add "traceoff" flag to boot time tracing instancesSteven Rostedt1-1/+30
Add a "flags" delimiter (^) to the "trace_instance" kernel command line parameter, and add the "traceoff" flag. The format is: trace_instance=<name>[^<flag1>[^<flag2>]][@<memory>][,<events>] The code allows for more than one flag to be added, but currently only "traceoff" is done so. The motivation for this change came from debugging with the persistent ring buffer and having trace_printk() writing to it. The trace_printk calls are always enabled, and the boot after the crash was having the unwanted trace_printks from the current boot inject into the ring buffer with the trace_printks of the crash kernel, making the output very confusing. Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Vincent Donnefort <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vineeth Pillai <[email protected]> Cc: Beau Belgrave <[email protected]> Cc: Alexander Graf <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: David Howells <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Tony Luck <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Kees Cook <[email protected]> Cc: Alexander Aring <[email protected]> Cc: "Luis Claudio R. Goncalves" <[email protected]> Cc: Tomas Glozar <[email protected]> Cc: John Kacur <[email protected]> Cc: Clark Williams <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: "Jonathan Corbet" <[email protected]> Link: https://lore.kernel.org/[email protected] Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-08-26ring-buffer: Align meta-page to sub-buffers for improved TLB usageVincent Donnefort1-13/+20
Previously, the mapped ring-buffer layout caused misalignment between the meta-page and sub-buffers when the sub-buffer size was not a multiple of PAGE_SIZE. This prevented hardware with larger TLB entries from utilizing them effectively. Add a padding with the zero-page between the meta-page and sub-buffers. Also update the ring-buffer map_test to verify that padding. Link: https://lore.kernel.org/[email protected] Signed-off-by: Vincent Donnefort <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-08-26ring-buffer: Add magic and struct size to boot up meta dataSteven Rostedt1-0/+14
Add a magic number as well as save the struct size of the ring_buffer_meta structure in the meta data to also use as validation. Updating the magic number could be used to force a invalidation between kernel versions, and saving the structure size is also a good method to make sure the content is what is expected. Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Vincent Donnefort <[email protected]> Link: https://lore.kernel.org/[email protected] Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-08-26ring-buffer: Don't reset persistent ring-buffer meta saved addressesSteven Rostedt1-8/+24
The text and data address is saved in the meta data so that it can be used to know the delta of the text and data addresses of the last boot compared to the text and data addresses of the current boot. The delta is used to convert function pointer entries in the ring buffer to something that can be used by kallsyms (note this only works for built-in functions). But the saved addresses get reset on boot up. If the buffer is not used and there's another reboot, then the saved text and data addresses will be of the last boot and not that of the boot that created the content in the ring buffer. To get an idea of the issue: # trace-cmd start -B boot_mapped -p function # reboot # trace-cmd show -B boot_mapped | tail <...>-1 [000] d..1. 461.983243: native_apic_msr_write <-native_kick_ap <...>-1 [000] d..1. 461.983244: __pfx_native_apic_msr_eoi <-native_kick_ap <...>-1 [000] d..1. 461.983244: reserve_irq_vector_locked <-native_kick_ap <...>-1 [000] d..1. 461.983262: branch_emulate_op <-native_kick_ap <...>-1 [000] d..1. 461.983262: __ia32_sys_ia32_pread64 <-native_kick_ap <...>-1 [000] d..1. 461.983263: native_kick_ap <-__smpboot_create_thread <...>-1 [000] d..1. 461.983263: store_cache_disable <-native_kick_ap <...>-1 [000] d..1. 461.983279: acpi_power_off_prepare <-native_kick_ap <...>-1 [000] d..1. 461.983280: __pfx_acpi_ns_delete_node <-acpi_suspend_enter <...>-1 [000] d..1. 461.983280: __pfx_acpi_os_release_lock <-acpi_suspend_enter # reboot # trace-cmd show -B boot_mapped |tail <...>-1 [000] d..1. 461.983243: 0xffffffffa9669220 <-0xffffffffa965f3db <...>-1 [000] d..1. 461.983244: 0xffffffffa96690f0 <-0xffffffffa965f3db <...>-1 [000] d..1. 461.983244: 0xffffffffa9663fa0 <-0xffffffffa965f3db <...>-1 [000] d..1. 461.983262: 0xffffffffa9672e80 <-0xffffffffa965f3e0 <...>-1 [000] d..1. 461.983262: 0xffffffffa962b940 <-0xffffffffa965f3ec <...>-1 [000] d..1. 461.983263: 0xffffffffa965f540 <-0xffffffffa96e1362 <...>-1 [000] d..1. 461.983263: 0xffffffffa963c940 <-0xffffffffa965f55b <...>-1 [000] d..1. 461.983279: 0xffffffffa9ee30c0 <-0xffffffffa965f59b <...>-1 [000] d..1. 461.983280: 0xffffffffa9f16c10 <-0xffffffffa9ee3157 <...>-1 [000] d..1. 461.983280: 0xffffffffa9ee02e0 <-0xffffffffa9ee3157 By not updating the saved text and data addresses in the meta data at every boot up and only updating them when the buffer is reset, it allows multiple boots to see the same data. Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Vincent Donnefort <[email protected]> Link: https://lore.kernel.org/[email protected] Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-08-24Merge tag 'cgroup-for-6.11-rc4-fixes' of ↵Linus Torvalds1-17/+21
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Three patches addressing cpuset corner cases" * tag 'cgroup-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Eliminate unncessary sched domains rebuilds in hotplug cgroup/cpuset: Clear effective_xcpus on cpus_allowed clearing only if cpus.exclusive not set cgroup/cpuset: fix panic caused by partcmd_update
2024-08-24Merge tag 'wq-for-6.11-rc4-fixes' of ↵Linus Torvalds1-23/+27
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Nothing too interesting. One patch to remove spurious warning and others to address static checker warnings" * tag 'wq-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Correct declaration of cpu_pwq in struct workqueue_struct workqueue: Fix spruious data race in __flush_work() workqueue: Remove incorrect "WARN_ON_ONCE(!list_empty(&worker->entry));" from dying worker workqueue: Fix UBSAN 'subtraction overflow' error in shift_and_mask() workqueue: doc: Fix function name, remove markers
2024-08-23bpf: Add bpf_copy_from_user_str kfuncJordan Rome1-0/+42
This adds a kfunc wrapper around strncpy_from_user, which can be called from sleepable BPF programs. This matches the non-sleepable 'bpf_probe_read_user_str' helper except it includes an additional 'flags' param, which allows consumers to clear the entire destination buffer on success or failure. Signed-off-by: Jordan Rome <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-23bpf: Support bpf_kptr_xchg into local kptrDave Marchevsky2-16/+32
Currently, users can only stash kptr into map values with bpf_kptr_xchg(). This patch further supports stashing kptr into local kptr by adding local kptr as a valid destination type. When stashing into local kptr, btf_record in program BTF is used instead of btf_record in map to search for the btf_field of the local kptr. The local kptr specific checks in check_reg_type() only apply when the source argument of bpf_kptr_xchg() is local kptr. Therefore, we make the scope of the check explicit as the destination now can also be local kptr. Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: Dave Marchevsky <[email protected]> Signed-off-by: Amery Hung <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-23bpf: Rename ARG_PTR_TO_KPTR -> ARG_KPTR_XCHG_DESTDave Marchevsky2-4/+4
ARG_PTR_TO_KPTR is currently only used by the bpf_kptr_xchg helper. Although it limits reg types for that helper's first arg to PTR_TO_MAP_VALUE, any arbitrary mapval won't do: further custom verification logic ensures that the mapval reg being xchgd-into is pointing to a kptr field. If this is not the case, it's not safe to xchg into that reg's pointee. Let's rename the bpf_arg_type to more accurately describe the fairly specific expectations that this arg type encodes. This is a nonfunctional change. Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: Dave Marchevsky <[email protected]> Signed-off-by: Amery Hung <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-23bpf: Search for kptrs in prog BTF structsDave Marchevsky1-18/+52
Currently btf_parse_fields is used in two places to create struct btf_record's for structs: when looking at mapval type, and when looking at any struct in program BTF. The former looks for kptr fields while the latter does not. This patch modifies the btf_parse_fields call made when looking at prog BTF struct types to search for kptrs as well. Before this series there was no reason to search for kptrs in non-mapval types: a referenced kptr needs some owner to guarantee resource cleanup, and map values were the only owner that supported this. If a struct with a kptr field were to have some non-kptr-aware owner, the kptr field might not be properly cleaned up and result in resources leaking. Only searching for kptr fields in mapval was a simple way to avoid this problem. In practice, though, searching for BPF_KPTR when populating struct_meta_tab does not expose us to this risk, as struct_meta_tab is only accessed through btf_find_struct_meta helper, and that helper is only called in contexts where recognizing the kptr field is safe: * PTR_TO_BTF_ID reg w/ MEM_ALLOC flag * Such a reg is a local kptr and must be free'd via bpf_obj_drop, which will correctly handle kptr field * When handling specific kfuncs which either expect MEM_ALLOC input or return MEM_ALLOC output (obj_{new,drop}, percpu_obj_{new,drop}, list+rbtree funcs, refcount_acquire) * Will correctly handle kptr field for same reasons as above * When looking at kptr pointee type * Called by functions which implement "correct kptr resource handling" * In btf_check_and_fixup_fields * Helper that ensures no ownership loops for lists and rbtrees, doesn't care about kptr field existence So we should be able to find BPF_KPTR fields in all prog BTF structs without leaking resources. Further patches in the series will build on this change to support kptr_xchg into non-mapval local kptr. Without this change there would be no kptr field found in such a type. Acked-by: Martin KaFai Lau <[email protected]> Acked-by: Hou Tao <[email protected]> Signed-off-by: Dave Marchevsky <[email protected]> Signed-off-by: Amery Hung <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-23bpf: Let callers of btf_parse_kptr() track life cycle of prog btfAmery Hung2-3/+5
btf_parse_kptr() and btf_record_free() do btf_get() and btf_put() respectively when working on btf_record in program and map if there are kptr fields. If the kptr is from program BTF, since both callers has already tracked the life cycle of program BTF, it is safe to remove the btf_get() and btf_put(). This change prevents memory leak of program BTF later when we start searching for kptr fields when building btf_record for program. It can happen when the btf fd is closed. The btf_put() corresponding to the btf_get() in btf_parse_kptr() was supposed to be called by btf_record_free() in btf_free_struct_meta_tab() in btf_free(). However, it will never happen since the invocation of btf_free() depends on the refcount of the btf to become 0 in the first place. Acked-by: Martin KaFai Lau <[email protected]> Acked-by: Hou Tao <[email protected]> Signed-off-by: Amery Hung <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-23hrtimer: Use and report correct timerslack values for realtime tasksFelix Moessbauer3-15/+13
The timerslack_ns setting is used to specify how much the hardware timers should be delayed, to potentially dispatch multiple timers in a single interrupt. This is a performance optimization. Timers of realtime tasks (having a realtime scheduling policy) should not be delayed. This logic was inconsitently applied to the hrtimers, leading to delays of realtime tasks which used timed waits for events (e.g. condition variables). Due to the downstream override of the slack for rt tasks, the procfs reported incorrect (non-zero) timerslack_ns values. This is changed by setting the timer_slack_ns task attribute to 0 for all tasks with a rt policy. By that, downstream users do not need to specially handle rt tasks (w.r.t. the slack), and the procfs entry shows the correct value of "0". Setting non-zero slack values (either via procfs or PR_SET_TIMERSLACK) on tasks with a rt policy is ignored, as stated in "man 2 PR_SET_TIMERSLACK": Timer slack is not applied to threads that are scheduled under a real-time scheduling policy (see sched_setscheduler(2)). The special handling of timerslack on rt tasks in downstream users is removed as well. Signed-off-by: Felix Moessbauer <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov40-619/+476
Cross-merge bpf fixes after downstream PR including important fixes (from bpf-next point of view): commit 41c24102af7b ("selftests/bpf: Filter out _GNU_SOURCE when compiling test_cpp") commit fdad456cbcca ("bpf: Fix updating attached freplace prog in prog_array map") No conflicts. Adjacent changes in: include/linux/bpf_verifier.h kernel/bpf/verifier.c tools/testing/selftests/bpf/Makefile Link: https://lore.kernel.org/bpf/[email protected]/ Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-22bpf: allow bpf_fastcall for bpf_cast_to_kern_ctx and bpf_rdonly_castEduard Zingerman1-0/+3
do_misc_fixups() relaces bpf_cast_to_kern_ctx() and bpf_rdonly_cast() by a single instruction "r0 = r1". This follows bpf_fastcall contract. This commit allows bpf_fastcall pattern rewrite for these two functions in order to use them in bpf_fastcall selftests. Acked-by: Yonghong Song <[email protected]> Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-22bpf: support bpf_fastcall patterns for kfuncsEduard Zingerman1-1/+34
Recognize bpf_fastcall patterns around kfunc calls. For example, suppose bpf_cast_to_kern_ctx() follows bpf_fastcall contract (which it does), in such a case allow verifier to rewrite BPF program below: r2 = 1; *(u64 *)(r10 - 32) = r2; call %[bpf_cast_to_kern_ctx]; r2 = *(u64 *)(r10 - 32); r0 = r2; By removing the spill/fill pair: r2 = 1; call %[bpf_cast_to_kern_ctx]; r0 = r2; Acked-by: Yonghong Song <[email protected]> Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-22bpf: rename nocsr -> bpf_fastcall in verifierEduard Zingerman2-73/+72
Attribute used by LLVM implementation of the feature had been changed from no_caller_saved_registers to bpf_fastcall (see [1]). This commit replaces references to nocsr by references to bpf_fastcall to keep LLVM and Kernel parts in sync. [1] https://github.com/llvm/llvm-project/pull/105417 Acked-by: Yonghong Song <[email protected]> Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-22bpf: Fix percpu address space issuesUros Bizjak4-16/+17
In arraymap.c: In bpf_array_map_seq_start() and bpf_array_map_seq_next() cast return values from the __percpu address space to the generic address space via uintptr_t [1]. Correct the declaration of pptr pointer in __bpf_array_map_seq_show() to void __percpu * and cast the value from the generic address space to the __percpu address space via uintptr_t [1]. In hashtab.c: Assign the return value from bpf_mem_cache_alloc() to void pointer and cast the value to void __percpu ** (void pointer to percpu void pointer) before dereferencing. In memalloc.c: Explicitly declare __percpu variables. Cast obj to void __percpu **. In helpers.c: Cast ptr in BPF_CALL_1 and BPF_CALL_2 from generic address space to __percpu address space via const uintptr_t [1]. Found by GCC's named address space checks. There were no changes in the resulting object files. [1] https://sparse.docs.kernel.org/en/latest/annotations.html#address-space-name Signed-off-by: Uros Bizjak <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Andrii Nakryiko <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Eduard Zingerman <[email protected]> Cc: Song Liu <[email protected]> Cc: Yonghong Song <[email protected]> Cc: John Fastabend <[email protected]> Cc: KP Singh <[email protected]> Cc: Stanislav Fomichev <[email protected]> Cc: Hao Luo <[email protected]> Cc: Jiri Olsa <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-22bpf: correctly handle malformed BPF_CORE_TYPE_ID_LOCAL relosEduard Zingerman1-0/+8
In case of malformed relocation record of kind BPF_CORE_TYPE_ID_LOCAL referencing a non-existing BTF type, function bpf_core_calc_relo_insn would cause a null pointer deference. Fix this by adding a proper check upper in call stack, as malformed relocation records could be passed from user space. Simplest reproducer is a program: r0 = 0 exit With a single relocation record: .insn_off = 0, /* patch first instruction */ .type_id = 100500, /* this type id does not exist */ .access_str_off = 6, /* offset of string "0" */ .kind = BPF_CORE_TYPE_ID_LOCAL, See the link for original reproducer or next commit for a test case. Fixes: 74753e1462e7 ("libbpf: Replace btf__type_by_id() with btf_type_by_id().") Reported-by: Liu RuiTong <[email protected]> Closes: https://lore.kernel.org/bpf/CAK55_s6do7C+DVwbwY_7nKfUz0YLDoiA1v6X3Y9+p0sWzipFSA@mail.gmail.com/ Acked-by: Andrii Nakryiko <[email protected]> Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-22dma-mapping: direct calls for dma-iommuLeon Romanovsky3-10/+79
Directly call into dma-iommu just like we have been doing for dma-direct for a while. This avoids the indirect call overhead for IOMMU ops and removes the need to have DMA ops entirely for many common configurations. Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Acked-by: Robin Murphy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-08-22dma-mapping: call ->unmap_page and ->unmap_sg unconditionallyLeon Romanovsky2-2/+23
Almost all instances of the dma_map_ops ->map_page()/map_sg() methods implement ->unmap_page()/unmap_sg() too. The once instance which doesn't dma_dummy_ops which is used to fail the DMA mapping and thus there won't be any calls to ->unmap_page()/unmap_sg(). Remove the checks for ->unmap_page()/unmap_sg() and call them directly to create an interface that is symmetrical to ->map_page()/map_sg(). Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Reviewed-by: Robin Murphy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-08-22dma-mapping: replace zone_dma_bits by zone_dma_limitCatalin Marinas3-8/+8
The hardware DMA limit might not be power of 2. When RAM range starts above 0, say 4GB, DMA limit of 30 bits should end at 5GB. A single high bit can not encode this limit. Use a plain address for the DMA zone limit instead. Since the DMA zone can now potentially span beyond 4GB physical limit of DMA32, make sure to use DMA zone for GFP_DMA32 allocations in that case. Signed-off-by: Catalin Marinas <[email protected]> Co-developed-by: Baruch Siach <[email protected]> Signed-off-by: Baruch Siach <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Reviewed-by: Petr Tesarik <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-08-22dma-mapping: use bit masking to check VM_DMA_COHERENTYosry Ahmed1-2/+4
In dma_common_find_pages(), area->flags are compared directly with VM_DMA_COHERENT. This works because VM_DMA_COHERENT is the only set flag. During development of a new feature (ASI [1]), a new VM flag is introduced, and that flag can be injected into VM_DMA_COHERENT mappings (among others). The presence of that flag caused dma_common_find_pages() to return NULL for VM_DMA_COHERENT addresses, leading to a lot of problems ending in crashing during boot. It took a bit of time to figure this problem out. It was a mistake to inject a VM flag to begin with, but it took a significant amount of debugging to figure out the problem. Most users of area->flags use bitmasking rather than equivalency to check for flags. Update dma_common_find_pages() and dma_common_free_remap() to do the same, which would have avoided the boot crashing. Instead, add a warning in dma_common_find_pages() if any extra VM flags are set to catch such problems more easily during development. No functional change intended. [1]https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Yosry Ahmed <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>