aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-08-15rcuscale: NULL out top-level pointers to heap memoryPaul E. McKenney1-0/+5
Currently, if someone modprobes and rmmods rcuscale successfully, but the next run errors out during the modprobe, non-NULL pointers to freed memory will remain. If the run after that also errors out during the modprobe, there will be double-free bugs. This commit therefore NULLs out top-level pointers to memory that has just been freed. Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-15rcuscale: Use special allocator for rcu_scale_writer()Paul E. McKenney1-10/+113
The rcu_scale_writer() function needs only a fixed number of rcu_head structures per kthread, which means that a trivial allocator suffices. This commit therefore uses an llist-based allocator using a fixed array of structures per kthread. This allows aggressive testing of RCU performance without stressing the slab allocators. Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-15rcuscale: Make rcu_scale_writer() tolerate repeated GFP_KERNEL failurePaul E. McKenney1-3/+6
Under some conditions, kmalloc(GFP_KERNEL) allocations have been observed to repeatedly fail. This situation has been observed to cause one of the rcu_scale_writer() instances to loop indefinitely retrying memory allocation for an asynchronous grace-period primitive. The problem is that if memory is short, all the other instances will allocate all available memory before the looping task is awakened from its rcu_barrier*() call. This in turn results in hangs, so that rcuscale fails to complete. This commit therefore removes the tight retry loop, so that when this condition occurs, the affected task is still passing through the full loop with its full set of termination checks. This spreads the risk of indefinite memory-allocation retry failures across all instances of rcu_scale_writer() tasks, which in turn prevents the hangs. Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-15rcuscale: Make all writer tasks report upon hangPaul E. McKenney1-0/+6
This commit causes all writer tasks to provide a brief report after a hang has been reported, spaced at one-second intervals. Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-15rcuscale: Provide clear error when async specified without primitivesPaul E. McKenney1-2/+2
Currently, if the rcuscale module's async module parameter is specified for RCU implementations that do not have async primitives such as RCU Tasks Rude (which now lacks a call_rcu_tasks_rude() function), there will be a series of splats due to calls to a NULL pointer. This commit therefore warns of this situation, but switches to non-async testing. Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-15rcu: Let dump_cpu_task() be used without preemption disabledRyo Takakura2-3/+1
The commit 2d7f00b2f0130 ("rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait()") disabled preemption around dump_cpu_task() to suppress warning on its usage within preemtible context. Calling dump_cpu_task() doesn't required to be in non-preemptible context except for suppressing the smp_processor_id() warning. As the smp_processor_id() is evaluated along with in_hardirq() to check if it's in interrupt context, this patch removes the need for its preemtion disablement by reordering the condition so that smp_processor_id() only gets evaluated when it's in interrupt context. Signed-off-by: Ryo Takakura <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-15rcu: Summarize expedited RCU CPU stall warnings during CSD-lock stallsPaul E. McKenney1-0/+4
During CSD-lock stalls, the additional information output by expedited RCU CPU stall warnings is usually redundant, flooding the console for not good reason. However, this has been the way things work for a few years. This commit therefore uses rcutree.csd_lock_suppress_rcu_stall kernel boot parameter that causes expedited RCU CPU stall warnings to be abbreviated to a single line when there is at least one CPU that has been stuck waiting for CSD lock for more than five seconds. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-15rcu: Extract synchronize_rcu_expedited_stall() from ↵Paul E. McKenney1-51/+60
synchronize_rcu_expedited_wait() This commit extracts the RCU CPU stall-warning report code from synchronize_rcu_expedited_wait() and places it in a new function named synchronize_rcu_expedited_stall(). This is strictly a code-movement commit. A later commit will use this reorganization to avoid printing expedited RCU CPU stall warnings while there are ongoing CSD-lock stall reports. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-15rcu: Summarize RCU CPU stall warnings during CSD-lock stallsPaul E. McKenney1-1/+7
During CSD-lock stalls, the additional information output by RCU CPU stall warnings is usually redundant, flooding the console for not good reason. However, this has been the way things work for a few years. This commit therefore adds an rcutree.csd_lock_suppress_rcu_stall kernel boot parameter that causes RCU CPU stall warnings to be abbreviated to a single line when there is at least one CPU that has been stuck waiting for CSD lock for more than five seconds. To make this abbreviated message happen with decent probability: tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 8 \ --configs "2*TREE01" --kconfig "CONFIG_CSD_LOCK_WAIT_DEBUG=y" \ --bootargs "csdlock_debug=1 rcutorture.stall_cpu=200 \ rcutorture.stall_cpu_holdoff=120 rcutorture.stall_cpu_irqsoff=1 \ rcutree.csd_lock_suppress_rcu_stall=1 \ rcupdate.rcu_exp_cpu_stall_timeout=5000" --trust-make [ paulmck: Apply kernel test robot feedback. ] Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-15smp: print only local CPU info when sched_clock goes backwardRik van Riel1-0/+8
About 40% of all csd_lock warnings observed in our fleet appear to be due to sched_clock() going backward in time (usually only a little bit), resulting in ts0 being larger than ts2. When the local CPU is at fault, we should print out a message reflecting that, rather than trying to get the remote CPU's stack trace. Signed-off-by: Rik van Riel <[email protected]> Tested-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-15locking/csd-lock: Use backoff for repeated reports of same incidentPaul E. McKenney1-3/+7
Currently, the CSD-lock diagnostics in CONFIG_CSD_LOCK_WAIT_DEBUG=y kernels are emitted at five-second intervals. Although this has proven to be a good time interval for the first diagnostic, if the target CPU keeps interrupts disabled for way longer than five seconds, the ratio of useful new information to pointless repetition increases considerably. Therefore, back off the time period for repeated reports of the same incident, increasing linearly with the number of reports and logarithmicly with the number of online CPUs. [ paulmck: Apply Dan Carpenter feedback. ] Signed-off-by: Paul E. McKenney <[email protected]> Cc: Imran Khan <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: "Peter Zijlstra (Intel)" <[email protected]> Cc: Rik van Riel <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-15locking/csd_lock: Provide an indication of ongoing CSD-lock stallPaul E. McKenney1-0/+16
If a CSD-lock stall goes on long enough, it will cause an RCU CPU stall warning. This additional warning provides much additional console-log traffic and little additional information. Therefore, provide a new csd_lock_is_stuck() function that returns true if there is an ongoing CSD-lock stall. This function will be used by the RCU CPU stall warnings to provide a one-line indication of the stall when this function returns true. [ neeraj.upadhyay: Apply Rik van Riel feedback. ] [ neeraj.upadhyay: Apply kernel test robot feedback. ] Signed-off-by: Paul E. McKenney <[email protected]> Cc: Imran Khan <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: "Peter Zijlstra (Intel)" <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14Merge tag 'vfs-6.11-rc4.fixes' of ↵Linus Torvalds1-3/+22
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "VFS: - Fix the name of file lease slab cache. When file leases were split out of file locks the name of the file lock slab cache was used for the file leases slab cache as well. - Fix a type in take_fd() helper. - Fix infinite directory iteration for stable offsets in tmpfs. - When the icache is pruned all reclaimable inodes are marked with I_FREEING and other processes that try to lookup such inodes will block. But some filesystems like ext4 can trigger lookups in their inode evict callback causing deadlocks. Ext4 does such lookups if the ea_inode feature is used whereby a separate inode may be used to store xattrs. Introduce I_LRU_ISOLATING which pins the inode while its pages are reclaimed. This avoids inode deletion during inode_lru_isolate() avoiding the deadlock and evict is made to wait until I_LRU_ISOLATING is done. netfs: - Fault in smaller chunks for non-large folio mappings for filesystems that haven't been converted to large folios yet. - Fix the CONFIG_NETFS_DEBUG config option. The config option was renamed a short while ago and that introduced two minor issues. First, it depended on CONFIG_NETFS whereas it wants to depend on CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter does. Second, the documentation for the config option wasn't fixed up. - Revert the removal of the PG_private_2 writeback flag as ceph is using it and fix how that flag is handled in netfs. - Fix DIO reads on 9p. A program watching a file on a 9p mount wouldn't see any changes in the size of the file being exported by the server if the file was changed directly in the source filesystem. Fix this by attempting to read the full size specified when a DIO read is requested. - Fix a NULL pointer dereference bug due to a data race where a cachefiles cookies was retired even though it was still in use. Check the cookie's n_accesses counter before discarding it. nsfs: - Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as the kernel is writing to userspace. pidfs: - Prevent the creation of pidfds for kthreads until we have a use-case for it and we know the semantics we want. It also confuses userspace why they can get pidfds for kthreads. squashfs: - Fix an unitialized value bug reported by KMSAN caused by a corrupted symbolic link size read from disk. Check that the symbolic link size is not larger than expected" * tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: Squashfs: sanity check symbolic link size 9p: Fix DIO read through netfs vfs: Don't evict inode under the inode lru traversing context netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag" file: fix typo in take_fd() comment pidfd: prevent creation of pidfds for kthreads netfs: clean up after renaming FSCACHE_DEBUG config libfs: fix infinite directory reads for offset dir nsfs: fix ioctl declaration fs/netfs/fscache_cookie: add missing "n_accesses" check filelock: fix name of file_lease slab cache netfs: Fault in smaller chunks for non-large folio mappings
2024-08-14rcuscale: Print detailed grace-period and barrier diagnosticsPaul E. McKenney1-0/+18
This commit uses the new rcu_tasks_torture_stats_print(), rcu_tasks_trace_torture_stats_print(), and rcu_tasks_rude_torture_stats_print() functions in order to provide detailed diagnostics on grace-period, callback, and barrier state when rcu_scale_writer() hangs. [ paulmck: Apply kernel test robot feedback. ] Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14rcu: Mark callbacks not currently participating in barrier operationPaul E. McKenney1-0/+3
RCU keeps a count of the number of callbacks that the current rcu_barrier() is waiting on, but there is currently no easy way to work out which callback is stuck. One way to do this is to mark idle RCU-barrier callbacks by making the ->next pointer point to the callback itself, and this commit does just that. Later commits will use this for debug output. Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14rcuscale: Dump grace-period statistics when rcu_scale_writer() stallsPaul E. McKenney1-0/+10
This commit adds a .stats function pointer to the rcu_scale_ops structure, and if this is non-NULL, it is invoked after stack traces are dumped in response to a rcu_scale_writer() stall. Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14rcuscale: Dump stacks of stalled rcu_scale_writer() instancesPaul E. McKenney1-2/+21
This commit improves debuggability by dumping the stacks of rcu_scale_writer() instances that have not completed in a reasonable timeframe. These stacks are dumped remotely, but they will be accurate in the thus-far common case where the stalled rcu_scale_writer() instances are blocked. [ paulmck: Apply kernel test robot feedback. ] Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14rcuscale: Save a few lines with whitespace-only changePaul E. McKenney1-7/+3
This whitespace-only commit fuses a few lines of code, taking advantage of the newish 100-character-per-line limit to save a few lines of code. Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14refscale: Optimize process_durations()Christophe JAILLET1-11/+14
process_durations() is not a hot path, but there is no good reason to iterate over and over the data already in 'buf'. Using a seq_buf saves some useless strcat() and the need of a temp buffer. Data is written directly at the correct place. Signed-off-by: Christophe JAILLET <[email protected]> Tested-by: "Paul E. McKenney" <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14rcu/tasks: Add rcu_barrier_tasks*() start time to diagnosticsPaul E. McKenney1-2/+6
This commit adds the start time, in jiffies, of the most recently started rcu_barrier_tasks*() operation to the diagnostic output used by rcuscale. This information can be helpful in distinguishing a hung barrier operation from a long series of barrier operations. Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14rcu/tasks: Add detailed grace-period and barrier diagnosticsPaul E. McKenney1-2/+64
This commit adds rcu_tasks_torture_stats_print(), rcu_tasks_trace_torture_stats_print(), and rcu_tasks_rude_torture_stats_print() functions that provide detailed diagnostics on grace-period, callback, and barrier state. Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14rcu/tasks: Mark callbacks not currently participating in barrier operationPaul E. McKenney1-0/+2
Each Tasks RCU flavor keeps a count of the number of callbacks that the current rcu_barrier_tasks*() is waiting on, but there is currently no easy way to work out which callback is stuck. One way to do this is to mark idle RCU-barrier callbacks by making the ->next pointer point to the callback itself, and this commit does just that. Later commits will use this for debug output. Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14rcu: Provide rcu_barrier_cb_is_done() to check rcu_barrier() CBsPaul E. McKenney1-0/+5
This commit provides a rcu_barrier_cb_is_done() function that returns true if the *rcu_barrier*() callback passed in is done. This will be used when printing grace-period debugging information. Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14rcu/tasks: Update rtp->tasks_gp_seq commentPaul E. McKenney1-1/+1
The rtp->tasks_gp_seq grace-period sequence number is not a strict count, but rather the usual RCU sequence number with the lower few bits tracking per-grace-period state and the upper bits the count of grace periods since boot, give or take the initial value. This commit therefore adjusts this comment. Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14rcu/tasks: Check processor-ID assumptionsPaul E. McKenney1-0/+1
The current mapping of smp_processor_id() to a CPU processing Tasks-RCU callbacks makes some assumptions about layout. This commit therefore adds a WARN_ON() to check these assumptions. [ neeraj.upadhyay: Replace nr_cpu_ids with rcu_task_cpu_ids. ] Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14rcu-tasks: Fix access non-existent percpu rtpcp variable in ↵Zqiang1-29/+53
rcu_tasks_need_gpcb() For kernels built with CONFIG_FORCE_NR_CPUS=y, the nr_cpu_ids is defined as NR_CPUS instead of the number of possible cpus, this will cause the following system panic: smpboot: Allowing 4 CPUs, 0 hotplug CPUs ... setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:512 nr_node_ids:1 ... BUG: unable to handle page fault for address: ffffffff9911c8c8 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 15 Comm: rcu_tasks_trace Tainted: G W 6.6.21 #1 5dc7acf91a5e8e9ac9dcfc35bee0245691283ea6 RIP: 0010:rcu_tasks_need_gpcb+0x25d/0x2c0 RSP: 0018:ffffa371c00a3e60 EFLAGS: 00010082 CR2: ffffffff9911c8c8 CR3: 000000040fa20005 CR4: 00000000001706f0 Call Trace: <TASK> ? __die+0x23/0x80 ? page_fault_oops+0xa4/0x180 ? exc_page_fault+0x152/0x180 ? asm_exc_page_fault+0x26/0x40 ? rcu_tasks_need_gpcb+0x25d/0x2c0 ? __pfx_rcu_tasks_kthread+0x40/0x40 rcu_tasks_one_gp+0x69/0x180 rcu_tasks_kthread+0x94/0xc0 kthread+0xe8/0x140 ? __pfx_kthread+0x40/0x40 ret_from_fork+0x34/0x80 ? __pfx_kthread+0x40/0x40 ret_from_fork_asm+0x1b/0x80 </TASK> Considering that there may be holes in the CPU numbers, use the maximum possible cpu number, instead of nr_cpu_ids, for configuring enqueue and dequeue limits. [ neeraj.upadhyay: Fix htmldocs build error reported by Stephen Rothwell ] Closes: https://lore.kernel.org/linux-input/CALMA0xaTSMN+p4xUXkzrtR5r6k7hgoswcaXx7baR_z9r5jjskw@mail.gmail.com/T/#u Reported-by: Zhixu Liu <[email protected]> Signed-off-by: Zqiang <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14rcu-tasks: Remove RCU Tasks Rude asynchronous APIsPaul E. McKenney1-38/+17
The call_rcu_tasks_rude() and rcu_barrier_tasks_rude() APIs are currently unused. This commit therefore removes their definitions and boot-time self-tests. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14rcuscale: Stop testing RCU Tasks Rude asynchronous APIsPaul E. McKenney1-2/+0
The call_rcu_tasks_rude() and rcu_barrier_tasks_rude() APIs are currently unused. Furthermore, the idea is to get rid of RCU Tasks Rude entirely once all architectures have their deep-idle and entry/exit code correctly marked as inline or noinstr. As a step towards this goal, this commit therefore removes these two functions from rcuscale testing. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14rcutorture: Stop testing RCU Tasks Rude asynchronous APIsPaul E. McKenney1-8/+0
The call_rcu_tasks_rude() and rcu_barrier_tasks_rude() APIs are currently unused. Furthermore, the idea is to get rid of RCU Tasks Rude entirely once all architectures have their deep-idle and entry/exit code correctly marked as inline or noinstr. As a first step towards this goal, this commit therefore removes these two functions from rcutorture testing. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14rcutorture: Add a stall_cpu_repeat module parameterPaul E. McKenney1-16/+40
This commit adds an stall_cpu_repeat kernel, which is also the rcutorture.stall_cpu_repeat boot parameter, to test repeated CPU stalls. Note that only the first stall will pay attention to the stall_cpu_irqsoff module parameter. For the second and subsequent stalls, interrupts will be enabled. This is helpful when testing the interaction between RCU CPU stall warnings and CSD-lock stall warnings. Reported-by: Rik van Riel <[email protected]> Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-08-14hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse.Sebastian Andrzej Siewior1-0/+2
The two hrtimer_cpu_base_.*_expiry() functions are wrappers around the locking functions and sparse complains about the missing counterpart. Add sparse annotation to denote that this bevaviour is expected. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-14timers: Add sparse annotation for timer_sync_wait_running().Sebastian Andrzej Siewior1-0/+2
timer_sync_wait_running() first releases two locks and then acquires them again. This is unexpected and sparse complains about it. Add sparse annotation for timer_sync_wait_running() to note that the locking is expected. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-14perf: Really fix event_function_call() lockingNamhyung Kim1-5/+8
Commit 558abc7e3f89 ("perf: Fix event_function_call() locking") lost IRQ disabling by mistake. Fixes: 558abc7e3f89 ("perf: Fix event_function_call() locking") Reported-by: Pengfei Xu <[email protected]> Reported-by: Naresh Kamboju <[email protected]> Tested-by: Pengfei Xu <[email protected]> Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
2024-08-13bpf: more trivial fdget() conversionsAl Viro1-15/+7
All failure exits prior to fdget() leave the scope, all matching fdput() are immediately followed by leaving the scope. Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]>
2024-08-13bpf: trivial conversions for fdget()Al Viro3-21/+9
fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope. Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]>
2024-08-13bpf: switch maps to CLASS(fd, ...)Al Viro3-121/+42
Calling conventions for __bpf_map_get() would be more convenient if it left fpdut() on failure to callers. Makes for simpler logics in the callers. Among other things, the proof of memory safety no longer has to rely upon file->private_data never being ERR_PTR(...) for bpffs files. Original calling conventions made it impossible for the caller to tell whether __bpf_map_get() has returned ERR_PTR(-EINVAL) because it has found the file not be a bpf map one (in which case it would've done fdput()) or because it found that ERR_PTR(-EINVAL) in file->private_data of a bpf map file (in which case fdput() would _not_ have been done). Signed-off-by: Al Viro <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]>
2024-08-13bpf: factor out fetching bpf_map from FD and adding it to used_maps listAndrii Nakryiko1-49/+66
Factor out the logic to extract bpf_map instances from FD embedded in bpf_insns, adding it to the list of used_maps (unless it's already there, in which case we just reuse map's index). This simplifies the logic in resolve_pseudo_ldimm64(), especially around `struct fd` handling, as all that is now neatly contained in the helper and doesn't leak into a dozen error handling paths. Signed-off-by: Andrii Nakryiko <[email protected]>
2024-08-13bpf: switch fdget_raw() uses to CLASS(fd_raw, ...)Al Viro1-16/+8
Swith fdget_raw() use cases in bpf_inode_storage.c to CLASS(fd_raw). Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]>
2024-08-13bpf: convert __bpf_prog_get() to CLASS(fd, ...)Al Viro1-22/+9
Irregularity here is fdput() not in the same scope as fdget(); just fold ____bpf_prog_get() into its (only) caller and that's it... Signed-off-by: Al Viro <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]>
2024-08-13Merge remote-tracking branch 'vfs/stable-struct_fd'Andrii Nakryiko15-274/+284
Merge Al Viro's struct fd refactorings. Signed-off-by: Andrii Nakryiko <[email protected]>
2024-08-13sched_ext: Don't use double locking to migrate tasks across CPUsTejun Heo1-88/+46
consume_remote_task() and dispatch_to_local_dsq() use move_task_to_local_dsq() to migrate the task to the target CPU. Currently, move_task_to_local_dsq() expects the caller to lock both the source and destination rq's. While this may save a few lock operations while the rq's are not contended, under contention, the double locking can exacerbate the situation significantly (refer to the linked message below). Update the migration path so that double locking is not used. move_task_to_local_dsq() now expects the caller to be locking the source rq, drops it and then acquires the destination rq lock. Code is simpler this way and, on a 2-way NUMA machine w/ Xeon Gold 6138, 'hackbench 100 thread 5000` shows ~3% improvement with scx_simple. Signed-off-by: Tejun Heo <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: David Vernet <[email protected]>
2024-08-13workqueue: Add interface for user-defined workqueue lockdep mapMatthew Brost1-0/+28
Add an interface for a user-defined workqueue lockdep map, which is helpful when multiple workqueues are created for the same purpose. This also helps avoid leaking lockdep maps on each workqueue creation. v2: - Add alloc_workqueue_lockdep_map (Tejun) v3: - Drop __WQ_USER_OWNED_LOCKDEP (Tejun) - static inline alloc_ordered_workqueue_lockdep_map (Tejun) Cc: Tejun Heo <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Matthew Brost <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-08-13workqueue: Change workqueue lockdep map to pointerMatthew Brost1-7/+9
Will help enable user-defined lockdep maps for workqueues. Cc: Tejun Heo <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Matthew Brost <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-08-13workqueue: Split alloc_workqueue into internal function and lockdep initMatthew Brost1-8/+23
Will help enable user-defined lockdep maps for workqueues. Cc: Tejun Heo <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Matthew Brost <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-08-13sched_ext: define missing cfi stubs for sched_extManu Bretelle1-0/+6
`__bpf_ops_sched_ext_ops` was missing the initialization of some struct attributes. With https://lore.kernel.org/all/[email protected]/ every single attributes need to be initialized programs (like scx_layered) will fail to load. 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_init not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_exit not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_prep_move not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_move not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_cancel_move not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_set_weight not found in kernel, skipping it as it's set to zero 05:26:48 [WARN] libbpf: prog 'layered_dump': BPF program load failed: unknown error (-524) 05:26:48 [WARN] libbpf: prog 'layered_dump': -- BEGIN PROG LOAD LOG -- attach to unsupported member dump of struct sched_ext_ops processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 -- END PROG LOAD LOG -- 05:26:48 [WARN] libbpf: prog 'layered_dump': failed to load: -524 05:26:48 [WARN] libbpf: failed to load object 'bpf_bpf' 05:26:48 [WARN] libbpf: failed to load BPF skeleton 'bpf_bpf': -524 Error: Failed to load BPF program Signed-off-by: Manu Bretelle <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-08-13perf/bpf: Don't call bpf_overflow_handler() for tracing eventsKyle Huey1-1/+2
The regressing commit is new in 6.10. It assumed that anytime event->prog is set bpf_overflow_handler() should be invoked to execute the attached bpf program. This assumption is false for tracing events, and as a result the regressing commit broke bpftrace by invoking the bpf handler with garbage inputs on overflow. Prior to the regression the overflow handlers formed a chain (of length 0, 1, or 2) and perf_event_set_bpf_handler() (the !tracing case) added bpf_overflow_handler() to that chain, while perf_event_attach_bpf_prog() (the tracing case) did not. Both set event->prog. The chain of overflow handlers was replaced by a single overflow handler slot and a fixed call to bpf_overflow_handler() when appropriate. This modifies the condition there to check event->prog->type == BPF_PROG_TYPE_PERF_EVENT, restoring the previous behavior and fixing bpftrace. Signed-off-by: Kyle Huey <[email protected]> Suggested-by: Andrii Nakryiko <[email protected]> Reported-by: Joe Damato <[email protected]> Closes: https://lore.kernel.org/lkml/ZpFfocvyF3KHaSzF@LQ3V64L9R2/ Fixes: f11f10bfa1ca ("perf/bpf: Call BPF handler directly, not through overflow machinery") Cc: [email protected] Tested-by: Joe Damato <[email protected]> # bpftrace Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-13printk/panic: Allow cpu backtraces to be written into ringbuffer during panicRyo Takakura2-2/+8
commit 779dbc2e78d7 ("printk: Avoid non-panic CPUs writing to ringbuffer") disabled non-panic CPUs to further write messages to ringbuffer after panicked. Since the commit, non-panicked CPU's are not allowed to write to ring buffer after panicked and CPU backtrace which is triggered after panicked to sample non-panicked CPUs' backtrace no longer serves its function as it has nothing to print. Fix the issue by allowing non-panicked CPUs to write into ringbuffer while CPU backtrace is in flight. Fixes: 779dbc2e78d7 ("printk: Avoid non-panic CPUs writing to ringbuffer") Signed-off-by: Ryo Takakura <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-08-13irqdomain: Remove stray '-' in the domain nameAndy Shevchenko1-2/+2
When the domain suffix is not supplied alloc_fwnode_name() unconditionally adds a separator. Fix the format strings to get rid of the stray '-' separator. Fixes: 1e7c05292531 ("irqdomain: Allow giving name suffix for domain") Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-13irqdomain: Clarify checks for bus_tokenAndy Shevchenko1-8/+14
The code uses if (bus_token) and if (bus_token == DOMAIN_BUS_ANY). Since bus_token is an enum, the latter is more robust against changes. Convert all !bus_token checks to explicitely check for DOMAIN_BUS_ANY. Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-12struct fd: representation changeAl Viro1-1/+1
We want the compiler to see that fdput() on empty instance is a no-op. The emptiness check is that file reference is NULL, while fdput() is "fput() if FDPUT_FPUT is present in flags". The reason why fdput() on empty instance is a no-op is something compiler can't see - it's that we never generate instances with NULL file reference combined with non-zero flags. It's not that hard to deal with - the real primitives behind fdget() et.al. are returning an unsigned long value, unpacked by (inlined) __to_fd() into the current struct file * + int. The lower bits are used to store flags, while the rest encodes the pointer. Linus suggested that keeping this unsigned long around with the extractions done by inlined accessors should generate a sane code and that turns out to be the case. Namely, turning struct fd into a struct-wrapped unsinged long, with fd_empty(f) => unlikely(f.word == 0) fd_file(f) => (struct file *)(f.word & ~3) fdput(f) => if (f.word & 1) fput(fd_file(f)) ends up with compiler doing the right thing. The cost is the patch footprint, of course - we need to switch f.file to fd_file(f) all over the tree, and it's not doable with simple search and replace; there are false positives, etc. Note that the sole member of that structure is an opaque unsigned long - all accesses should be done via wrappers and I don't want to use a name that would invite manual casts to file pointers, etc. The value of that member is equal either to (unsigned long)p | flags, p being an address of some struct file instance, or to 0 for an empty fd. For now the new predicate (fd_empty(f)) has no users; all the existing checks have form (!fd_file(f)). We will convert to fd_empty() use later; here we only define it (and tell the compiler that it's unlikely to return true). This commit only deals with representation change; there will be followups. Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>