Age | Commit message (Collapse) | Author | Files | Lines |
|
Cross-merge networking fixes after downstream PR.
Conflicts:
net/ipv4/raw.c
3632679d9e4f ("ipv{4,6}/raw: fix output xfrm lookup wrt protocol")
c85be08fc4fa ("raw: Stop using RTO_ONLINK.")
https://lore.kernel.org/all/[email protected]/
Adjacent changes:
drivers/net/ethernet/freescale/fec_main.c
9025944fddfe ("net: fec: add dma_wmb to ensure correct descriptor values")
144470c88c5d ("net: fec: using the standard return codes when xdp xmit errors")
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
It turns out that udev under certain circumstances will concurrently try
to load the same modules over-and-over excessively. This isn't a kernel
bug, but it ends up affecting the kernel, to the point that under
certain circumstances we can fail to boot, because the kernel uses a lot
of memory to read all the module data all at once.
Note that it isn't a memory leak, it's just basically a thundering herd
problem happening at bootup with a lot of CPUs, with the worst cases
then being pretty bad.
Admittedly the worst situations are somewhat contrived: lots and lots of
CPUs, not a lot of memory, and KASAN enabled to make it all slower and
as such (unintentionally) exacerbate the problem.
Luis explains: [1]
"My best assessment of the situation is that each CPU in udev ends up
triggering a load of duplicate set of modules, not just one, but *a
lot*. Not sure what heuristics udev uses to load a set of modules per
CPU."
Petr Pavlu chimes in: [2]
"My understanding is that udev workers are forked. An initial kmod
context is created by the main udevd process but no sharing happens
after the fork. It means that the mentioned memory pool logic doesn't
really kick in.
Multiple parallel load requests come from multiple udev workers, for
instance, each handling an udev event for one CPU device and making
the exactly same requests as all others are doing at the same time.
The optimization idea would be to recognize these duplicate requests
at the udevd/kmod level and converge them"
Note that module loading has tried to mitigate this issue before, see
for example commit 064f4536d139 ("module: avoid allocation if module is
already present and ready"), which has a few ASCII graphs on memory use
due to this same issue.
However, while that noticed that the module was already loaded, and
exited with an error early before spending any more time on setting up
the module, it didn't handle the case of multiple concurrent module
loads all being active - but not complete - at the same time.
Yes, one of them will eventually win the race and finalize its copy, and
the others will then notice that the module already exists and error
out, but while this all happens, we have tons of unnecessary concurrent
work being done.
Again, the real fix is for udev to not do that (maybe it should use
threads instead of fork, and have actual shared data structures and not
cause duplicate work). That real fix is apparently not trivial.
But it turns out that the kernel already has a pretty good model for
dealing with concurrent access to the same file: the i_writecount of the
inode.
In fact, the module loading already indirectly uses 'i_writecount' ,
because 'kernel_file_read()' will in fact do
ret = deny_write_access(file);
if (ret)
return ret;
...
allow_write_access(file);
around the read of the file data. We do not allow concurrent writes to
the file, and return -ETXTBUSY if the file was open for writing at the
same time as the module data is loaded from it.
And the solution to the reader concurrency problem is to simply extend
this "no concurrent writers" logic to simply be "exclusive access".
Note that "exclusive" in this context isn't really some absolute thing:
it's only exclusion from writers and from other "special readers" that
do this writer denial. So we simply introduce a variation of that
"deny_write_access()" logic that not only denies write access, but also
requires that this is the _only_ such access that denies write access.
Which means that you can't start loading a module that is already being
loaded as a module by somebody else, or you will get the same -ETXTBSY
error that you would get if there were writers around.
[ It also means that you can't try to load a currently executing
executable as a module, for the same reason: executables do that same
"deny_write_access()" thing, and that's obviously where the whole
ETXTBSY logic traditionally came from.
This is not a problem for kernel modules, since the set of normal
executable files and kernel module files is entirely disjoint. ]
This new function is called "exclusive_deny_write_access()", and the
implementation is trivial, in that it's just an atomic decrement of
i_writecount if it was 0 before.
To use that new exclusivity check, all we then do is wrap the module
loading with that exclusive_deny_write_access()() / allow_write_access()
pair. The actual patch is a bit bigger than that, because we want to
surround not just the "load file data" part, but the whole module setup,
to get maximum exclusion.
So this ends up splitting up "finit_module()" into a few helper
functions to make it all very clear and legible.
In Luis' test-case (bringing up 255 vcpu's in a virtual machine [3]),
the "wasted vmalloc" space (ie module data read into a vmalloc'ed area
in order to be loaded as a module, but then discarded because somebody
else loaded the same module instead) dropped from 1.8GiB to 474kB. Yes,
that's gigabytes to kilobytes.
It doesn't drop completely to zero, because even with this change, you
can still end up having completely serial pointless module loads, where
one udev process has loaded a module fully (and thus the kernel has
released that exclusive lock on the module file), and then another udev
process tries to load the same module again.
So while we cannot fully get rid of the fundamental bug in user space,
we _can_ get rid of the excessive concurrent thundering herd effect.
A couple of final side notes on this all:
- This tweak only affects the "finit_module()" system call, which gives
the kernel a file descriptor with the module data.
You can also just feed the module data as raw data from user space
with "init_module()" (note the lack of 'f' at the beginning), and
obviously for that case we do _not_ have any "exclusive read" logic.
So if you absolutely want to do things wrong in user space, and try
to load the same module multiple times, and error out only later when
the kernel ends up saying "you can't load the same module name
twice", you can still do that.
And in fact, some distros will do exactly that, because they will
uncompress the kernel module data in user space before feeding it to
the kernel (mainly because they haven't started using the new kernel
side decompression yet).
So this is not some absolute "you can't do concurrent loads of the
same module". It's literally just a very simple heuristic that will
catch it early in case you try to load the exact same module file at
the same time, and in that case avoid a potentially nasty situation.
- There is another user of "deny_write_access()": the verity code that
enables fs-verity on a file (the FS_IOC_ENABLE_VERITY ioctl).
If you use fs-verity and you care about verifying the kernel modules
(which does make sense), you should do it *before* loading said
kernel module. That may sound obvious, but now the implementation
basically requires it. Because if you try to do it concurrently, the
kernel may refuse to load the module file that is being set up by the
fs-verity code.
- This all will obviously mean that if you insist on loading the same
module in parallel, only one module load will succeed, and the others
will return with an error.
That was true before too, but what is different is that the -ETXTBSY
error can be returned *before* the success case of another process
fully loading and instantiating the module.
Again, that might sound obvious, and it is indeed the whole point of
the whole change: we are much quicker to notice the whole "you're
already in the process of loading this module".
So it's very much intentional, but it does mean that if you just
spray the kernel with "finit_module()", and expect that the module is
immediately loaded afterwards without checking the return value, you
are doing something horribly horribly wrong.
I'd like to say that that would never happen, but the whole _reason_
for this commit is that udev is currently doing something horribly
horribly wrong, so ...
Link: https://lore.kernel.org/all/[email protected]/ [1]
Link: https://lore.kernel.org/all/[email protected]/ [2]
Link: https://lore.kernel.org/lkml/ZG%2Fa+nrt4%[email protected]/ [3]
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Lucas De Marchi <[email protected]>
Cc: Petr Pavlu <[email protected]>
Tested-by: Luis Chamberlain <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
is 0
If workqueue.cpu_intensive_thresh_us is set to 0, the detection mechanism
for CPU-hogging per-cpu work item will keep triggering spuriously:
workqueue: process_srcu hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
workqueue: gc_worker hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
workqueue: gc_worker hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND
workqueue: wait_rcu_exp_gp hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
workqueue: kfree_rcu_monitor hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
workqueue: kfree_rcu_monitor hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND
workqueue: reg_todo hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
This commit therefore disables the CPU-hog detection mechanism when
workqueue.cpu_intensive_thresh_us is set to 0.
tj: Patch description updated and the condition check on
cpu_intensive_thresh_us separated into a separate if statement for
readability.
Signed-off-by: Zqiang <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from bluetooth and bpf.
Current release - regressions:
- net: fix skb leak in __skb_tstamp_tx()
- eth: mtk_eth_soc: fix QoS on DSA MAC on non MTK_NETSYS_V2 SoCs
Current release - new code bugs:
- handshake:
- fix sock->file allocation
- fix handshake_dup() ref counting
- bluetooth:
- fix potential double free caused by hci_conn_unlink
- fix UAF in hci_conn_hash_flush
Previous releases - regressions:
- core: fix stack overflow when LRO is disabled for virtual
interfaces
- tls: fix strparser rx issues
- bpf:
- fix many sockmap/TCP related issues
- fix a memory leak in the LRU and LRU_PERCPU hash maps
- init the offload table earlier
- eth: mlx5e:
- do as little as possible in napi poll when budget is 0
- fix using eswitch mapping in nic mode
- fix deadlock in tc route query code
Previous releases - always broken:
- udplite: fix NULL pointer dereference in __sk_mem_raise_allocated()
- raw: fix output xfrm lookup wrt protocol
- smc: reset connection when trying to use SMCRv2 fails
- phy: mscc: enable VSC8501/2 RGMII RX clock
- eth: octeontx2-pf: fix TSOv6 offload
- eth: cdc_ncm: deal with too low values of dwNtbOutMaxSize"
* tag 'net-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits)
udplite: Fix NULL pointer dereference in __sk_mem_raise_allocated().
net: phy: mscc: enable VSC8501/2 RGMII RX clock
net: phy: mscc: remove unnecessary phydev locking
net: phy: mscc: add support for VSC8501
net: phy: mscc: add VSC8502 to MODULE_DEVICE_TABLE
net/handshake: Enable the SNI extension to work properly
net/handshake: Unpin sock->file if a handshake is cancelled
net/handshake: handshake_genl_notify() shouldn't ignore @flags
net/handshake: Fix uninitialized local variable
net/handshake: Fix handshake_dup() ref counting
net/handshake: Remove unneeded check from handshake_dup()
ipv6: Fix out-of-bounds access in ipv6_find_tlv()
net: ethernet: mtk_eth_soc: fix QoS on DSA MAC on non MTK_NETSYS_V2 SoCs
docs: netdev: document the existence of the mail bot
net: fix skb leak in __skb_tstamp_tx()
r8169: Use a raw_spinlock_t for the register locks.
page_pool: fix inconsistency for page_pool_ring_[un]lock()
bpf, sockmap: Test progs verifier error with latest clang
bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops
bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer
...
|
|
Seems like that extra bpf_capable() check in BPF_MAP_FREEZE handler was
unintentionally left when we switched to a model that all BPF map
operations should be allowed regardless of CAP_BPF (or any other
capabilities), as long as process got BPF map FD somehow.
This patch replaces bpf_capable() check in BPF_MAP_FREEZE handler with
writeable access check, given conceptually freezing the map is modifying
it: map becomes unmodifiable for subsequent updates.
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-05-24
We've added 19 non-merge commits during the last 10 day(s) which contain
a total of 20 files changed, 738 insertions(+), 448 deletions(-).
The main changes are:
1) Batch of BPF sockmap fixes found when running against NGINX TCP tests,
from John Fastabend.
2) Fix a memleak in the LRU{,_PERCPU} hash map when bucket locking fails,
from Anton Protopopov.
3) Init the BPF offload table earlier than just late_initcall,
from Jakub Kicinski.
4) Fix ctx access mask generation for 32-bit narrow loads of 64-bit fields,
from Will Deacon.
5) Remove a now unsupported __fallthrough in BPF samples,
from Andrii Nakryiko.
6) Fix a typo in pkg-config call for building sign-file,
from Jeremy Sowden.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, sockmap: Test progs verifier error with latest clang
bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops
bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer
bpf, sockmap: Test shutdown() correctly exits epoll and recv()=0
bpf, sockmap: Build helper to create connected socket pair
bpf, sockmap: Pull socket helpers out of listen test for general use
bpf, sockmap: Incorrectly handling copied_seq
bpf, sockmap: Wake up polling after data copy
bpf, sockmap: TCP data stall on recv before accept
bpf, sockmap: Handle fin correctly
bpf, sockmap: Improved check for empty queue
bpf, sockmap: Reschedule is now done through backlog
bpf, sockmap: Convert schedule_work into delayed_work
bpf, sockmap: Pass skb ownership through read_skb
bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps
bpf: Fix mask generation for 32-bit narrow loads of 64-bit fields
samples/bpf: Drop unnecessary fallthrough
bpf: netdev: init the offload table earlier
selftests/bpf: Fix pkg-config call building sign-file
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Commit 674b745e22b3 ("cgroup: remove rcu_read_lock()/rcu_read_unlock()
in critical section of spin_lock_irq()") has removed the rcu_read_lock,
which makes the comment out-of-date, so update it.
tj: Updated the comment a bit.
Signed-off-by: Xiu Jianfeng <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
Currently, pool->nr_running can be modified from timer tick, that means the
timer tick can run nested inside a not-irq-protected section that's in the
process of modifying nr_running. Consider the following scenario:
CPU0
kworker/0:2 (events)
worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
->pool->nr_running++; (1)
process_one_work()
->worker->current_func(work);
->schedule()
->wq_worker_sleeping()
->worker->sleeping = 1;
->pool->nr_running--; (0)
....
->wq_worker_running()
....
CPU0 by interrupt:
wq_worker_tick()
->worker_set_flags(worker, WORKER_CPU_INTENSIVE);
->pool->nr_running--; (-1)
->worker->flags |= WORKER_CPU_INTENSIVE;
....
->if (!(worker->flags & WORKER_NOT_RUNNING))
->pool->nr_running++; (will not execute)
->worker->sleeping = 0;
....
->worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
->pool->nr_running++; (0)
....
worker_set_flags(worker, WORKER_PREP);
->pool->nr_running--; (-1)
....
worker_enter_idle()
->WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
if the nr_workers is equal to nr_idle, due to the nr_running is not zero,
will trigger WARN_ON_ONCE().
[ 2.460602] WARNING: CPU: 0 PID: 63 at kernel/workqueue.c:1999 worker_enter_idle+0xb2/0xc0
[ 2.462163] Modules linked in:
[ 2.463401] CPU: 0 PID: 63 Comm: kworker/0:2 Not tainted 6.4.0-rc2-next-20230519 #1
[ 2.463771] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
[ 2.465127] Workqueue: 0x0 (events)
[ 2.465678] RIP: 0010:worker_enter_idle+0xb2/0xc0
...
[ 2.472614] Call Trace:
[ 2.473152] <TASK>
[ 2.474182] worker_thread+0x71/0x430
[ 2.474992] ? _raw_spin_unlock_irqrestore+0x28/0x50
[ 2.475263] kthread+0x103/0x120
[ 2.475493] ? __pfx_worker_thread+0x10/0x10
[ 2.476355] ? __pfx_kthread+0x10/0x10
[ 2.476635] ret_from_fork+0x2c/0x50
[ 2.477051] </TASK>
This commit therefore add the check of worker->sleeping in wq_worker_tick(),
if the worker->sleeping is not zero, directly return.
tj: Updated comment and description.
Reported-by: Naresh Kamboju <[email protected]>
Reported-by: Linux Kernel Functional Testing <[email protected]>
Tested-by: Anders Roxell <[email protected]>
Closes: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230519/testrun/17078554/suite/boot/test/clang-nightly-lkftconfig/log
Signed-off-by: Zqiang <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
Fix a typo in a comment in kernel/power/snapshot.c
Signed-off-by: Wang Honghui <[email protected]>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
Commit bf5e758f02fc ("genirq/msi: Simplify sysfs handling") reworked the
creation of sysfs entries for MSI IRQs. The creation used to be in
msi_domain_alloc_irqs_descs_locked after calling ops->domain_alloc_irqs.
Then it moved into __msi_domain_alloc_irqs which is an implementation of
domain_alloc_irqs. However, Xen comes with the only other implementation
of domain_alloc_irqs and hence doesn't run the sysfs population code
anymore.
Commit 6c796996ee70 ("x86/pci/xen: Fixup fallout from the PCI/MSI
overhaul") set the flag MSI_FLAG_DEV_SYSFS for the xen msi_domain_info
but that doesn't actually have an effect because Xen uses it's own
domain_alloc_irqs implementation.
Fix this by making use of the fallback functions for sysfs population.
Fixes: bf5e758f02fc ("genirq/msi: Simplify sysfs handling")
Signed-off-by: Maximilian Heyne <[email protected]>
Reviewed-by: Juergen Gross <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Juergen Gross <[email protected]>
|
|
For the splice from the trace seq buffer, just use copy_splice_read().
In the future, something better can probably be done by gifting pages from
seq->buf into the pipe, but that would require changing seq->buf into a
vmap over an array of pages.
Signed-off-by: David Howells <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Al Viro <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Steven Rostedt <[email protected]>
cc: Masami Hiramatsu <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
The current implementation uses a static bitmap for interrupt descriptor
allocation and a radix tree to pointer store the pointer for lookup.
However, the size of the bitmap is constrained by the build time macro
MAX_SPARSE_IRQS, which may not be sufficient to support high-end servers,
particularly those with GICv4.1 hardware, which require a large interrupt
space to cover LPIs and vSGIs.
Replace the bitmap and the radix tree with a maple tree, which not only
stores pointers for lookup, but also provides a mechanism to find free
ranges. That removes the build time hardcoded upper limit.
Signed-off-by: Shanker Donthineni <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Move the open coded sparse bitmap handling into helper functions as
a preparatory step for converting the sparse interrupt management
to a maple tree.
No functional change.
Signed-off-by: Shanker Donthineni <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The current implementation utilizes a bitmap for managing interrupt resend
handlers, which is allocated based on the SPARSE_IRQ/NR_IRQS macros.
However, this method may not efficiently utilize memory during runtime,
particularly when IRQ_BITMAP_BITS is large.
Address this issue by using an hlist to manage interrupt resend handlers
instead of relying on a static bitmap memory allocation. Additionally, a
new function, clear_irq_resend(), is introduced and called from
irq_shutdown to ensure a graceful teardown of the interrupt.
Signed-off-by: Shanker Donthineni <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The preempt_disable() section in module_put() was added in commit
e1783a240f491 ("module: Use this_cpu_xx to dynamically allocate counters")
while the per-CPU counter were switched to another API. The API requires
that during the RMW operation the CPU remained the same.
This counting API was later replaced with atomic_t in commit
2f35c41f58a97 ("module: Replace module_ref with atomic_t refcnt")
Since this atomic_t replacement there is no need to keep preemption
disabled while the reference counter is modified.
Remove preempt_disable() from module_put(), __module_get() and
try_module_get().
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
|
|
This is part of the general push to deprecate register_sysctl_paths and
register_sysctl_table. The old way of doing this through
register_sysctl_base and DECLARE_SYSCTL_BASE macro is replaced with a
call to register_sysctl_init. The 5 base paths affected are: "kernel",
"vm", "debug", "dev" and "fs".
We remove the register_sysctl_base function and the DECLARE_SYSCTL_BASE
macro since they are no longer needed.
In order to quickly acertain that the paths did not actually change I
executed `find /proc/sys/ | sha1sum` and made sure that the sha was the
same before and after the commit.
We end up saving 563 bytes with this change:
./scripts/bloat-o-meter vmlinux.0.base vmlinux.1.refactor-base-paths
add/remove: 0/5 grow/shrink: 2/0 up/down: 77/-640 (-563)
Function old new delta
sysctl_init_bases 55 111 +56
init_fs_sysctls 12 33 +21
vm_base_table 128 - -128
kernel_base_table 128 - -128
fs_base_table 128 - -128
dev_base_table 128 - -128
debug_base_table 128 - -128
Total: Before=21258215, After=21257652, chg -0.00%
[mcgrof: modified to use register_sysctl_init() over register_sysctl()
and add bloat-o-meter stats]
Signed-off-by: Joel Granados <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
Tested-by: Stephen Rothwell <[email protected]>
Acked-by: Christian Brauner <[email protected]>
|
|
The histogram and synthetic events can use a pseudo event called
"stacktrace" that will create a stacktrace at the time of the event and
use it just like it was a normal field. We have other pseudo events such
as "common_cpu" and "common_timestamp". To stay consistent with that,
convert "stacktrace" to "common_stacktrace". As this was used in older
kernels, to keep backward compatibility, this will act just like
"common_cpu" did with "cpu". That is, "cpu" will be the same as
"common_cpu" unless the event has a "cpu" field. In which case, the
event's field is used. The same is true with "stacktrace".
Also update the documentation to reflect this change.
Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: Masami Hiramatsu <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Mark Rutland <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Modifiers are used to change the behavior of keys. For instance, they
can grouped into buckets, converted to syscall names (from the syscall
identifier), show task->comm of the current pid, be an array of longs
that represent a stacktrace, and more.
It was found that nothing stopped a value from taking a modifier. As
values are simple counters. If this happened, it would call code that
was not expecting a modifier and crash the kernel. This was fixed by
having the ___create_val_field() function test if a modifier was present
and fail if one was. This fixed the crash.
Now there's a problem with variables. Variables are used to pass fields
from one event to another. Variables are allowed to have some modifiers,
as the processing may need to happen at the time of the event (like
stacktraces and comm names of the current pid). The issue is that it too
uses __create_val_field(). Now that fails on modifiers, variables can no
longer use them (this is a regression).
As not all modifiers are for variables, have them use a separate check.
Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: [email protected]
Cc: Masami Hiramatsu <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Mark Rutland <[email protected]>
Fixes: e0213434fe3e4 ("tracing: Do not let histogram values have some modifiers")
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
During 6.4 development it became clear that the one-shot list used by
the user_event_mm's next field was confusing to others. It is not clear
how this list is protected or what the next field usage is for unless
you are familiar with the code.
Add comments into the user_event_mm struct indicating lock requirement
and usage. Also document how and why this approach was used via comments
in both user_event_enabler_update() and user_event_mm_get_all() and the
rules to properly use it.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wicngggxVpbnrYHjRTwGE0WYscPRM+L2HO2BF8ia1EXgQ@mail.gmail.com/
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Beau Belgrave <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Currently most list_head fields of various structs within user_events
are simply named link. This causes folks to keep additional context in
their head when working with the code, which can be confusing.
Instead of using link, describe what the actual link is, for example:
list_del_rcu(&mm->link);
Changes into:
list_del_rcu(&mm->mms_link);
The reader now is given a hint the link is to the mms global list
instead of having to remember or spot check within the code.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wicngggxVpbnrYHjRTwGE0WYscPRM+L2HO2BF8ia1EXgQ@mail.gmail.com/
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Beau Belgrave <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
pin_user_pages_remote() can reschedule which means we cannot hold any
RCU lock while using it. Now that enablers are not exposed out to the
tracing register callbacks during fork(), there is clearly no need to
require the RCU lock as event_mutex is enough to protect changes.
Remove unneeded RCU usages when pinning pages and walking enablers with
event_mutex held. Cleanup a misleading "safe" list walk that is not
needed. During fork() duplication, remove unneeded RCU list add, since
the list is not exposed yet.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wiiBfT4zNS29jA0XEsy8EmbqTH1hAPdRJCDAJMD8Gxt5A@mail.gmail.com/
Fixes: 7235759084a4 ("tracing/user_events: Use remote writes for event enablement")
Signed-off-by: Linus Torvalds <[email protected]>
[ change log written by Beau Belgrave ]
Signed-off-by: Beau Belgrave <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
When a new mm is being created in a fork() path it currently is
allocated and then attached in one go. This leaves the mm exposed out to
the tracing register callbacks while any parent enabler locations are
copied in. This should not happen.
Split up mm alloc and attach as unique operations. When duplicating
enablers, first alloc, then duplicate, and only upon success, attach.
This prevents any timing window outside of the event_reg mutex for
enablement walking. This allows for dropping RCU requirement for
enablement walking in later patches.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=whTBvXJuoi_kACo3qi5WZUmRrhyA-_=rRFsycTytmB6qw@mail.gmail.com/
Signed-off-by: Linus Torvalds <[email protected]>
[ change log written by Beau Belgrave ]
Signed-off-by: Beau Belgrave <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
forces users to specify pinning location as a string-based absolute or
relative (to current working directory) path. This has various
implications related to security (e.g., symlink-based attacks), forces
BPF FS to be exposed in the file system, which can cause races with
other applications.
One of the feedbacks we got from folks working with containers heavily
was that inability to use purely FD-based location specification was an
unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
commands. This patch closes this oversight, adding path_fd field to
BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
*at() syscalls for dirfd + pathname combinations.
This now allows interesting possibilities like working with detached BPF
FS mount (e.g., to perform multiple pinnings without running a risk of
someone interfering with them), and generally making pinning/getting
more secure and not prone to any races and/or security attacks.
This is demonstrated by a selftest added in subsequent patch that takes
advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
creating detached BPF FS mount, pinning, and then getting BPF map out of
it, all while never exposing this private instance of BPF FS to outside
worlds.
Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
cpuhp_bringup_mask() iterates over a cpumask and starts all present CPUs up
to a caller provided upper limit.
The limit variable is decremented and checked for 0 before invoking
cpu_up(), which is obviously off by one and prevents the bringup of the
last CPU when the limit is equal to the number of present CPUs.
Move the decrement and check after the cpu_up() invocation.
Fixes: 18415f33e2ac ("cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE")
Reported-by: Mark Brown <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Mark Brown <[email protected]>
Link: https://lore.kernel.org/r/87wn10ufj9.ffs@tglx
|
|
While testing rtla timerlat auto analysis, I reach a condition where
the interface was not receiving tracing data. I was able to manually
reproduce the problem with these steps:
# echo 0 > tracing_on # disable trace
# echo 1 > osnoise/stop_tracing_us # stop trace if timerlat irq > 1 us
# echo timerlat > current_tracer # enable timerlat tracer
# sleep 1 # wait... that is the time when rtla
# apply configs like prio or cgroup
# echo 1 > tracing_on # start tracing
# cat trace
# tracer: timerlat
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# ||||| ACTIVATION
# TASK-PID CPU# ||||| TIMESTAMP ID CONTEXT LATENCY
# | | | ||||| | | | |
NOTHING!
Then, trying to enable tracing again with echo 1 > tracing_on resulted
in no change: the trace was still not tracing.
This problem happens because the timerlat IRQ hits the stop tracing
condition while tracing is off, and do not wake up the timerlat thread,
so the timerlat threads are kept sleeping forever, resulting in no
trace, even after re-enabling the tracer.
Avoid this condition by always waking up the threads, even after stopping
tracing, allowing the tracer to return to its normal operating after
a new tracing on.
Link: https://lore.kernel.org/linux-trace-kernel/1ed8f830638b20a39d535d27d908e319a9a3c4e2.1683822622.git.bristot@kernel.org
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Fixes: a955d7eac177 ("trace: Add timerlat tracer")
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Do a sanity check whether provided file-to-be-pinned is actually a BPF
object (prog, map, btf) before calling security_path_mknod LSM hook. If
it's not, LSM hook doesn't have to be triggered, as the operation has no
chance of succeeding anyways.
Suggested-by: Christian Brauner <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Each event stores a int to track which bit to set/clear when enablement
changes. On big endian 64-bit configurations, it's possible this could
cause memory corruption when it's used for atomic bit operations.
Use unsigned long for enablement values to ensure any possible
corruption cannot occur. Downcast to int after mask for the bit target.
Link: https://lore.kernel.org/all/[email protected]/
Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Fixes: dcb8177c1395 ("tracing/user_events: Add ioctl for disabling addresses")
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Beau Belgrave <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
A successful call to cgroup_css_set_fork() will always have taken
a ref on kargs->cset (regardless of CLONE_INTO_CGROUP), so always
do a corresponding put in cgroup_css_set_put_fork().
Without this, a cset and its contained css structures will be
leaked for some fork failures. The following script reproduces
the leak for a fork failure due to exceeding pids.max in the
pids controller. A similar thing can happen if we jump to the
bad_fork_cancel_cgroup label in copy_process().
[ -z "$1" ] && echo "Usage $0 pids-root" && exit 1
PID_ROOT=$1
CGROUP=$PID_ROOT/foo
[ -e $CGROUP ] && rmdir -f $CGROUP
mkdir $CGROUP
echo 5 > $CGROUP/pids.max
echo $$ > $CGROUP/cgroup.procs
fork_bomb()
{
set -e
for i in $(seq 10); do
/bin/sleep 3600 &
done
}
(fork_bomb) &
wait
echo $$ > $PID_ROOT/cgroup.procs
kill $(cat $CGROUP/cgroup.procs)
rmdir $CGROUP
Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups")
Cc: [email protected] # v5.7+
Signed-off-by: John Sperbeck <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
Smatch warns:
kernel/module/stats.c:394 read_file_mod_stats()
warn: passing freed memory 'buf'
We are passing 'buf' to simple_read_from_buffer() after freeing it.
Fix this by changing the order of 'simple_read_from_buffer' and 'kfree'.
Fixes: df3e764d8e5c ("module: add debug stats to help identify memory pressure")
Signed-off-by: Harshit Mogalapalli <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
|
|
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.
[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89
Signed-off-by: Azeem Shaikh <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
The commit 4f7e7236435c ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock()
deadlock") fixed the deadlock between cgroup_threadgroup_rwsem and
cpus_read_lock() by introducing cgroup_attach_{lock,unlock}() and removing
cpus_read_{lock,unlock}() from cpuset_attach(). But cgroup_transfer_tasks()
was missed and not handled, which will cause th following warning:
WARNING: CPU: 0 PID: 589 at kernel/cpu.c:526 lockdep_assert_cpus_held+0x32/0x40
CPU: 0 PID: 589 Comm: kworker/1:4 Not tainted 6.4.0-rc2-next-20230517 #50
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Workqueue: events cpuset_hotplug_workfn
RIP: 0010:lockdep_assert_cpus_held+0x32/0x40
<...>
Call Trace:
<TASK>
cpuset_attach+0x40/0x240
cgroup_migrate_execute+0x452/0x5e0
? _raw_spin_unlock_irq+0x28/0x40
cgroup_transfer_tasks+0x1f3/0x360
? find_held_lock+0x32/0x90
? cpuset_hotplug_workfn+0xc81/0xed0
cpuset_hotplug_workfn+0xcb1/0xed0
? process_one_work+0x248/0x5b0
process_one_work+0x2b9/0x5b0
worker_thread+0x56/0x3b0
? process_one_work+0x5b0/0x5b0
kthread+0xf1/0x120
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
So just use the cgroup_attach_{lock,unlock}() helper to fix it.
Reported-by: Zhao Gongyi <[email protected]>
Signed-off-by: Qi Zheng <[email protected]>
Acked-by: Muchun Song <[email protected]>
Fixes: 05c7b7a92cc8 ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug")
Cc: [email protected] # v5.17+
Signed-off-by: Tejun Heo <[email protected]>
|
|
Fix all kernel-doc warnings in capability.c:
kernel/capability.c:477: warning: Function parameter or member 'idmap'
not described in 'privileged_wrt_inode_uidgid'
kernel/capability.c:493: warning: Function parameter or member 'idmap'
not described in 'capable_wrt_inode_uidgid'
Signed-off-by: Gaosheng Cui <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Paul Moore <[email protected]>
|
|
The LRU and LRU_PERCPU maps allocate a new element on update before locking the
target hash table bucket. Right after that the maps try to lock the bucket.
If this fails, then maps return -EBUSY to the caller without releasing the
allocated element. This makes the element untracked: it doesn't belong to
either of free lists, and it doesn't belong to the hash table, so can't be
re-used; this eventually leads to the permanent -ENOMEM on LRU map updates,
which is unexpected. Fix this by returning the element to the local free list
if bucket locking fails.
Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked")
Signed-off-by: Anton Protopopov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
|
|
Remove some unnecessary header files. No functional change intended.
Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
Psi_group's poll_min_period is determined by the minimum window size of
psi_trigger when creating new triggers. While destroying a psi_trigger,
there is no need to reset poll_min_period if the psi_trigger being
destroyed did not have the minimum window size, since in this condition
poll_min_period will remain the same as before.
Signed-off-by: Yang Yang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Suren Baghdasaryan <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
This commit adds the ability to filter kfuncs to certain BPF program
types. This is required to limit bpf_sock_destroy kfunc implemented in
follow-up commits to programs with attach type 'BPF_TRACE_ITER'.
The commit adds a callback filter to 'struct btf_kfunc_id_set'. The
filter has access to the `bpf_prog` construct including its properties
such as `expected_attached_type`.
Signed-off-by: Aditi Ghag <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
|
|
The target_btf_id can help us understand which kernel function is
linked by a tracing prog. The target_btf_id and target_obj_id have
already been exposed to userspace, so we just need to show them.
The result as follows,
$ cat /proc/10673/fdinfo/10
pos: 0
flags: 02000000
mnt_id: 15
ino: 2094
link_type: tracing
link_id: 2
prog_tag: a04f5eef06a7f555
prog_id: 13
attach_type: 24
target_obj_id: 1
target_btf_id: 13964
Signed-off-by: Yafang Shao <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
A narrow load from a 64-bit context field results in a 64-bit load
followed potentially by a 64-bit right-shift and then a bitwise AND
operation to extract the relevant data.
In the case of a 32-bit access, an immediate mask of 0xffffffff is used
to construct a 64-bit BPP_AND operation which then sign-extends the mask
value and effectively acts as a glorified no-op. For example:
0: 61 10 00 00 00 00 00 00 r0 = *(u32 *)(r1 + 0)
results in the following code generation for a 64-bit field:
ldr x7, [x7] // 64-bit load
mov x10, #0xffffffffffffffff
and x7, x7, x10
Fix the mask generation so that narrow loads always perform a 32-bit AND
operation:
ldr x7, [x7] // 64-bit load
mov w10, #0xffffffff
and w7, w7, w10
Cc: Alexei Starovoitov <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: Krzesimir Nowak <[email protected]>
Cc: Andrey Ignatov <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Fixes: 31fd85816dbe ("bpf: permits narrower load from bpf program context fields")
Signed-off-by: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
This implements a new interface to lockdep, lock_set_cmp_fn(), for
defining a custom ordering when taking multiple locks of the same
class.
This is an alternative to subclasses, but can not fully replace them
since subclasses allow lock hierarchies with other clasees
inter-twined, while this relies on pure class nesting.
Specifically, if A is our nesting class then:
A/0 <- B <- A/1
Would be a valid lock order with subclasses (each subclass really is a
full class from the validation PoV) but not with this annotation,
which requires all nesting to be consecutive.
Example output:
| ============================================
| WARNING: possible recursive locking detected
| 6.2.0-rc8-00003-g7d81e591ca6a-dirty #15 Not tainted
| --------------------------------------------
| kworker/14:3/938 is trying to acquire lock:
| ffff8880143218c8 (&b->lock l=0 0:2803368){++++}-{3:3}, at: bch_btree_node_get.part.0+0x81/0x2b0
|
| but task is already holding lock:
| ffff8880143de8c8 (&b->lock l=1 1048575:9223372036854775807){++++}-{3:3}, at: __bch_btree_map_nodes+0xea/0x1e0
| and the lock comparison function returns 1:
|
| other info that might help us debug this:
| Possible unsafe locking scenario:
|
| CPU0
| ----
| lock(&b->lock l=1 1048575:9223372036854775807);
| lock(&b->lock l=0 0:2803368);
|
| *** DEADLOCK ***
|
| May be due to missing lock nesting notation
|
| 3 locks held by kworker/14:3/938:
| #0: ffff888005ea9d38 ((wq_completion)bcache){+.+.}-{0:0}, at: process_one_work+0x1ec/0x530
| #1: ffff8880098c3e70 ((work_completion)(&cl->work)#3){+.+.}-{0:0}, at: process_one_work+0x1ec/0x530
| #2: ffff8880143de8c8 (&b->lock l=1 1048575:9223372036854775807){++++}-{3:3}, at: __bch_btree_map_nodes+0xea/0x1e0
[peterz: extended changelog]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Conflicts:
drivers/net/ethernet/freescale/fec_main.c
6ead9c98cafc ("net: fec: remove the xdp_return_frame when lack of tx BDs")
144470c88c5d ("net: fec: using the standard return codes when xdp xmit errors")
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Three functions that are defined in x86 specific code to override
generic __weak implementations cause a warning because of a missing
prototype:
arch/x86/power/cpu.c:298:5: error: no previous prototype for 'hibernate_resume_nonboot_cpu_disable' [-Werror=missing-prototypes]
arch/x86/power/hibernate.c:129:5: error: no previous prototype for 'arch_hibernation_header_restore' [-Werror=missing-prototypes]
arch/x86/power/hibernate.c:91:5: error: no previous prototype for 'arch_hibernation_header_save' [-Werror=missing-prototypes]
Move the declarations into a global header so it can be included
by any file defining one of these.
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Alexander Lobakin <[email protected]>
Link: https://lore.kernel.org/all/20230516193549.544673-14-arnd%40kernel.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- Initialize 'ret' local variables on fprobe_handler() to fix the
smatch warning. With this, fprobe function exit handler is not
working randomly.
- Fix to use preempt_enable/disable_notrace for rethook handler to
prevent recursive call of fprobe exit handler (which is based on
rethook)
- Fix recursive call issue on fprobe_kprobe_handler()
- Fix to detect recursive call on fprobe_exit_handler()
- Fix to make all arch-dependent rethook code notrace (the
arch-independent code is already notrace)"
* tag 'probes-fixes-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rethook, fprobe: do not trace rethook related functions
fprobe: add recursion detection in fprobe_exit_handler
fprobe: make fprobe_kprobe_handler recursion free
rethook: use preempt_{disable, enable}_notrace in rethook_trampoline_handler
tracing: fprobe: Initialize ret valiable to fix smatch error
|
|
Now that wq_worker_tick() is there, we can easily track the rough CPU time
consumption of each workqueue by charging the whole tick whenever a tick
hits an active workqueue. While not super accurate, it provides reasonable
visibility into the workqueues that consume a lot of CPU cycles.
wq_monitor.py is updated to report the per-workqueue CPU times.
v2: wq_monitor.py was using "cputime" as the key when outputting in json
format. Use "cpu_time" instead for consistency with other fields.
Signed-off-by: Tejun Heo <[email protected]>
|
|
Workqueue now automatically marks per-cpu work items that hog CPU for too
long as CPU_INTENSIVE, which excludes them from concurrency management and
prevents stalling other concurrency-managed work items. If a work function
keeps running over the thershold, it likely needs to be switched to use an
unbound workqueue.
This patch adds a debug mechanism which tracks the work functions which
trigger the automatic CPU_INTENSIVE mechanism and report them using
pr_warn() with exponential backoff.
v3: Documentation update.
v2: Drop bouncing to kthread_worker for printing messages. It was to avoid
introducing circular locking dependency through printk but not effective
as it still had pool lock -> wci_lock -> printk -> pool lock loop. Let's
just print directly using printk_deferred().
Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
|
|
If a per-cpu work item hogs the CPU, it can prevent other work items from
starting through concurrency management. A per-cpu workqueue which intends
to host such CPU-hogging work items can choose to not participate in
concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be
error-prone and difficult to debug when missed.
This patch adds an automatic CPU usage based detection. If a
concurrency-managed work item consumes more CPU time than the threshold
(10ms by default) continuously without intervening sleeps, wq_worker_tick()
which is called from scheduler_tick() will detect the condition and
automatically mark it CPU_INTENSIVE.
The mechanism isn't foolproof:
* Detection depends on tick hitting the work item. Getting preempted at the
right timings may allow a violating work item to evade detection at least
temporarily.
* nohz_full CPUs may not be running ticks and thus can fail detection.
* Even when detection is working, the 10ms detection delays can add up if
many CPU-hogging work items are queued at the same time.
However, in vast majority of cases, this should be able to detect violations
reliably and provide reasonable protection with a small increase in code
complexity.
If some work items trigger this condition repeatedly, the bigger problem
likely is the CPU being saturated with such per-cpu work items and the
solution would be making them UNBOUND. The next patch will add a debug
mechanism to help spot such cases.
v4: Documentation for workqueue.cpu_intensive_thresh_us added to
kernel-parameters.txt.
v3: Switch to use wq_worker_tick() instead of hooking into preemptions as
suggested by Peter.
v2: Lai pointed out that wq_worker_stopping() also needs to be called from
preemption and rtlock paths and an earlier patch was updated
accordingly. This patch adds a comment describing the risk of infinte
recursions and how they're avoided.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Lai Jiangshan <[email protected]>
|
|
* Some worker fields are modified only by the worker itself while holding
pool->lock thus making them safe to read from self, IRQ context if the CPU
is running the worker or while holding pool->lock. Add 'K' locking rule
for them.
* worker->sleeping is currently marked "None" which isn't very descriptive.
It's used only by the worker itself. Add 'S' locking rule for it.
A future patch will depend on the 'K' rule to access worker->current_* from
the scheduler ticks.
Signed-off-by: Tejun Heo <[email protected]>
|
|
They are going to be used in wq_worker_stopping(). Move them upwards.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
|
|
struct worker was laid out with the intent that all fields that are modified
for each work item execution are in the first cacheline. However, this
hasn't been true for a while with the addition of ->last_func. Let's just
collect hot fields together at the top.
Move ->sleeping in the hole after ->current_color and move ->lst_func right
below. While at it, drop the cacheline comment which isn't useful anymore.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
|
|
Currently, the only way to peer into workqueue operations is through
tracing. While possible, it isn't easy or convenient to monitor
per-workqueue behaviors over time this way. Let's add pwq->stats[] that
track relevant events and a drgn monitoring script -
tools/workqueue/wq_monitor.py.
It's arguable whether this needs to be configurable. However, it currently
only has several counters and the runtime overhead shouldn't be noticeable
given that they're on pwq's which are per-cpu on per-cpu workqueues and
per-numa-node on unbound ones. Let's keep it simple for the time being.
v2: Patch reordered to earlier with fewer fields. Field will be added back
gradually. Help message improved.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
|
|
fprobe_hander and fprobe_kprobe_handler has guarded ftrace recursion
detection but fprobe_exit_handler has not, which possibly introduce
recursive calls if the fprobe exit callback calls any traceable
functions. Checking in fprobe_hander or fprobe_kprobe_handler
is not enough and misses this case.
So add recursion free guard the same way as fprobe_hander. Since
ftrace recursion check does not employ ip(s), so here use entry_ip and
entry_parent_ip the same as fprobe_handler.
Link: https://lore.kernel.org/all/[email protected]/
Fixes: 5b0ab78998e3 ("fprobe: Add exit_handler support")
Signed-off-by: Ze Gao <[email protected]>
Cc: [email protected]
Acked-by: Masami Hiramatsu (Google) <[email protected]>
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
|