Age | Commit message (Collapse) | Author | Files | Lines |
|
This patch fixes an incorrect assumption made in the original
bpf_refcount series [0], specifically that the BPF program calling
bpf_refcount_acquire on some node can always guarantee that the node is
alive. In that series, the patch adding failure behavior to rbtree_add
and list_push_{front, back} breaks this assumption for non-owning
references.
Consider the following program:
n = bpf_kptr_xchg(&mapval, NULL);
/* skip error checking */
bpf_spin_lock(&l);
if(bpf_rbtree_add(&t, &n->rb, less)) {
bpf_refcount_acquire(n);
/* Failed to add, do something else with the node */
}
bpf_spin_unlock(&l);
It's incorrect to assume that bpf_refcount_acquire will always succeed in this
scenario. bpf_refcount_acquire is being called in a critical section
here, but the lock being held is associated with rbtree t, which isn't
necessarily the lock associated with the tree that the node is already
in. So after bpf_rbtree_add fails to add the node and calls bpf_obj_drop
in it, the program has no ownership of the node's lifetime. Therefore
the node's refcount can be decr'd to 0 at any time after the failing
rbtree_add. If this happens before the refcount_acquire above, the node
might be free'd, and regardless refcount_acquire will be incrementing a
0 refcount.
Later patches in the series exercise this scenario, resulting in the
expected complaint from the kernel (without this patch's changes):
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 1 PID: 207 at lib/refcount.c:25 refcount_warn_saturate+0xbc/0x110
Modules linked in: bpf_testmod(O)
CPU: 1 PID: 207 Comm: test_progs Tainted: G O 6.3.0-rc7-02231-g723de1a718a2-dirty #371
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
RIP: 0010:refcount_warn_saturate+0xbc/0x110
Code: 6f 64 f6 02 01 e8 84 a3 5c ff 0f 0b eb 9d 80 3d 5e 64 f6 02 00 75 94 48 c7 c7 e0 13 d2 82 c6 05 4e 64 f6 02 01 e8 64 a3 5c ff <0f> 0b e9 7a ff ff ff 80 3d 38 64 f6 02 00 0f 85 6d ff ff ff 48 c7
RSP: 0018:ffff88810b9179b0 EFLAGS: 00010082
RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
RDX: 0000000000000202 RSI: 0000000000000008 RDI: ffffffff857c3680
RBP: ffff88810027d3c0 R08: ffffffff8125f2a4 R09: ffff88810b9176e7
R10: ffffed1021722edc R11: 746e756f63666572 R12: ffff88810027d388
R13: ffff88810027d3c0 R14: ffffc900005fe030 R15: ffffc900005fe048
FS: 00007fee0584a700(0000) GS:ffff88811b280000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005634a96f6c58 CR3: 0000000108ce9002 CR4: 0000000000770ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
bpf_refcount_acquire_impl+0xb5/0xc0
(rest of output snipped)
The patch addresses this by changing bpf_refcount_acquire_impl to use
refcount_inc_not_zero instead of refcount_inc and marking
bpf_refcount_acquire KF_RET_NULL.
For owning references, though, we know the above scenario is not possible
and thus that bpf_refcount_acquire will always succeed. Some verifier
bookkeeping is added to track "is input owning ref?" for bpf_refcount_acquire
calls and return false from is_kfunc_ret_null for bpf_refcount_acquire on
owning refs despite it being marked KF_RET_NULL.
Existing selftests using bpf_refcount_acquire are modified where
necessary to NULL-check its return value.
[0]: https://lore.kernel.org/bpf/[email protected]/
Fixes: d2dcc67df910 ("bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail")
Reported-by: Kumar Kartikeya Dwivedi <[email protected]>
Signed-off-by: Dave Marchevsky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Document the threshold behavior for -M/--metrics.
Reviewed-by: Kan Liang <[email protected]>
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ahmad Yasin <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Caleb Biggers <[email protected]>
Cc: Eduard Zingerman <[email protected]>
Cc: Edward Baker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Perry Taylor <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Samantha Alt <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Weilin Wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Currently the & and | operators are only used in metric thresholds like
(from the tma_retiring metric):
tma_retiring > 0.7 | tma_heavy_operations > 0.1
Thresholds are always computed when present, but a lack of events may
mean the threshold can't be computed. This happens with the option
--metric-no-threshold for say the metric tma_retiring on Tigerlake model
CPUs.
To fully compute the threshold tma_heavy_operations is needed and it
needs the extra events of IDQ.MS_UOPS, UOPS_DECODED.DEC0,
cpu/UOPS_DECODED.DEC0,cmask=1/ and IDQ.MITE_UOPS. So
--metric-no-threshold is a useful option to reduce the number of events
needed and potentially multiplexing of events.
Rather than just fail threshold computations like this, we may know a
result from just the left or right-hand side. So, for tma_retiring if
its value is "> 0.7" we know it is over the threshold. This allows the
metric to have the threshold coloring, when possible, without all the
counters being programmed.
Reviewed-by: Kan Liang <[email protected]>
Signed-off-by: Ian Rogers <[email protected]>
Tested-by: Kan Liang <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ahmad Yasin <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Caleb Biggers <[email protected]>
Cc: Eduard Zingerman <[email protected]>
Cc: Edward Baker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Perry Taylor <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Samantha Alt <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Weilin Wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Add the MOPS hwcap to the hwcap kselftest and check that a SIGILL is not
generated when the feature is detected. A SIGILL is reliable when the
feature is not detected as SCTLR_EL1.MSCEn won't have been set.
Signed-off-by: Kristina Martsenko <[email protected]>
Reviewed-by: Mark Brown <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
|
|
In order to print the numerical entries of the syscall table,
there is no need to call the host compiler to build and then
run a program, this can be done directly by the shell script.
This is similar with commit 9854e7ad35fe ("perf arm64: Simplify
mksyscalltbl"). For now, the mksyscalltbl file of LoongArch is
almost same with arm64.
Reviewed-by: Huacai Chen <[email protected]>
Reviewed-by: Leo Yan <[email protected]>
Signed-off-by: Tiezhu Yang <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Like x86, powerpc, mips and s390, use max_nr which is a digital
number to define SYSCALLTBL_ARM64_MAX_ID.
Reviewed-by: Huacai Chen <[email protected]>
Signed-off-by: Tiezhu Yang <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
After commit 9854e7ad35fe ("perf arm64: Simplify mksyscalltbl"),
in the generated syscall table file syscalls.c, there exist some
__NR3264_ prefixed syscall numbers such as [__NR3264_ftruncate],
it looks like not so good, just do some small filter operations
to handle __NR3264_ prefixed syscall number as a digital number.
Without this patch:
[__NR3264_ftruncate] = "ftruncate",
With this patch:
[46] = "ftruncate",
Suggested-by: Alexander Kapshuk <[email protected]>
Reviewed-by: Huacai Chen <[email protected]>
Reviewed-by: Leo Yan <[email protected]>
Signed-off-by: Tiezhu Yang <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
After commit 9854e7ad35fecf30 ("perf arm64: Simplify mksyscalltbl") it
has been removed the temporary C program and used shell to generate
syscall table, so let us rename create_table_from_c() to
create_sc_table() to avoid confusion.
Suggested-by: Leo Yan <[email protected]>
Reviewed-by: Huacai Chen <[email protected]>
Signed-off-by: Tiezhu Yang <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
syscalltbl_*[] should never be changing, let us declare it as const.
Suggested-by: Ian Rogers <[email protected]>
Reviewed-by: Huacai Chen <[email protected]>
Signed-off-by: Tiezhu Yang <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
To align with what is done by the in-kernel PM, update userspace pm
subflow selftests, by sending the a remove_addrs command together
before the remove_subflows command. This will get a RM_ADDR in
chk_rm_nr().
Fixes: d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
Fixes: 5e986ec46874 ("selftests: mptcp: userspace pm subflow tests")
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/379
Cc: [email protected]
Reviewed-by: Matthieu Baerts <[email protected]>
Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This patch is linked to the previous commit ("mptcp: only send RM_ADDR in
nl_cmd_remove").
To align with what is done by the in-kernel PM, update userspace pm addr
selftests, by sending a remove_subflows command together after the
remove_addrs command.
Fixes: d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
Fixes: 97040cf9806e ("selftests: mptcp: userspace pm address tests")
Cc: [email protected]
Reviewed-by: Matthieu Baerts <[email protected]>
Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Without this we were not getting the thousands separator for big
numbers.
Noticed while developing 'perf bench uprobe', but the use of %' predates
that, for instance 'perf bench syscall' uses it.
Before:
# perf bench uprobe all
# Running uprobe/baseline benchmark...
# Executed 1000 usleep(1000) calls
Total time: 1054082243ns
1054082.243000 nsecs/op
#
After:
# perf bench uprobe all
# Running uprobe/baseline benchmark...
# Executed 1,000 usleep(1000) calls
Total time: 1,053,715,144ns
1,053,715.144000 nsecs/op
#
Fixes: c2a08203052f8975 ("perf bench: Add basic syscall benchmark")
Cc: Adrian Hunter <[email protected]>
Cc: Andre Fredette <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: Dave Tucker <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Derek Barbosa <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Tiezhu Yang <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
When everything is configured, VLAN membership on the bridge in this
selftest are as follows:
# bridge vlan show
port vlan-id
swp2 1 PVID Egress Untagged
555
br1 1 Egress Untagged
555 PVID Egress Untagged
Note that it is possible for untagged traffic to just flow through as VLAN
1, instead of using VLAN 555 as intended by the test. This configuration
seems too close to "works by accident", and it would be better to just shut
out VLAN 1 altogether.
To that end, configure vlan_default_pvid of 0:
# bridge vlan show
port vlan-id
swp2 555
br1 555 PVID Egress Untagged
Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Add a topology diagram to this selftest to make the configuration easier to
understand.
Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The topology diagram implies that $swp1 and $swp2 are members of the bridge
br0, when in fact only their uppers, $swp1.10 and $swp2.10 are. Adjust the
diagram.
Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The topology diagram implies that $swp1 and $swp2 are members of the bridge
br0, when in fact only their uppers, $swp1.10 and $swp2.10 are. Adjust the
diagram.
Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Amit Cohen <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
GCC 11.3.0 issues warnings in this module about wrong sizes of format
specifiers:
pcm-test.c: In function ‘test_pcm_time’:
pcm-test.c:384:68: warning: format ‘%ld’ expects argument of type ‘long int’, but argument 5 \
has type ‘unsigned int’ [-Wformat=]
384 | snprintf(msg, sizeof(msg), "rate mismatch %ld != %ld", rate, rrate);
pcm-test.c:455:53: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has \
type ‘long int’ [-Wformat=]
455 | "expected %d, wrote %li", rate, frames);
pcm-test.c:462:53: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has \
type ‘long int’ [-Wformat=]
462 | "expected %d, wrote %li", rate, frames);
pcm-test.c:467:53: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has \
type ‘long int’ [-Wformat=]
467 | "expected %d, wrote %li", rate, frames);
Simple fix according to compiler's suggestion removed the warnings.
Signed-off-by: Mirsad Goran Todorovac <[email protected]>
Reviewed-by: Mark Brown <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
|
|
We need the USB fixes in here are well.
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Address some fallout of the locking rework, this time affecting the
way the vgic is configured
- Fix an issue where the page table walker frees a subtree and then
proceeds with walking what it has just freed...
- Check that a given PA donated to the guest is actually memory (only
affecting pKVM)
- Correctly handle MTE CMOs by Set/Way
- Fix the reported address of a watchpoint forwarded to userspace
- Fix the freeing of the root of stage-2 page tables
- Stop creating spurious PMU events to perform detection of the
default PMU and use the existing PMU list instead
x86:
- Fix a memslot lookup bug in the NX recovery thread that could
theoretically let userspace bypass the NX hugepage mitigation
- Fix a s/BLOCKING/PENDING bug in SVM's vNMI support
- Account exit stats for fastpath VM-Exits that never leave the super
tight run-loop
- Fix an out-of-bounds bug in the optimized APIC map code, and add a
regression test for the race"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: selftests: Add test for race in kvm_recalculate_apic_map()
KVM: x86: Bail from kvm_recalculate_phys_map() if x2APIC ID is out-of-bounds
KVM: x86: Account fastpath-only VM-Exits in vCPU stats
KVM: SVM: vNMI pending bit is V_NMI_PENDING_MASK not V_NMI_BLOCKING_MASK
KVM: x86/mmu: Grab memslot for correct address space in NX recovery worker
KVM: arm64: Document default vPMU behavior on heterogeneous systems
KVM: arm64: Iterate arm_pmus list to probe for default PMU
KVM: arm64: Drop last page ref in kvm_pgtable_stage2_free_removed()
KVM: arm64: Populate fault info for watchpoint
KVM: arm64: Reload PTE after invoking walker callback on preorder traversal
KVM: arm64: Handle trap of tagged Set/Way CMOs
arm64: Add missing Set/Way CMO encodings
KVM: arm64: Prevent unconditional donation of unmapped regions from the host
KVM: arm64: vgic: Fix a comment
KVM: arm64: vgic: Fix locking comment
KVM: arm64: vgic: Wrap vgic_its_create() with config_lock
KVM: arm64: vgic: Fix a circular locking issue
|
|
This resolves the issue that generated binary is showing up as an untracked git file after every build on the kernel.
Signed-off-by: Weihao Gao <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- Return NULL if the trace_probe list on trace_probe_event is empty
- selftests/ftrace: Choose testing symbol name for filtering feature
from sample data instead of fixed symbol
* tag 'probes-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
selftests/ftrace: Choose target function for filter test from samples
tracing/probe: trace_probe_primary_from_call(): checked list_first_entry
|
|
Since the event-filter-function.tc expects the 'exit_mmap()' directly
calls 'kmem_cache_free()', this is vulnerable to code modifications.
Choose the target function for the filter test from the sample
event data so that it can keep test running correctly even if the caller
function name will be changed.
Link: https://lore.kernel.org/linux-trace-kernel/167919441260.1922645.18355804179347364057.stgit@mhiramat.roam.corp.google.com/
Link: https://lore.kernel.org/all/CA+G9fYtF-XEKi9YNGgR=Kf==7iRb2FrmEC7qtwAeQbfyah-UhA@mail.gmail.com/
Reported-by: Linux Kernel Functional Testing <[email protected]>
Fixes: 7f09d639b8c4 ("tracing/selftests: Add test for event filtering on function name")
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
Acked-by: Steven Rostedt (Google) <[email protected]>
|
|
Notifications may come in at any time. The family must be always
ready to parse a random incoming notification. Generate notification
table for parsing and tell YNL which request we're processing
to distinguish responses from notifications.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
We'll want to store static info about the family soon.
Generate a struct. This changes creation from, e.g.:
ys = ynl_sock_create("netdev", &yerr);
to:
ys = ynl_sock_create(&ynl_netdev_family, &yerr);
on user's side.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
We expect user to allocate requests with calloc(),
make things a bit more consistent and provide helpers.
Generate free calls, too.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
We generate send() and recv() calls and all msg handling for
each operation. It's a lot of repeated code and will only grow
with notification handling. Call back to a helper YNL lib instead.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
It's sometimes useful to print the name of an enum value,
flag or name of the op. Python can do it, add C helper
code gen for getting names of things.
Example:
static const char * const netdev_xdp_act_strmap[] = {
[0] = "basic",
[1] = "redirect",
[2] = "ndo-xmit",
[3] = "xsk-zerocopy",
[4] = "hw-offload",
[5] = "rx-sg",
[6] = "ndo-xmit-sg",
};
const char *netdev_xdp_act_str(enum netdev_xdp_act value)
{
value = ffs(value) - 1;
if (value < 0 || value >= (int)MNL_ARRAY_SIZE(netdev_xdp_act_strmap))
return NULL;
return netdev_xdp_act_strmap[value];
}
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Parsing nested types may return an error, propagate it.
Not marking as a fix, because nothing uses YNL upstream.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Both event and notify types are always consistent. Rewrite
the condition checking if we can reuse reply types to be
less picky and let notify thru.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
For pure structs (parsed nested attributes) we track what
forms of the struct exist in request and reply directions.
Make sure we don't overwrite the recorded struct each time,
otherwise the information is lost.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Unused and Pad attributes don't carry information.
Unused should never exist, and be rejected.
Pad should be silently skipped.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Make sure all relevant headers are included, we allocate memory,
use memcpy() and Linux types without including the headers.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Keep switching between LAPIC_MODE_X2APIC and LAPIC_MODE_DISABLED during
APIC map construction to hunt for TOCTOU bugs in KVM. KVM's optimized map
recalc makes multiple passes over the list of vCPUs, and the calculations
ignore vCPU's whose APIC is hardware-disabled, i.e. there's a window where
toggling LAPIC_MODE_DISABLED is quite interesting.
Signed-off-by: Michal Luczaj <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Add a selftest that accesses a BPF_MAP_TYPE_ARRAY (at a nonzero index)
nested within a BPF_MAP_TYPE_HASH_OF_MAPS to flex a previously buggy
case.
Signed-off-by: Rhys Rustad-Elliott <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
|
|
Fixes a bunch of warnings like:
drivers/input/tests/input_test.o: warning: objtool: input_test_init+0x1cb: stack state mismatch: cfa1=4+64 cfa2=4+56
lib/kunit/kunit-test.o: warning: objtool: kunit_log_newline_test+0xfb: return with modified stack frame
...
Fixes: 260755184cbd ("kunit: Move kunit_abort() call out of kunit_do_failed_assertion()")
Reported-by: Stephen Rothwell <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/20230602175453.swsn3ehyochtwkhy@treble
|
|
Make sure we don't generate premature POLLIN events.
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The test case shown in [1] triggers the kernel to access the null pointer.
Therefore, add related test cases to mq.
The test results are as follows:
./tdc.py -e 0531
1..1
ok 1 0531 - Replace mq with invalid parent ID
./tdc.py -c mq
1..8
ok 1 ce7d - Add mq Qdisc to multi-queue device (4 queues)
ok 2 2f82 - Add mq Qdisc to multi-queue device (256 queues)
ok 3 c525 - Add duplicate mq Qdisc
ok 4 128a - Delete nonexistent mq Qdisc
ok 5 03a9 - Delete mq Qdisc twice
ok 6 be0f - Add mq Qdisc to single-queue device
ok 7 1023 - Show mq class
ok 8 0531 - Replace mq with invalid parent ID
[1] https://lore.kernel.org/all/[email protected]/
Signed-off-by: Zhengchao Shao <[email protected]>
Reviewed-by: Pedro Tammela <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
drivers/net/ethernet/sfc/tc.c
622ab656344a ("sfc: fix error unwinds in TC offload")
b6583d5e9e94 ("sfc: support TC decap rules matching on enc_src_port")
net/mptcp/protocol.c
5b825727d087 ("mptcp: add annotations around msk->subflow accesses")
e76c8ef5cc5b ("mptcp: refactor mptcp_stream_accept()")
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Happy Wear a Dress Day.
Fairly standard-sized batch of fixes, accounting for the lack of
sub-tree submissions this week. The mlx5 IRQ fixes are notable, people
were complaining about that. No fires burning.
Current release - regressions:
- eth: mlx5e:
- multiple fixes for dynamic IRQ allocation
- prevent encap offload when neigh update is running
- eth: mana: fix perf regression: remove rx_cqes, tx_cqes counters
Current release - new code bugs:
- eth: mlx5e: DR, add missing mutex init/destroy in pattern manager
Previous releases - always broken:
- tcp: deny tcp_disconnect() when threads are waiting
- sched: prevent ingress Qdiscs from getting installed in random
locations in the hierarchy and moving around
- sched: flower: fix possible OOB write in fl_set_geneve_opt()
- netlink: fix NETLINK_LIST_MEMBERSHIPS length report
- udp6: fix race condition in udp6_sendmsg & connect
- tcp: fix mishandling when the sack compression is deferred
- rtnetlink: validate link attributes set at creation time
- mptcp: fix connect timeout handling
- eth: stmmac: fix call trace when stmmac_xdp_xmit() is invoked
- eth: amd-xgbe: fix the false linkup in xgbe_phy_status
- eth: mlx5e:
- fix corner cases in internal buffer configuration
- drain health before unregistering devlink
- usb: qmi_wwan: set DTR quirk for BroadMobi BM818
Misc:
- tcp: return user_mss for TCP_MAXSEG in CLOSE/LISTEN state if
user_mss set"
* tag 'net-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (71 commits)
mptcp: fix active subflow finalization
mptcp: add annotations around sk->sk_shutdown accesses
mptcp: fix data race around msk->first access
mptcp: consolidate passive msk socket initialization
mptcp: add annotations around msk->subflow accesses
mptcp: fix connect timeout handling
rtnetlink: add the missing IFLA_GRO_ tb check in validate_linkmsg
rtnetlink: move IFLA_GSO_ tb check to validate_linkmsg
rtnetlink: call validate_linkmsg in rtnl_create_link
ice: recycle/free all of the fragments from multi-buffer frame
net: phy: mxl-gpy: extend interrupt fix to all impacted variants
net: renesas: rswitch: Fix return value in error path of xmit
net: dsa: mv88e6xxx: Increase wait after reset deactivation
net: ipa: Use correct value for IPA_STATUS_SIZE
tcp: fix mishandling when the sack compression is deferred.
net/sched: flower: fix possible OOB write in fl_set_geneve_opt()
sfc: fix error unwinds in TC offload
net/mlx5: Read embedded cpu after init bit cleared
net/mlx5e: Fix error handling in mlx5e_refresh_tirs
net/mlx5: Ensure af_desc.mask is properly initialized
...
|
|
Verify that KVM reports the actual number of CPUID entries on success, but
doesn't touch the userspace struct on failure (which for better or worse,
is KVM's ABI).
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Add a test for page splitting during dirty logging and for hugepage
recovery after dirty logging.
Page splitting represents non-trivial behavior, which is complicated
by MANUAL_PROTECT mode, which causes pages to be split on the first
clear, instead of when dirty logging is enabled.
Add a test which makes assertions about page counts to help define the
expected behavior of page splitting and to provide needed coverage of the
behavior. This also helps ensure that a failure in eager page splitting
is not covered up by splitting in the vCPU path.
Tested by running the test on an Intel Haswell machine w/wo
MANUAL_PROTECT.
Reviewed-by: Vipin Sharma <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: let the user run without hugetlb, as suggested by Paolo]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Move some helper functions from dirty_log_perf_test.c to the memstress
library so that they can be used in a future commit which tests page
splitting during dirty logging.
Reviewed-by: Vipin Sharma <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Access the same memory addresses on each iteration of the memstress
guest code. This ensures that the state of KVM's page tables
is the same after every iteration, including the pages that host the
guest page tables for args and vcpu_args.
This difference is visible when running the proposed
dirty_log_page_splitting_test[*] on AMD, or on Intel with pml=0 and
eptad=0. The tests fail due to different semantics of dirty bits for
page-table pages on AMD (and eptad=0) and Intel. Both AMD and Intel with
eptad=0 treat page-table accesses as writes, therefore more pages are
dropped before the repopulation phase when dirty logging is disabled.
The "missing" page had been included in the population phase because it
hosts the page tables for vcpu_args, but repopulation does not need it."
Signed-off-by: Paolo Bonzini <[email protected]>
Reviewed-by: Vipin Sharma <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: add additional details in changelog]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
On large systems, it's common that PID/TID is bigger than 5-digit and it
makes the output unaligned. Let's increase the width to 7.
Before:
$ perf script
...
swapper 0 [006] 1540823.803935: 1369324 cycles:P: ffffffff9c755588 ktime_get+0x18 ([kernel.kallsyms])
gvfsd-dnssd 95114 [004] 1540823.804164: 1643871 cycles:P: ffffffff9cfdca5c __get_user_8+0x1c ([kernel.kallsyms])
perf-exec 1558582 [000] 1540823.804209: 1018714 cycles:P: ffffffff9c924ab9 __slab_free+0x9 ([kernel.kallsyms])
nmcli 1558589 [007] 1540823.804384: 1859212 cycles:P: 7f70537a8ad8 __strchrnul_evex+0x18 (/usr/lib/x86_64-linux-gnu/libc.so.6>
sleep 1558582 [000] 1540823.804456: 987425 cycles:P: 7fd35bb27b30 _dl_init+0x0 (/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2>
dbus-daemon 3043 [003] 1540823.804575: 1564465 cycles:P: ffffffff9cb2bb70 llist_add_batch+0x0 ([kernel.kallsyms])
gdbus 1558592 [001] 1540823.804766: 1315219 cycles:P: ffffffff9c797b2e audit_filter_syscall+0x9e ([kernel.kallsyms])
NetworkManager 3452 [005] 1540823.805301: 1558782 cycles:P: 7fa957737748 g_bit_lock+0x58 (/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.7400.5>
After:
$ perf script
...
swapper 0 [006] 1540823.803935: 1369324 cycles:P: ffffffff9c755588 ktime_get+0x18 ([kernel.kallsyms])
gvfsd-dnssd 95114 [004] 1540823.804164: 1643871 cycles:P: ffffffff9cfdca5c __get_user_8+0x1c ([kernel.kallsyms])
perf-exec 1558582 [000] 1540823.804209: 1018714 cycles:P: ffffffff9c924ab9 __slab_free+0x9 ([kernel.kallsyms])
nmcli 1558589 [007] 1540823.804384: 1859212 cycles:P: 7f70537a8ad8 __strchrnul_evex+0x18 (/usr/lib/x86_64-linux-gnu/libc.so.6>
sleep 1558582 [000] 1540823.804456: 987425 cycles:P: 7fd35bb27b30 _dl_init+0x0 (/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2>
dbus-daemon 3043 [003] 1540823.804575: 1564465 cycles:P: ffffffff9cb2bb70 llist_add_batch+0x0 ([kernel.kallsyms])
gdbus 1558592 [001] 1540823.804766: 1315219 cycles:P: ffffffff9c797b2e audit_filter_syscall+0x9e ([kernel.kallsyms])
NetworkManager 3452 [005] 1540823.805301: 1558782 cycles:P: 7fa957737748 g_bit_lock+0x58 (/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.7400.5>
Reviewer notes:
Adrian added:
"Might be worth noting that currently the biggest PID_MAX_LIMIT is 2^22
so pids don't get bigger than 7 digits presently"
$ echo $((2 ** 22))
4194304
$ echo -n $((2 ** 22)) | wc -c
7
$
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Adrian Hunter <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Don't just check the raw PMU type, the only core PMU on homogeneous x86,
check raw and all dynamically added PMUs. Extend the
perf_pmu__warn_invalid_config to check all 4 config values.
Rather than process the format list once per event, store the computed
masks for each config value.
Don't ignore the mask being zero, which is likely for config2 and
config3, add config_masks_present so config values can be ignored only
when no format information is present.
Signed-off-by: Ian Rogers <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Suzuki Poulouse <[email protected]>
Cc: Xing Zhengjun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Avoid scanning format list for each event parsed.
Signed-off-by: Ian Rogers <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Suzuki Poulouse <[email protected]>
Cc: Xing Zhengjun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Add additional test cases to `fib_lookup.c` prog_test.
These test cases add a new /24 network to the previously unused veth2
device, removes the directly connected route from the main routing table
and moves it to table 100.
The first test case then confirms a fib lookup for a remote address in
this directly connected network, using the main routing table fails.
The second test case ensures the same fib lookup using table 100 succeeds.
An additional pair of tests which function in the same manner are added
for IPv6.
Signed-off-by: Louis DeLosSantos <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Add ability to specify routing table ID to the `bpf_fib_lookup` BPF
helper.
A new field `tbid` is added to `struct bpf_fib_lookup` used as
parameters to the `bpf_fib_lookup` BPF helper.
When the helper is called with the `BPF_FIB_LOOKUP_DIRECT` and
`BPF_FIB_LOOKUP_TBID` flags the `tbid` field in `struct bpf_fib_lookup`
will be used as the table ID for the fib lookup.
If the `tbid` does not exist the fib lookup will fail with
`BPF_FIB_LKUP_RET_NOT_FWDED`.
The `tbid` field becomes a union over the vlan related output fields
in `struct bpf_fib_lookup` and will be zeroed immediately after usage.
This functionality is useful in containerized environments.
For instance, if a CNI wants to dictate the next-hop for traffic leaving
a container it can create a container-specific routing table and perform
a fib lookup against this table in a "host-net-namespace-side" TC program.
This functionality also allows `ip rule` like functionality at the TC
layer, allowing an eBPF program to pick a routing table based on some
aspect of the sk_buff.
As a concrete use case, this feature will be used in Cilium's SRv6 L3VPN
datapath.
When egress traffic leaves a Pod an eBPF program attached by Cilium will
determine which VRF the egress traffic should target, and then perform a
FIB lookup in a specific table representing this VRF's FIB.
Signed-off-by: Louis DeLosSantos <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
With PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events opening on
multiple PMUs, the test expectations need updating to test for
multiple events. TODOs are added to document existing hybrid perf
bugs.
Tested on hybrid alderlake and non-hybrid tigerlake.
Signed-off-by: Ian Rogers <[email protected]>
Tested-by: Kan Liang <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Thomas Richter <[email protected]>
Cc: Xing Zhengjun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Numeric events are either raw events or those with ABI defined numbers
matched by the lexer. PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events
should wildcard match on hybrid systems. So "cycles" should match each
PMU type with an extended type, not just PERF_TYPE_HARDWARE.
Change wildcard matching to add the event even if wildcard PMU
scanning fails, there will be no extended type but this best matches
previous behavior.
Only set the extended type when the event type supports it and when
perf_pmus__supports_extended_type is true. This new function returns
true if >1 core PMU and avoids potential errors on older kernels.
Modify evsel__compute_group_pmu_name using a helper
perf_pmu__is_software to determine when grouping should occur. Try to
use PMUs, and evsel__find_pmu, as being more dependable than
evsel->pmu_name.
Set a parse events error if a hardware term's PMU lookup fails, to
provide extra diagnostics.
Fixes: 8bc75f699c141420 ("perf parse-events: Support wildcards on raw events")
Reported-by: Kan Liang <[email protected]>
Signed-off-by: Ian Rogers <[email protected]>
Tested-by: Kan Liang <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Thomas Richter <[email protected]>
Cc: Xing Zhengjun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|