Age | Commit message (Collapse) | Author | Files | Lines |
|
Implement driver support for the HW feature that allows RX steering of
one device to target other device's RQs.
In SD multi-pf netdev mode, we set the secondaries into silent mode,
disconnecting them from the network. This feature is then used to steer
traffic from the primary to the secondaries.
Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Gal Pressman <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Distribute the channels between the different SD-devices to acheive
local numa node performance on multiple numas.
Each channel works against one specific mdev, creating all datapath
queues against it.
We distribute channels to mdevs in a round-robin policy.
Example for 2 mdevs and 6 channels:
+-------+---------+
| ch ix | mdev ix |
+-------+---------+
| 0 | 0 |
| 1 | 1 |
| 2 | 0 |
| 3 | 1 |
| 4 | 0 |
| 5 | 1 |
+-------+---------+
This round-robin distribution policy is preferred over another suggested
intuitive distribution, in which we first distribute one half of the
channels to mdev #0 and then the second half to mdev #1.
We prefer round-robin for a reason: it is less influenced by changes in
the number of channels. The mapping between channel index and mdev is
fixed, no matter how many channels the user configures. As the channel
stats are persistent to channels closure, changing the mapping every
single time would turn the accumulative stats less representing of the
channel's history.
Per-channel objects should stop using the primary mdev (priv->mdev)
directly, and instead move to using their own channel's mdev.
Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Gal Pressman <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Traffic queues will be created on all devices, including the
secondaries. Create the needed core layer resources for them as well.
Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Gal Pressman <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Integrate the SD library calls into the auxiliary_driver ops in
preparation for creating a single netdev for the multiple PFs belonging
to the same SD group.
SD is still disabled at this stage. It is enabled by a downstream patch
when all needed parts are implemented.
The netdev is created whenever the SD group, with all its participants,
are ready. It is later destroyed whenever any of the participating PFs
drops.
Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Gal Pressman <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Add debugfs entries that describe the Socket-Direct group.
Example:
$ grep -H . /sys/kernel/debug/mlx5/0000\:08\:00.0/multi-pf/*
/sys/kernel/debug/mlx5/0000:08:00.0/multi-pf/group_id:0x00000101
/sys/kernel/debug/mlx5/0000:08:00.0/multi-pf/primary:0000:08:00.0 vhca 0x0
/sys/kernel/debug/mlx5/0000:08:00.0/multi-pf/secondary_0:0000:09:00.0 vhca 0x2
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Print to kernel log when an SD group moves from/to ready state.
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Implement the needed SD steering adjustments for the primary and
secondaries.
While the SD multiple PFs are used to avoid cross-numa memory, when it
comes to chip level all traffic goes only through the primary device.
The secondaries are forced to silent mode, to guarantee they are not
involved in any unexpected ingress/egress traffic.
In RX, secondary devices will not have steering objects. Traffic will be
steered from the primary device to the RQs of a secondary device using
advanced cross-vhca RX steering capabilities.
In TX, the primary creates a new TX flow table, which is aliased by the
secondaries.
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Use devcom to communicate between the different devices. Add a new
devcom component type for this.
Each device registers itself to the devcom component <SD, group ID>.
Once all devices of a component are registered, the component becomes
ready, and a primary device is elected.
In principle, any of the devices can act as a primary, they are all
capable, and a random election would've worked. However, we aim to
achieve predictability and consistency, hence each group always choses
the same device, with the lowest PCI BUS number, as primary.
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Add implementation for querying the MPIR register for Socket-Direct
attributes, and instantiating a SD struct accordingly.
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Add Socket-Direct API with empty/minimal implementation.
We fill-in the implementation gradually in downstream patches.
Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Gal Pressman <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Add a cap bit in mcam_access_reg to check for MPIR support.
Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Gal Pressman <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2024-03-06
1) Clear the ECN bits flowi4_tos in decode_session4().
This was already fixed but the bug was reintroduced
when decode_session4() switched to us the flow dissector.
From Guillaume Nault.
2) Fix UDP encapsulation in the TX path with packet offload mode.
From Leon Romanovsky,
3) Avoid clang fortify warning in copy_to_user_tmpl().
From Nathan Chancellor.
4) Fix inter address family tunnel in packet offload mode.
From Mike Yu.
* tag 'ipsec-2024-03-06' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: set skb control buffer based on packet offload as well
xfrm: fix xfrm child route lookup for packet offload
xfrm: Avoid clang fortify warning in copy_to_user_tmpl()
xfrm: Pass UDP encapsulation in TX packet offload
xfrm: Clear low order bits of ->flowi4_tos in decode_session4().
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
After 292fac464b01 ("net: ethtool: eee: Remove legacy _u32 from keee")
this function has no user any longer.
Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
This patch updates the mlxbf_gige driver to support the
"get_pause_stats()" callback, which enables display of
pause frame counters via "ethtool -I -a oob_net0".
The pause frame counters are only enabled if the "counters_en"
bit is asserted in the LLU general config register. The driver
will only report stats, and thus overwrite the default stats
state of ETHTOOL_STAT_NOT_SET, if "counters_en" is asserted.
Reviewed-by: Asmaa Mnebhi <[email protected]>
Signed-off-by: David Thompson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Kernel bot has discovered that if CONFIG_GPIOLIB is not set compilation
will fail.
Upon investigation the issue is that qca807x_gpio() is guarded by a
preprocessor check but then it is called under
if (IS_ENABLED(CONFIG_GPIOLIB)) in the probe call so the compiler will
error out since qca807x_gpio() has not been declared if CONFIG_GPIOLIB has
not been set.
Fixes: d1cb613efbd3 ("net: phy: qcom: add support for QCA807x PHY Family")
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Robert Marko <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Tested-by: Simon Horman <[email protected]> # build-tested
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.
Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.
Signed-off-by: Breno Leitao <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.
With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.
Remove the allocation in the geneve driver and leverage the network
core allocation instead.
Signed-off-by: Breno Leitao <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Assign netdev to gtp->dev at setup time, so, we can get rid of
gtp_dev_init() completely.
Signed-off-by: Breno Leitao <[email protected]>
Acked-by: Pablo Neira Ayuso <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.
Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.
Signed-off-by: Breno Leitao <[email protected]>
Acked-by: Pablo Neira Ayuso <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.
With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.
Remove the allocation in the gtp driver and leverage the network
core allocation instead.
Signed-off-by: Breno Leitao <[email protected]>
Acked-by: Pablo Neira Ayuso <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.
With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.
Remove the allocation in the macsec driver and leverage the network
core allocation instead.
Signed-off-by: Breno Leitao <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Sabrina Dubroca <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Document support for the Renesas Ethernet AVB (EtherAVB-IF) block in the
Renesas R-Car V4M (R8A779H0) SoC.
Signed-off-by: Thanh Quan <[email protected]>
Signed-off-by: Geert Uytterhoeven <[email protected]>
Reviewed-by: Niklas Söderlund <[email protected]>
Acked-by: Rob Herring <[email protected]>
Reviewed-by: Sergey Shtylyov <[email protected]>
Link: https://lore.kernel.org/r/0212b57ba1005bb9b5a922f8f25cc67a7bc15f30.1709631152.git.geert+renesas@glider.be
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Add check for usbnet_get_endpoints() and return the error if it fails
in order to transfer the error.
Signed-off-by: Chen Ni <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Fixes: 19a38d8e0aa3 ("USB2NET : SR9800 : One chip USB2.0 USB2NET SR9800 Device Driver Support")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Always run fixture setup in the grandchild process, and by default also
run the teardown in the same process. However, this change makes it
possible to run the teardown in a parent process when
_metadata->teardown_parent is set to true (e.g. in fixture setup).
Fix TEST_SIGNAL() by forwarding grandchild's signal to its parent. Fix
seccomp tests by running the test setup in the parent of the test
thread, as expected by the related test code. Fix Landlock tests by
waiting for the grandchild before processing _metadata.
Use of exit(3) in tests should be OK because the environment in which
the vfork(2) call happen is already dedicated to the running test (with
flushed stdio, setpgrp() call), see __run_test() and the call to fork(2)
just before running the setup/test/teardown. Even if the test
configures its own exit handlers, they will not be run by the parent
because it never calls exit(3), and the test function either ends with a
call to _exit(2) or a signal.
Cc: Günther Noack <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Will Drewry <[email protected]>
Fixes: 0710a1a73fb4 ("selftests/harness: Merge TEST_F_FORK() into TEST_F()")
Reviewed-by: Kees Cook <[email protected]>
Tested-by: Kees Cook <[email protected]>
Signed-off-by: Mickaël Salaün <[email protected]>
Reported-by: Mark Brown <[email protected]>
Tested-by: Mark Brown <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Matthieu Baerts says:
====================
mptcp: some clean-up patches
Here are some clean-up patches for MPTCP:
- Patch 1 drops duplicated header inclusions.
- Patch 2 updates PM 'set_flags' interface, to make it more similar to
others.
- Patch 3 adds some error messages for the PM 'set_flags' command to
help the userspace understanding what's wrong in case of error.
- Patch 4 simplifies __lookup_addr() function from pm_netlink.c.
Except for the 3rd patch, the behaviour is not supposed to be modified.
====================
Link: https://lore.kernel.org/r/20240305-upstream-net-next-20240304-mptcp-misc-cleanup-v1-0-c436ba5e569b@kernel.org
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
When the lookup_by_id parameter of __lookup_addr() is true, it's the same
as __lookup_addr_by_id(), it can be replaced by __lookup_addr_by_id()
directly. So drop this parameter, let __lookup_addr() only looks up address
on the local address list by comparing addresses in it, not address ids.
Signed-off-by: Geliang Tang <[email protected]>
Reviewed-by: Matthieu Baerts (NGI0) <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://lore.kernel.org/r/20240305-upstream-net-next-20240304-mptcp-misc-cleanup-v1-4-c436ba5e569b@kernel.org
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
In addition to returning the error value, this patch also sets an error
messages with GENL_SET_ERR_MSG or NL_SET_ERR_MSG_ATTR both for pm_netlink.c
and pm_userspace.c. It will help the userspace to identify the issue.
Signed-off-by: Geliang Tang <[email protected]>
Reviewed-by: Matthieu Baerts (NGI0) <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://lore.kernel.org/r/20240305-upstream-net-next-20240304-mptcp-misc-cleanup-v1-3-c436ba5e569b@kernel.org
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
This patch updates set_flags interfaces, make it more similar to the
interfaces of dump_addr and get_addr:
mptcp_pm_set_flags(struct sk_buff *skb, struct genl_info *info)
mptcp_pm_nl_set_flags(struct sk_buff *skb, struct genl_info *info)
mptcp_userspace_pm_set_flags(struct sk_buff *skb, struct genl_info *info)
Signed-off-by: Geliang Tang <[email protected]>
Reviewed-by: Matthieu Baerts (NGI0) <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://lore.kernel.org/r/20240305-upstream-net-next-20240304-mptcp-misc-cleanup-v1-2-c436ba5e569b@kernel.org
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
The headers net/tcp.h, net/genetlink.h and uapi/linux/mptcp.h are included
in protocol.h already, no need to include them again directly. This patch
removes these duplicate header inclusions.
Signed-off-by: Geliang Tang <[email protected]>
Reviewed-by: Matthieu Baerts (NGI0) <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://lore.kernel.org/r/20240305-upstream-net-next-20240304-mptcp-misc-cleanup-v1-1-c436ba5e569b@kernel.org
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2024-03-06
We've added 5 non-merge commits during the last 1 day(s) which contain
a total of 5 files changed, 77 insertions(+), 4 deletions(-).
The main changes are:
1) Fix BPF verifier to check bpf_func_state->callback_depth when pruning
states as otherwise unsafe programs could get accepted,
from Eduard Zingerman.
2) Fix to zero-initialise xdp_rxq_info struct before running XDP program in
CPU map which led to random xdp_md fields, from Toke Høiland-Jørgensen.
3) Fix bonding XDP feature flags calculation when bonding device has no
slave devices anymore, from Daniel Borkmann.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
cpumap: Zero-initialise xdp_rxq_info struct before running XDP program
selftests/bpf: Fix up xdp bonding test wrt feature flags
xdp, bonding: Fix feature flags when there are no slave devs anymore
selftests/bpf: test case for callback_depth states pruning logic
bpf: check bpf_func_state->callback_depth when pruning states
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
UBSAN load reports an exception of BRK#5515 SHIFT_ISSUE:Bitwise shifts
that are out of bounds for their data type.
vmlinux get_bitmap(b=75) + 712
<net/netfilter/nf_conntrack_h323_asn1.c:0>
vmlinux decode_seq(bs=0xFFFFFFD008037000, f=0xFFFFFFD008037018, level=134443100) + 1956
<net/netfilter/nf_conntrack_h323_asn1.c:592>
vmlinux decode_choice(base=0xFFFFFFD0080370F0, level=23843636) + 1216
<net/netfilter/nf_conntrack_h323_asn1.c:814>
vmlinux decode_seq(f=0xFFFFFFD0080371A8, level=134443500) + 812
<net/netfilter/nf_conntrack_h323_asn1.c:576>
vmlinux decode_choice(base=0xFFFFFFD008037280, level=0) + 1216
<net/netfilter/nf_conntrack_h323_asn1.c:814>
vmlinux DecodeRasMessage() + 304
<net/netfilter/nf_conntrack_h323_asn1.c:833>
vmlinux ras_help() + 684
<net/netfilter/nf_conntrack_h323_main.c:1728>
vmlinux nf_confirm() + 188
<net/netfilter/nf_conntrack_proto.c:137>
Due to abnormal data in skb->data, the extension bitmap length
exceeds 32 when decoding ras message then uses the length to make
a shift operation. It will change into negative after several loop.
UBSAN load could detect a negative shift as an undefined behaviour
and reports exception.
So we add the protection to avoid the length exceeding 32. Or else
it will return out of range error and stop decoding.
Fixes: 5e35941d9901 ("[NETFILTER]: Add H.323 conntrack/NAT helper")
Signed-off-by: Lena Wang <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
While the rhashtable set gc runs asynchronously, a race allows it to
collect elements from anonymous sets with timeouts while it is being
released from the commit path.
Mingi Cho originally reported this issue in a different path in 6.1.x
with a pipapo set with low timeouts which is not possible upstream since
7395dfacfff6 ("netfilter: nf_tables: use timestamp to check for set
element timeout").
Fix this by setting on the dead flag for anonymous sets to skip async gc
in this case.
According to 08e4c8c5919f ("netfilter: nf_tables: mark newset as dead on
transaction abort"), Florian plans to accelerate abort path by releasing
objects via workqueue, therefore, this sets on the dead flag for abort
path too.
Cc: [email protected]
Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Reported-by: Mingi Cho <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
The riscv BPF JIT doesn't emit proper kCFI prologues for BPF programs
and struct_ops trampolines when CONFIG_CFI_CLANG is enabled.
This causes CFI failures when calling BPF programs and can even crash
the kernel due to invalid memory accesses.
Example crash:
root@rv-selftester:~/bpf# ./test_progs -a dummy_st_ops
Unable to handle kernel paging request at virtual address ffffffff78204ffc
Oops [#1]
Modules linked in: bpf_testmod(OE) [....]
CPU: 3 PID: 356 Comm: test_progs Tainted: P OE 6.8.0-rc1 #1
Hardware name: riscv-virtio,qemu (DT)
epc : bpf_struct_ops_test_run+0x28c/0x5fc
ra : bpf_struct_ops_test_run+0x26c/0x5fc
epc : ffffffff82958010 ra : ffffffff82957ff0 sp : ff200000007abc80
gp : ffffffff868d6218 tp : ff6000008d87b840 t0 : 000000000000000f
t1 : 0000000000000000 t2 : 000000002005793e s0 : ff200000007abcf0
s1 : ff6000008a90fee0 a0 : 0000000000000000 a1 : 0000000000000000
a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000000
a5 : ffffffff868dba26 a6 : 0000000000000001 a7 : 0000000052464e43
s2 : 00007ffffc0a95f0 s3 : ff6000008a90fe80 s4 : ff60000084c24c00
s5 : ffffffff78205000 s6 : ff60000088750648 s7 : ff20000000035008
s8 : fffffffffffffff4 s9 : ffffffff86200610 s10: 0000000000000000
s11: 0000000000000000 t3 : ffffffff8483dc30 t4 : ffffffff8483dc10
t5 : ffffffff8483dbf0 t6 : ffffffff8483dbd0
status: 0000000200000120 badaddr: ffffffff78204ffc cause: 000000000000000d
[<ffffffff82958010>] bpf_struct_ops_test_run+0x28c/0x5fc
[<ffffffff805083ee>] bpf_prog_test_run+0x170/0x548
[<ffffffff805029c8>] __sys_bpf+0x2d2/0x378
[<ffffffff804ff570>] __riscv_sys_bpf+0x5c/0x120
[<ffffffff8000e8fe>] syscall_handler+0x62/0xe4
[<ffffffff83362df6>] do_trap_ecall_u+0xc6/0x27c
[<ffffffff833822c4>] ret_from_exception+0x0/0x64
Code: b603 0109 b683 0189 b703 0209 8493 0609 157d 8d65 (a303) ffca
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Fatal exception
SMP: stopping secondary CPUs
Implement proper kCFI prologues for the BPF programs and callbacks and
drop __nocfi for riscv64. Fix the trampoline generation code to emit kCFI
prologue when a struct_ops trampoline is being prepared.
Signed-off-by: Puranjay Mohan <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Björn Töpel <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Eduard Zingerman says:
====================
libbpf: type suffixes and autocreate flag for struct_ops maps
Tweak struct_ops related APIs to allow the following features:
- specify version suffixes for stuct_ops map types;
- share same BPF program between several map definitions with
different local BTF types, assuming only maps with same
kernel BTF type would be selected for load;
- toggle autocreate flag for struct_ops maps;
- automatically toggle autoload for struct_ops programs referenced
from struct_ops maps, depending on autocreate status of the
corresponding map;
- use SEC("?.struct_ops") and SEC("?.struct_ops.link")
to define struct_ops maps with autocreate == false after object open.
This would allow loading programs like below:
SEC("struct_ops/foo") int BPF_PROG(foo) { ... }
SEC("struct_ops/bar") int BPF_PROG(bar) { ... }
struct bpf_testmod_ops___v1 {
int (*foo)(void);
};
struct bpf_testmod_ops___v2 {
int (*foo)(void);
int (*bar)(void);
};
/* Assume kernel type name to be 'test_ops' */
SEC(".struct_ops.link")
struct test_ops___v1 map_v1 = {
/* Program 'foo' shared by maps with
* different local BTF type
*/
.foo = (void *)foo
};
SEC(".struct_ops.link")
struct test_ops___v2 map_v2 = {
.foo = (void *)foo,
.bar = (void *)bar
};
Assuming the following tweaks are done before loading:
/* to load v1 */
bpf_map__set_autocreate(skel->maps.map_v1, true);
bpf_map__set_autocreate(skel->maps.map_v2, false);
/* to load v2 */
bpf_map__set_autocreate(skel->maps.map_v1, false);
bpf_map__set_autocreate(skel->maps.map_v2, true);
Patch #8 ties autocreate and autoload flags for struct_ops maps and
programs.
Changelog:
- v3 [3] -> v4:
- changes for multiple styling suggestions from Andrii;
- patch #5: libbpf log capture now happens for LIBBPF_INFO and
LIBBPF_WARN messages and does not depend on verbosity flags
(Andrii);
- patch #6: fixed runtime crash caused by conflict with newly added
test case struct_ops_multi_pages;
- patch #7: fixed free of possibly uninitialized pointer (Daniel)
- patch #8: simpler algorithm to detect which programs to autoload
(Andrii);
- patch #9: added assertions for autoload flag after object load
(Andrii);
- patch #12: DATASEC name rewrite in libbpf is now done inplace, no
new strings added to BTF (Andrii);
- patch #14: allow any printable characters in DATASEC names when
kernel validates BTF (Andrii)
- v2 [2] -> v3:
- moved patch #8 logic to be fully done on load
(requested by Andrii in offlist discussion);
- in patch #9 added test case for shadow vars and
autocreate/autoload interaction.
- v1 [1] -> v2:
- fixed memory leak in patch #1 (Kui-Feng);
- improved error messages in patch #2 (Martin, Andrii);
- in bad_struct_ops selftest from patch #6 added .test_2
map member setup (David);
- added utility functions to capture libbpf log from selftests (David)
- in selftests replaced usage of ...__open_and_load by separate
calls to ..._open() and ..._load() (Andrii);
- removed serial_... in selftest definitions (Andrii);
- improved comments in selftest struct_ops_autocreate
from patch #7 (David);
- removed autoload toggling logic incompatible with shadow variables
from bpf_map__set_autocreate(), instead struct_ops programs
autoload property is computed at struct_ops maps load phase,
see patch #8 (Kui-Feng, Martin, Andrii);
- added support for SEC("?.struct_ops") and SEC("?.struct_ops.link")
(Andrii).
[1] https://lore.kernel.org/bpf/[email protected]/
[2] https://lore.kernel.org/bpf/[email protected]/
[3] https://lore.kernel.org/bpf/[email protected]/
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Andrii Nakryiko <[email protected]>
|
|
Two test cases to verify that '?' and other printable characters are
allowed in BTF DATASEC names:
- DATASEC with name "?.foo bar:buz" should be accepted;
- type with name "?foo" should be rejected.
Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
The intent is to allow libbpf to use SEC("?.struct_ops") to identify
struct_ops maps that are optional, e.g. like in the following BPF code:
SEC("?.struct_ops")
struct test_ops optional_map = { ... };
Which yields the following BTF:
...
[13] DATASEC '?.struct_ops' size=0 vlen=...
...
To load such BTF libbpf rewrites DATASEC name before load.
After this patch the rewrite won't be necessary.
Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Check that "?.struct_ops" and "?.struct_ops.link" section names define
struct_ops maps with autocreate == false after open.
Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Optional struct_ops maps are defined using question mark at the start
of the section name, e.g.:
SEC("?.struct_ops")
struct test_ops optional_map = { ... };
This commit teaches libbpf to detect if kernel allows '?' prefix
in datasec names, and if it doesn't then to rewrite such names
by replacing '?' with '_', e.g.:
DATASEC ?.struct_ops -> DATASEC _.struct_ops
Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Allow using two new section names for struct_ops maps:
- SEC("?.struct_ops")
- SEC("?.struct_ops.link")
To specify maps that have bpf_map->autocreate == false after open.
Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
The next patch would add two new section names for struct_ops maps.
To make working with multiple struct_ops sections more convenient:
- remove fields like elf_state->st_ops_{shndx,link_shndx};
- mark section descriptions hosting struct_ops as
elf_sec_desc->sec_type == SEC_ST_OPS;
After these changes struct_ops sections could be processed uniformly
by iterating bpf_object->efile.secs entries.
Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Check that autocreate flags of struct_ops map cause autoload of
struct_ops corresponding programs:
- when struct_ops program is referenced only from a map for which
autocreate is set to false, that program should not be loaded;
- when struct_ops program with autoload == false is set to be used
from a map with autocreate == true using shadow var,
that program should be loaded;
- when struct_ops program is not referenced from any map object load
should fail.
Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Automatically select which struct_ops programs to load depending on
which struct_ops maps are selected for automatic creation.
E.g. for the BPF code below:
SEC("struct_ops/test_1") int BPF_PROG(foo) { ... }
SEC("struct_ops/test_2") int BPF_PROG(bar) { ... }
SEC(".struct_ops.link")
struct test_ops___v1 A = {
.foo = (void *)foo
};
SEC(".struct_ops.link")
struct test_ops___v2 B = {
.foo = (void *)foo,
.bar = (void *)bar,
};
And the following libbpf API calls:
bpf_map__set_autocreate(skel->maps.A, true);
bpf_map__set_autocreate(skel->maps.B, false);
The autoload would be enabled for program 'foo' and disabled for
program 'bar'.
During load, for each struct_ops program P, referenced from some
struct_ops map M:
- set P.autoload = true if M.autocreate is true for some M;
- set P.autoload = false if M.autocreate is false for all M;
- don't change P.autoload, if P is not referenced from any map.
Do this after bpf_object__init_kern_struct_ops_maps()
to make sure that shadow vars assignment is done.
Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Check that bpf_map__set_autocreate() can be used to disable automatic
creation for struct_ops maps.
Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
When loading struct_ops programs kernel requires BTF id of the
struct_ops type and member index for attachment point inside that
type. This makes impossible to use same BPF program in several
struct_ops maps that have different struct_ops type.
Check if libbpf rejects such BPF objects files.
Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Several test_progs tests already capture libbpf log in order to check
for some expected output, e.g bpf_tcp_ca.c, kfunc_dynptr_param.c,
log_buf.c and a few others.
This commit provides a, hopefully, simple API to capture libbpf log
w/o necessity to define new print callback in each test:
/* Creates a global memstream capturing INFO and WARN level output
* passed to libbpf_print_fn.
* Returns 0 on success, negative value on failure.
* On failure the description is printed using PRINT_FAIL and
* current test case is marked as fail.
*/
int start_libbpf_log_capture(void)
/* Destroys global memstream created by start_libbpf_log_capture().
* Returns a pointer to captured data which has to be freed.
* Returned buffer is null terminated.
*/
char *stop_libbpf_log_capture(void)
The intended usage is as follows:
if (start_libbpf_log_capture())
return;
use_libbpf();
char *log = stop_libbpf_log_capture();
ASSERT_HAS_SUBSTR(log, "... expected ...", "expected some message");
free(log);
As a safety measure, free(start_libbpf_log_capture()) is invoked in the
epilogue of the test_progs.c:run_one_test().
Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Extend struct_ops_module test case to check if it is possible to use
'___' suffixes for struct_ops type specification.
Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: David Vernet <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Skip load steps for struct_ops maps not marked for automatic creation.
This should allow to load bpf object in situations like below:
SEC("struct_ops/foo") int BPF_PROG(foo) { ... }
SEC("struct_ops/bar") int BPF_PROG(bar) { ... }
struct test_ops___v1 {
int (*foo)(void);
};
struct test_ops___v2 {
int (*foo)(void);
int (*does_not_exist)(void);
};
SEC(".struct_ops.link")
struct test_ops___v1 map_for_old = {
.test_1 = (void *)foo
};
SEC(".struct_ops.link")
struct test_ops___v2 map_for_new = {
.test_1 = (void *)foo,
.does_not_exist = (void *)bar
};
Suppose program is loaded on old kernel that does not have definition
for 'does_not_exist' struct_ops member. After this commit it would be
possible to load such object file after the following tweaks:
bpf_program__set_autoload(skel->progs.bar, false);
bpf_map__set_autocreate(skel->maps.map_for_new, false);
Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: David Vernet <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Enforce the following existing limitation on struct_ops programs based
on kernel BTF id instead of program-local BTF id:
struct_ops BPF prog can be re-used between multiple .struct_ops &
.struct_ops.link as long as it's the same struct_ops struct
definition and the same function pointer field
This allows reusing same BPF program for versioned struct_ops map
definitions, e.g.:
SEC("struct_ops/test")
int BPF_PROG(foo) { ... }
struct some_ops___v1 { int (*test)(void); };
struct some_ops___v2 { int (*test)(void); };
SEC(".struct_ops.link") struct some_ops___v1 a = { .test = foo }
SEC(".struct_ops.link") struct some_ops___v2 b = { .test = foo }
Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
E.g. allow the following struct_ops definitions:
struct bpf_testmod_ops___v1 { int (*test)(void); };
struct bpf_testmod_ops___v2 { int (*test)(void); };
SEC(".struct_ops.link")
struct bpf_testmod_ops___v1 a = { .test = ... }
SEC(".struct_ops.link")
struct bpf_testmod_ops___v2 b = { .test = ... }
Where both bpf_testmod_ops__v1 and bpf_testmod_ops__v2 would be
resolved as 'struct bpf_testmod_ops' from kernel BTF.
Signed-off-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: David Vernet <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Alexei Starovoitov says:
====================
bpf: Introduce may_goto and cond_break
From: Alexei Starovoitov <[email protected]>
v5 -> v6:
- Rename BPF_JMA to BPF_JCOND
- Addressed Andrii's review comments
v4 -> v5:
- rewrote patch 1 to avoid fake may_goto_reg and use 'u32 may_goto_cnt' instead.
This way may_goto handling is similar to bpf_loop() processing.
- fixed bug in patch 2 that RANGE_WITHIN should not use
rold->type == NOT_INIT as a safe signal.
- patch 3 fixed negative offset computation in cond_break macro
- using bpf_arena and cond_break recompiled lib/glob.c as bpf prog
and it works! It will be added as a selftest to arena series.
v3 -> v4:
- fix drained issue reported by John.
may_goto insn could be implemented with sticky state (once
reaches 0 it stays 0), but the verifier shouldn't assume that.
It has to explore both branches.
Arguably drained iterator state shouldn't be there at all.
bpf_iter_css_next() is not sticky. Can be fixed, but auditing all
iterators for stickiness. That's an orthogonal discussion.
- explained JMA name reasons in patch 1
- fixed test_progs-no_alu32 issue and added another test
v2 -> v3: Major change
- drop bpf_can_loop() kfunc and introduce may_goto instruction instead
kfunc is a function call while may_goto doesn't consume any registers
and LLVM can produce much better code due to less register pressure.
- instead of counting from zero to BPF_MAX_LOOPS start from it instead
and break out of the loop when count reaches zero
- use may_goto instruction in cond_break macro
- recognize that 'exact' state comparison doesn't need to be truly exact.
regsafe() should ignore precision and liveness marks, but range_within
logic is safe to use while evaluating open coded iterators.
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Andrii Nakryiko <[email protected]>
|