diff options
| author | Jakub Kicinski <[email protected]> | 2024-10-14 17:55:08 -0700 |
|---|---|---|
| committer | Jakub Kicinski <[email protected]> | 2024-10-14 17:55:09 -0700 |
| commit | 5bedbfc16552b2284863c7e06bc0846554743152 (patch) | |
| tree | 39da64914b8b90dcc73590ab46ab2a27b9fb790c /include/linux | |
| parent | ff1585e971392e70ba47433413ad28d7417debb8 (diff) | |
| parent | c9191eaa728510dab391d384359b8dbf14c25c9e (diff) | |
Merge branch 'add-support-for-per-napi-config-via-netlink'
Joe Damato says:
====================
Add support for per-NAPI config via netlink
Greetings:
Welcome to v6. Minor changes from v5 [1], please see changelog below.
There were no explicit comments from reviewers on the call outs in my
v5, so I'm retaining them from my previous cover letter just in case :)
A few important call outs for reviewers:
1. This revision seems to work (see below for a full walk through). I
think this is the behavior we talked about, but please let me know if
a use case is missing.
2. Re a previous point made by Stanislav regarding "taking over a NAPI
ID" when the channel count changes: mlx5 seems to call napi_disable
followed by netif_napi_del for the old queues and then calls
napi_enable for the new ones. In this RFC, the NAPI ID generation is
deferred to napi_enable. This means we won't end up with two of the
same NAPI IDs added to the hash at the same time.
Can we assume all drivers will napi_disable the old queues before
napi_enable the new ones?
- If yes: we might not need to worry about a NAPI ID takeover
function.
- If no: I'll need to make a change so that the NAPI ID generation
is deferred only for drivers which have opted into the config
space via calls to netif_napi_add_config
3. I made the decision to remove the WARN_ON_ONCE that (I think?)
Jakub previously suggested in alloc_netdev_mqs (WARN_ON_ONCE(txqs
!= rxqs);) because this was triggering on every kernel boot with my
mlx5 NIC.
4. I left the "maxqs = max(txqs, rxqs);" in alloc_netdev_mqs despite
thinking this is a bit strange. I think it's strange that we might
be short some number of NAPI configs, but it seems like most people
are in favor of this approach, so I've left it.
I'd appreciate thoughts from reviewers on the above items, if at all
possible.
Now, on to the implementation.
Firstly, this implementation moves certain settings to napi_struct so that
they are "per-NAPI", while taking care to respect existing sysfs
parameters which are interface wide and affect all NAPIs:
- NAPI ID
- gro_flush_timeout
- defer_hard_irqs
Furthermore:
- NAPI ID generation and addition to the hash is now deferred to
napi_enable, instead of during netif_napi_add
- NAPIs are removed from the hash during napi_disable, instead of
netif_napi_del.
- An array of "struct napi_config" is allocated in net_device.
IMPORTANT: The above changes affect all network drivers.
Optionally, drivers may opt-in to using their config space by calling
netif_napi_add_config instead of netif_napi_add.
If a driver does this, the NAPI being added is linked with an allocated
"struct napi_config" and the per-NAPI settings (including NAPI ID) are
persisted even as hardware queues are destroyed and recreated.
To help illustrate how this would end up working, I've added patches for
3 drivers, of which I have access to only 1:
- mlx5 which is the basis of the examples below
- mlx4 which has TX only NAPIs, just to highlight that case. I have
only compile tested this patch; I don't have this hardware.
- bnxt which I have only compiled tested. I don't have this
hardware.
NOTE: I only tested this on mlx5; I have no access to the other hardware
for which I provided patches. Hopefully other folks can help test :)
Here's how it works when I test it on my mlx5 system:
$ ethtool -l eth4 | grep Combined | tail -1
Combined: 2
First, output the current NAPI settings:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 0,
'gro-flush-timeout': 0,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 0,
'gro-flush-timeout': 0,
'id': 344,
'ifindex': 7,
'irq': 327}]
Now, set the global sysfs parameters:
$ sudo bash -c 'echo 20000 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 100 >/sys/class/net/eth4/napi_defer_hard_irqs'
Output current NAPI settings again:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Now set NAPI ID 345, via its NAPI ID to specific values:
$ sudo ./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--do napi-set \
--json='{"id": 345,
"defer-hard-irqs": 111,
"gro-flush-timeout": 11111}'
None
Now output current NAPI settings again to ensure only NAPI ID 345
changed:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
'gro-flush-timeout': 11111,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Now, increase gro-flush-timeout only:
$ sudo ./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--do napi-set --json='{"id": 345,
"gro-flush-timeout": 44444}'
None
Now output the current NAPI settings once more:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
'gro-flush-timeout': 44444,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Now set NAPI ID 345 to have gro_flush_timeout of 0:
$ sudo ./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--do napi-set --json='{"id": 345,
"gro-flush-timeout": 0}'
None
Check that NAPI ID 345 has a value of 0:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
'gro-flush-timeout': 0,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Change the queue count, ensuring that NAPI ID 345 retains its settings:
$ sudo ethtool -L eth4 combined 4
Check that the new queues have the system wide settings but that NAPI ID
345 remains unchanged:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 347,
'ifindex': 7,
'irq': 529},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 346,
'ifindex': 7,
'irq': 528},
{'defer-hard-irqs': 111,
'gro-flush-timeout': 0,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Now reduce the queue count below where NAPI ID 345 is indexed:
$ sudo ethtool -L eth4 combined 1
Check the output:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Re-increase the queue count to ensure NAPI ID 345 is re-assigned the same
values:
$ sudo ethtool -L eth4 combined 2
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
'gro-flush-timeout': 0,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Create new queues to ensure the sysfs globals are used for the new NAPIs
but that NAPI ID 345 is unchanged:
$ sudo ethtool -L eth4 combined 8
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[...]
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 346,
'ifindex': 7,
'irq': 528},
{'defer-hard-irqs': 111,
'gro-flush-timeout': 0,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Last, but not least, let's try writing the sysfs parameters to ensure
all NAPIs are rewritten:
$ sudo bash -c 'echo 33333 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 222 >/sys/class/net/eth4/napi_defer_hard_irqs'
Check that worked:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[...]
{'defer-hard-irqs': 222,
'gro-flush-timeout': 33333,
'id': 346,
'ifindex': 7,
'irq': 528},
{'defer-hard-irqs': 222,
'gro-flush-timeout': 33333,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 222,
'gro-flush-timeout': 33333,
'id': 344,
'ifindex': 7,
'irq': 327}]
[1]: https://lore.kernel.org/[email protected]
v5: https://lore.kernel.org/[email protected]
rfcv4: https://lore.kernel.org/lkml/[email protected]
rfcv3: https://lore.kernel.org/[email protected]
rfcv2: https://lore.kernel.org/[email protected]
====================
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Diffstat (limited to 'include/linux')
| -rw-r--r-- | include/linux/netdevice.h | 42 |
1 files changed, 37 insertions, 5 deletions
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index e6b93d01e631..8feaca12655e 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -343,6 +343,15 @@ struct gro_list { #define GRO_HASH_BUCKETS 8 /* + * Structure for per-NAPI config + */ +struct napi_config { + u64 gro_flush_timeout; + u32 defer_hard_irqs; + unsigned int napi_id; +}; + +/* * Structure for NAPI scheduling similar to tasklet but with weighting */ struct napi_struct { @@ -373,10 +382,14 @@ struct napi_struct { unsigned int napi_id; struct hrtimer timer; struct task_struct *thread; + unsigned long gro_flush_timeout; + u32 defer_hard_irqs; /* control-path-only fields follow */ struct list_head dev_list; struct hlist_node napi_hash_node; int irq; + int index; + struct napi_config *config; }; enum { @@ -1866,9 +1879,6 @@ enum netdev_reg_state { * allocated at register_netdev() time * @real_num_rx_queues: Number of RX queues currently active in device * @xdp_prog: XDP sockets filter program pointer - * @gro_flush_timeout: timeout for GRO layer in NAPI - * @napi_defer_hard_irqs: If not zero, provides a counter that would - * allow to avoid NIC hard IRQ, on busy queues. * * @rx_handler: handler for received packets * @rx_handler_data: XXX: need comments on this one @@ -2018,6 +2028,11 @@ enum netdev_reg_state { * where the clock is recovered. * * @max_pacing_offload_horizon: max EDT offload horizon in nsec. + * @napi_config: An array of napi_config structures containing per-NAPI + * settings. + * @gro_flush_timeout: timeout for GRO layer in NAPI + * @napi_defer_hard_irqs: If not zero, provides a counter that would + * allow to avoid NIC hard IRQ, on busy queues. * * FIXME: cleanup struct net_device such that network protocol info * moves out. @@ -2084,8 +2099,6 @@ struct net_device { int ifindex; unsigned int real_num_rx_queues; struct netdev_rx_queue *_rx; - unsigned long gro_flush_timeout; - u32 napi_defer_hard_irqs; unsigned int gro_max_size; unsigned int gro_ipv4_max_size; rx_handler_func_t __rcu *rx_handler; @@ -2413,6 +2426,9 @@ struct net_device { struct dim_irq_moder *irq_moder; u64 max_pacing_offload_horizon; + struct napi_config *napi_config; + unsigned long gro_flush_timeout; + u32 napi_defer_hard_irqs; /** * @lock: protects @net_shaper_hierarchy, feel free to use for other @@ -2677,6 +2693,22 @@ netif_napi_add_tx_weight(struct net_device *dev, } /** + * netif_napi_add_config - initialize a NAPI context with persistent config + * @dev: network device + * @napi: NAPI context + * @poll: polling function + * @index: the NAPI index + */ +static inline void +netif_napi_add_config(struct net_device *dev, struct napi_struct *napi, + int (*poll)(struct napi_struct *, int), int index) +{ + napi->index = index; + napi->config = &dev->napi_config[index]; + netif_napi_add_weight(dev, napi, poll, NAPI_POLL_WEIGHT); +} + +/** * netif_napi_add_tx() - initialize a NAPI context to be used for Tx only * @dev: network device * @napi: NAPI context |