Age | Commit message (Collapse) | Author | Files | Lines |
|
Add netlink support for reading NH group hardware stats.
Stats collection is done through a new notifier,
NEXTHOP_EVENT_HW_STATS_REPORT_DELTA. Drivers that implement HW counters for
a given NH group are thereby asked to collect the stats and report back to
core by calling nh_grp_hw_stats_report_delta(). This is similar to what
netdevice L3 stats do.
Besides exposing number of packets that passed in the HW datapath, also
include information on whether any driver actually realizes the counters.
The core can tell based on whether it got any _report_delta() reports from
the drivers. This allows enabling the statistics at the group at any time,
with drivers opting into supporting them. This is also in line with what
netdevice L3 stats are doing.
So as not to waste time and space, tie the collection and reporting of HW
stats with a new op flag, NHA_OP_FLAG_DUMP_HW_STATS.
Co-developed-by: Petr Machata <[email protected]>
Signed-off-by: Petr Machata <[email protected]>
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Kees Cook <[email protected]> # For the __counted_by bits
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Add netlink support for enabling collection of HW statistics on nexthop
groups.
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: Petr Machata <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Add hw_stats field to several notifier structures to communicate to the
drivers that HW statistics should be configured for nexthops within a given
group.
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: Petr Machata <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Add netlink support for reading NH group stats.
This data is only for statistics of the traffic in the SW datapath. HW
nexthop group statistics will be added in the following patches.
Emission of the stats is keyed to a new op_stats flag to avoid cluttering
the netlink message with stats if the user doesn't need them:
NHA_OP_FLAG_DUMP_STATS.
Co-developed-by: Petr Machata <[email protected]>
Signed-off-by: Petr Machata <[email protected]>
Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Add nexthop group entry stats to count the number of packets forwarded
via each nexthop in the group. The stats will be exposed to user space
for better data path observability in the next patch.
The per-CPU stats pointer is placed at the beginning of 'struct
nh_grp_entry', so that all the fields accessed for the data path reside
on the same cache line:
struct nh_grp_entry {
struct nexthop * nh; /* 0 8 */
struct nh_grp_entry_stats * stats; /* 8 8 */
u8 weight; /* 16 1 */
/* XXX 7 bytes hole, try to pack */
union {
struct {
atomic_t upper_bound; /* 24 4 */
} hthr; /* 24 4 */
struct {
struct list_head uw_nh_entry; /* 24 16 */
u16 count_buckets; /* 40 2 */
u16 wants_buckets; /* 42 2 */
} res; /* 24 24 */
}; /* 24 24 */
struct list_head nh_list; /* 48 16 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct nexthop * nh_parent; /* 64 8 */
/* size: 72, cachelines: 2, members: 6 */
/* sum members: 65, holes: 1, sum holes: 7 */
/* last cacheline: 8 bytes */
};
Co-developed-by: Petr Machata <[email protected]>
Signed-off-by: Petr Machata <[email protected]>
Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
In order to add per-nexthop statistics, but still not increase netlink
message size for consumers that do not care about them, there needs to be a
toggle through which the user indicates their desire to get the statistics.
To that end, add a new attribute, NHA_OP_FLAGS. The idea is to be able to
use the attribute for carrying of arbitrary operation-specific flags, i.e.
not make it specific for get / dump.
Add the new attribute to get and dump policies, but do not actually allow
any flags yet -- those will come later as the flags themselves are defined.
Add the necessary parsing code.
Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
It is useful to expose skb addr and sock addr to user in tracepoint
tcp_probe, so that we can get more information while monitoring
receiving of tcp data, by ebpf or other ways.
For example, we need to identify a packet by seq and end_seq when
calculate transmit latency between layer 2 and layer 4 by ebpf, but which is
not available in tcp_probe, so we can only use kprobe hooking
tcp_rcv_established to get them. But we can use tcp_probe directly if skb
addr and sock addr are available, which is more efficient.
Signed-off-by: fuyuanli <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
softnet_data->time_squeeze is sometimes used as a proxy for
host overload or indication of scheduling problems. In practice
this statistic is very noisy and has hard to grasp units -
e.g. is 10 squeezes a second to be expected, or high?
Delaying network (NAPI) processing leads to drops on NIC queues
but also RTT bloat, impacting pacing and CA decisions.
Stalls are a little hard to detect on the Rx side, because
there may simply have not been any packets received in given
period of time. Packet timestamps help a little bit, but
again we don't know if packets are stale because we're
not keeping up or because someone (*cough* cgroups)
disabled IRQs for a long time.
We can, however, use Tx as a proxy for Rx stalls. Most drivers
use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
On the Tx side we know exactly when packets get queued,
and completed, so there is no uncertainty.
This patch adds stall checks to BQL. Why BQL? Because
it's a convenient place to add such checks, already
called by most drivers, and it has copious free space
in its structures (this patch adds no extra cache
references or dirtying to the fast path).
The algorithm takes one parameter - max delay AKA stall
threshold and increments a counter whenever NAPI got delayed
for at least that amount of time. It also records the length
of the longest stall.
To be precise every time NAPI has not polled for at least
stall thrs we check if there were any Tx packets queued
between last NAPI run and now - stall_thrs/2.
Unlike the classic Tx watchdog this mechanism does not
ignore stalls caused by Tx being disabled, or loss of link.
I don't think the check is worth the complexity, and
stall is a stall, whether due to host overload, flow
control, link down... doesn't matter much to the application.
We have been running this detector in production at Meta
for 2 years, with the threshold of 8ms. It's the lowest
value where false positives become rare. There's still
a constant stream of reported stalls (especially without
the ksoftirqd deferral patches reverted), those who like
their stall metrics to be 0 may prefer higher value.
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Breno Leitao <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Rx alloc failures are commonly counted by drivers.
Support reporting those via netdev-genl queue stats.
Acked-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Amritha Nambiar <[email protected]>
Reviewed-by: Xuan Zhuo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
The ethtool-nl family does a good job exposing various protocol
related and IEEE/IETF statistics which used to get dumped under
ethtool -S, with creative names. Queue stats don't have a netlink
API, yet, and remain a lion's share of ethtool -S output for new
drivers. Not only is that bad because the names differ driver to
driver but it's also bug-prone. Intuitively drivers try to report
only the stats for active queues, but querying ethtool stats
involves multiple system calls, and the number of stats is
read separately from the stats themselves. Worse still when user
space asks for values of the stats, it doesn't inform the kernel
how big the buffer is. If number of stats increases in the meantime
kernel will overflow user buffer.
Add a netlink API for dumping queue stats. Queue information is
exposed via the netdev-genl family, so add the stats there.
Support per-queue and sum-for-device dumps. Latter will be useful
when subsequent patches add more interesting common stats than
just bytes and packets.
The API does not currently distinguish between HW and SW stats.
The expectation is that the source of the stats will either not
matter much (good packets) or be obvious (skb alloc errors).
Acked-by: Stanislav Fomichev <[email protected]>
Reviewed-by: Amritha Nambiar <[email protected]>
Reviewed-by: Xuan Zhuo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
rps_sock_flow_table and rps_cpu_mask are used in fast path.
Move them to net_hotdata for better cache locality.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Move RPS related structures and helpers from include/linux/netdevice.h
and include/net/sock.h to a new include file.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Use a 32bit hole in "struct net_offload" to store
the remaining 32bit secrets used by TCPv6 and UDPv6.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
"struct inet6_protocol" has a 32bit hole in 32bit arches.
Use it to store the 32bit secret used by UDP and TCP,
to increase cache locality in rx path.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
"struct net_protocol" has a 32bit hole in 32bit arches.
Use it to store the 32bit secret used by UDP and TCP,
to increase cache locality in rx path.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
These structures are read in rx path, move them to net_hotdata
for better cache locality.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
These structures are read in rx path, move them to net_hotdata
for better cache locality.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
These structures are used in GRO and GSO paths.
Move them to net_hodata for better cache locality.
v2: udpv6_offload definition depends on CONFIG_INET=y
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
skbuff_cache, skbuff_fclone_cache and skb_small_head_cache
are used in rx/tx fast paths.
Move them to net_hotdata for better cache locality.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
dev_rx_weight is read from process_backlog().
Move it to net_hotdata for better cache locality.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
dev_tx_weight is used in tx fast path.
Move it to net_hotdata for better cache locality.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
These are used in TCP fast paths.
Move them into net_hotdata for better cache locality.
v2: tcpv6_offload definition depends on CONFIG_INET
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
These structures are used in GRO and GSO paths.
v2: ipv6_packet_offload definition depends on CONFIG_INET
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
netdev_max_backlog is used in rx fat path.
Move it to net_hodata for better cache locality.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
ptype_all is used in rx/tx fast paths.
Move it to net_hotdata for better cache locality.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
netdev_tstamp_prequeue is used in rx path.
Move it to net_hotdata for better cache locality.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
netdev_budget and netdev_budget are used in rx path (net_rx_action())
Move them into net_hotdata for better cache locality.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Instead of spreading networking critical fields
all over the places, add a custom net_hotdata
structure so that we can precisely control its layout.
In this first patch, move :
- gro_normal_batch used in rx (GRO stack)
- offload_base used in rx and tx (GRO and TSO stacks)
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
Here are some changes to AF_RXRPC:
(1) Cache the transmission serial number of ACK and DATA packets in the
rxrpc_txbuf struct and log this in the retransmit tracepoint.
(2) Don't use atomics on rxrpc_txbuf::flags[*] and cache the intended wire
header flags there too to avoid duplication.
(3) Cache the wire checksum in rxrpc_txbuf to make it easier to create
jumbo packets in future (which will require altering the wire header
to a jumbo header and restoring it back again for retransmission).
(4) Fix the protocol names in the wire ACK trailer struct.
(5) Strip all the barriers and atomics out of the call timer tracking[*].
(6) Remove atomic handling from call->tx_transmitted and
call->acks_prev_seq[*].
(7) Don't bother resetting the DF flag after UDP packet transmission. To
change it, we now call directly into UDP code, so it's quick just to
set it every time.
(8) Merge together the DF/non-DF branches of the DATA transmission to
reduce duplication in the code.
(9) Add a kvec array into rxrpc_txbuf and start moving things over to it.
This paves the way for using page frags.
(10) Split (sub)packet preparation and timestamping out of the DATA
transmission function. This helps pave the way for future jumbo
packet generation.
(11) In rxkad, don't pick values out of the wire header stored in
rxrpc_txbuf, buf rather find them elsewhere so we can remove the wire
header from there.
(12) Move rxrpc_send_ACK() to output.c so that it can be merged with
rxrpc_send_ack_packet().
(13) Use rxrpc_txbuf::kvec[0] to access the wire header for the packet
rather than directly accessing the copy in rxrpc_txbuf. This will
allow that to be removed to a page frag.
(14) Switch from keeping the transmission buffers in rxrpc_txbuf allocated
in the slab to allocating them using page fragment allocators. There
are separate allocators for DATA packets (which persist for a while)
and control packets (which are discarded immediately).
We can then turn on MSG_SPLICE_PAGES when transmitting DATA and ACK
packets.
We can also get rid of the RCU cleanup on rxrpc_txbufs, preferring
instead to release the page frags as soon as possible.
(15) Parse received packets before handling timeouts as the former may
reset the latter.
(16) Make sure we don't retransmit DATA packets after all the packets have
been ACK'd.
(17) Differentiate traces for PING ACK transmission.
(18) Switch to keeping timeouts as ktime_t rather than a number of jiffies
as the latter is too coarse a granularity. Only set the call timer at
the end of the call event function from the aggregate of all the
timeouts, thereby reducing the number of timer calls made. In future,
it might be possible to reduce the number of timers from one per call
to one per I/O thread and to use a high-precision timer.
(19) Record RTT probes after successful transmission rather than recording
it before and then cancelling it after if unsuccessful[*]. This
allows a number of calls to get the current time to be removed.
(20) Clean up the resend algorithm as there's now no need to walk the
transmission buffer under lock[*]. DATA packets can be retransmitted
as soon as they're found rather than being queued up and transmitted
when the locked is dropped.
(21) When initially parsing a received ACK packet, extract some of the
fields from the ack info to the skbuff private data. This makes it
easier to do path MTU discovery in the future when the call to which a
PING RESPONSE ACK refers has been deallocated.
[*] Possible with the move of almost all code from softirq context to the
I/O thread.
Link: https://lore.kernel.org/r/[email protected]/ # v1
Link: https://lore.kernel.org/r/[email protected]/ # v2
* tag 'rxrpc-iothread-20240305' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (21 commits)
rxrpc: Extract useful fields from a received ACK to skb priv data
rxrpc: Clean up the resend algorithm
rxrpc: Record probes after transmission and reduce number of time-gets
rxrpc: Use ktimes for call timeout tracking and set the timer lazily
rxrpc: Differentiate PING ACK transmission traces.
rxrpc: Don't permit resending after all Tx packets acked
rxrpc: Parse received packets before dealing with timeouts
rxrpc: Do zerocopy using MSG_SPLICE_PAGES and page frags
rxrpc: Use rxrpc_txbuf::kvec[0] instead of rxrpc_txbuf::wire
rxrpc: Move rxrpc_send_ACK() to output.c with rxrpc_send_ack_packet()
rxrpc: Don't pick values out of the wire header when setting up security
rxrpc: Split up the DATA packet transmission function
rxrpc: Add a kvec[] to the rxrpc_txbuf struct
rxrpc: Merge together DF/non-DF branches of data Tx function
rxrpc: Do lazy DF flag resetting
rxrpc: Remove atomic handling on some fields only used in I/O thread
rxrpc: Strip barriers and atomics off of timer tracking
rxrpc: Fix the names of the fields in the ACK trailer struct
rxrpc: Note cksum in txbuf
rxrpc: Convert rxrpc_txbuf::flags into a mask and don't use atomics
...
====================
Link: https://lore.kernel.org/r/
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
net/core/page_pool_user.c
0b11b1c5c320 ("netdev: let netlink core handle -EMSGSIZE errors")
429679dcf7d9 ("page_pool: fix netlink dump stop/resume")
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from bpf, ipsec and netfilter.
No solution yet for the stmmac issue mentioned in the last PR, but it
proved to be a lockdep false positive, not a blocker.
Current release - regressions:
- dpll: move all dpll<>netdev helpers to dpll code, fix build
regression with old compilers
Current release - new code bugs:
- page_pool: fix netlink dump stop/resume
Previous releases - regressions:
- bpf: fix verifier to check bpf_func_state->callback_depth when
pruning states as otherwise unsafe programs could get accepted
- ipv6: avoid possible UAF in ip6_route_mpath_notify()
- ice: reconfig host after changing MSI-X on VF
- mlx5:
- e-switch, change flow rule destination checking
- add a memory barrier to prevent a possible null-ptr-deref
- switch to using _bh variant of of spinlock where needed
Previous releases - always broken:
- netfilter: nf_conntrack_h323: add protection for bmp length out of
range
- bpf: fix to zero-initialise xdp_rxq_info struct before running XDP
program in CPU map which led to random xdp_md fields
- xfrm: fix UDP encapsulation in TX packet offload
- netrom: fix data-races around sysctls
- ice:
- fix potential NULL pointer dereference in ice_bridge_setlink()
- fix uninitialized dplls mutex usage
- igc: avoid returning frame twice in XDP_REDIRECT
- i40e: disable NAPI right after disabling irqs when handling
xsk_pool
- geneve: make sure to pull inner header in geneve_rx()
- sparx5: fix use after free inside sparx5_del_mact_entry
- dsa: microchip: fix register write order in ksz8_ind_write8()
Misc:
- selftests: mptcp: fixes for diag.sh"
* tag 'net-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (63 commits)
net: pds_core: Fix possible double free in error handling path
netrom: Fix data-races around sysctl_net_busy_read
netrom: Fix a data-race around sysctl_netrom_link_fails_count
netrom: Fix a data-race around sysctl_netrom_routing_control
netrom: Fix a data-race around sysctl_netrom_transport_no_activity_timeout
netrom: Fix a data-race around sysctl_netrom_transport_requested_window_size
netrom: Fix a data-race around sysctl_netrom_transport_busy_delay
netrom: Fix a data-race around sysctl_netrom_transport_acknowledge_delay
netrom: Fix a data-race around sysctl_netrom_transport_maximum_tries
netrom: Fix a data-race around sysctl_netrom_transport_timeout
netrom: Fix data-races around sysctl_netrom_network_ttl_initialiser
netrom: Fix a data-race around sysctl_netrom_obsolescence_count_initialiser
netrom: Fix a data-race around sysctl_netrom_default_path_quality
netfilter: nf_conntrack_h323: Add protection for bmp length out of range
netfilter: nf_tables: mark set as dead when unbinding anonymous set with timeout
netfilter: nft_ct: fix l3num expectations with inet pseudo family
netfilter: nf_tables: reject constant set with timeout
netfilter: nf_tables: disallow anonymous set with timeout flag
net/rds: fix WARNING in rds_conn_connect_if_down
net: dsa: microchip: fix register write order in ksz8_ind_write8()
...
|
|
Use the existing parameter and print the address of skbaddr
as other trace functions do.
Signed-off-by: Jason Xing <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
Printing the addresses can help us identify the exact skb/sk
for those system in which it's not that easy to run BPF program.
As we can see, it already fetches those, then use it directly
and it will print like below:
...tcp_retransmit_skb: skbaddr=XXX skaddr=XXX family=AF_INET...
Signed-off-by: Jason Xing <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
commit 4d72c3bb60dd ("net: phylink: strip out pre-March 2020 legacy code")
dropped the mac_pcs_get_state ops in phylink_mac_ops in favor of
dedicated PCS operation pcs_get_state. However, the documentation for
the pcs_get_state ops was incorrectly converted and now self-references.
Drop the extra comment.
Signed-off-by: Maxime Chevallier <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Get rid of copy_mc flag in iov_iter which really only makes sense for
the core dumping code so move it out of the generic iov iter code and
make it coredump's problem. See the detailed commit description.
- Revert fs/aio: Make io_cancel() generate completions again
The initial fix here was predicated on the assumption that calling
ki_cancel() didn't complete aio requests. However, that turned out to
be wrong since the two drivers that actually make use of this set a
cancellation function that performs the cancellation correctly. So
revert this change.
- Ensure that the test for IOCB_AIO_RW always happens before the read
from ki_ctx.
* tag 'vfs-6.8-release.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iov_iter: get rid of 'copy_mc' flag
fs/aio: Check IOCB_AIO_RW before the struct aio_kiocb conversion
Revert "fs/aio: Make io_cancel() generate completions again"
|
|
Currently getsockopt does not support IP_ROUTER_ALERT and
IPV6_ROUTER_ALERT, and we are unable to get the values of these two
socket options through getsockopt.
This patch adds getsockopt support for IP_ROUTER_ALERT and
IPV6_ROUTER_ALERT.
Signed-off-by: Juntong Deng <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This flag is only set by one single user: the magical core dumping code
that looks up user pages one by one, and then writes them out using
their kernel addresses (by using a BVEC_ITER).
That actually ends up being a huge problem, because while we do use
copy_mc_to_kernel() for this case and it is able to handle the possible
machine checks involved, nothing else is really ready to handle the
failures caused by the machine check.
In particular, as reported by Tong Tiangen, we don't actually support
fault_in_iov_iter_readable() on a machine check area.
As a result, the usual logic for writing things to a file under a
filesystem lock, which involves doing a copy with page faults disabled
and then if that fails trying to fault pages in without holding the
locks with fault_in_iov_iter_readable() does not work at all.
We could decide to always just make the MC copy "succeed" (and filling
the destination with zeroes), and that would then create a core dump
file that just ignores any machine checks.
But honestly, this single special case has been problematic before, and
means that all the normal iov_iter code ends up slightly more complex
and slower.
See for example commit c9eec08bac96 ("iov_iter: Don't deal with
iter->copy_mc in memcpy_from_iter_mc()") where David Howells
re-organized the code just to avoid having to check the 'copy_mc' flags
inside the inner iov_iter loops.
So considering that we have exactly one user, and that one user is a
non-critical special case that doesn't actually ever trigger in real
life (Tong found this with manual error injection), the sane solution is
to just decide that the onus on handling the machine check lines on that
user instead.
Ergo, do the copy_mc_to_kernel() in the core dump logic itself, copying
the user data to a stable kernel page before writing it out.
Fixes: f1982740f5e7 ("iov_iter: Convert iterate*() to inline funcs")
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Tong Tiangen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/[email protected]/
Tested-by: David Howells <[email protected]>
Reviewed-by: David Howells <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Reported-by: Tong Tiangen <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
In order for EEE to operate, both the MAC and the PHY need to support
it, similar to how pause works. With some exception - a number of PHYs
have SmartEEE or AutoGrEEEn support in order to provide some EEE-like
power savings with non-EEE capable MACs.
Copy the pause concept and add the call phy_support_eee() which the MAC
makes after connecting the PHY to indicate it supports EEE. phylib will
then advertise EEE when auto-neg is performed.
Signed-off-by: Andrew Lunn <[email protected]>
Signed-off-by: Oleksij Rempel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Have phylib keep track of the EEE configuration. This simplifies the
MAC drivers, in that they don't need to store it.
Future patches to phylib will also make use of this information to
further simplify the MAC drivers.
Reviewed-by: Russell King (Oracle) <[email protected]>
Signed-off-by: Andrew Lunn <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: Oleksij Rempel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
MAC drivers which support EEE need to know the results of the EEE
auto-neg in order to program the hardware to perform EEE or not. The
oddly named phy_init_eee() can be used to determine this, it returns 0
if EEE should be used, or a negative error code,
e.g. -EOPPROTONOTSUPPORT if the PHY does not support EEE or negotiate
resulted in it not being used.
However, many MAC drivers get this wrong. Add phydev->enable_tx_lpi
which indicates the result of the autoneg for EEE, including if EEE is
administratively disabled with ethtool. The MAC driver can then access
this in the same way as link speed and duplex in the adjust link
callback. If enable_tx_lpi is true, the MAC should send low power
indications and does not need to consider anything else with respect
to EEE.
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: Andrew Lunn <[email protected]>
Signed-off-by: Oleksij Rempel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Add helpers that phylib and phylink can use to manage EEE configuration
and determine whether the MAC should be permitted to use LPI based on
that configuration.
Signed-off-by: Russell King (Oracle) <[email protected]>
Signed-off-by: Andrew Lunn <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: Oleksij Rempel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Older versions of GCC really want to know the full definition
of the type involved in rcu_assign_pointer().
struct dpll_pin is defined in a local header, net/core can't
reach it. Move all the netdev <> dpll code into dpll, where
the type is known. Otherwise we'd need multiple function calls
to jump between the compilation units.
This is the same problem the commit under fixes was trying to address,
but with rcu_assign_pointer() not rcu_dereference().
Some of the exports are not needed, networking core can't
be a module, we only need exports for the helpers used by
drivers.
Reported-by: Geert Uytterhoeven <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/
Fixes: 640f41ed33b5 ("dpll: fix build failure due to rcu_dereference_check() on unknown type")
Reviewed-by: Jiri Pirko <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Track the call timeouts as ktimes rather than jiffies as the latter's
granularity is too high and only set the timer at the end of the event
handling function.
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: [email protected]
cc: [email protected]
|
|
There are three points that transmit PING ACKs and all of them use the same
trace string. Change two of them to use different strings.
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: [email protected]
cc: [email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- Multiple fixes, cleanups and documentations for Hyper-V core code and
drivers
* tag 'hyperv-fixes-signed-20240303' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
Drivers: hv: vmbus: make hv_bus const
x86/hyperv: Allow 15-bit APIC IDs for VTL platforms
x86/hyperv: Make encrypted/decrypted changes safe for load_unaligned_zeropad()
x86/mm: Regularize set_memory_p() parameters and make non-static
x86/hyperv: Use slow_virt_to_phys() in page transition hypervisor callback
Documentation: hyperv: Add overview of PCI pass-thru device support
Drivers: hv: vmbus: Update indentation in create_gpadl_header()
Drivers: hv: vmbus: Remove duplication and cleanup code in create_gpadl_header()
fbdev/hyperv_fb: Fix logic error for Gen2 VMs in hvfb_getmem()
Drivers: hv: vmbus: Calculate ring buffer size for more efficient use of memory
hv_utils: Allow implicit ICTIMESYNCFLAG_SYNC
|
|
Since commit 43a7206b0963 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the nfc_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.
Suggested-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Ricardo B. Marliere <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Bridge driver today has no support to forward the userspace timestamp
packets and ends up resetting the timestamp. ETF qdisc checks the
packet coming from userspace and encounters to be 0 thereby dropping
time sensitive packets. These changes will allow userspace timestamps
packets to be forwarded from the bridge to NIC drivers.
Setting the same bit (mono_delivery_time) to avoid dropping of
userspace tstamp packets in the forwarding path.
Existing functionality of mono_delivery_time remains unaltered here,
instead just extended with userspace tstamp support for bridge
forwarding path.
Signed-off-by: Abhishek Chauhan <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
|
|
Currently the so-called GRO fast path is only enabled for
napi_frags_skb() callers.
After the prior patch, we no longer have to clear frag0 whenever
we pulled bytes to skb->head.
We therefore can initialize frag0 to skb->data so that GRO
fast path can be used in the following additional cases:
- Drivers using header split (populating skb->data with headers,
and having payload in one or more page fragments).
- Drivers not using any page frag (entire packet is in skb->data)
Add a likely() in skb_gro_may_pull() to help the compiler
to generate better code if possible.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Paolo Abeni <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
Change skb_gro_network_header() to accept a const sk_buff
and to no longer check if frag0 is NULL or not.
This allows to remove skb_gro_frag0_invalidate()
which is seen in profiles when header-split is enabled.
sk_buff parameter is constified for skb_gro_header_fast(),
inet_gro_compute_pseudo() and ip6_gro_compute_pseudo().
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Paolo Abeni <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
skb_gro_header_hard() is renamed to skb_gro_may_pull() to match
the convention used by common helpers like pskb_may_pull().
This means the condition is inverted:
if (skb_gro_header_hard(skb, hlen))
slow_path();
becomes:
if (!skb_gro_may_pull(skb, hlen))
slow_path();
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Paolo Abeni <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|