aboutsummaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)AuthorFilesLines
2022-02-03ax25: fix reference count leaks of ax25_devDuoming Zhou1-3/+5
The previous commit d01ffb9eee4a ("ax25: add refcount in ax25_dev to avoid UAF bugs") introduces refcount into ax25_dev, but there are reference leak paths in ax25_ctl_ioctl(), ax25_fwd_ioctl(), ax25_rt_add(), ax25_rt_del() and ax25_rt_opt(). This patch uses ax25_dev_put() and adjusts the position of ax25_addr_ax25dev() to fix reference cout leaks of ax25_dev. Fixes: d01ffb9eee4a ("ax25: add refcount in ax25_dev to avoid UAF bugs") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20220203150811.42256-1-duoming@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-02net, neigh: Do not trigger immediate probes on NUD_FAILED from ↵Daniel Borkmann1-5/+13
neigh_managed_work syzkaller was able to trigger a deadlock for NTF_MANAGED entries [0]: kworker/0:16/14617 is trying to acquire lock: ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652 [...] but task is already holding lock: ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: neigh_managed_work+0x35/0x250 net/core/neighbour.c:1572 The neighbor entry turned to NUD_FAILED state, where __neigh_event_send() triggered an immediate probe as per commit cd28ca0a3dd1 ("neigh: reduce arp latency") via neigh_probe() given table lock was held. One option to fix this situation is to defer the neigh_probe() back to the neigh_timer_handler() similarly as pre cd28ca0a3dd1. For the case of NTF_MANAGED, this deferral is acceptable given this only happens on actual failure state and regular / expected state is NUD_VALID with the entry already present. The fix adds a parameter to __neigh_event_send() in order to communicate whether immediate probe is allowed or disallowed. Existing call-sites of neigh_event_send() default as-is to immediate probe. However, the neigh_managed_work() disables it via use of neigh_event_send_probe(). [0] <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_deadlock_bug kernel/locking/lockdep.c:2956 [inline] check_deadlock kernel/locking/lockdep.c:2999 [inline] validate_chain kernel/locking/lockdep.c:3788 [inline] __lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5027 lock_acquire kernel/locking/lockdep.c:5639 [inline] lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5604 __raw_write_lock_bh include/linux/rwlock_api_smp.h:202 [inline] _raw_write_lock_bh+0x2f/0x40 kernel/locking/spinlock.c:334 ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652 ip6_finish_output2+0x1070/0x14f0 net/ipv6/ip6_output.c:123 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] __ip6_finish_output+0x61e/0xe90 net/ipv6/ip6_output.c:170 ip6_finish_output+0x32/0x200 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:451 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0xa99/0x17f0 net/ipv6/ndisc.c:508 ndisc_send_ns+0x3a9/0x840 net/ipv6/ndisc.c:650 ndisc_solicit+0x2cd/0x4f0 net/ipv6/ndisc.c:742 neigh_probe+0xc2/0x110 net/core/neighbour.c:1040 __neigh_event_send+0x37d/0x1570 net/core/neighbour.c:1201 neigh_event_send include/net/neighbour.h:470 [inline] neigh_managed_work+0x162/0x250 net/core/neighbour.c:1574 process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307 worker_thread+0x657/0x1110 kernel/workqueue.c:2454 kthread+0x2e9/0x3a0 kernel/kthread.c:377 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> Fixes: 7482e3841d52 ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries") Reported-by: syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Roopa Prabhu <roopa@nvidia.com> Tested-by: syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220201193942.5055-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-28ax25: add refcount in ax25_dev to avoid UAF bugsDuoming Zhou1-0/+10
If we dereference ax25_dev after we call kfree(ax25_dev) in ax25_dev_device_down(), it will lead to concurrency UAF bugs. There are eight syscall functions suffer from UAF bugs, include ax25_bind(), ax25_release(), ax25_connect(), ax25_ioctl(), ax25_getname(), ax25_sendmsg(), ax25_getsockopt() and ax25_info_show(). One of the concurrency UAF can be shown as below: (USE) | (FREE) | ax25_device_event | ax25_dev_device_down ax25_bind | ... ... | kfree(ax25_dev) ax25_fillin_cb() | ... ax25_fillin_cb_from_dev() | ... | The root cause of UAF bugs is that kfree(ax25_dev) in ax25_dev_device_down() is not protected by any locks. When ax25_dev, which there are still pointers point to, is released, the concurrency UAF bug will happen. This patch introduces refcount into ax25_dev in order to guarantee that there are no pointers point to it when ax25_dev is released. Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-27ipv4: remove sparse error in ip_neigh_gw4()Eric Dumazet1-1/+1
./include/net/route.h:373:48: warning: incorrect type in argument 2 (different base types) ./include/net/route.h:373:48: expected unsigned int [usertype] key ./include/net/route.h:373:48: got restricted __be32 [usertype] daddr Fixes: 5c9f7c1dfc2e ("ipv4: Add helpers for neigh lookup for nexthop") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220127013404.1279313-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27ipv4: avoid using shared IP generator for connected socketsEric Dumazet1-11/+10
ip_select_ident_segs() has been very conservative about using the connected socket private generator only for packets with IP_DF set, claiming it was needed for some VJ compression implementations. As mentioned in this referenced document, this can be abused. (Ref: Off-Path TCP Exploits of the Mixed IPID Assignment) Before switching to pure random IPID generation and possibly hurt some workloads, lets use the private inet socket generator. Not only this will remove one vulnerability, this will also improve performance of TCP flows using pmtudisc==IP_PMTUDISC_DONT Fixes: 73f156a6e8c1 ("inetpeer: get rid of ip_id_count") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reported-by: Ray Che <xijiache@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27Revert "ipv6: Honor all IPv6 PIO Valid Lifetime values"Guillaume Nault1-0/+2
This reverts commit b75326c201242de9495ff98e5d5cff41d7fc0d9d. This commit breaks Linux compatibility with USGv6 tests. The RFC this commit was based on is actually an expired draft: no published RFC currently allows the new behaviour it introduced. Without full IETF endorsement, the flash renumbering scenario this patch was supposed to enable is never going to work, as other IPv6 equipements on the same LAN will keep the 2 hours limit. Fixes: b75326c20124 ("ipv6: Honor all IPv6 PIO Valid Lifetime values") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-24bonding: use rcu_dereference_rtnl when get bonding active slaveHangbin Liu1-1/+1
bond_option_active_slave_get_rcu() should not be used in rtnl_mutex as it use rcu_dereference(). Replace to rcu_dereference_rtnl() so we also can use this function in rtnl protected context. With this update, we can rmeove the rcu_read_lock/unlock in bonding .ndo_eth_ioctl and .get_ts_info. Reported-by: Vladimir Oltean <vladimir.oltean@nxp.com> Fixes: 94dd016ae538 ("bond: pass get_ts_info and SIOC[SG]HWTSTAMP ioctl to active device") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-20ipv6: annotate accesses to fn->fn_sernumEric Dumazet1-1/+1
struct fib6_node's fn_sernum field can be read while other threads change it. Add READ_ONCE()/WRITE_ONCE() annotations. Do not change existing smp barriers in fib6_get_cookie_safe() and __fib6_update_sernum_upto_root() syzbot reported: BUG: KCSAN: data-race in fib6_clean_node / inet6_csk_route_socket write to 0xffff88813df62e2c of 4 bytes by task 1920 on cpu 1: fib6_clean_node+0xc2/0x260 net/ipv6/ip6_fib.c:2178 fib6_walk_continue+0x38e/0x430 net/ipv6/ip6_fib.c:2112 fib6_walk net/ipv6/ip6_fib.c:2160 [inline] fib6_clean_tree net/ipv6/ip6_fib.c:2240 [inline] __fib6_clean_all+0x1a9/0x2e0 net/ipv6/ip6_fib.c:2256 fib6_flush_trees+0x6c/0x80 net/ipv6/ip6_fib.c:2281 rt_genid_bump_ipv6 include/net/net_namespace.h:488 [inline] addrconf_dad_completed+0x57f/0x870 net/ipv6/addrconf.c:4230 addrconf_dad_work+0x908/0x1170 process_one_work+0x3f6/0x960 kernel/workqueue.c:2307 worker_thread+0x616/0xa70 kernel/workqueue.c:2454 kthread+0x1bf/0x1e0 kernel/kthread.c:359 ret_from_fork+0x1f/0x30 read to 0xffff88813df62e2c of 4 bytes by task 15701 on cpu 0: fib6_get_cookie_safe include/net/ip6_fib.h:285 [inline] rt6_get_cookie include/net/ip6_fib.h:306 [inline] ip6_dst_store include/net/ip6_route.h:234 [inline] inet6_csk_route_socket+0x352/0x3c0 net/ipv6/inet6_connection_sock.c:109 inet6_csk_xmit+0x91/0x1e0 net/ipv6/inet6_connection_sock.c:121 __tcp_transmit_skb+0x1323/0x1840 net/ipv4/tcp_output.c:1402 tcp_transmit_skb net/ipv4/tcp_output.c:1420 [inline] tcp_write_xmit+0x1450/0x4460 net/ipv4/tcp_output.c:2680 __tcp_push_pending_frames+0x68/0x1c0 net/ipv4/tcp_output.c:2864 tcp_push+0x2d9/0x2f0 net/ipv4/tcp.c:725 mptcp_push_release net/mptcp/protocol.c:1491 [inline] __mptcp_push_pending+0x46c/0x490 net/mptcp/protocol.c:1578 mptcp_sendmsg+0x9ec/0xa50 net/mptcp/protocol.c:1764 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:643 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] kernel_sendmsg+0x97/0xd0 net/socket.c:745 sock_no_sendpage+0x84/0xb0 net/core/sock.c:3086 inet_sendpage+0x9d/0xc0 net/ipv4/af_inet.c:834 kernel_sendpage+0x187/0x200 net/socket.c:3492 sock_sendpage+0x5a/0x70 net/socket.c:1007 pipe_to_sendpage+0x128/0x160 fs/splice.c:364 splice_from_pipe_feed fs/splice.c:418 [inline] __splice_from_pipe+0x207/0x500 fs/splice.c:562 splice_from_pipe fs/splice.c:597 [inline] generic_splice_sendpage+0x94/0xd0 fs/splice.c:746 do_splice_from fs/splice.c:767 [inline] direct_splice_actor+0x80/0xa0 fs/splice.c:936 splice_direct_to_actor+0x345/0x650 fs/splice.c:891 do_splice_direct+0x106/0x190 fs/splice.c:979 do_sendfile+0x675/0xc40 fs/read_write.c:1245 __do_sys_sendfile64 fs/read_write.c:1310 [inline] __se_sys_sendfile64 fs/read_write.c:1296 [inline] __x64_sys_sendfile64+0x102/0x140 fs/read_write.c:1296 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000026f -> 0x00000271 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 15701 Comm: syz-executor.2 Not tainted 5.16.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 The Fixes tag I chose is probably arbitrary, I do not think we need to backport this patch to older kernels. Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220120174112.1126644-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-20tcp: Add a stub for sk_defer_free_flush()Gal Pressman1-0/+4
When compiling the kernel with CONFIG_INET disabled, the sk_defer_free_flush() should be defined as a nop. This resolves the following compilation error: ld: net/core/sock.o: in function `sk_defer_free_flush': ./include/net/tcp.h:1378: undefined reference to `__sk_defer_free_flush' Fixes: 79074a72d335 ("net: Flush deferred skb free on socket destroy") Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220120123440.9088-1-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-20Merge tag 'net-5.17-rc1' of ↵Linus Torvalds4-4/+19
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter, bpf. Quite a handful of old regression fixes but most of those are pre-5.16. Current release - regressions: - fix memory leaks in the skb free deferral scheme if upper layer protocols are used, i.e. in-kernel TCP readers like TLS Current release - new code bugs: - nf_tables: fix NULL check typo in _clone() functions - change the default to y for Vertexcom vendor Kconfig - a couple of fixes to incorrect uses of ref tracking - two fixes for constifying netdev->dev_addr Previous releases - regressions: - bpf: - various verifier fixes mainly around register offset handling when passed to helper functions - fix mount source displayed for bpffs (none -> bpffs) - bonding: - fix extraction of ports for connection hash calculation - fix bond_xmit_broadcast return value when some devices are down - phy: marvell: add Marvell specific PHY loopback - sch_api: don't skip qdisc attach on ingress, prevent ref leak - htb: restore minimal packet size handling in rate control - sfp: fix high power modules without diagnostic monitoring - mscc: ocelot: - don't let phylink re-enable TX PAUSE on the NPI port - don't dereference NULL pointers with shared tc filters - smsc95xx: correct reset handling for LAN9514 - cpsw: avoid alignment faults by taking NET_IP_ALIGN into account - phy: micrel: use kszphy_suspend/_resume for irq aware devices, avoid races with the interrupt Previous releases - always broken: - xdp: check prog type before updating BPF link - smc: resolve various races around abnormal connection termination - sit: allow encapsulated IPv6 traffic to be delivered locally - axienet: fix init/reset handling, add missing barriers, read the right status words, stop queues correctly - add missing dev_put() in sock_timestamping_bind_phc() Misc: - ipv4: prevent accidentally passing RTO_ONLINK to ip_route_output_key_hash() by sanitizing flags - ipv4: avoid quadratic behavior in netns dismantle - stmmac: dwmac-oxnas: add support for OX810SE - fsl: xgmac_mdio: add workaround for erratum A-009885" * tag 'net-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (92 commits) ipv4: add net_hash_mix() dispersion to fib_info_laddrhash keys ipv4: avoid quadratic behavior in netns dismantle net/fsl: xgmac_mdio: Fix incorrect iounmap when removing module powerpc/fsl/dts: Enable WA for erratum A-009885 on fman3l MDIO buses dt-bindings: net: Document fsl,erratum-a009885 net/fsl: xgmac_mdio: Add workaround for erratum A-009885 net: mscc: ocelot: fix using match before it is set net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices net: cpsw: avoid alignment faults by taking NET_IP_ALIGN into account nfc: llcp: fix NULL error pointer dereference on sendmsg() after failed bind() net: axienet: increase default TX ring size to 128 net: axienet: fix for TX busy handling net: axienet: fix number of TX ring slots for available check net: axienet: Fix TX ring slot available check net: axienet: limit minimum TX ring size net: axienet: add missing memory barriers net: axienet: reset core on initialization prior to MDIO access net: axienet: Wait for PhyRstCmplt after core reset net: axienet: increase reset timeout bpf, selftests: Add ringbuf memory type confusion test ...
2022-01-16Merge tag '9p-for-5.17-rc1' of git://github.com/martinetd/linuxLinus Torvalds2-3/+1
Pull 9p updates from Dominique Martinet: "Fixes, split 9p_net_fd, and new reviewer: - fix possible uninitialized memory usage for setattr - fix fscache reading hole in a file just after it's been grown - split net/9p/trans_fd.c in its own module like other transports. The new transport module defaults to 9P_NET and is autoloaded if required so users should not be impacted - add Christian Schoenebeck to 9p reviewers - some more trivial cleanup" * tag '9p-for-5.17-rc1' of git://github.com/martinetd/linux: 9p: fix enodata when reading growing file net/9p: show error message if user 'msize' cannot be satisfied MAINTAINERS: 9p: add Christian Schoenebeck as reviewer 9p: only copy valid iattrs in 9P2000.L setattr implementation 9p: Use BUG_ON instead of if condition followed by BUG. net/p9: load default transports 9p/xen: autoload when xenbus service is available 9p/trans_fd: split into dedicated module fs: 9p: remove unneeded variable 9p/trans_virtio: Fix typo in the comment for p9_virtio_create()
2022-01-13net_sched: restore "mpu xxx" handlingKevin Bracey1-0/+5
commit 56b765b79e9a ("htb: improved accuracy at high rates") broke "overhead X", "linklayer atm" and "mpu X" attributes. "overhead X" and "linklayer atm" have already been fixed. This restores the "mpu X" handling, as might be used by DOCSIS or Ethernet shaping: tc class add ... htb rate X overhead 4 mpu 64 The code being fixed is used by htb, tbf and act_police. Cake has its own mpu handling. qdisc_calculate_pkt_len still uses the size table containing values adjusted for mpu by user space. iproute2 tc has always passed mpu into the kernel via a tc_ratespec structure, but the kernel never directly acted on it, merely stored it so that it could be read back by `tc class show`. Rather, tc would generate length-to-time tables that included the mpu (and linklayer) in their construction, and the kernel used those tables. Since v3.7, the tables were no longer used. Along with "mpu", this also broke "overhead" and "linklayer" which were fixed in 01cb71d2d47b ("net_sched: restore "overhead xxx" handling", v3.10) and 8a8e3d84b171 ("net_sched: restore "linklayer atm" handling", v3.11). "overhead" was fixed by simply restoring use of tc_ratespec::overhead - this had originally been used by the kernel but was initially omitted from the new non-table-based calculations. "linklayer" had been handled in the table like "mpu", but the mode was not originally passed in tc_ratespec. The new implementation was made to handle it by getting new versions of tc to pass the mode in an extended tc_ratespec, and for older versions of tc the table contents were analysed at load time to deduce linklayer. As "mpu" has always been given to the kernel in tc_ratespec, accompanying the mpu-based table, we can restore system functionality with no userspace change by making the kernel act on the tc_ratespec value. Fixes: 56b765b79e9a ("htb: improved accuracy at high rates") Signed-off-by: Kevin Bracey <kevin@bracey.fi> Cc: Eric Dumazet <edumazet@google.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: Vimalkumar <j.vimal@gmail.com> Link: https://lore.kernel.org/r/20220112170210.1014351-1-kevin@bracey.fi Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-13inet: frags: annotate races around fqdir->dead and fqdir->high_threshEric Dumazet2-3/+11
Both fields can be read/written without synchronization, add proper accessors and documentation. Fixes: d5dd88794a13 ("inet: fix various use-after-free in defrags units") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-11net: sched: do not allocate a tracker in tcf_exts_init()Eric Dumazet1-1/+3
While struct tcf_exts has a net pointer, it is not refcounted until tcf_exts_get_net() is called. Fixes: dbdcda634ce3 ("net: sched: add netns refcount tracker to struct tcf_exts") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220110094750.236478-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-2/+4
Merge in fixes directly in prep for the 5.17 merge window. No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-10net/p9: load default transportsThomas Weißschuh1-1/+1
Now that all transports are split into modules it may happen that no transports are registered when v9fs_get_default_trans() is called. When that is the case try to load more transports from modules. Link: https://lkml.kernel.org/r/20211103193823.111007-5-linux@weissschuh.net Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> [Dominique: constify v9fs_get_trans_by_name argument as per patch1v2] Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-01-109p/trans_fd: split into dedicated moduleThomas Weißschuh1-2/+0
This allows these transports only to be used when needed. Link: https://lkml.kernel.org/r/20211103193823.111007-3-linux@weissschuh.net Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> [Dominique: Kconfig NET_9P_FD: -depends VIRTIO, +default NET_9P] Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-01-09net: openvswitch: Fix ct_state nat flags for conns arriving from tcPaul Blakey1-1/+3
Netfilter conntrack maintains NAT flags per connection indicating whether NAT was configured for the connection. Openvswitch maintains NAT flags on the per packet flow key ct_state field, indicating whether NAT was actually executed on the packet. When a packet misses from tc to ovs the conntrack NAT flags are set. However, NAT was not necessarily executed on the packet because the connection's state might still be in NEW state. As such, openvswitch wrongly assumes that NAT was executed and sets an incorrect flow key NAT flags. Fix this, by flagging to openvswitch which NAT was actually done in act_ct via tc_skb_ext and tc_skb_cb to the openvswitch module, so the packet flow key NAT flags will be correctly set. Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct") Signed-off-by: Paul Blakey <paulb@nvidia.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20220106153804.26451-1-paulb@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextJakub Kicinski3-8/+49
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next. This includes one patch to update ovs and act_ct to use nf_ct_put() instead of nf_conntrack_put(). 1) Add netns_tracker to nfnetlink_log and masquerade, from Eric Dumazet. 2) Remove redundant rcu read-size lock in nf_tables packet path. 3) Replace BUG() by WARN_ON_ONCE() in nft_payload. 4) Consolidate rule verdict tracing. 5) Replace WARN_ON() by WARN_ON_ONCE() in nf_tables core. 6) Make counter support built-in in nf_tables. 7) Add new field to conntrack object to identify locally generated traffic, from Florian Westphal. 8) Prevent NAT from shadowing well-known ports, from Florian Westphal. 9) Merge nf_flow_table_{ipv4,ipv6} into nf_flow_table_inet, also from Florian. 10) Remove redundant pointer in nft_pipapo AVX2 support, from Colin Ian King. 11) Replace opencoded max() in conntrack, from Jiapeng Chong. 12) Update conntrack to use refcount_t API, from Florian Westphal. 13) Move ip_ct_attach indirection into the nf_ct_hook structure. 14) Constify several pointer object in the netfilter codebase, from Florian Westphal. 15) Tree-wide replacement of nf_conntrack_put() by nf_ct_put(), also from Florian. 16) Fix egress splat due to incorrect rcu notation, from Florian. 17) Move stateful fields of connlimit, last, quota, numgen and limit out of the expression data area. 18) Build a blob to represent the ruleset in nf_tables, this is a requirement of the new register tracking infrastructure. 19) Add NFT_REG32_NUM to define the maximum number of 32-bit registers. 20) Add register tracking infrastructure to skip redundant store-to-register operations, this includes support for payload, meta and bitwise expresssions. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next: (32 commits) netfilter: nft_meta: cancel register tracking after meta update netfilter: nft_payload: cancel register tracking after payload update netfilter: nft_bitwise: track register operations netfilter: nft_meta: track register operations netfilter: nft_payload: track register operations netfilter: nf_tables: add register tracking infrastructure netfilter: nf_tables: add NFT_REG32_NUM netfilter: nf_tables: add rule blob layout netfilter: nft_limit: move stateful fields out of expression data netfilter: nft_limit: rename stateful structure netfilter: nft_numgen: move stateful fields out of expression data netfilter: nft_quota: move stateful fields out of expression data netfilter: nft_last: move stateful fields out of expression data netfilter: nft_connlimit: move stateful fields out of expression data netfilter: egress: avoid a lockdep splat net: prefer nf_ct_put instead of nf_conntrack_put netfilter: conntrack: avoid useless indirection during conntrack destruction netfilter: make function op structures const netfilter: core: move ip_ct_attach indirection to struct nf_ct_hook netfilter: conntrack: convert to refcount_t api ... ==================== Link: https://lore.kernel.org/r/20220109231640.104123-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-09netfilter: nft_bitwise: track register operationsPablo Neira Ayuso1-0/+2
Check if the destination register already contains the data that this bitwise expression performs. This allows to skip this redundant operation. If the destination contains a different bitwise operation, cancel the register tracking information. If the destination contains no bitwise operation, update the register tracking information. Update the payload and meta expression to check if this bitwise operation has been already performed on the register. Hence, both the payload/meta and the bitwise expressions are reduced. There is also a special case: If source register != destination register and source register is not updated by a previous bitwise operation, then transfer selector from the source register to the destination register. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-09netfilter: nf_tables: add register tracking infrastructurePablo Neira Ayuso1-0/+12
This patch adds new infrastructure to skip redundant selector store operations on the same register to achieve a performance boost from the packet path. This is particularly noticeable in pure linear rulesets but it also helps in rulesets which are already heaving relying in maps to avoid ruleset linear inspection. The idea is to keep data of the most recurrent store operations on register to reuse them with cmp and lookup expressions. This infrastructure allows for dynamic ruleset updates since the ruleset blob reduction happens from the kernel. Userspace still needs to be updated to maximize register utilization to cooperate to improve register data reuse / reduce number of store on register operations. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-09netfilter: nf_tables: add NFT_REG32_NUMPablo Neira Ayuso1-1/+3
Add a definition including the maximum number of 32-bits registers that are used a scratchpad memory area to store data. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-09netfilter: nf_tables: add rule blob layoutPablo Neira Ayuso1-4/+18
This patch adds a blob layout per chain to represent the ruleset in the packet datapath. size (unsigned long) struct nft_rule_dp struct nft_expr ... struct nft_rule_dp struct nft_expr ... struct nft_rule_dp (is_last=1) The new structure nft_rule_dp represents the rule in a more compact way (smaller memory footprint) compared to the control-plane nft_rule structure. The ruleset blob is a read-only data structure. The first field contains the blob size, then the rules containing expressions. There is a trailing rule which is used by the tracing infrastructure which is equivalent to the NULL rule marker in the previous representation. The blob size field does not include the size of this trailing rule marker. The ruleset blob is generated from the commit path. This patch reuses the infrastructure available since 0cbc06b3faba ("netfilter: nf_tables: remove synchronize_rcu in commit phase") to build the array of rules per chain. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-09netfilter: conntrack: avoid useless indirection during conntrack destructionFlorian Westphal1-2/+6
nf_ct_put() results in a usesless indirection: nf_ct_put -> nf_conntrack_put -> nf_conntrack_destroy -> rcu readlock + indirect call of ct_hooks->destroy(). There are two _put helpers: nf_ct_put and nf_conntrack_put. The latter is what should be used in code that MUST NOT cause a linker dependency on the conntrack module (e.g. calls from core network stack). Everyone else should call nf_ct_put() instead. A followup patch will convert a few nf_conntrack_put() calls to nf_ct_put(), in particular from modules that already have a conntrack dependency such as act_ct or even nf_conntrack itself. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-09netfilter: conntrack: Use max() instead of doing it manuallyJiapeng Chong1-1/+1
Fix following coccicheck warning: ./include/net/netfilter/nf_conntrack.h:282:16-17: WARNING opportunity for max(). Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-09Merge tag 'for-net-next-2022-01-07' of ↵Jakub Kicinski1-5/+1
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Add support for Foxconn QCA 0xe0d0 - Fix HCI init sequence on MacBook Air 8,1 and 8,2 - Fix Intel firmware loading on legacy ROM devices * tag 'for-net-next-2022-01-07' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: Bluetooth: hci_sock: fix endian bug in hci_sock_setsockopt() Bluetooth: L2CAP: uninitialized variables in l2cap_sock_setsockopt() Bluetooth: btqca: sequential validation Bluetooth: btusb: Add support for Foxconn QCA 0xe0d0 Bluetooth: btintel: Fix broken LED quirk for legacy ROM devices Bluetooth: hci_event: Rework hci_inquiry_result_with_rssi_evt Bluetooth: btbcm: disable read tx power for MacBook Air 8,1 and 8,2 Bluetooth: hci_qca: Fix NULL vs IS_ERR_OR_NULL check in qca_serdev_probe Bluetooth: hci_bcm: Check for error irq ==================== Link: https://lore.kernel.org/r/20220107210942.3750887-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski3-2/+13
Alexei Starovoitov says: ==================== pull-request: bpf-next 2022-01-06 We've added 41 non-merge commits during the last 2 day(s) which contain a total of 36 files changed, 1214 insertions(+), 368 deletions(-). The main changes are: 1) Various fixes in the verifier, from Kris and Daniel. 2) Fixes in sockmap, from John. 3) bpf_getsockopt fix, from Kuniyuki. 4) INET_POST_BIND fix, from Menglong. 5) arm64 JIT fix for bpf pseudo funcs, from Hou. 6) BPF ISA doc improvements, from Christoph. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits) bpf: selftests: Add bind retry for post_bind{4, 6} bpf: selftests: Use C99 initializers in test_sock.c net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() bpf/selftests: Test bpf_d_path on rdonly_mem. libbpf: Add documentation for bpf_map batch operations selftests/bpf: Don't rely on preserving volatile in PT_REGS macros in loop3 xdp: Add xdp_do_redirect_frame() for pre-computed xdp_frames xdp: Move conversion to xdp_frame out of map functions page_pool: Store the XDP mem id page_pool: Add callback to init pages when they are allocated xdp: Allow registering memory model without rxq reference samples/bpf: xdpsock: Add timestamp for Tx-only operation samples/bpf: xdpsock: Add time-out for cleaning Tx samples/bpf: xdpsock: Add sched policy and priority support samples/bpf: xdpsock: Add cyclic TX operation capability samples/bpf: xdpsock: Add clockid selection support samples/bpf: xdpsock: Add Dest and Src MAC setting for Tx-only operation samples/bpf: xdpsock: Add VLAN support for Tx-only operation libbpf 1.0: Deprecate bpf_object__find_map_by_offset() API libbpf 1.0: Deprecate bpf_map__is_offload_neutral() ... ==================== Link: https://lore.kernel.org/r/20220107013626.53943-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()Menglong Dong1-0/+1
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in __inet_bind() is not handled properly. While the return value is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and exit: err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk); if (err) { inet->inet_saddr = inet->inet_rcv_saddr = 0; goto out_release_sock; } Let's take UDP for example and see what will happen. For UDP socket, it will be added to 'udp_prot.h.udp_table->hash' and 'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port() called success. If 'inet->inet_rcv_saddr' is specified here, then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong to (because inet_saddr is changed to 0), and UDP packet received will not be passed to this sock. If 'inet->inet_rcv_saddr' is not specified here, the sock will work fine, as it can receive packet properly, which is wired, as the 'bind()' is already failed. To undo the get_port() operation, introduce the 'put_port' field for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP proto, it is udp_lib_unhash(); For icmp proto, it is ping_unhash(). Therefore, after sys_bind() fail caused by BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which means that it can try to be binded to another port. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
2022-01-06Bluetooth: hci_event: Rework hci_inquiry_result_with_rssi_evtLuiz Augusto von Dentz1-5/+1
This rework the handling of hci_inquiry_result_with_rssi_evt to not use a union to represent the different inquiry responses. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Tested-by: Soenke Huster <soenke.huster@eknoes.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-01-06net: dsa: warn about dsa_port and dsa_switch bit fields being non atomicVladimir Oltean1-0/+8
As discussed during review here: https://patchwork.kernel.org/project/netdevbpf/patch/20220105132141.2648876-3-vladimir.oltean@nxp.com/ we should inform developers about pitfalls of concurrent access to the boolean properties of dsa_switch and dsa_port, now that they've been converted to bit fields. No other measure than a comment needs to be taken, since the code paths that update these bit fields are not concurrent with each other. Suggested-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: don't enumerate dsa_switch and dsa_port bit fields using commasVladimir Oltean1-58/+56
This is a cosmetic incremental fixup to commits 7787ff776398 ("net: dsa: merge all bools of struct dsa_switch into a single u32") bde82f389af1 ("net: dsa: merge all bools of struct dsa_port into a single u8") The desire to make this change was enunciated after posting these patches here: https://patchwork.kernel.org/project/netdevbpf/cover/20220105132141.2648876-1-vladimir.oltean@nxp.com/ but due to a slight timing overlap (message posted at 2:28 p.m. UTC, merge commit is at 2:46 p.m. UTC), that comment was missed and the changes were applied as-is. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06Merge branch 'master' of ↵David S. Miller1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2022-01-06 1) Fix xfrm policy lookups for ipv6 gre packets by initializing fl6_gre_key properly. From Ghalem Boudour. 2) Fix the dflt policy check on forwarding when there is no policy configured. The check was done for the wrong direction. From Nicolas Dichtel. 3) Use the correct 'struct xfrm_user_offload' when calculating netlink message lenghts in xfrm_sa_len(). From Eric Dumazet. 4) Tread inserting xfrm interface id 0 as an error. From Antony Antony. 5) Fail if xfrm state or policy is inserted with XFRMA_IF_ID 0, xfrm interfaces with id 0 are not allowed. From Antony Antony. 6) Fix inner_ipproto setting in the sec_path for tunnel mode. From Raed Salem. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06Merge branch 'master' of ↵David S. Miller1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2022-01-06 1) Fix some clang_analyzer warnings about never read variables. From luo penghao. 2) Check for pols[0] only once in xfrm_expand_policies(). From Jean Sacren. 3) The SA curlft.use_time was updated only on SA cration time. Update whenever the SA is used. From Antony Antony 4) Add support for SM3 secure hash. From Xu Jia. 5) Add support for SM4 symmetric cipher algorithm. From Xu Jia. 6) Add a rate limit for SA mapping change messages. From Antony Antony. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05page_pool: Store the XDP mem idToke Høiland-Jørgensen1-2/+7
Store the XDP mem ID inside the page_pool struct so it can be retrieved later for use in bpf_prog_run(). Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20220103150812.87914-4-toke@redhat.com
2022-01-05page_pool: Add callback to init pages when they are allocatedToke Høiland-Jørgensen1-0/+2
Add a new callback function to page_pool that, if set, will be called every time a new page is allocated. This will be used from bpf_test_run() to initialise the page data with the data provided by userspace when running XDP programs with redirect turned on. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20220103150812.87914-3-toke@redhat.com
2022-01-05xdp: Allow registering memory model without rxq referenceToke Høiland-Jørgensen1-0/+3
The functions that register an XDP memory model take a struct xdp_rxq as parameter, but the RXQ is not actually used for anything other than pulling out the struct xdp_mem_info that it embeds. So refactor the register functions and export variants that just take a pointer to the xdp_mem_info. This is in preparation for enabling XDP_REDIRECT in bpf_prog_run(), using a page_pool instance that is not connected to any network device. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220103150812.87914-2-toke@redhat.com
2022-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-2/+22
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05net: dsa: combine two holes in struct dsa_switch_treeVladimir Oltean1-3/+3
There is a 7 byte hole after dst->setup and a 4 byte hole after dst->default_proto. Combining them, we have a single hole of just 3 bytes on 64 bit machines. Before: pahole -C dsa_switch_tree net/dsa/slave.o struct dsa_switch_tree { struct list_head list; /* 0 16 */ struct list_head ports; /* 16 16 */ struct raw_notifier_head nh; /* 32 8 */ unsigned int index; /* 40 4 */ struct kref refcount; /* 44 4 */ struct net_device * * lags; /* 48 8 */ bool setup; /* 56 1 */ /* XXX 7 bytes hole, try to pack */ /* --- cacheline 1 boundary (64 bytes) --- */ const struct dsa_device_ops * tag_ops; /* 64 8 */ enum dsa_tag_protocol default_proto; /* 72 4 */ /* XXX 4 bytes hole, try to pack */ struct dsa_platform_data * pd; /* 80 8 */ struct list_head rtable; /* 88 16 */ unsigned int lags_len; /* 104 4 */ unsigned int last_switch; /* 108 4 */ /* size: 112, cachelines: 2, members: 13 */ /* sum members: 101, holes: 2, sum holes: 11 */ /* last cacheline: 48 bytes */ }; After: pahole -C dsa_switch_tree net/dsa/slave.o struct dsa_switch_tree { struct list_head list; /* 0 16 */ struct list_head ports; /* 16 16 */ struct raw_notifier_head nh; /* 32 8 */ unsigned int index; /* 40 4 */ struct kref refcount; /* 44 4 */ struct net_device * * lags; /* 48 8 */ const struct dsa_device_ops * tag_ops; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ enum dsa_tag_protocol default_proto; /* 64 4 */ bool setup; /* 68 1 */ /* XXX 3 bytes hole, try to pack */ struct dsa_platform_data * pd; /* 72 8 */ struct list_head rtable; /* 80 16 */ unsigned int lags_len; /* 96 4 */ unsigned int last_switch; /* 100 4 */ /* size: 104, cachelines: 2, members: 13 */ /* sum members: 101, holes: 1, sum holes: 3 */ /* last cacheline: 40 bytes */ }; Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05net: dsa: move dsa_switch_tree :: ports and lags to first cache lineVladimir Oltean1-7/+9
dst->ports is accessed most notably by dsa_master_find_slave(), which is invoked in the RX path. dst->lags is accessed by dsa_lag_dev(), which is invoked in the RX path of tag_dsa.c. dst->tag_ops, dst->default_proto and dst->pd don't need to be in the first cache line, so they are moved out by this change. Before: pahole -C dsa_switch_tree net/dsa/slave.o struct dsa_switch_tree { struct list_head list; /* 0 16 */ struct raw_notifier_head nh; /* 16 8 */ unsigned int index; /* 24 4 */ struct kref refcount; /* 28 4 */ bool setup; /* 32 1 */ /* XXX 7 bytes hole, try to pack */ const struct dsa_device_ops * tag_ops; /* 40 8 */ enum dsa_tag_protocol default_proto; /* 48 4 */ /* XXX 4 bytes hole, try to pack */ struct dsa_platform_data * pd; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct list_head ports; /* 64 16 */ struct list_head rtable; /* 80 16 */ struct net_device * * lags; /* 96 8 */ unsigned int lags_len; /* 104 4 */ unsigned int last_switch; /* 108 4 */ /* size: 112, cachelines: 2, members: 13 */ /* sum members: 101, holes: 2, sum holes: 11 */ /* last cacheline: 48 bytes */ }; After: pahole -C dsa_switch_tree net/dsa/slave.o struct dsa_switch_tree { struct list_head list; /* 0 16 */ struct list_head ports; /* 16 16 */ struct raw_notifier_head nh; /* 32 8 */ unsigned int index; /* 40 4 */ struct kref refcount; /* 44 4 */ struct net_device * * lags; /* 48 8 */ bool setup; /* 56 1 */ /* XXX 7 bytes hole, try to pack */ /* --- cacheline 1 boundary (64 bytes) --- */ const struct dsa_device_ops * tag_ops; /* 64 8 */ enum dsa_tag_protocol default_proto; /* 72 4 */ /* XXX 4 bytes hole, try to pack */ struct dsa_platform_data * pd; /* 80 8 */ struct list_head rtable; /* 88 16 */ unsigned int lags_len; /* 104 4 */ unsigned int last_switch; /* 108 4 */ /* size: 112, cachelines: 2, members: 13 */ /* sum members: 101, holes: 2, sum holes: 11 */ /* last cacheline: 48 bytes */ }; Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05net: dsa: make dsa_switch :: num_ports an unsigned intVladimir Oltean1-1/+1
Currently, num_ports is declared as size_t, which is defined as __kernel_ulong_t, therefore it occupies 8 bytes of memory. Even switches with port numbers in the range of tens are exotic, so there is no need for this amount of storage. Additionally, because the max_num_bridges member right above it is also 4 bytes, it means the compiler needs to add padding between the last 2 fields. By reducing the size, we don't need that padding and can reduce the struct size. Before: pahole -C dsa_switch net/dsa/slave.o struct dsa_switch { struct device * dev; /* 0 8 */ struct dsa_switch_tree * dst; /* 8 8 */ unsigned int index; /* 16 4 */ u32 setup:1; /* 20: 0 4 */ u32 vlan_filtering_is_global:1; /* 20: 1 4 */ u32 needs_standalone_vlan_filtering:1; /* 20: 2 4 */ u32 configure_vlan_while_not_filtering:1; /* 20: 3 4 */ u32 untag_bridge_pvid:1; /* 20: 4 4 */ u32 assisted_learning_on_cpu_port:1; /* 20: 5 4 */ u32 vlan_filtering:1; /* 20: 6 4 */ u32 pcs_poll:1; /* 20: 7 4 */ u32 mtu_enforcement_ingress:1; /* 20: 8 4 */ /* XXX 23 bits hole, try to pack */ struct notifier_block nb; /* 24 24 */ /* XXX last struct has 4 bytes of padding */ void * priv; /* 48 8 */ void * tagger_data; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dsa_chip_data * cd; /* 64 8 */ const struct dsa_switch_ops * ops; /* 72 8 */ u32 phys_mii_mask; /* 80 4 */ /* XXX 4 bytes hole, try to pack */ struct mii_bus * slave_mii_bus; /* 88 8 */ unsigned int ageing_time_min; /* 96 4 */ unsigned int ageing_time_max; /* 100 4 */ struct dsa_8021q_context * tag_8021q_ctx; /* 104 8 */ struct devlink * devlink; /* 112 8 */ unsigned int num_tx_queues; /* 120 4 */ unsigned int num_lag_ids; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ unsigned int max_num_bridges; /* 128 4 */ /* XXX 4 bytes hole, try to pack */ size_t num_ports; /* 136 8 */ /* size: 144, cachelines: 3, members: 27 */ /* sum members: 132, holes: 2, sum holes: 8 */ /* sum bitfield members: 9 bits, bit holes: 1, sum bit holes: 23 bits */ /* paddings: 1, sum paddings: 4 */ /* last cacheline: 16 bytes */ }; After: pahole -C dsa_switch net/dsa/slave.o struct dsa_switch { struct device * dev; /* 0 8 */ struct dsa_switch_tree * dst; /* 8 8 */ unsigned int index; /* 16 4 */ u32 setup:1; /* 20: 0 4 */ u32 vlan_filtering_is_global:1; /* 20: 1 4 */ u32 needs_standalone_vlan_filtering:1; /* 20: 2 4 */ u32 configure_vlan_while_not_filtering:1; /* 20: 3 4 */ u32 untag_bridge_pvid:1; /* 20: 4 4 */ u32 assisted_learning_on_cpu_port:1; /* 20: 5 4 */ u32 vlan_filtering:1; /* 20: 6 4 */ u32 pcs_poll:1; /* 20: 7 4 */ u32 mtu_enforcement_ingress:1; /* 20: 8 4 */ /* XXX 23 bits hole, try to pack */ struct notifier_block nb; /* 24 24 */ /* XXX last struct has 4 bytes of padding */ void * priv; /* 48 8 */ void * tagger_data; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dsa_chip_data * cd; /* 64 8 */ const struct dsa_switch_ops * ops; /* 72 8 */ u32 phys_mii_mask; /* 80 4 */ /* XXX 4 bytes hole, try to pack */ struct mii_bus * slave_mii_bus; /* 88 8 */ unsigned int ageing_time_min; /* 96 4 */ unsigned int ageing_time_max; /* 100 4 */ struct dsa_8021q_context * tag_8021q_ctx; /* 104 8 */ struct devlink * devlink; /* 112 8 */ unsigned int num_tx_queues; /* 120 4 */ unsigned int num_lag_ids; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ unsigned int max_num_bridges; /* 128 4 */ unsigned int num_ports; /* 132 4 */ /* size: 136, cachelines: 3, members: 27 */ /* sum members: 128, holes: 1, sum holes: 4 */ /* sum bitfield members: 9 bits, bit holes: 1, sum bit holes: 23 bits */ /* paddings: 1, sum paddings: 4 */ /* last cacheline: 8 bytes */ }; Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05net: dsa: merge all bools of struct dsa_switch into a single u32Vladimir Oltean1-46/+51
struct dsa_switch has 9 boolean properties, many of which are in fact set by drivers for custom behavior (vlan_filtering_is_global, needs_standalone_vlan_filtering, etc etc). The binary layout of the structure could be improved. For example, the "bool setup" at the beginning introduces a gratuitous 7 byte hole in the first cache line. The change merges all boolean properties into bitfields of an u32, and places that u32 in the first cache line of the structure, since many bools are accessed from the data path (untag_bridge_pvid, vlan_filtering, vlan_filtering_is_global). We place this u32 after the existing ds->index, which is also 4 bytes in size. As a positive side effect, ds->tagger_data now fits into the first cache line too, because 4 bytes are saved. Before: pahole -C dsa_switch net/dsa/slave.o struct dsa_switch { bool setup; /* 0 1 */ /* XXX 7 bytes hole, try to pack */ struct device * dev; /* 8 8 */ struct dsa_switch_tree * dst; /* 16 8 */ unsigned int index; /* 24 4 */ /* XXX 4 bytes hole, try to pack */ struct notifier_block nb; /* 32 24 */ /* XXX last struct has 4 bytes of padding */ void * priv; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ void * tagger_data; /* 64 8 */ struct dsa_chip_data * cd; /* 72 8 */ const struct dsa_switch_ops * ops; /* 80 8 */ u32 phys_mii_mask; /* 88 4 */ /* XXX 4 bytes hole, try to pack */ struct mii_bus * slave_mii_bus; /* 96 8 */ unsigned int ageing_time_min; /* 104 4 */ unsigned int ageing_time_max; /* 108 4 */ struct dsa_8021q_context * tag_8021q_ctx; /* 112 8 */ struct devlink * devlink; /* 120 8 */ /* --- cacheline 2 boundary (128 bytes) --- */ unsigned int num_tx_queues; /* 128 4 */ bool vlan_filtering_is_global; /* 132 1 */ bool needs_standalone_vlan_filtering; /* 133 1 */ bool configure_vlan_while_not_filtering; /* 134 1 */ bool untag_bridge_pvid; /* 135 1 */ bool assisted_learning_on_cpu_port; /* 136 1 */ bool vlan_filtering; /* 137 1 */ bool pcs_poll; /* 138 1 */ bool mtu_enforcement_ingress; /* 139 1 */ unsigned int num_lag_ids; /* 140 4 */ unsigned int max_num_bridges; /* 144 4 */ /* XXX 4 bytes hole, try to pack */ size_t num_ports; /* 152 8 */ /* size: 160, cachelines: 3, members: 27 */ /* sum members: 141, holes: 4, sum holes: 19 */ /* paddings: 1, sum paddings: 4 */ /* last cacheline: 32 bytes */ }; After: pahole -C dsa_switch net/dsa/slave.o struct dsa_switch { struct device * dev; /* 0 8 */ struct dsa_switch_tree * dst; /* 8 8 */ unsigned int index; /* 16 4 */ u32 setup:1; /* 20: 0 4 */ u32 vlan_filtering_is_global:1; /* 20: 1 4 */ u32 needs_standalone_vlan_filtering:1; /* 20: 2 4 */ u32 configure_vlan_while_not_filtering:1; /* 20: 3 4 */ u32 untag_bridge_pvid:1; /* 20: 4 4 */ u32 assisted_learning_on_cpu_port:1; /* 20: 5 4 */ u32 vlan_filtering:1; /* 20: 6 4 */ u32 pcs_poll:1; /* 20: 7 4 */ u32 mtu_enforcement_ingress:1; /* 20: 8 4 */ /* XXX 23 bits hole, try to pack */ struct notifier_block nb; /* 24 24 */ /* XXX last struct has 4 bytes of padding */ void * priv; /* 48 8 */ void * tagger_data; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dsa_chip_data * cd; /* 64 8 */ const struct dsa_switch_ops * ops; /* 72 8 */ u32 phys_mii_mask; /* 80 4 */ /* XXX 4 bytes hole, try to pack */ struct mii_bus * slave_mii_bus; /* 88 8 */ unsigned int ageing_time_min; /* 96 4 */ unsigned int ageing_time_max; /* 100 4 */ struct dsa_8021q_context * tag_8021q_ctx; /* 104 8 */ struct devlink * devlink; /* 112 8 */ unsigned int num_tx_queues; /* 120 4 */ unsigned int num_lag_ids; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ unsigned int max_num_bridges; /* 128 4 */ /* XXX 4 bytes hole, try to pack */ size_t num_ports; /* 136 8 */ /* size: 144, cachelines: 3, members: 27 */ /* sum members: 132, holes: 2, sum holes: 8 */ /* sum bitfield members: 9 bits, bit holes: 1, sum bit holes: 23 bits */ /* paddings: 1, sum paddings: 4 */ /* last cacheline: 16 bytes */ }; Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05net: dsa: move dsa_port :: type near dsa_port :: indexVladimir Oltean1-2/+4
Both dsa_port :: type and dsa_port :: index introduce a 4 octet hole after them, so we can group them together and the holes would be eliminated, turning 16 octets of storage into just 8. This makes the cpu_dp pointer fit in the first cache line, which is good, because dsa_slave_to_master(), called by dsa_enqueue_skb(), uses it. Before: pahole -C dsa_port net/dsa/slave.o struct dsa_port { union { struct net_device * master; /* 0 8 */ struct net_device * slave; /* 0 8 */ }; /* 0 8 */ const struct dsa_device_ops * tag_ops; /* 8 8 */ struct dsa_switch_tree * dst; /* 16 8 */ struct sk_buff * (*rcv)(struct sk_buff *, struct net_device *); /* 24 8 */ enum { DSA_PORT_TYPE_UNUSED = 0, DSA_PORT_TYPE_CPU = 1, DSA_PORT_TYPE_DSA = 2, DSA_PORT_TYPE_USER = 3, } type; /* 32 4 */ /* XXX 4 bytes hole, try to pack */ struct dsa_switch * ds; /* 40 8 */ unsigned int index; /* 48 4 */ /* XXX 4 bytes hole, try to pack */ const char * name; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dsa_port * cpu_dp; /* 64 8 */ u8 mac[6]; /* 72 6 */ u8 stp_state; /* 78 1 */ u8 vlan_filtering:1; /* 79: 0 1 */ u8 learning:1; /* 79: 1 1 */ u8 lag_tx_enabled:1; /* 79: 2 1 */ u8 devlink_port_setup:1; /* 79: 3 1 */ u8 setup:1; /* 79: 4 1 */ /* XXX 3 bits hole, try to pack */ struct device_node * dn; /* 80 8 */ unsigned int ageing_time; /* 88 4 */ /* XXX 4 bytes hole, try to pack */ struct dsa_bridge * bridge; /* 96 8 */ struct devlink_port devlink_port; /* 104 288 */ /* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */ struct phylink * pl; /* 392 8 */ struct phylink_config pl_config; /* 400 40 */ struct net_device * lag_dev; /* 440 8 */ /* --- cacheline 7 boundary (448 bytes) --- */ struct net_device * hsr_dev; /* 448 8 */ struct list_head list; /* 456 16 */ const struct ethtool_ops * orig_ethtool_ops; /* 472 8 */ const struct dsa_netdevice_ops * netdev_ops; /* 480 8 */ struct mutex addr_lists_lock; /* 488 32 */ /* --- cacheline 8 boundary (512 bytes) was 8 bytes ago --- */ struct list_head fdbs; /* 520 16 */ struct list_head mdbs; /* 536 16 */ /* size: 552, cachelines: 9, members: 30 */ /* sum members: 539, holes: 3, sum holes: 12 */ /* sum bitfield members: 5 bits, bit holes: 1, sum bit holes: 3 bits */ /* last cacheline: 40 bytes */ }; After: pahole -C dsa_port net/dsa/slave.o struct dsa_port { union { struct net_device * master; /* 0 8 */ struct net_device * slave; /* 0 8 */ }; /* 0 8 */ const struct dsa_device_ops * tag_ops; /* 8 8 */ struct dsa_switch_tree * dst; /* 16 8 */ struct sk_buff * (*rcv)(struct sk_buff *, struct net_device *); /* 24 8 */ struct dsa_switch * ds; /* 32 8 */ unsigned int index; /* 40 4 */ enum { DSA_PORT_TYPE_UNUSED = 0, DSA_PORT_TYPE_CPU = 1, DSA_PORT_TYPE_DSA = 2, DSA_PORT_TYPE_USER = 3, } type; /* 44 4 */ const char * name; /* 48 8 */ struct dsa_port * cpu_dp; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ u8 mac[6]; /* 64 6 */ u8 stp_state; /* 70 1 */ u8 vlan_filtering:1; /* 71: 0 1 */ u8 learning:1; /* 71: 1 1 */ u8 lag_tx_enabled:1; /* 71: 2 1 */ u8 devlink_port_setup:1; /* 71: 3 1 */ u8 setup:1; /* 71: 4 1 */ /* XXX 3 bits hole, try to pack */ struct device_node * dn; /* 72 8 */ unsigned int ageing_time; /* 80 4 */ /* XXX 4 bytes hole, try to pack */ struct dsa_bridge * bridge; /* 88 8 */ struct devlink_port devlink_port; /* 96 288 */ /* --- cacheline 6 boundary (384 bytes) --- */ struct phylink * pl; /* 384 8 */ struct phylink_config pl_config; /* 392 40 */ struct net_device * lag_dev; /* 432 8 */ struct net_device * hsr_dev; /* 440 8 */ /* --- cacheline 7 boundary (448 bytes) --- */ struct list_head list; /* 448 16 */ const struct ethtool_ops * orig_ethtool_ops; /* 464 8 */ const struct dsa_netdevice_ops * netdev_ops; /* 472 8 */ struct mutex addr_lists_lock; /* 480 32 */ /* --- cacheline 8 boundary (512 bytes) --- */ struct list_head fdbs; /* 512 16 */ struct list_head mdbs; /* 528 16 */ /* size: 544, cachelines: 9, members: 30 */ /* sum members: 539, holes: 1, sum holes: 4 */ /* sum bitfield members: 5 bits, bit holes: 1, sum bit holes: 3 bits */ /* last cacheline: 32 bytes */ }; Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05net: dsa: merge all bools of struct dsa_port into a single u8Vladimir Oltean1-7/+10
struct dsa_port has 5 bool members which create quite a number of 7 byte holes in the structure layout. By merging them all into bitfields of an u8, and placing that u8 in the 1-byte hole after dp->mac and dp->stp_state, we can reduce the structure size from 576 bytes to 552 bytes on arm64. Before: pahole -C dsa_port net/dsa/slave.o struct dsa_port { union { struct net_device * master; /* 0 8 */ struct net_device * slave; /* 0 8 */ }; /* 0 8 */ const struct dsa_device_ops * tag_ops; /* 8 8 */ struct dsa_switch_tree * dst; /* 16 8 */ struct sk_buff * (*rcv)(struct sk_buff *, struct net_device *); /* 24 8 */ enum { DSA_PORT_TYPE_UNUSED = 0, DSA_PORT_TYPE_CPU = 1, DSA_PORT_TYPE_DSA = 2, DSA_PORT_TYPE_USER = 3, } type; /* 32 4 */ /* XXX 4 bytes hole, try to pack */ struct dsa_switch * ds; /* 40 8 */ unsigned int index; /* 48 4 */ /* XXX 4 bytes hole, try to pack */ const char * name; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dsa_port * cpu_dp; /* 64 8 */ u8 mac[6]; /* 72 6 */ u8 stp_state; /* 78 1 */ /* XXX 1 byte hole, try to pack */ struct device_node * dn; /* 80 8 */ unsigned int ageing_time; /* 88 4 */ bool vlan_filtering; /* 92 1 */ bool learning; /* 93 1 */ /* XXX 2 bytes hole, try to pack */ struct dsa_bridge * bridge; /* 96 8 */ struct devlink_port devlink_port; /* 104 288 */ /* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */ bool devlink_port_setup; /* 392 1 */ /* XXX 7 bytes hole, try to pack */ struct phylink * pl; /* 400 8 */ struct phylink_config pl_config; /* 408 40 */ /* --- cacheline 7 boundary (448 bytes) --- */ struct net_device * lag_dev; /* 448 8 */ bool lag_tx_enabled; /* 456 1 */ /* XXX 7 bytes hole, try to pack */ struct net_device * hsr_dev; /* 464 8 */ struct list_head list; /* 472 16 */ const struct ethtool_ops * orig_ethtool_ops; /* 488 8 */ const struct dsa_netdevice_ops * netdev_ops; /* 496 8 */ struct mutex addr_lists_lock; /* 504 32 */ /* --- cacheline 8 boundary (512 bytes) was 24 bytes ago --- */ struct list_head fdbs; /* 536 16 */ struct list_head mdbs; /* 552 16 */ bool setup; /* 568 1 */ /* size: 576, cachelines: 9, members: 30 */ /* sum members: 544, holes: 6, sum holes: 25 */ /* padding: 7 */ }; After: pahole -C dsa_port net/dsa/slave.o struct dsa_port { union { struct net_device * master; /* 0 8 */ struct net_device * slave; /* 0 8 */ }; /* 0 8 */ const struct dsa_device_ops * tag_ops; /* 8 8 */ struct dsa_switch_tree * dst; /* 16 8 */ struct sk_buff * (*rcv)(struct sk_buff *, struct net_device *); /* 24 8 */ enum { DSA_PORT_TYPE_UNUSED = 0, DSA_PORT_TYPE_CPU = 1, DSA_PORT_TYPE_DSA = 2, DSA_PORT_TYPE_USER = 3, } type; /* 32 4 */ /* XXX 4 bytes hole, try to pack */ struct dsa_switch * ds; /* 40 8 */ unsigned int index; /* 48 4 */ /* XXX 4 bytes hole, try to pack */ const char * name; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dsa_port * cpu_dp; /* 64 8 */ u8 mac[6]; /* 72 6 */ u8 stp_state; /* 78 1 */ u8 vlan_filtering:1; /* 79: 0 1 */ u8 learning:1; /* 79: 1 1 */ u8 lag_tx_enabled:1; /* 79: 2 1 */ u8 devlink_port_setup:1; /* 79: 3 1 */ u8 setup:1; /* 79: 4 1 */ /* XXX 3 bits hole, try to pack */ struct device_node * dn; /* 80 8 */ unsigned int ageing_time; /* 88 4 */ /* XXX 4 bytes hole, try to pack */ struct dsa_bridge * bridge; /* 96 8 */ struct devlink_port devlink_port; /* 104 288 */ /* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */ struct phylink * pl; /* 392 8 */ struct phylink_config pl_config; /* 400 40 */ struct net_device * lag_dev; /* 440 8 */ /* --- cacheline 7 boundary (448 bytes) --- */ struct net_device * hsr_dev; /* 448 8 */ struct list_head list; /* 456 16 */ const struct ethtool_ops * orig_ethtool_ops; /* 472 8 */ const struct dsa_netdevice_ops * netdev_ops; /* 480 8 */ struct mutex addr_lists_lock; /* 488 32 */ /* --- cacheline 8 boundary (512 bytes) was 8 bytes ago --- */ struct list_head fdbs; /* 520 16 */ struct list_head mdbs; /* 536 16 */ /* size: 552, cachelines: 9, members: 30 */ /* sum members: 539, holes: 3, sum holes: 12 */ /* sum bitfield members: 5 bits, bit holes: 1, sum bit holes: 3 bits */ /* last cacheline: 40 bytes */ }; Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05net: dsa: move dsa_port :: stp_state near dsa_port :: macVladimir Oltean1-1/+3
The MAC address of a port is 6 octets in size, and this creates a 2 octet hole after it. There are some other u8 members of struct dsa_port that we can put in that hole. One such member is the stp_state. Before: pahole -C dsa_port net/dsa/slave.o struct dsa_port { union { struct net_device * master; /* 0 8 */ struct net_device * slave; /* 0 8 */ }; /* 0 8 */ const struct dsa_device_ops * tag_ops; /* 8 8 */ struct dsa_switch_tree * dst; /* 16 8 */ struct sk_buff * (*rcv)(struct sk_buff *, struct net_device *); /* 24 8 */ enum { DSA_PORT_TYPE_UNUSED = 0, DSA_PORT_TYPE_CPU = 1, DSA_PORT_TYPE_DSA = 2, DSA_PORT_TYPE_USER = 3, } type; /* 32 4 */ /* XXX 4 bytes hole, try to pack */ struct dsa_switch * ds; /* 40 8 */ unsigned int index; /* 48 4 */ /* XXX 4 bytes hole, try to pack */ const char * name; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dsa_port * cpu_dp; /* 64 8 */ u8 mac[6]; /* 72 6 */ /* XXX 2 bytes hole, try to pack */ struct device_node * dn; /* 80 8 */ unsigned int ageing_time; /* 88 4 */ bool vlan_filtering; /* 92 1 */ bool learning; /* 93 1 */ u8 stp_state; /* 94 1 */ /* XXX 1 byte hole, try to pack */ struct dsa_bridge * bridge; /* 96 8 */ struct devlink_port devlink_port; /* 104 288 */ /* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */ bool devlink_port_setup; /* 392 1 */ /* XXX 7 bytes hole, try to pack */ struct phylink * pl; /* 400 8 */ struct phylink_config pl_config; /* 408 40 */ /* --- cacheline 7 boundary (448 bytes) --- */ struct net_device * lag_dev; /* 448 8 */ bool lag_tx_enabled; /* 456 1 */ /* XXX 7 bytes hole, try to pack */ struct net_device * hsr_dev; /* 464 8 */ struct list_head list; /* 472 16 */ const struct ethtool_ops * orig_ethtool_ops; /* 488 8 */ const struct dsa_netdevice_ops * netdev_ops; /* 496 8 */ struct mutex addr_lists_lock; /* 504 32 */ /* --- cacheline 8 boundary (512 bytes) was 24 bytes ago --- */ struct list_head fdbs; /* 536 16 */ struct list_head mdbs; /* 552 16 */ bool setup; /* 568 1 */ /* size: 576, cachelines: 9, members: 30 */ /* sum members: 544, holes: 6, sum holes: 25 */ /* padding: 7 */ }; After: pahole -C dsa_port net/dsa/slave.o struct dsa_port { union { struct net_device * master; /* 0 8 */ struct net_device * slave; /* 0 8 */ }; /* 0 8 */ const struct dsa_device_ops * tag_ops; /* 8 8 */ struct dsa_switch_tree * dst; /* 16 8 */ struct sk_buff * (*rcv)(struct sk_buff *, struct net_device *); /* 24 8 */ enum { DSA_PORT_TYPE_UNUSED = 0, DSA_PORT_TYPE_CPU = 1, DSA_PORT_TYPE_DSA = 2, DSA_PORT_TYPE_USER = 3, } type; /* 32 4 */ /* XXX 4 bytes hole, try to pack */ struct dsa_switch * ds; /* 40 8 */ unsigned int index; /* 48 4 */ /* XXX 4 bytes hole, try to pack */ const char * name; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dsa_port * cpu_dp; /* 64 8 */ u8 mac[6]; /* 72 6 */ u8 stp_state; /* 78 1 */ /* XXX 1 byte hole, try to pack */ struct device_node * dn; /* 80 8 */ unsigned int ageing_time; /* 88 4 */ bool vlan_filtering; /* 92 1 */ bool learning; /* 93 1 */ /* XXX 2 bytes hole, try to pack */ struct dsa_bridge * bridge; /* 96 8 */ struct devlink_port devlink_port; /* 104 288 */ /* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */ bool devlink_port_setup; /* 392 1 */ /* XXX 7 bytes hole, try to pack */ struct phylink * pl; /* 400 8 */ struct phylink_config pl_config; /* 408 40 */ /* --- cacheline 7 boundary (448 bytes) --- */ struct net_device * lag_dev; /* 448 8 */ bool lag_tx_enabled; /* 456 1 */ /* XXX 7 bytes hole, try to pack */ struct net_device * hsr_dev; /* 464 8 */ struct list_head list; /* 472 16 */ const struct ethtool_ops * orig_ethtool_ops; /* 488 8 */ const struct dsa_netdevice_ops * netdev_ops; /* 496 8 */ struct mutex addr_lists_lock; /* 504 32 */ /* --- cacheline 8 boundary (512 bytes) was 24 bytes ago --- */ struct list_head fdbs; /* 536 16 */ struct list_head mdbs; /* 552 16 */ bool setup; /* 568 1 */ /* size: 576, cachelines: 9, members: 30 */ /* sum members: 544, holes: 6, sum holes: 25 */ /* padding: 7 */ }; Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04mac80211: Add stations iterator where the iterator function may sleepMartin Blumenstingl1-0/+21
ieee80211_iterate_active_interfaces() and ieee80211_iterate_active_interfaces_atomic() already exist, where the former allows the iterator function to sleep. Add ieee80211_iterate_stations() which is similar to ieee80211_iterate_stations_atomic() but allows the iterator to sleep. This is needed for adding SDIO support to the rtw88 driver. Some interators there are reading or writing registers. With the SDIO ops (sdio_readb, sdio_writeb and friends) this means that the iterator function may sleep. Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Link: https://lore.kernel.org/r/20211228211501.468981-2-martin.blumenstingl@googlemail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-01-04Namespaceify mtu_expires sysctlxu xin1-0/+1
This patch enables the sysctl mtu_expires to be configured per net namespace. Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04Namespaceify min_pmtu sysctlxu xin1-0/+2
This patch enables the sysctl min_pmtu to be configured per net namespace. Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04udp6: Use Segment Routing Header for dest address if presentAndrew Lunn1-0/+19
When finding the socket to report an error on, if the invoking packet is using Segment Routing, the IPv6 destination address is that of an intermediate router, not the end destination. Extract the ultimate destination address from the segment address. This change allows traceroute to function in the presence of Segment Routing. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04icmp: ICMPV6: Examine invoking packet for Segment Route Headers.Andrew Lunn1-0/+1
RFC8754 says: ICMP error packets generated within the SR domain are sent to source nodes within the SR domain. The invoking packet in the ICMP error message may contain an SRH. Since the destination address of a packet with an SRH changes as each segment is processed, it may not be the destination used by the socket or application that generated the invoking packet. For the source of an invoking packet to process the ICMP error message, the ultimate destination address of the IPv6 header may be required. The following logic is used to determine the destination address for use by protocol-error handlers. * Walk all extension headers of the invoking IPv6 packet to the routing extension header preceding the upper-layer header. - If routing header is type 4 Segment Routing Header (SRH) o The SID at Segment List[0] may be used as the destination address of the invoking packet. Mangle the skb so the network header points to the invoking packet inside the ICMP packet. The seg6 helpers can then be used on the skb to find any segment routing headers. If found, mark this fact in the IPv6 control block of the skb, and store the offset into the packet of the SRH. Then restore the skb back to its old state. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04seg6: export get_srh() for ICMP handlingAndrew Lunn1-0/+1
An ICMP error message can contain in its message body part of an IPv6 packet which invoked the error. Such a packet might contain a segment router header. Export get_srh() so the ICMP code can make use of it. Since his changes the scope of the function from local to global, add the seg6_ prefix to keep the namespace clean. And move it into seg6.c so it is always available, not just when IPV6_SEG6_LWTUNNEL is enabled. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>