aboutsummaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
2022-11-03net: devlink: convert devlink port type-specific pointers to unionJiri Pirko1-4/+13
Instead of storing type_dev as a void pointer, convert it to union and use it to store either struct net_device or struct ib_device pointer. Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-11-03bridge: Add MAC Authentication Bypass (MAB) supportHans J. Schultz1-0/+5
Hosts that support 802.1X authentication are able to authenticate themselves by exchanging EAPOL frames with an authenticator (Ethernet bridge, in this case) and an authentication server. Access to the network is only granted by the authenticator to successfully authenticated hosts. The above is implemented in the bridge using the "locked" bridge port option. When enabled, link-local frames (e.g., EAPOL) can be locally received by the bridge, but all other frames are dropped unless the host is authenticated. That is, unless the user space control plane installed an FDB entry according to which the source address of the frame is located behind the locked ingress port. The entry can be dynamic, in which case learning needs to be enabled so that the entry will be refreshed by incoming traffic. There are deployments in which not all the devices connected to the authenticator (the bridge) support 802.1X. Such devices can include printers and cameras. One option to support such deployments is to unlock the bridge ports connecting these devices, but a slightly more secure option is to use MAB. When MAB is enabled, the MAC address of the connected device is used as the user name and password for the authentication. For MAB to work, the user space control plane needs to be notified about MAC addresses that are trying to gain access so that they will be compared against an allow list. This can be implemented via the regular learning process with the sole difference that learned FDB entries are installed with a new "locked" flag indicating that the entry cannot be used to authenticate the device. The flag cannot be set by user space, but user space can clear the flag by replacing the entry, thereby authenticating the device. Locked FDB entries implement the following semantics with regards to roaming, aging and forwarding: 1. Roaming: Locked FDB entries can roam to unlocked (authorized) ports, in which case the "locked" flag is cleared. FDB entries cannot roam to locked ports regardless of MAB being enabled or not. Therefore, locked FDB entries are only created if an FDB entry with the given {MAC, VID} does not already exist. This behavior prevents unauthenticated devices from disrupting traffic destined to already authenticated devices. 2. Aging: Locked FDB entries age and refresh by incoming traffic like regular entries. 3. Forwarding: Locked FDB entries forward traffic like regular entries. If user space detects an unauthorized MAC behind a locked port and wishes to prevent traffic with this MAC DA from reaching the host, it can do so using tc or a different mechanism. Enable the above behavior using a new bridge port option called "mab". It can only be enabled on a bridge port that is both locked and has learning enabled. Locked FDB entries are flushed from the port once MAB is disabled. A new option is added because there are pure 802.1X deployments that are not interested in notifications about locked FDB entries. Signed-off-by: Hans J. Schultz <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Reviewed-by: Vladimir Oltean <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-11-03Merge tag 'for-netdev' of ↵Jakub Kicinski2-8/+6
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== bpf 2022-11-04 We've added 8 non-merge commits during the last 3 day(s) which contain a total of 10 files changed, 113 insertions(+), 16 deletions(-). The main changes are: 1) Fix memory leak upon allocation failure in BPF verifier's stack state tracking, from Kees Cook. 2) Fix address leakage when BPF progs release reference to an object, from Youlin Li. 3) Fix BPF CI breakage from buggy in.h uapi header dependency, from Andrii Nakryiko. 4) Fix bpftool pin sub-command's argument parsing, from Pu Lehui. 5) Fix BPF sockmap lockdep warning by cancelling psock work outside of socket lock, from Cong Wang. 6) Follow-up for BPF sockmap to fix sk_forward_alloc accounting, from Wang Yufen. bpf-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add verifier test for release_reference() bpf: Fix wrong reg type conversion in release_reference() bpf, sock_map: Move cancel_work_sync() out of sock lock tools/headers: Pull in stddef.h to uapi to fix BPF selftests build in CI net/ipv4: Fix linux/in.h header dependencies bpftool: Fix NULL pointer dereference when pin {PROG, MAP, LINK} without FILE bpf, sockmap: Fix the sk->sk_forward_alloc warning of sk_stream_kill_queues bpf, verifier: Fix memory leak in array reallocation for stack state ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-11-03bpf: make sure skb->len != 0 when redirecting to a tunneling deviceStanislav Fomichev1-0/+4
syzkaller managed to trigger another case where skb->len == 0 when we enter __dev_queue_xmit: WARNING: CPU: 0 PID: 2470 at include/linux/skbuff.h:2576 skb_assert_len include/linux/skbuff.h:2576 [inline] WARNING: CPU: 0 PID: 2470 at include/linux/skbuff.h:2576 __dev_queue_xmit+0x2069/0x35e0 net/core/dev.c:4295 Call Trace: dev_queue_xmit+0x17/0x20 net/core/dev.c:4406 __bpf_tx_skb net/core/filter.c:2115 [inline] __bpf_redirect_no_mac net/core/filter.c:2140 [inline] __bpf_redirect+0x5fb/0xda0 net/core/filter.c:2163 ____bpf_clone_redirect net/core/filter.c:2447 [inline] bpf_clone_redirect+0x247/0x390 net/core/filter.c:2419 bpf_prog_48159a89cb4a9a16+0x59/0x5e bpf_dispatcher_nop_func include/linux/bpf.h:897 [inline] __bpf_prog_run include/linux/filter.h:596 [inline] bpf_prog_run include/linux/filter.h:603 [inline] bpf_test_run+0x46c/0x890 net/bpf/test_run.c:402 bpf_prog_test_run_skb+0xbdc/0x14c0 net/bpf/test_run.c:1170 bpf_prog_test_run+0x345/0x3c0 kernel/bpf/syscall.c:3648 __sys_bpf+0x43a/0x6c0 kernel/bpf/syscall.c:5005 __do_sys_bpf kernel/bpf/syscall.c:5091 [inline] __se_sys_bpf kernel/bpf/syscall.c:5089 [inline] __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5089 do_syscall_64+0x54/0x70 arch/x86/entry/common.c:48 entry_SYSCALL_64_after_hwframe+0x61/0xc6 The reproducer doesn't really reproduce outside of syzkaller environment, so I'm taking a guess here. It looks like we do generate correct ETH_HLEN-sized packet, but we redirect the packet to the tunneling device. Before we do so, we __skb_pull l2 header and arrive again at skb->len == 0. Doesn't seem like we can do anything better than having an explicit check after __skb_pull? Cc: Eric Dumazet <[email protected]> Reported-by: [email protected] Signed-off-by: Stanislav Fomichev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2022-11-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+1
No conflicts. Signed-off-by: Jakub Kicinski <[email protected]>
2022-11-03bpf, sock_map: Move cancel_work_sync() out of sock lockCong Wang2-8/+6
Stanislav reported a lockdep warning, which is caused by the cancel_work_sync() called inside sock_map_close(), as analyzed below by Jakub: psock->work.func = sk_psock_backlog() ACQUIRE psock->work_mutex sk_psock_handle_skb() skb_send_sock() __skb_send_sock() sendpage_unlocked() kernel_sendpage() sock->ops->sendpage = inet_sendpage() sk->sk_prot->sendpage = tcp_sendpage() ACQUIRE sk->sk_lock tcp_sendpage_locked() RELEASE sk->sk_lock RELEASE psock->work_mutex sock_map_close() ACQUIRE sk->sk_lock sk_psock_stop() sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED) cancel_work_sync() __cancel_work_timer() __flush_work() // wait for psock->work to finish RELEASE sk->sk_lock We can move the cancel_work_sync() out of the sock lock protection, but still before saved_close() was called. Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()") Reported-by: Stanislav Fomichev <[email protected]> Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Acked-by: Jakub Sitnicki <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-11-02net, neigh: Fix null-ptr-deref in neigh_table_clear()Chen Zhongjin1-1/+1
When IPv6 module gets initialized but hits an error in the middle, kenel panic with: KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f] CPU: 1 PID: 361 Comm: insmod Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370 RSP: 0018:ffff888012677908 EFLAGS: 00000202 ... Call Trace: <TASK> neigh_table_clear+0x94/0x2d0 ndisc_cleanup+0x27/0x40 [ipv6] inet6_init+0x21c/0x2cb [ipv6] do_one_initcall+0xd3/0x4d0 do_init_module+0x1ae/0x670 ... Kernel panic - not syncing: Fatal exception When ipv6 initialization fails, it will try to cleanup and calls: neigh_table_clear() neigh_ifdown(tbl, NULL) pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL)) # dev_net(NULL) triggers null-ptr-deref. Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev is NULL, to make kernel not panic immediately. Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop") Signed-off-by: Chen Zhongjin <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Denis V. Lunev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-11-02Merge tag 'for-netdev' of ↵Jakub Kicinski1-32/+3
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== bpf-next 2022-11-02 We've added 70 non-merge commits during the last 14 day(s) which contain a total of 96 files changed, 3203 insertions(+), 640 deletions(-). The main changes are: 1) Make cgroup local storage available to non-cgroup attached BPF programs such as tc BPF ones, from Yonghong Song. 2) Avoid unnecessary deadlock detection and failures wrt BPF task storage helpers, from Martin KaFai Lau. 3) Add LLVM disassembler as default library for dumping JITed code in bpftool, from Quentin Monnet. 4) Various kprobe_multi_link fixes related to kernel modules, from Jiri Olsa. 5) Optimize x86-64 JIT with emitting BMI2-based shift instructions, from Jie Meng. 6) Improve BPF verifier's memory type compatibility for map key/value arguments, from Dave Marchevsky. 7) Only create mmap-able data section maps in libbpf when data is exposed via skeletons, from Andrii Nakryiko. 8) Add an autoattach option for bpftool to load all object assets, from Wang Yufen. 9) Various memory handling fixes for libbpf and BPF selftests, from Xu Kuohai. 10) Initial support for BPF selftest's vmtest.sh on arm64, from Manu Bretelle. 11) Improve libbpf's BTF handling to dedup identical structs, from Alan Maguire. 12) Add BPF CI and denylist documentation for BPF selftests, from Daniel Müller. 13) Check BPF cpumap max_entries before doing allocation work, from Florian Lehner. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (70 commits) samples/bpf: Fix typo in README bpf: Remove the obsolte u64_stats_fetch_*_irq() users. bpf: check max_entries before allocating memory bpf: Fix a typo in comment for DFS algorithm bpftool: Fix spelling mistake "disasembler" -> "disassembler" selftests/bpf: Fix bpftool synctypes checking failure selftests/bpf: Panic on hard/soft lockup docs/bpf: Add documentation for new cgroup local storage selftests/bpf: Add test cgrp_local_storage to DENYLIST.s390x selftests/bpf: Add selftests for new cgroup local storage selftests/bpf: Fix test test_libbpf_str/bpf_map_type_str bpftool: Support new cgroup local storage libbpf: Support new cgroup local storage bpf: Implement cgroup storage available to non-cgroup-attached bpf progs bpf: Refactor some inode/task/sk storage functions for reuse bpf: Make struct cgroup btf id global selftests/bpf: Tracing prog can still do lookup under busy lock selftests/bpf: Ensure no task storage failure for bpf_lsm.s prog due to deadlock detection bpf: Add new bpf_task_storage_delete proto with no deadlock detection bpf: bpf_task_storage_delete_recur does lookup first before the deadlock check ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-11-01net: core: inet[46]_pton strlen len typesDr. David Alan Gilbert1-2/+2
inet[46]_pton check the input length against a sane length limit (INET[6]_ADDRSTRLEN), but the strlen value gets truncated due to being stored in an int, so there's a theoretical potential for a >4G string to pass the limit test. Use size_t since that's what strlen actually returns. I've had a hunt for callers that could hit this, but I've not managed to find anything that doesn't get checked with some other limit first; but it's possible that I've missed something in the depth of the storage target paths. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-10-31net: dropreason: propagate drop_reason to skb_release_data()Eric Dumazet1-12/+12
When an skb with a frag list is consumed, we currently pretend all skbs in the frag list were dropped. In order to fix this, add a @reason argument to skb_release_data() and skb_release_all(). Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-10-31net: dropreason: add SKB_CONSUMED reasonEric Dumazet1-1/+5
This will allow to simply use in the future: kfree_skb_reason(skb, reason); Instead of repeating sequences like: if (dropped) kfree_skb_reason(skb, reason); else consume_skb(skb); For instance, following patch in the series is adding @reason to skb_release_data() and skb_release_all(), so that we can propagate a meaningful @reason whenever consume_skb()/kfree_skb() have to take care of a potential frag_list. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-10-31rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_linkHangbin Liu1-3/+4
This patch use the new helper unregister_netdevice_many_notify() for rtnl_delete_link(), so that the kernel could reply unicast when userspace set NLM_F_ECHO flag to request the new created interface info. At the same time, the parameters of rtnl_delete_link() need to be updated since we need nlmsghdr and portid info. Suggested-by: Guillaume Nault <[email protected]> Signed-off-by: Hangbin Liu <[email protected]> Reviewed-by: Guillaume Nault <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-10-31rtnetlink: Honour NLM_F_ECHO flag in rtnl_newlink_createHangbin Liu1-2/+4
This patch pass the netlink header message in rtnl_newlink_create() to the new updated rtnl_configure_link(), so that the kernel could reply unicast when userspace set NLM_F_ECHO flag to request the new created interface info. Suggested-by: Guillaume Nault <[email protected]> Signed-off-by: Hangbin Liu <[email protected]> Reviewed-by: Guillaume Nault <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-10-31net: add new helper unregister_netdevice_many_notifyHangbin Liu2-10/+20
Add new helper unregister_netdevice_many_notify(), pass netlink message header and portid, which could be used to notify userspace when flag NLM_F_ECHO is set. Make the unregister_netdevice_many() as a wrapper of new function unregister_netdevice_many_notify(). Suggested-by: Guillaume Nault <[email protected]> Signed-off-by: Hangbin Liu <[email protected]> Reviewed-by: Guillaume Nault <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-10-31rtnetlink: pass netlink message header and portid to rtnl_configure_link()Hangbin Liu3-28/+36
This patch pass netlink message header and portid to rtnl_configure_link() All the functions in this call chain need to add the parameters so we can use them in the last call rtnl_notify(), and notify the userspace about the new link info if NLM_F_ECHO flag is set. - rtnl_configure_link() - __dev_notify_flags() - rtmsg_ifinfo() - rtmsg_ifinfo_event() - rtmsg_ifinfo_build_skb() - rtmsg_ifinfo_send() - rtnl_notify() Also move __dev_notify_flags() declaration to net/core/dev.h, as Jakub suggested. Signed-off-by: Hangbin Liu <[email protected]> Reviewed-by: Guillaume Nault <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-10-28net: Remove the obsolte u64_stats_fetch_*_irq() users (net).Thomas Gleixner4-16/+16
Now that the 32bit UP oddity is gone and 32bit uses always a sequence count, there is no need for the fetch_irq() variants anymore. Convert to the regular interface. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-10-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+1
drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c 2871edb32f46 ("can: kvaser_usb: Fix possible completions during init_completion") abb8670938b2 ("can: kvaser_usb_leaf: Ignore stale bus-off after start") 8d21f5927ae6 ("can: kvaser_usb_leaf: Fix improved state not being reported") Signed-off-by: Jakub Kicinski <[email protected]>
2022-10-27net: do not sense pfmemalloc status in skb_append_pagefrags()Eric Dumazet1-1/+1
skb_append_pagefrags() is used by af_unix and udp sendpage() implementation so far. In commit 326140063946 ("tcp: TX zerocopy should not sense pfmemalloc status") we explained why we should not sense pfmemalloc status for pages owned by user space. We should also use skb_fill_page_desc_noacc() in skb_append_pagefrags() to avoid following KCSAN report: BUG: KCSAN: data-race in lru_add_fn / skb_append_pagefrags write to 0xffffea00058fc1c8 of 8 bytes by task 17319 on cpu 0: __list_add include/linux/list.h:73 [inline] list_add include/linux/list.h:88 [inline] lruvec_add_folio include/linux/mm_inline.h:323 [inline] lru_add_fn+0x327/0x410 mm/swap.c:228 folio_batch_move_lru+0x1e1/0x2a0 mm/swap.c:246 lru_add_drain_cpu+0x73/0x250 mm/swap.c:669 lru_add_drain+0x21/0x60 mm/swap.c:773 free_pages_and_swap_cache+0x16/0x70 mm/swap_state.c:311 tlb_batch_pages_flush mm/mmu_gather.c:59 [inline] tlb_flush_mmu_free mm/mmu_gather.c:256 [inline] tlb_flush_mmu+0x5b2/0x640 mm/mmu_gather.c:263 tlb_finish_mmu+0x86/0x100 mm/mmu_gather.c:363 exit_mmap+0x190/0x4d0 mm/mmap.c:3098 __mmput+0x27/0x1b0 kernel/fork.c:1185 mmput+0x3d/0x50 kernel/fork.c:1207 copy_process+0x19fc/0x2100 kernel/fork.c:2518 kernel_clone+0x166/0x550 kernel/fork.c:2671 __do_sys_clone kernel/fork.c:2812 [inline] __se_sys_clone kernel/fork.c:2796 [inline] __x64_sys_clone+0xc3/0xf0 kernel/fork.c:2796 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffffea00058fc1c8 of 8 bytes by task 17325 on cpu 1: page_is_pfmemalloc include/linux/mm.h:1817 [inline] __skb_fill_page_desc include/linux/skbuff.h:2432 [inline] skb_fill_page_desc include/linux/skbuff.h:2453 [inline] skb_append_pagefrags+0x210/0x600 net/core/skbuff.c:3974 unix_stream_sendpage+0x45e/0x990 net/unix/af_unix.c:2338 kernel_sendpage+0x184/0x300 net/socket.c:3561 sock_sendpage+0x5a/0x70 net/socket.c:1054 pipe_to_sendpage+0x128/0x160 fs/splice.c:361 splice_from_pipe_feed fs/splice.c:415 [inline] __splice_from_pipe+0x222/0x4d0 fs/splice.c:559 splice_from_pipe fs/splice.c:594 [inline] generic_splice_sendpage+0x89/0xc0 fs/splice.c:743 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x80/0xa0 fs/splice.c:931 splice_direct_to_actor+0x305/0x620 fs/splice.c:886 do_splice_direct+0xfb/0x180 fs/splice.c:974 do_sendfile+0x3bf/0x910 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x10c/0x150 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x0000000000000000 -> 0xffffea00058fc188 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 17325 Comm: syz-executor.0 Not tainted 6.1.0-rc1-syzkaller-00158-g440b7895c990-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 Fixes: 326140063946 ("tcp: TX zerocopy should not sense pfmemalloc status") Reported-by: syzbot <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-10-27skbuff: Proactively round up to kmalloc bucket sizeKees Cook1-26/+26
Instead of discovering the kmalloc bucket size _after_ allocation, round up proactively so the allocation is explicitly made for the full size, allowing the compiler to correctly reason about the resulting size of the buffer through the existing __alloc_size() hint. This will allow for kernels built with CONFIG_UBSAN_BOUNDS or the coming dynamic bounds checking under CONFIG_FORTIFY_SOURCE to gain back the __alloc_size() hints that were temporarily reverted in commit 93dd04ab0b2b ("slab: remove __alloc_size attribute from __kmalloc_track_caller") Cc: "David S. Miller" <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Paolo Abeni <[email protected]> Cc: [email protected] Cc: Greg Kroah-Hartman <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: David Rientjes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Link: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/ Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
2022-10-27kunit: Use KUNIT_EXPECT_MEMEQ macroMaíra Canal1-2/+2
Use KUNIT_EXPECT_MEMEQ to compare memory blocks in replacement of the KUNIT_EXPECT_EQ macro. Therefor, the statement KUNIT_EXPECT_EQ(test, memcmp(foo, bar, size), 0); is replaced by: KUNIT_EXPECT_MEMEQ(test, foo, bar, size); Signed-off-by: Maíra Canal <[email protected]> Acked-by: Daniel Latypov <[email protected]> Reviewed-by: David Gow <[email protected]> Signed-off-by: Shuah Khan <[email protected]>
2022-10-25bpf: Refactor some inode/task/sk storage functions for reuseYonghong Song1-32/+3
Refactor codes so that inode/task/sk storage implementation can maximally share the same code. I also added some comments in new function bpf_local_storage_unlink_nolock() to make codes easy to understand. There is no functionality change. Acked-by: David Vernet <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-10-25net: dev: Convert sa_data to flexible array in struct sockaddrKees Cook2-2/+2
One of the worst offenders of "fake flexible arrays" is struct sockaddr, as it is the classic example of why GCC and Clang have been traditionally forced to treat all trailing arrays as fake flexible arrays: in the distant misty past, sa_data became too small, and code started just treating it as a flexible array, even though it was fixed-size. The special case by the compiler is specifically that sizeof(sa->sa_data) and FORTIFY_SOURCE (which uses __builtin_object_size(sa->sa_data, 1)) do not agree (14 and -1 respectively), which makes FORTIFY_SOURCE treat it as a flexible array. However, the coming -fstrict-flex-arrays compiler flag will remove these special cases so that FORTIFY_SOURCE can gain coverage over all the trailing arrays in the kernel that are _not_ supposed to be treated as a flexible array. To deal with this change, convert sa_data to a true flexible array. To keep the structure size the same, move sa_data into a union with a newly introduced sa_data_min with the original size. The result is that FORTIFY_SOURCE can continue to have no idea how large sa_data may actually be, but anything using sizeof(sa->sa_data) must switch to sizeof(sa->sa_data_min). Cc: Jens Axboe <[email protected]> Cc: Pavel Begunkov <[email protected]> Cc: David Ahern <[email protected]> Cc: Dylan Yudaken <[email protected]> Cc: Yajun Deng <[email protected]> Cc: Petr Machata <[email protected]> Cc: Hangbin Liu <[email protected]> Cc: Leon Romanovsky <[email protected]> Cc: syzbot <[email protected]> Cc: Willem de Bruijn <[email protected]> Cc: Pablo Neira Ayuso <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-10-25soreuseport: Fix socket selection for SO_INCOMING_CPU.Kuniyuki Iwashima2-6/+90
Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work with setsockopt(SO_REUSEPORT) since v4.6. With the combination of SO_REUSEPORT and SO_INCOMING_CPU, we could build a highly efficient server application. setsockopt(SO_INCOMING_CPU) associates a CPU with a TCP listener or UDP socket, and then incoming packets processed on the CPU will likely be distributed to the socket. Technically, a socket could even receive packets handled on another CPU if no sockets in the reuseport group have the same CPU receiving the flow. The logic exists in compute_score() so that a socket will get a higher score if it has the same CPU with the flow. However, the score gets ignored after the blamed two commits, which introduced a faster socket selection algorithm for SO_REUSEPORT. This patch introduces a counter of sockets with SO_INCOMING_CPU in a reuseport group to check if we should iterate all sockets to find a proper one. We increment the counter when * calling listen() if the socket has SO_INCOMING_CPU and SO_REUSEPORT * enabling SO_INCOMING_CPU if the socket is in a reuseport group Also, we decrement it when * detaching a socket out of the group to apply SO_INCOMING_CPU to migrated TCP requests * disabling SO_INCOMING_CPU if the socket is in a reuseport group When the counter reaches 0, we can get back to the O(1) selection algorithm. The overall changes are negligible for the non-SO_INCOMING_CPU case, and the only notable thing is that we have to update sk_incomnig_cpu under reuseport_lock. Otherwise, the race prevents transitioning to the O(n) algorithm and results in the wrong socket selection. cpu1 (setsockopt) cpu2 (listen) +-----------------+ +-------------+ lock_sock(sk1) lock_sock(sk2) reuseport_update_incoming_cpu(sk1, val) . | /* set CPU as 0 */ |- WRITE_ONCE(sk1->incoming_cpu, val) | | spin_lock_bh(&reuseport_lock) | reuseport_grow(sk2, reuse) | . | |- more_socks_size = reuse->max_socks * 2U; | |- if (more_socks_size > U16_MAX && | | reuse->num_closed_socks) | | . | | |- RCU_INIT_POINTER(sk1->sk_reuseport_cb, NULL); | | `- __reuseport_detach_closed_sock(sk1, reuse) | | . | | `- reuseport_put_incoming_cpu(sk1, reuse) | | . | | | /* Read shutdown()ed sk1's sk_incoming_cpu | | | * without lock_sock(). | | | */ | | `- if (sk1->sk_incoming_cpu >= 0) | | . | | | /* decrement not-yet-incremented | | | * count, which is never incremented. | | | */ | | `- __reuseport_put_incoming_cpu(reuse); | | | `- spin_lock_bh(&reuseport_lock) | |- spin_lock_bh(&reuseport_lock) | |- reuse = rcu_dereference_protected(sk1->sk_reuseport_cb, ...) |- if (!reuse) | . | | /* Cannot increment reuse->incoming_cpu. */ | `- goto out; | `- spin_unlock_bh(&reuseport_lock) Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Fixes: c125e80b8868 ("soreuseport: fast reuseport TCP socket selection") Reported-by: Kazuho Oku <[email protected]> Signed-off-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2022-10-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+7
include/linux/net.h a5ef058dc4d9 ("net: introduce and use custom sockopt socket flag") e993ffe3da4b ("net: flag sockets supporting msghdr originated zerocopy") Signed-off-by: Jakub Kicinski <[email protected]>
2022-10-24net: skb: move skb_pp_recycle() to skbuff.cYunsheng Lin1-0/+7
skb_pp_recycle() is only used by skb_free_head() in skbuff.c, so move it to skbuff.c. Signed-off-by: Yunsheng Lin <[email protected]> Acked-by: Ilias Apalodimas <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-10-24net: remove useless parameter of __sock_cmsg_sendxu xin1-2/+2
The parameter 'msg' has never been used by __sock_cmsg_send, so we can remove it safely. Reported-by: Zeal Robot <[email protected]> Signed-off-by: xu xin <[email protected]> Reviewed-by: Zhang Yunkai <[email protected]> Acked-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-10-24net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failedZhengchao Shao1-0/+7
When the ops_init() interface is invoked to initialize the net, but ops->init() fails, data is released. However, the ptr pointer in net->gen is invalid. In this case, when nfqnl_nf_hook_drop() is invoked to release the net, invalid address access occurs. The process is as follows: setup_net() ops_init() data = kzalloc(...) ---> alloc "data" net_assign_generic() ---> assign "date" to ptr in net->gen ... ops->init() ---> failed ... kfree(data); ---> ptr in net->gen is invalid ... ops_exit_list() ... nfqnl_nf_hook_drop() *q = nfnl_queue_pernet(net) ---> q is invalid The following is the Call Trace information: BUG: KASAN: use-after-free in nfqnl_nf_hook_drop+0x264/0x280 Read of size 8 at addr ffff88810396b240 by task ip/15855 Call Trace: <TASK> dump_stack_lvl+0x8e/0xd1 print_report+0x155/0x454 kasan_report+0xba/0x1f0 nfqnl_nf_hook_drop+0x264/0x280 nf_queue_nf_hook_drop+0x8b/0x1b0 __nf_unregister_net_hook+0x1ae/0x5a0 nf_unregister_net_hooks+0xde/0x130 ops_exit_list+0xb0/0x170 setup_net+0x7ac/0xbd0 copy_net_ns+0x2e6/0x6b0 create_new_namespaces+0x382/0xa50 unshare_nsproxy_namespaces+0xa6/0x1c0 ksys_unshare+0x3a4/0x7e0 __x64_sys_unshare+0x2d/0x40 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> Allocated by task 15855: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 __kasan_kmalloc+0xa1/0xb0 __kmalloc+0x49/0xb0 ops_init+0xe7/0x410 setup_net+0x5aa/0xbd0 copy_net_ns+0x2e6/0x6b0 create_new_namespaces+0x382/0xa50 unshare_nsproxy_namespaces+0xa6/0x1c0 ksys_unshare+0x3a4/0x7e0 __x64_sys_unshare+0x2d/0x40 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Freed by task 15855: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_save_free_info+0x2a/0x40 ____kasan_slab_free+0x155/0x1b0 slab_free_freelist_hook+0x11b/0x220 __kmem_cache_free+0xa4/0x360 ops_init+0xb9/0x410 setup_net+0x5aa/0xbd0 copy_net_ns+0x2e6/0x6b0 create_new_namespaces+0x382/0xa50 unshare_nsproxy_namespaces+0xa6/0x1c0 ksys_unshare+0x3a4/0x7e0 __x64_sys_unshare+0x2d/0x40 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fixes: f875bae06533 ("net: Automatically allocate per namespace data.") Signed-off-by: Zhengchao Shao <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-10-24net: add a refcount tracker for kernel socketsEric Dumazet2-0/+19
Commit ffa84b5ffb37 ("net: add netns refcount tracker to struct sock") added a tracker to sockets, but did not track kernel sockets. We still have syzbot reports hinting about netns being destroyed while some kernel TCP sockets had not been dismantled. This patch tracks kernel sockets, and adds a ref_tracker_dir_print() call to net_free() right before the netns is freed. Normally, each layer is responsible for properly releasing its kernel sockets before last call to net_free(). This debugging facility is enabled with CONFIG_NET_NS_REFCNT_TRACKER=y Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Tested-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-10-20Merge tag 'net-6.1-rc2' of ↵Linus Torvalds3-4/+24
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter. Current release - regressions: - revert "net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and}" - revert "net: sched: fq_codel: remove redundant resource cleanup in fq_codel_init()" - dsa: uninitialized variable in dsa_slave_netdevice_event() - eth: sunhme: uninitialized variable in happy_meal_init() Current release - new code bugs: - eth: octeontx2: fix resource not freed after malloc Previous releases - regressions: - sched: fix return value of qdisc ingress handling on success - sched: fix race condition in qdisc_graft() - udp: update reuse->has_conns under reuseport_lock. - tls: strp: make sure the TCP skbs do not have overlapping data - hsr: avoid possible NULL deref in skb_clone() - tipc: fix an information leak in tipc_topsrv_kern_subscr - phylink: add mac_managed_pm in phylink_config structure - eth: i40e: fix DMA mappings leak - eth: hyperv: fix a RX-path warning - eth: mtk: fix memory leaks Previous releases - always broken: - sched: cake: fix null pointer access issue when cake_init() fails" * tag 'net-6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (43 commits) net: phy: dp83822: disable MDI crossover status change interrupt net: sched: fix race condition in qdisc_graft() net: hns: fix possible memory leak in hnae_ae_register() wwan_hwsim: fix possible memory leak in wwan_hwsim_dev_new() sfc: include vport_id in filter spec hash and equal() genetlink: fix kdoc warnings selftests: add selftest for chaining of tc ingress handling to egress net: Fix return value of qdisc ingress handling on success net: sched: sfb: fix null pointer access issue when sfb_init() fails Revert "net: sched: fq_codel: remove redundant resource cleanup in fq_codel_init()" net: sched: cake: fix null pointer access issue when cake_init() fails ethernet: marvell: octeontx2 Fix resource not freed after malloc netfilter: nf_tables: relax NFTA_SET_ELEM_KEY_END set flags requirements netfilter: rpfilter/fib: Set ->flowic_uid correctly for user namespaces. ionic: catch NULL pointer issue on reconfig net: hsr: avoid possible NULL deref in skb_clone() bnxt_en: fix memory leak in bnxt_nvm_test() ip6mr: fix UAF issue in ip6mr_sk_done() when addrconf_init_net() failed udp: Update reuse->has_conns under reuseport_lock. net: ethernet: mediatek: ppe: Remove the unused function mtk_foe_entry_usable() ...
2022-10-19net: Fix return value of qdisc ingress handling on successPaul Blakey1-0/+4
Currently qdisc ingress handling (sch_handle_ingress()) doesn't set a return value and it is left to the old return value of the caller (__netif_receive_skb_core()) which is RX drop, so if the packet is consumed, caller will stop and return this value as if the packet was dropped. This causes a problem in the kernel tcp stack when having a egress tc rule forwarding to a ingress tc rule. The tcp stack sending packets on the device having the egress rule will see the packets as not successfully transmitted (although they actually were), will not advance it's internal state of sent data, and packets returning on such tcp stream will be dropped by the tcp stack with reason ack-of-unsent-data. See reproduction in [0] below. Fix that by setting the return value to RX success if the packet was handled successfully. [0] Reproduction steps: $ ip link add veth1 type veth peer name peer1 $ ip link add veth2 type veth peer name peer2 $ ifconfig peer1 5.5.5.6/24 up $ ip netns add ns0 $ ip link set dev peer2 netns ns0 $ ip netns exec ns0 ifconfig peer2 5.5.5.5/24 up $ ifconfig veth2 0 up $ ifconfig veth1 0 up #ingress forwarding veth1 <-> veth2 $ tc qdisc add dev veth2 ingress $ tc qdisc add dev veth1 ingress $ tc filter add dev veth2 ingress prio 1 proto all flower \ action mirred egress redirect dev veth1 $ tc filter add dev veth1 ingress prio 1 proto all flower \ action mirred egress redirect dev veth2 #steal packet from peer1 egress to veth2 ingress, bypassing the veth pipe $ tc qdisc add dev peer1 clsact $ tc filter add dev peer1 egress prio 20 proto ip flower \ action mirred ingress redirect dev veth1 #run iperf and see connection not running $ iperf3 -s& $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1 #delete egress rule, and run again, now should work $ tc filter del dev peer1 egress $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1 Fixes: f697c3e8b35c ("[NET]: Avoid unnecessary cloning for ingress filtering") Signed-off-by: Paul Blakey <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-10-18udp: Update reuse->has_conns under reuseport_lock.Kuniyuki Iwashima1-0/+16
When we call connect() for a UDP socket in a reuseport group, we have to update sk->sk_reuseport_cb->has_conns to 1. Otherwise, the kernel could select a unconnected socket wrongly for packets sent to the connected socket. However, the current way to set has_conns is illegal and possible to trigger that problem. reuseport_has_conns() changes has_conns under rcu_read_lock(), which upgrades the RCU reader to the updater. Then, it must do the update under the updater's lock, reuseport_lock, but it doesn't for now. For this reason, there is a race below where we fail to set has_conns resulting in the wrong socket selection. To avoid the race, let's split the reader and updater with proper locking. cpu1 cpu2 +----+ +----+ __ip[46]_datagram_connect() reuseport_grow() . . |- reuseport_has_conns(sk, true) |- more_reuse = __reuseport_alloc(more_socks_size) | . | | |- rcu_read_lock() | |- reuse = rcu_dereference(sk->sk_reuseport_cb) | | | | | /* reuse->has_conns == 0 here */ | | |- more_reuse->has_conns = reuse->has_conns | |- reuse->has_conns = 1 | /* more_reuse->has_conns SHOULD BE 1 HERE */ | | | | | |- rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb, | | | more_reuse) | `- rcu_read_unlock() `- kfree_rcu(reuse, rcu) | |- sk->sk_state = TCP_ESTABLISHED Note the likely(reuse) in reuseport_has_conns_set() is always true, but we put the test there for ease of review. [0] For the record, usually, sk_reuseport_cb is changed under lock_sock(). The only exception is reuseport_grow() & TCP reqsk migration case. 1) shutdown() TCP listener, which is moved into the latter part of reuse->socks[] to migrate reqsk. 2) New listen() overflows reuse->socks[] and call reuseport_grow(). 3) reuse->max_socks overflows u16 with the new listener. 4) reuseport_grow() pops the old shutdown()ed listener from the array and update its sk->sk_reuseport_cb as NULL without lock_sock(). shutdown()ed TCP sk->sk_reuseport_cb can be changed without lock_sock(), but, reuseport_has_conns_set() is called only for UDP under lock_sock(), so likely(reuse) never be false in reuseport_has_conns_set(). [0]: https://lore.kernel.org/netdev/CANn89iLja=eQHbsM_Ta2sQF0tOGU8vAGrh_izRuuHjuO1ouUag@mail.gmail.com/ Fixes: acdcecc61285 ("udp: correct reuseport selection with connected sockets") Signed-off-by: Kuniyuki Iwashima <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
2022-10-16Merge tag 'random-6.1-rc1-for-linus' of ↵Linus Torvalds3-26/+25
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull more random number generator updates from Jason Donenfeld: "This time with some large scale treewide cleanups. The intent of this pull is to clean up the way callers fetch random integers. The current rules for doing this right are: - If you want a secure or an insecure random u64, use get_random_u64() - If you want a secure or an insecure random u32, use get_random_u32() The old function prandom_u32() has been deprecated for a while now and is just a wrapper around get_random_u32(). Same for get_random_int(). - If you want a secure or an insecure random u16, use get_random_u16() - If you want a secure or an insecure random u8, use get_random_u8() - If you want secure or insecure random bytes, use get_random_bytes(). The old function prandom_bytes() has been deprecated for a while now and has long been a wrapper around get_random_bytes() - If you want a non-uniform random u32, u16, or u8 bounded by a certain open interval maximum, use prandom_u32_max() I say "non-uniform", because it doesn't do any rejection sampling or divisions. Hence, it stays within the prandom_*() namespace, not the get_random_*() namespace. I'm currently investigating a "uniform" function for 6.2. We'll see what comes of that. By applying these rules uniformly, we get several benefits: - By using prandom_u32_max() with an upper-bound that the compiler can prove at compile-time is ≤65536 or ≤256, internally get_random_u16() or get_random_u8() is used, which wastes fewer batched random bytes, and hence has higher throughput. - By using prandom_u32_max() instead of %, when the upper-bound is not a constant, division is still avoided, because prandom_u32_max() uses a faster multiplication-based trick instead. - By using get_random_u16() or get_random_u8() in cases where the return value is intended to indeed be a u16 or a u8, we waste fewer batched random bytes, and hence have higher throughput. This series was originally done by hand while I was on an airplane without Internet. Later, Kees and I worked on retroactively figuring out what could be done with Coccinelle and what had to be done manually, and then we split things up based on that. So while this touches a lot of files, the actual amount of code that's hand fiddled is comfortably small" * tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: prandom: remove unused functions treewide: use get_random_bytes() when possible treewide: use get_random_u32() when possible treewide: use get_random_{u8,u16}() when possible, part 2 treewide: use get_random_{u8,u16}() when possible, part 1 treewide: use prandom_u32_max() when possible, part 2 treewide: use prandom_u32_max() when possible, part 1
2022-10-16skmsg: pass gfp argument to alloc_sk_msg()Eric Dumazet1-4/+4
syzbot found that alloc_sk_msg() could be called from a non sleepable context. sk_psock_verdict_recv() uses rcu_read_lock() protection. We need the callers to pass a gfp_t argument to avoid issues. syzbot report was: BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 3613, name: syz-executor414 preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 INFO: lockdep is turned off. CPU: 0 PID: 3613 Comm: syz-executor414 Not tainted 6.0.0-syzkaller-09589-g55be6084c8e0 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e3/0x2cb lib/dump_stack.c:106 __might_resched+0x538/0x6a0 kernel/sched/core.c:9877 might_alloc include/linux/sched/mm.h:274 [inline] slab_pre_alloc_hook mm/slab.h:700 [inline] slab_alloc_node mm/slub.c:3162 [inline] slab_alloc mm/slub.c:3256 [inline] kmem_cache_alloc_trace+0x59/0x310 mm/slub.c:3287 kmalloc include/linux/slab.h:600 [inline] kzalloc include/linux/slab.h:733 [inline] alloc_sk_msg net/core/skmsg.c:507 [inline] sk_psock_skb_ingress_self+0x5c/0x330 net/core/skmsg.c:600 sk_psock_verdict_apply+0x395/0x440 net/core/skmsg.c:1014 sk_psock_verdict_recv+0x34d/0x560 net/core/skmsg.c:1201 tcp_read_skb+0x4a1/0x790 net/ipv4/tcp.c:1770 tcp_rcv_established+0x129d/0x1a10 net/ipv4/tcp_input.c:5971 tcp_v4_do_rcv+0x479/0xac0 net/ipv4/tcp_ipv4.c:1681 sk_backlog_rcv include/net/sock.h:1109 [inline] __release_sock+0x1d8/0x4c0 net/core/sock.c:2906 release_sock+0x5d/0x1c0 net/core/sock.c:3462 tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1483 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg net/socket.c:734 [inline] __sys_sendto+0x46d/0x5f0 net/socket.c:2117 __do_sys_sendto net/socket.c:2129 [inline] __se_sys_sendto net/socket.c:2125 [inline] __x64_sys_sendto+0xda/0xf0 net/socket.c:2125 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: 43312915b5ba ("skmsg: Get rid of unncessary memset()") Reported-by: syzbot <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Cc: Cong Wang <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: John Fastabend <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-10-12ipv6: Fix data races around sk->sk_prot.Kuniyuki Iwashima1-2/+4
Commit 086d49058cd8 ("ipv6: annotate some data-races around sk->sk_prot") fixed some data-races around sk->sk_prot but it was not enough. Some functions in inet6_(stream|dgram)_ops still access sk->sk_prot without lock_sock() or rtnl_lock(), so they need READ_ONCE() to avoid load tearing. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-10-12xfrm: lwtunnel: squelch kernel warning in case XFRM encap type is not availableEyal Birger1-1/+3
Ido reported that a kernel warning [1] can be triggered from user space when the kernel is compiled with CONFIG_MODULES=y and CONFIG_XFRM=n when adding an xfrm encap type route, e.g: $ ip route add 198.51.100.0/24 dev dummy1 encap xfrm if_id 1 Error: lwt encapsulation type not supported. The reason for the warning is that the LWT infrastructure has an autoloading feature which is meant only for encap types that don't use a net device, which is not the case in xfrm encap. Mute this warning for xfrm encap as there's no encap module to autoload in this case. [1] WARNING: CPU: 3 PID: 2746262 at net/core/lwtunnel.c:57 lwtunnel_valid_encap_type+0x4f/0x120 [...] Call Trace: <TASK> rtm_to_fib_config+0x211/0x350 inet_rtm_newroute+0x3a/0xa0 rtnetlink_rcv_msg+0x154/0x3c0 netlink_rcv_skb+0x49/0xf0 netlink_unicast+0x22f/0x350 netlink_sendmsg+0x208/0x440 ____sys_sendmsg+0x21f/0x250 ___sys_sendmsg+0x83/0xd0 __sys_sendmsg+0x54/0xa0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Reported-by: Ido Schimmel <[email protected]> Fixes: 2c2493b9da91 ("xfrm: lwtunnel: add lwtunnel support for xfrm interfaces in collect_md mode") Signed-off-by: Eyal Birger <[email protected]> Tested-by: Ido Schimmel <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2022-10-11treewide: use get_random_u32() when possibleJason A. Donenfeld1-2/+2
The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Yury Norov <[email protected]> Reviewed-by: Jan Kara <[email protected]> # for ext4 Acked-by: Toke Høiland-Jørgensen <[email protected]> # for sch_cake Acked-by: Chuck Lever <[email protected]> # for nfsd Acked-by: Jakub Kicinski <[email protected]> Acked-by: Mika Westerberg <[email protected]> # for thunderbolt Acked-by: Darrick J. Wong <[email protected]> # for xfs Acked-by: Helge Deller <[email protected]> # for parisc Acked-by: Heiko Carstens <[email protected]> # for s390 Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-10-11treewide: use prandom_u32_max() when possible, part 1Jason A. Donenfeld3-24/+23
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script: @basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) | - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) | - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) | - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) ) @multi_line@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; identifier RAND; expression E; @@ - RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1)) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT) @collapse_ret@ type T; identifier VAR; expression E; @@ { - T VAR; - VAR = (E); - return VAR; + return E; } @drop_var@ type T; identifier VAR; @@ { - T VAR; ... when != VAR } Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Yury Norov <[email protected]> Reviewed-by: KP Singh <[email protected]> Reviewed-by: Jan Kara <[email protected]> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <[email protected]> # for drbd Acked-by: Jakub Kicinski <[email protected]> Acked-by: Heiko Carstens <[email protected]> # for s390 Acked-by: Ulf Hansson <[email protected]> # for mmc Acked-by: Darrick J. Wong <[email protected]> # for xfs Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-10-03Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski3-21/+118
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-10-03 We've added 143 non-merge commits during the last 27 day(s) which contain a total of 151 files changed, 8321 insertions(+), 1402 deletions(-). The main changes are: 1) Add kfuncs for PKCS#7 signature verification from BPF programs, from Roberto Sassu. 2) Add support for struct-based arguments for trampoline based BPF programs, from Yonghong Song. 3) Fix entry IP for kprobe-multi and trampoline probes under IBT enabled, from Jiri Olsa. 4) Batch of improvements to veristat selftest tool in particular to add CSV output, a comparison mode for CSV outputs and filtering, from Andrii Nakryiko. 5) Add preparatory changes needed for the BPF core for upcoming BPF HID support, from Benjamin Tissoires. 6) Support for direct writes to nf_conn's mark field from tc and XDP BPF program types, from Daniel Xu. 7) Initial batch of documentation improvements for BPF insn set spec, from Dave Thaler. 8) Add a new BPF_MAP_TYPE_USER_RINGBUF map which provides single-user-space-producer / single-kernel-consumer semantics for BPF ring buffer, from David Vernet. 9) Follow-up fixes to BPF allocator under RT to always use raw spinlock for the BPF hashtab's bucket lock, from Hou Tao. 10) Allow creating an iterator that loops through only the resources of one task/thread instead of all, from Kui-Feng Lee. 11) Add support for kptrs in the per-CPU arraymap, from Kumar Kartikeya Dwivedi. 12) Add a new kfunc helper for nf to set src/dst NAT IP/port in a newly allocated CT entry which is not yet inserted, from Lorenzo Bianconi. 13) Remove invalid recursion check for struct_ops for TCP congestion control BPF programs, from Martin KaFai Lau. 14) Fix W^X issue with BPF trampoline and BPF dispatcher, from Song Liu. 15) Fix percpu_counter leakage in BPF hashtab allocation error path, from Tetsuo Handa. 16) Various cleanups in BPF selftests to use preferred ASSERT_* macros, from Wang Yufen. 17) Add invocation for cgroup/connect{4,6} BPF programs for ICMP pings, from YiFei Zhu. 18) Lift blinding decision under bpf_jit_harden = 1 to bpf_capable(), from Yauheni Kaliuta. 19) Various libbpf fixes and cleanups including a libbpf NULL pointer deref, from Xin Liu. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (143 commits) net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c Documentation: bpf: Add implementation notes documentations to table of contents bpf, docs: Delete misformatted table. selftests/xsk: Fix double free bpftool: Fix error message of strerror libbpf: Fix overrun in netlink attribute iteration selftests/bpf: Fix spelling mistake "unpriviledged" -> "unprivileged" samples/bpf: Fix typo in xdp_router_ipv4 sample bpftool: Remove unused struct event_ring_info bpftool: Remove unused struct btf_attach_point bpf, docs: Add TOC and fix formatting. bpf, docs: Add Clang note about BPF_ALU bpf, docs: Move Clang notes to a separate file bpf, docs: Linux byteswap note bpf, docs: Move legacy packet instructions to a separate file selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION) bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another function bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt() bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-10-03gro: add support of (hw)gro packets to gro stackCoco Li1-4/+14
Current GRO stack only supports incoming packets containing one frame/MSS. This patch changes GRO to accept packets that are already GRO. HW-GRO (aka RSC for some vendors) is very often limited in presence of interleaved packets. Linux SW GRO stack can complete the job and provide larger GRO packets, thus reducing rate of ACK packets and cpu overhead. This also means BIG TCP can still be used, even if HW-GRO/RSC was able to cook ~64 KB GRO packets. v2: fix logic in tcp_gro_receive() Only support TCP for the moment (Paolo) Co-Developed-by: Eric Dumazet <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Coco Li <[email protected]> Acked-by: Paolo Abeni <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-10-03Merge branch 'master' of ↵David S. Miller1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== 1) Refactor selftests to use an array of structs in xfrm_fill_key(). From Gautam Menghani. 2) Drop an unused argument from xfrm_policy_match. From Hongbin Wang. 3) Support collect metadata mode for xfrm interfaces. From Eyal Birger. 4) Add netlink extack support to xfrm. From Sabrina Dubroca. Please note, there is a merge conflict in: include/net/dst_metadata.h between commit: 0a28bfd4971f ("net/macsec: Add MACsec skb_metadata_dst Tx Data path support") from the net-next tree and commit: 5182a5d48c3d ("net: allow storing xfrm interface metadata in metadata_dst") from the ipsec-next tree. Can be solved as done in linux-next. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <[email protected]>
2022-09-30net: devlink: add port_init/fini() helpers to allow ↵Jiri Pirko1-3/+43
pre-register/post-unregister functions Lifetime of some of the devlink objects, like regions, is currently forced to be different for devlink instance and devlink port instance (per-port regions). The reason is that for devlink ports, the internal structures initialization happens only after devlink_port_register() is called. To resolve this inconsistency, introduce new set of helpers to allow driver to initialize devlink pointer and region list before devlink_register() is called. That allows port regions to be created before devlink port registration and destroyed after devlink port unregistration. Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-09-30net: devlink: introduce a flag to indicate devlink port being registeredJiri Pirko1-2/+4
Instead of relying on devlink pointer not being initialized, introduce an extra flag to indicate if devlink port is registered. This is needed as later on devlink pointer is going to be initialized even in case devlink port is not registered yet. Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-09-30net: devlink: introduce port registered assert helper and use itJiri Pirko1-13/+19
Instead of checking devlink_port->devlink pointer for not being NULL which indicates that devlink port is registered, put this check to new pair of helpers similar to what we have for devlink and use them in other functions. Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-09-30net-sysfs: Convert to use sysfs_emit() APIsWang Yufen1-29/+29
Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: Wang Yufen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-09-29net: skb: introduce and use a single page frag cachePaolo Abeni2-22/+103
After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs") we are observing 10-20% regressions in performance tests with small packets. The perf trace points to high pressure on the slab allocator. This change tries to improve the allocation schema for small packets using an idea originally suggested by Eric: a new per CPU page frag is introduced and used in __napi_alloc_skb to cope with small allocation requests. To ensure that the above does not lead to excessive truesize underestimation, the frag size for small allocation is inflated to 1K and all the above is restricted to build with 4K page size. Note that we need to update accordingly the run-time check introduced with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper"). Alex suggested a smart page refcount schema to reduce the number of atomic operations and deal properly with pfmemalloc pages. Under small packet UDP flood, I measure a 15% peak tput increases. Suggested-by: Eric Dumazet <[email protected]> Suggested-by: Alexander H Duyck <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Alexander Duyck <[email protected]> Link: https://lore.kernel.org/r/6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <[email protected]>
2022-09-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-7/+0
No conflicts. Signed-off-by: Jakub Kicinski <[email protected]>
2022-09-29bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itselfMartin KaFai Lau1-1/+27
When a bad bpf prog '.init' calls bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop: .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ... ... => .init => bpf_setsockopt(tcp_cc). It was prevented by the prog->active counter before but the prog->active detection cannot be used in struct_ops as explained in the earlier patch of the set. In this patch, the second bpf_setsockopt(tcp_cc) is not allowed in order to break the loop. This is done by using a bit of an existing 1 byte hole in tcp_sock to check if there is on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock. Note that this essentially limits only the first '.init' can call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer does not support ECN) and the second '.init' cannot fallback to another cc. This applies even the second bpf_setsockopt(TCP_CONGESTION) will not cause a loop. Signed-off-by: Martin KaFai Lau <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-09-29bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another functionMartin KaFai Lau1-17/+28
This patch moves the bpf_setsockopt(TCP_CONGESTION) logic into another function. The next patch will add extra logic to avoid recursion and this will make the latter patch easier to follow. Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-09-29bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt()Martin KaFai Lau1-6/+7
The check on the tcp-cc, "cdg", is done in the bpf_sk_setsockopt which is used by the bpf_tcp_ca, bpf_lsm, cg_sockopt, and tcp_iter hooks. However, it is not done for cg sock_ddr, cg sockops, and some of the bpf_lsm_cgroup hooks. The tcp-cc "cdg" should have very limited usage. This patch is to move the "cdg" check to the common sol_tcp_sockopt() so that all hooks have a consistent behavior. The motivation to make this check consistent now is because the latter patch will refactor the bpf_setsockopt(TCP_CONGESTION) into another function, so it is better to take this chance to refactor this piece also. Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-09-28net: drop the weight argument from netif_napi_addJakub Kicinski1-2/+1
We tell driver developers to always pass NAPI_POLL_WEIGHT as the weight to netif_napi_add(). This may be confusing to newcomers, drop the weight argument, those who really need to tweak the weight can use netif_napi_add_weight(). Acked-by: Marc Kleine-Budde <[email protected]> # for CAN Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>