aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2024-08-20mptcp: pm: check add_addr_accept_max before accepting new ADD_ADDRMatthieu Baerts (NGI0)1-2/+2
The limits might have changed in between, it is best to check them before accepting new ADD_ADDR. Fixes: d0876b2284cf ("mptcp: add the incoming RM_ADDR support") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-10-38035d40de5b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-20mptcp: pm: only decrement add_addr_accepted for MPJ reqMatthieu Baerts (NGI0)1-2/+6
Adding the following warning ... WARN_ON_ONCE(msk->pm.add_addr_accepted == 0) ... before decrementing the add_addr_accepted counter helped to find a bug when running the "remove single subflow" subtest from the mptcp_join.sh selftest. Removing a 'subflow' endpoint will first trigger a RM_ADDR, then the subflow closure. Before this patch, and upon the reception of the RM_ADDR, the other peer will then try to decrement this add_addr_accepted. That's not correct because the attached subflows have not been created upon the reception of an ADD_ADDR. A way to solve that is to decrement the counter only if the attached subflow was an MP_JOIN to a remote id that was not 0, and initiated by the host receiving the RM_ADDR. Fixes: d0876b2284cf ("mptcp: add the incoming RM_ADDR support") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-9-38035d40de5b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-20mptcp: pm: only mark 'subflow' endp as availableMatthieu Baerts (NGI0)1-9/+17
Adding the following warning ... WARN_ON_ONCE(msk->pm.local_addr_used == 0) ... before decrementing the local_addr_used counter helped to find a bug when running the "remove single address" subtest from the mptcp_join.sh selftests. Removing a 'signal' endpoint will trigger the removal of all subflows linked to this endpoint via mptcp_pm_nl_rm_addr_or_subflow() with rm_type == MPTCP_MIB_RMSUBFLOW. This will decrement the local_addr_used counter, which is wrong in this case because this counter is linked to 'subflow' endpoints, and here it is a 'signal' endpoint that is being removed. Now, the counter is decremented, only if the ID is being used outside of mptcp_pm_nl_rm_addr_or_subflow(), only for 'subflow' endpoints, and if the ID is not 0 -- local_addr_used is not taking into account these ones. This marking of the ID as being available, and the decrement is done no matter if a subflow using this ID is currently available, because the subflow could have been closed before. Fixes: 06faa2271034 ("mptcp: remove multi addresses and subflows in PM") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-8-38035d40de5b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-20mptcp: pm: remove mptcp_pm_remove_subflow()Matthieu Baerts (NGI0)3-22/+7
This helper is confusing. It is in pm.c, but it is specific to the in-kernel PM and it cannot be used by the userspace one. Also, it simply calls one in-kernel specific function with the PM lock, while the similar mptcp_pm_remove_addr() helper requires the PM lock. What's left is the pr_debug(), which is not that useful, because a similar one is present in the only function called by this helper: mptcp_pm_nl_rm_subflow_received() After these modifications, this helper can be marked as 'static', and the lock can be taken only once in mptcp_pm_flush_addrs_and_subflows(). Note that it is not a bug fix, but it will help backporting the following commits. Fixes: 0ee4261a3681 ("mptcp: implement mptcp_pm_remove_subflow") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-7-38035d40de5b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-20mptcp: pm: re-using ID of unused flushed subflowsMatthieu Baerts (NGI0)1-0/+7
If no subflows are attached to the 'subflow' endpoints that are being flushed, the corresponding addr IDs will not be marked as available again. Mark all ID as being available when flushing all the 'subflow' endpoints, and reset local_addr_used counter to cover these cases. Note that mptcp_pm_remove_addrs_and_subflows() helper is only called for flushing operations, not to remove a specific set of addresses and subflows. Fixes: 06faa2271034 ("mptcp: remove multi addresses and subflows in PM") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-5-38035d40de5b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-20mptcp: pm: re-using ID of unused removed subflowsMatthieu Baerts (NGI0)1-1/+10
If no subflow is attached to the 'subflow' endpoint that is being removed, the addr ID will not be marked as available again. Mark the linked ID as available when removing the 'subflow' endpoint if no subflow is attached to it. While at it, the local_addr_used counter is decremented if the ID was marked as being used to reflect the reality, but also to allow adding new endpoints after that. Fixes: b6c08380860b ("mptcp: remove addr and subflow in PM netlink") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-3-38035d40de5b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-20mptcp: pm: re-using ID of unused removed ADD_ADDRMatthieu Baerts (NGI0)1-1/+4
If no subflow is attached to the 'signal' endpoint that is being removed, the addr ID will not be marked as available again. Mark the linked ID as available when removing the address entry from the list to cover this case. Fixes: b6c08380860b ("mptcp: remove addr and subflow in PM netlink") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240819-net-mptcp-pm-reusing-id-v1-1-38035d40de5b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-20netem: fix return value if duplicate enqueue failsStephen Hemminger1-18/+29
There is a bug in netem_enqueue() introduced by commit 5845f706388a ("net: netem: fix skb length BUG_ON in __skb_to_sgvec") that can lead to a use-after-free. This commit made netem_enqueue() always return NET_XMIT_SUCCESS when a packet is duplicated, which can cause the parent qdisc's q.qlen to be mistakenly incremented. When this happens qlen_notify() may be skipped on the parent during destruction, leaving a dangling pointer for some classful qdiscs like DRR. There are two ways for the bug happen: - If the duplicated packet is dropped by rootq->enqueue() and then the original packet is also dropped. - If rootq->enqueue() sends the duplicated packet to a different qdisc and the original packet is dropped. In both cases NET_XMIT_SUCCESS is returned even though no packets are enqueued at the netem qdisc. The fix is to defer the enqueue of the duplicate packet until after the original packet has been guaranteed to return NET_XMIT_SUCCESS. Fixes: 5845f706388a ("net: netem: fix skb length BUG_ON in __skb_to_sgvec") Reported-by: Budimir Markovic <markovicbudimir@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240819175753.5151-1-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-20ip6_tunnel: Fix broken GROThomas Bogendoerfer1-5/+7
GRO code checks for matching layer 2 headers to see, if packet belongs to the same flow and because ip6 tunnel set dev->hard_header_len this check fails in cases, where it shouldn't. To fix this don't set hard_header_len, but use needed_headroom like ipv4/ip_tunnel.c does. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de> Link: https://patch.msgid.link/20240815151419.109864-1-tbogendoerfer@suse.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-20netfilter: nft_counter: Synchronize nft_counter_reset() against reader.Sebastian Andrzej Siewior1-0/+5
nft_counter_reset() resets the counter by subtracting the previously retrieved value from the counter. This is a write operation on the counter and as such it requires to be performed with a write sequence of nft_counter_seq to serialize against its possible reader. Update the packets/ bytes within write-sequence of nft_counter_seq. Fixes: d84701ecbcd6a ("netfilter: nft_counter: rework atomic dump and reset") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-20netfilter: nft_counter: Disable BH in nft_counter_offload_stats().Sebastian Andrzej Siewior1-2/+2
The sequence counter nft_counter_seq is a per-CPU counter. There is no lock associated with it. nft_counter_do_eval() is using the same counter and disables BH which suggest that it can be invoked from a softirq. This in turn means that nft_counter_offload_stats(), which disables only preemption, can be interrupted by nft_counter_do_eval() leading to two writer for one seqcount_t. This can lead to loosing stats or reading statistics while they are updated. Disable BH during stats update in nft_counter_offload_stats() to ensure one writer at a time. Fixes: b72920f6e4a9d ("netfilter: nftables: counter hardware offload support") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-19kcm: Serialise kcm_sendmsg() for the same socket.Kuniyuki Iwashima1-0/+4
syzkaller reported UAF in kcm_release(). [0] The scenario is 1. Thread A builds a skb with MSG_MORE and sets kcm->seq_skb. 2. Thread A resumes building skb from kcm->seq_skb but is blocked by sk_stream_wait_memory() 3. Thread B calls sendmsg() concurrently, finishes building kcm->seq_skb and puts the skb to the write queue 4. Thread A faces an error and finally frees skb that is already in the write queue 5. kcm_release() does double-free the skb in the write queue When a thread is building a MSG_MORE skb, another thread must not touch it. Let's add a per-sk mutex and serialise kcm_sendmsg(). [0]: BUG: KASAN: slab-use-after-free in __skb_unlink include/linux/skbuff.h:2366 [inline] BUG: KASAN: slab-use-after-free in __skb_dequeue include/linux/skbuff.h:2385 [inline] BUG: KASAN: slab-use-after-free in __skb_queue_purge_reason include/linux/skbuff.h:3175 [inline] BUG: KASAN: slab-use-after-free in __skb_queue_purge include/linux/skbuff.h:3181 [inline] BUG: KASAN: slab-use-after-free in kcm_release+0x170/0x4c8 net/kcm/kcmsock.c:1691 Read of size 8 at addr ffff0000ced0fc80 by task syz-executor329/6167 CPU: 1 PID: 6167 Comm: syz-executor329 Tainted: G B 6.8.0-rc5-syzkaller-g9abbc24128bc #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 Call trace: dump_backtrace+0x1b8/0x1e4 arch/arm64/kernel/stacktrace.c:291 show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:298 __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd0/0x124 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:377 [inline] print_report+0x178/0x518 mm/kasan/report.c:488 kasan_report+0xd8/0x138 mm/kasan/report.c:601 __asan_report_load8_noabort+0x20/0x2c mm/kasan/report_generic.c:381 __skb_unlink include/linux/skbuff.h:2366 [inline] __skb_dequeue include/linux/skbuff.h:2385 [inline] __skb_queue_purge_reason include/linux/skbuff.h:3175 [inline] __skb_queue_purge include/linux/skbuff.h:3181 [inline] kcm_release+0x170/0x4c8 net/kcm/kcmsock.c:1691 __sock_release net/socket.c:659 [inline] sock_close+0xa4/0x1e8 net/socket.c:1421 __fput+0x30c/0x738 fs/file_table.c:376 ____fput+0x20/0x30 fs/file_table.c:404 task_work_run+0x230/0x2e0 kernel/task_work.c:180 exit_task_work include/linux/task_work.h:38 [inline] do_exit+0x618/0x1f64 kernel/exit.c:871 do_group_exit+0x194/0x22c kernel/exit.c:1020 get_signal+0x1500/0x15ec kernel/signal.c:2893 do_signal+0x23c/0x3b44 arch/arm64/kernel/signal.c:1249 do_notify_resume+0x74/0x1f4 arch/arm64/kernel/entry-common.c:148 exit_to_user_mode_prepare arch/arm64/kernel/entry-common.c:169 [inline] exit_to_user_mode arch/arm64/kernel/entry-common.c:178 [inline] el0_svc+0xac/0x168 arch/arm64/kernel/entry-common.c:713 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 Allocated by task 6166: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x40/0x78 mm/kasan/common.c:68 kasan_save_alloc_info+0x70/0x84 mm/kasan/generic.c:626 unpoison_slab_object mm/kasan/common.c:314 [inline] __kasan_slab_alloc+0x74/0x8c mm/kasan/common.c:340 kasan_slab_alloc include/linux/kasan.h:201 [inline] slab_post_alloc_hook mm/slub.c:3813 [inline] slab_alloc_node mm/slub.c:3860 [inline] kmem_cache_alloc_node+0x204/0x4c0 mm/slub.c:3903 __alloc_skb+0x19c/0x3d8 net/core/skbuff.c:641 alloc_skb include/linux/skbuff.h:1296 [inline] kcm_sendmsg+0x1d3c/0x2124 net/kcm/kcmsock.c:783 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] sock_sendmsg+0x220/0x2c0 net/socket.c:768 splice_to_socket+0x7cc/0xd58 fs/splice.c:889 do_splice_from fs/splice.c:941 [inline] direct_splice_actor+0xec/0x1d8 fs/splice.c:1164 splice_direct_to_actor+0x438/0xa0c fs/splice.c:1108 do_splice_direct_actor fs/splice.c:1207 [inline] do_splice_direct+0x1e4/0x304 fs/splice.c:1233 do_sendfile+0x460/0xb3c fs/read_write.c:1295 __do_sys_sendfile64 fs/read_write.c:1362 [inline] __se_sys_sendfile64 fs/read_write.c:1348 [inline] __arm64_sys_sendfile64+0x160/0x3b4 fs/read_write.c:1348 __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:51 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:136 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:155 el0_svc+0x54/0x168 arch/arm64/kernel/entry-common.c:712 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 Freed by task 6167: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x40/0x78 mm/kasan/common.c:68 kasan_save_free_info+0x5c/0x74 mm/kasan/generic.c:640 poison_slab_object+0x124/0x18c mm/kasan/common.c:241 __kasan_slab_free+0x3c/0x78 mm/kasan/common.c:257 kasan_slab_free include/linux/kasan.h:184 [inline] slab_free_hook mm/slub.c:2121 [inline] slab_free mm/slub.c:4299 [inline] kmem_cache_free+0x15c/0x3d4 mm/slub.c:4363 kfree_skbmem+0x10c/0x19c __kfree_skb net/core/skbuff.c:1109 [inline] kfree_skb_reason+0x240/0x6f4 net/core/skbuff.c:1144 kfree_skb include/linux/skbuff.h:1244 [inline] kcm_release+0x104/0x4c8 net/kcm/kcmsock.c:1685 __sock_release net/socket.c:659 [inline] sock_close+0xa4/0x1e8 net/socket.c:1421 __fput+0x30c/0x738 fs/file_table.c:376 ____fput+0x20/0x30 fs/file_table.c:404 task_work_run+0x230/0x2e0 kernel/task_work.c:180 exit_task_work include/linux/task_work.h:38 [inline] do_exit+0x618/0x1f64 kernel/exit.c:871 do_group_exit+0x194/0x22c kernel/exit.c:1020 get_signal+0x1500/0x15ec kernel/signal.c:2893 do_signal+0x23c/0x3b44 arch/arm64/kernel/signal.c:1249 do_notify_resume+0x74/0x1f4 arch/arm64/kernel/entry-common.c:148 exit_to_user_mode_prepare arch/arm64/kernel/entry-common.c:169 [inline] exit_to_user_mode arch/arm64/kernel/entry-common.c:178 [inline] el0_svc+0xac/0x168 arch/arm64/kernel/entry-common.c:713 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 The buggy address belongs to the object at ffff0000ced0fc80 which belongs to the cache skbuff_head_cache of size 240 The buggy address is located 0 bytes inside of freed 240-byte region [ffff0000ced0fc80, ffff0000ced0fd70) The buggy address belongs to the physical page: page:00000000d35f4ae4 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10ed0f flags: 0x5ffc00000000800(slab|node=0|zone=2|lastcpupid=0x7ff) page_type: 0xffffffff() raw: 05ffc00000000800 ffff0000c1cbf640 fffffdffc3423100 dead000000000004 raw: 0000000000000000 00000000000c000c 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff0000ced0fb80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff0000ced0fc00: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc >ffff0000ced0fc80: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff0000ced0fd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc ffff0000ced0fd80: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module") Reported-by: syzbot+b72d86aa5df17ce74c60@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b72d86aa5df17ce74c60 Tested-by: syzbot+b72d86aa5df17ce74c60@syzkaller.appspotmail.com Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240815220437.69511-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-19net: mctp: test: Use correct skb for route input checkJeremy Kerr1-1/+1
In the MCTP route input test, we're routing one skb, then (when delivery is expected) checking the resulting routed skb. However, we're currently checking the original skb length, rather than the routed skb. Check the routed skb instead; the original will have been freed at this point. Fixes: 8892c0490779 ("mctp: Add route input to socket tests") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/kernel-janitors/4ad204f0-94cf-46c5-bdab-49592addf315@kili.mountain/ Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240816-mctp-kunit-skb-fix-v1-1-3c367ac89c27@codeconstruct.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-19rpcrdma: Trace connection registration and unregistrationChuck Lever1-0/+2
These new trace points record xarray indices and the time of endpoint registration and unregistration, to co-ordinate with device removal events. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-08-19rpcrdma: Use XA_FLAGS_ALLOC instead of XA_FLAGS_ALLOC1Chuck Lever1-1/+1
Nit: The built-in xa_limit_32b range starts at 0, but XA_FLAGS_ALLOC1 configures the xarray's allocator to start at 1. Adopt the more conventional XA_FLAGS_ALLOC because there's no mechanical reason to skip 0. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-08-19rpcrdma: Device kref is over-incremented on error from xa_allocChuck Lever1-1/+1
If the device's reference count is too high, the device completion callback never fires. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-08-19tcp: prevent concurrent execution of tcp_sk_exit_batchFlorian Westphal1-0/+14
Its possible that two threads call tcp_sk_exit_batch() concurrently, once from the cleanup_net workqueue, once from a task that failed to clone a new netns. In the latter case, error unwinding calls the exit handlers in reverse order for the 'failed' netns. tcp_sk_exit_batch() calls tcp_twsk_purge(). Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"), this function picks up twsk in any dying netns, not just the one passed in via exit_batch list. This means that the error unwind of setup_net() can "steal" and destroy timewait sockets belonging to the exiting netns. This allows the netns exit worker to proceed to call WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount)); without the expected 1 -> 0 transition, which then splats. At same time, error unwind path that is also running inet_twsk_purge() will splat as well: WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 ... refcount_dec include/linux/refcount.h:351 [inline] inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 setup_net+0x714/0xb40 net/core/net_namespace.c:375 copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 ... because refcount_dec() of tw_refcount unexpectedly dropped to 0. This doesn't seem like an actual bug (no tw sockets got lost and I don't see a use-after-free) but as erroneous trigger of debug check. Add a mutex to force strict ordering: the task that calls tcp_twsk_purge() blocks other task from doing final _dec_and_test before mutex-owner has removed all tw sockets of dying netns. Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.") Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/ Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/ Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240812222857.29837-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-16Merge tag 'for-net-2024-08-15' of ↵Jakub Kicinski4-85/+84
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - MGMT: Add error handling to pair_device() - HCI: Invert LE State quirk to be opt-out rather then opt-in - hci_core: Fix LE quote calculation - SMP: Fix assumption of Central always being Initiator * tag 'for-net-2024-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: MGMT: Add error handling to pair_device() Bluetooth: SMP: Fix assumption of Central always being Initiator Bluetooth: hci_core: Fix LE quote calculation Bluetooth: HCI: Invert LE State quirk to be opt-out rather then opt-in ==================== Link: https://patch.msgid.link/20240815171950.1082068-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-16net: dsa: provide a software untagging function on RX for VLAN-aware bridgesVladimir Oltean2-32/+108
Through code analysis, I realized that the ds->untag_bridge_pvid logic is contradictory - see the newly added FIXME above the kernel-doc for dsa_software_untag_vlan_unaware_bridge(). Moreover, for the Felix driver, I need something very similar, but which is actually _not_ contradictory: untag the bridge PVID on RX, but for VLAN-aware bridges. The existing logic does it for VLAN-unaware bridges. Since I don't want to change the functionality of drivers which were supposedly properly tested with the ds->untag_bridge_pvid flag, I have introduced a new one: ds->untag_vlan_aware_bridge_pvid, and I have refactored the DSA reception code into a common path for both flags. TODO: both flags should be unified under a single ds->software_vlan_untag, which users of both current flags should set. This is not something that can be carried out right away. It needs very careful examination of all drivers which make use of this functionality, since some of them actually get this wrong in the first place. For example, commit 9130c2d30c17 ("net: dsa: microchip: ksz8795: Use software untagging on CPU port") uses this in a driver which has ds->configure_vlan_while_not_filtering = true. The latter mechanism has been known for many years to be broken by design: https://lore.kernel.org/netdev/CABumfLzJmXDN_W-8Z=p9KyKUVi_HhS7o_poBkeKHS2BkAiyYpw@mail.gmail.com/ and we have the situation of 2 bugs canceling each other. There is no private VLAN, and the port follows the PVID of the VLAN-unaware bridge. So, it's kinda ok for that driver to use the ds->untag_bridge_pvid mechanism, in a broken way. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-16net: mscc: ocelot: use ocelot_xmit_get_vlan_info() also for FDMA and ↵Vladimir Oltean1-35/+2
register injection Problem description ------------------- On an NXP LS1028A (felix DSA driver) with the following configuration: - ocelot-8021q tagging protocol - VLAN-aware bridge (with STP) spanning at least swp0 and swp1 - 8021q VLAN upper interfaces on swp0 and swp1: swp0.700, swp1.700 - ptp4l on swp0.700 and swp1.700 we see that the ptp4l instances do not see each other's traffic, and they all go to the grand master state due to the ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES condition. Jumping to the conclusion for the impatient ------------------------------------------- There is a zero-day bug in the ocelot switchdev driver in the way it handles VLAN-tagged packet injection. The correct logic already exists in the source code, in function ocelot_xmit_get_vlan_info() added by commit 5ca721c54d86 ("net: dsa: tag_ocelot: set the classified VLAN during xmit"). But it is used only for normal NPI-based injection with the DSA "ocelot" tagging protocol. The other injection code paths (register-based and FDMA-based) roll their own wrong logic. This affects and was noticed on the DSA "ocelot-8021q" protocol because it uses register-based injection. By moving ocelot_xmit_get_vlan_info() to a place that's common for both the DSA tagger and the ocelot switch library, it can also be called from ocelot_port_inject_frame() in ocelot.c. We need to touch the lines with ocelot_ifh_port_set()'s prototype anyway, so let's rename it to something clearer regarding what it does, and add a kernel-doc. ocelot_ifh_set_basic() should do. Investigation notes ------------------- Debugging reveals that PTP event (aka those carrying timestamps, like Sync) frames injected into swp0.700 (but also swp1.700) hit the wire with two VLAN tags: 00000000: 01 1b 19 00 00 00 00 01 02 03 04 05 81 00 02 bc ~~~~~~~~~~~ 00000010: 81 00 02 bc 88 f7 00 12 00 2c 00 00 02 00 00 00 ~~~~~~~~~~~ 00000020: 00 00 00 00 00 00 00 00 00 00 00 01 02 ff fe 03 00000030: 04 05 00 01 00 04 00 00 00 00 00 00 00 00 00 00 00000040: 00 00 The second (unexpected) VLAN tag makes felix_check_xtr_pkt() -> ptp_classify_raw() fail to see these as PTP packets at the link partner's receiving end, and return PTP_CLASS_NONE (because the BPF classifier is not written to expect 2 VLAN tags). The reason why packets have 2 VLAN tags is because the transmission code treats VLAN incorrectly. Neither ocelot switchdev, nor felix DSA, declare the NETIF_F_HW_VLAN_CTAG_TX feature. Therefore, at xmit time, all VLANs should be in the skb head, and none should be in the hwaccel area. This is done by: static struct sk_buff *validate_xmit_vlan(struct sk_buff *skb, netdev_features_t features) { if (skb_vlan_tag_present(skb) && !vlan_hw_offload_capable(features, skb->vlan_proto)) skb = __vlan_hwaccel_push_inside(skb); return skb; } But ocelot_port_inject_frame() handles things incorrectly: ocelot_ifh_port_set(ifh, port, rew_op, skb_vlan_tag_get(skb)); void ocelot_ifh_port_set(struct sk_buff *skb, void *ifh, int port, u32 rew_op) { (...) if (vlan_tag) ocelot_ifh_set_vlan_tci(ifh, vlan_tag); (...) } The way __vlan_hwaccel_push_inside() pushes the tag inside the skb head is by calling: static inline void __vlan_hwaccel_clear_tag(struct sk_buff *skb) { skb->vlan_present = 0; } which does _not_ zero out skb->vlan_tci as seen by skb_vlan_tag_get(). This means that ocelot, when it calls skb_vlan_tag_get(), sees (and uses) a residual skb->vlan_tci, while the same VLAN tag is _already_ in the skb head. The trivial fix for double VLAN headers is to replace the content of ocelot_ifh_port_set() with: if (skb_vlan_tag_present(skb)) ocelot_ifh_set_vlan_tci(ifh, skb_vlan_tag_get(skb)); but this would not be correct either, because, as mentioned, vlan_hw_offload_capable() is false for us, so we'd be inserting dead code and we'd always transmit packets with VID=0 in the injection frame header. I can't actually test the ocelot switchdev driver and rely exclusively on code inspection, but I don't think traffic from 8021q uppers has ever been injected properly, and not double-tagged. Thus I'm blaming the introduction of VLAN fields in the injection header - early driver code. As hinted at in the early conclusion, what we _want_ to happen for VLAN transmission was already described once in commit 5ca721c54d86 ("net: dsa: tag_ocelot: set the classified VLAN during xmit"). ocelot_xmit_get_vlan_info() intends to ensure that if the port through which we're transmitting is under a VLAN-aware bridge, the outer VLAN tag from the skb head is stripped from there and inserted into the injection frame header (so that the packet is processed in hardware through that actual VLAN). And in all other cases, the packet is sent with VID=0 in the injection frame header, since the port is VLAN-unaware and has logic to strip this VID on egress (making it invisible to the wire). Fixes: 08d02364b12f ("net: mscc: fix the injection header") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-15Merge tag 'net-6.11-rc4' of ↵Linus Torvalds13-101/+222
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from wireless and netfilter Current release - regressions: - udp: fall back to software USO if IPv6 extension headers are present - wifi: iwlwifi: correctly lookup DMA address in SG table Current release - new code bugs: - eth: mlx5e: fix queue stats access to non-existing channels splat Previous releases - regressions: - eth: mlx5e: take state lock during tx timeout reporter - eth: mlxbf_gige: disable RX filters until RX path initialized - eth: igc: fix reset adapter logics when tx mode change Previous releases - always broken: - tcp: update window clamping condition - netfilter: - nf_queue: drop packets with cloned unconfirmed conntracks - nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests - vsock: fix recursive ->recvmsg calls - dsa: vsc73xx: fix MDIO bus access and PHY opera - eth: gtp: pull network headers in gtp_dev_xmit() - eth: igc: fix packet still tx after gate close by reducing i226 MAC retry buffer - eth: mana: fix RX buf alloc_size alignment and atomic op panic - eth: hns3: fix a deadlock problem when config TC during resetting" * tag 'net-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits) net: hns3: use correct release function during uninitialization net: hns3: void array out of bound when loop tnl_num net: hns3: fix a deadlock problem when config TC during resetting net: hns3: use the user's cfg after reset net: hns3: fix wrong use of semaphore up selftests: net: lib: kill PIDs before del netns pse-core: Conditionally set current limit during PI regulator registration net: thunder_bgx: Fix netdev structure allocation net: ethtool: Allow write mechanism of LPL and both LPL and EPL vsock: fix recursive ->recvmsg calls selftest: af_unix: Fix kselftest compilation warnings netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests netfilter: nf_tables: Introduce nf_tables_getobj_single netfilter: nf_tables: Audit log dump reset after the fact selftests: netfilter: add test for br_netfilter+conntrack+queue combination netfilter: nf_queue: drop packets with cloned unconfirmed conntracks netfilter: flowtable: initialise extack before use netfilter: nfnetlink: Initialise extack before use in ACKs netfilter: allow ipv6 fragments to arrive on different devices tcp: Update window clamping condition ...
2024-08-15Bluetooth: MGMT: Add error handling to pair_device()Griffin Kroah-Hartman1-0/+4
hci_conn_params_add() never checks for a NULL value and could lead to a NULL pointer dereference causing a crash. Fixed by adding error handling in the function. Cc: Stable <stable@kernel.org> Fixes: 5157b8a503fa ("Bluetooth: Fix initializing conn_params in scan phase") Signed-off-by: Griffin Kroah-Hartman <griffin@kroah.com> Reported-by: Yiwei Zhang <zhan4630@purdue.edu> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-08-15Bluetooth: SMP: Fix assumption of Central always being InitiatorLuiz Augusto von Dentz1-72/+72
SMP initiator role shall be considered the one that initiates the pairing procedure with SMP_CMD_PAIRING_REQ: BLUETOOTH CORE SPECIFICATION Version 5.3 | Vol 3, Part H page 1557: Figure 2.1: LE pairing phases Note that by sending SMP_CMD_SECURITY_REQ it doesn't change the role to be Initiator. Link: https://github.com/bluez/bluez/issues/567 Fixes: b28b4943660f ("Bluetooth: Add strict checks for allowed SMP PDUs") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-08-15Bluetooth: hci_core: Fix LE quote calculationLuiz Augusto von Dentz1-12/+7
Function hci_sched_le needs to update the respective counter variable inplace other the likes of hci_quote_sent would attempt to use the possible outdated value of conn->{le_cnt,acl_cnt}. Link: https://github.com/bluez/bluez/issues/915 Fixes: 73d80deb7bdf ("Bluetooth: prioritizing data over HCI") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-08-15Bluetooth: HCI: Invert LE State quirk to be opt-out rather then opt-inLuiz Augusto von Dentz1-1/+1
This inverts the LE State quirk so by default we assume the controllers would report valid states rather than invalid which is how quirks normally behave, also this would result in HCI command failing it the LE States are really broken thus exposing the controllers that are really broken in this respect. Link: https://github.com/bluez/bluez/issues/584 Fixes: 220915857e29 ("Bluetooth: Adding driver and quirk defs for multi-role LE") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-08-15Merge tag 'nf-24-08-15' of ↵Paolo Abeni6-50/+149
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Ignores ifindex for types other than mcast/linklocal in ipv6 frag reasm, from Tom Hughes. 2) Initialize extack for begin/end netlink message marker in batch, from Donald Hunter. 3) Initialize extack for flowtable offload support, also from Donald. 4) Dropped packets with cloned unconfirmed conntracks in nfqueue, later it should be possible to explore lookup after reinject but Florian prefers this approach at this stage. From Florian Westphal. 5) Add selftest for cloned unconfirmed conntracks in nfqueue for previous update. 6) Audit after filling netlink header successfully in object dump, from Phil Sutter. 7-8) Fix concurrent dump and reset which could result in underflow counter / quota objects. netfilter pull request 24-08-15 * tag 'nf-24-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests netfilter: nf_tables: Introduce nf_tables_getobj_single netfilter: nf_tables: Audit log dump reset after the fact selftests: netfilter: add test for br_netfilter+conntrack+queue combination netfilter: nf_queue: drop packets with cloned unconfirmed conntracks netfilter: flowtable: initialise extack before use netfilter: nfnetlink: Initialise extack before use in ACKs netfilter: allow ipv6 fragments to arrive on different devices ==================== Link: https://patch.msgid.link/20240814222042.150590-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15net: ethtool: Allow write mechanism of LPL and both LPL and EPLDanielle Ratson1-2/+6
CMIS 5.2 standard section 9.4.2 defines four types of firmware update supported mechanism: None, only LPL, only EPL, both LPL and EPL. Currently, only LPL (Local Payload) type of write firmware block is supported. However, if the module supports both LPL and EPL the flashing process wrongly fails for no supporting LPL. Fix that, by allowing the write mechanism to be LPL or both LPL and EPL. Fixes: c4f78134d45c ("ethtool: cmis_fw_update: add a layer for supporting firmware update using CDB") Reported-by: Vladyslav Mykhaliuk <vmykhaliuk@nvidia.com> Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20240812140824.3718826-1-danieller@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15vsock: fix recursive ->recvmsg callsCong Wang2-23/+31
After a vsock socket has been added to a BPF sockmap, its prot->recvmsg has been replaced with vsock_bpf_recvmsg(). Thus the following recursiion could happen: vsock_bpf_recvmsg() -> __vsock_recvmsg() -> vsock_connectible_recvmsg() -> prot->recvmsg() -> vsock_bpf_recvmsg() again We need to fix it by calling the original ->recvmsg() without any BPF sockmap logic in __vsock_recvmsg(). Fixes: 634f1a7110b4 ("vsock: support sockmap") Reported-by: syzbot+bdb4bd87b5e22058e2a4@syzkaller.appspotmail.com Tested-by: syzbot+bdb4bd87b5e22058e2a4@syzkaller.appspotmail.com Cc: Bobby Eshleman <bobby.eshleman@bytedance.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20240812022153.86512-1-xiyou.wangcong@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-14netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requestsPhil Sutter1-13/+59
Objects' dump callbacks are not concurrency-safe per-se with reset bit set. If two CPUs perform a reset at the same time, at least counter and quota objects suffer from value underrun. Prevent this by introducing dedicated locking callbacks for nfnetlink and the asynchronous dump handling to serialize access. Fixes: 43da04a593d8 ("netfilter: nf_tables: atomic dump and reset for stateful objects") Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14netfilter: nf_tables: Introduce nf_tables_getobj_singlePhil Sutter1-31/+44
Outsource the reply skb preparation for non-dump getrule requests into a distinct function. Prep work for object reset locking. Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14netfilter: nf_tables: Audit log dump reset after the factPhil Sutter1-15/+13
In theory, dumpreset may fail and invalidate the preceeding log message. Fix this and use the occasion to prepare for object reset locking, which benefits from a few unrelated changes: * Add an early call to nfnetlink_unicast if not resetting which effectively skips the audit logging but also unindents it. * Extract the table's name from the netlink attribute (which is verified via earlier table lookup) to not rely upon validity of the looked up table pointer. * Do not use local variable family, it will vanish. Fixes: 8e6cf365e1d5 ("audit: log nftables configuration change events") Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14netfilter: nf_queue: drop packets with cloned unconfirmed conntracksFlorian Westphal2-3/+38
Conntrack assumes an unconfirmed entry (not yet committed to global hash table) has a refcount of 1 and is not visible to other cores. With multicast forwarding this assumption breaks down because such skbs get cloned after being picked up, i.e. ct->use refcount is > 1. Likewise, bridge netfilter will clone broad/mutlicast frames and all frames in case they need to be flood-forwarded during learning phase. For ip multicast forwarding or plain bridge flood-forward this will "work" because packets don't leave softirq and are implicitly serialized. With nfqueue this no longer holds true, the packets get queued and can be reinjected in arbitrary ways. Disable this feature, I see no other solution. After this patch, nfqueue cannot queue packets except the last multicast/broadcast packet. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14netfilter: flowtable: initialise extack before useDonald Hunter1-1/+1
Fix missing initialisation of extack in flow offload. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14netfilter: nfnetlink: Initialise extack before use in ACKsDonald Hunter1-1/+4
Add missing extack initialisation when ACKing BATCH_BEGIN and BATCH_END. Fixes: bf2ac490d28c ("netfilter: nfnetlink: Handle ACK flags for batch messages") Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14netfilter: allow ipv6 fragments to arrive on different devicesTom Hughes1-0/+4
Commit 264640fc2c5f4 ("ipv6: distinguish frag queues by device for multicast and link-local packets") modified the ipv6 fragment reassembly logic to distinguish frag queues by device for multicast and link-local packets but in fact only the main reassembly code limits the use of the device to those address types and the netfilter reassembly code uses the device for all packets. This means that if fragments of a packet arrive on different interfaces then netfilter will fail to reassemble them and the fragments will be expired without going any further through the filters. Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units") Signed-off-by: Tom Hughes <tom@compton.nu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14tcp: Update window clamping conditionSubash Abhinov Kasiviswanathan1-16/+12
This patch is based on the discussions between Neal Cardwell and Eric Dumazet in the link https://lore.kernel.org/netdev/20240726204105.1466841-1-quic_subashab@quicinc.com/ It was correctly pointed out that tp->window_clamp would not be updated in cases where net.ipv4.tcp_moderate_rcvbuf=0 or if (copied <= tp->rcvq_space.space). While it is expected for most setups to leave the sysctl enabled, the latter condition may not end up hitting depending on the TCP receive queue size and the pattern of arriving data. The updated check should be hit only on initial MSS update from TCP_MIN_MSS to measured MSS value and subsequently if there was an update to a larger value. Fixes: 05f76b2d634e ("tcp: Adjust clamping window for applications specifying SO_RCVBUF") Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com> Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-13mptcp: correct MPTCP_SUBFLOW_ATTR_SSN_OFFSET reserved sizeEugene Syromiatnikov1-1/+1
ssn_offset field is u32 and is placed into the netlink response with nla_put_u32(), but only 2 bytes are reserved for the attribute payload in subflow_get_info_size() (even though it makes no difference in the end, as it is aligned up to 4 bytes). Supply the correct argument to the relevant nla_total_size() call to make it less confusing. Fixes: 5147dfb50832 ("mptcp: allow dumping subflow context to userspace") Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240812065024.GA19719@asgard.redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-10Merge tag 'nfsd-6.11-1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Two minor fixes for recent changes * tag 'nfsd-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: don't set SVC_SOCK_ANONYMOUS when creating nfsd sockets sunrpc: avoid -Wformat-security warning
2024-08-09udp: Fall back to software USO if IPv6 extension headers are presentJakub Sitnicki1-0/+6
In commit 10154dbded6d ("udp: Allow GSO transmit from devices with no checksum offload") we have intentionally allowed UDP GSO packets marked CHECKSUM_NONE to pass to the GSO stack, so that they can be segmented and checksummed by a software fallback when the egress device lacks these features. What was not taken into consideration is that a CHECKSUM_NONE skb can be handed over to the GSO stack also when the egress device advertises the tx-udp-segmentation / NETIF_F_GSO_UDP_L4 feature. This will happen when there are IPv6 extension headers present, which we check for in __ip6_append_data(). Syzbot has discovered this scenario, producing a warning as below: ip6tnl0: caps=(0x00000006401d7869, 0x00000006401d7869) WARNING: CPU: 0 PID: 5112 at net/core/dev.c:3293 skb_warn_bad_offload+0x166/0x1a0 net/core/dev.c:3291 Modules linked in: CPU: 0 PID: 5112 Comm: syz-executor391 Not tainted 6.10.0-rc7-syzkaller-01603-g80ab5445da62 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/07/2024 RIP: 0010:skb_warn_bad_offload+0x166/0x1a0 net/core/dev.c:3291 [...] Call Trace: <TASK> __skb_gso_segment+0x3be/0x4c0 net/core/gso.c:127 skb_gso_segment include/net/gso.h:83 [inline] validate_xmit_skb+0x585/0x1120 net/core/dev.c:3661 __dev_queue_xmit+0x17a4/0x3e90 net/core/dev.c:4415 neigh_output include/net/neighbour.h:542 [inline] ip6_finish_output2+0xffa/0x1680 net/ipv6/ip6_output.c:137 ip6_finish_output+0x41e/0x810 net/ipv6/ip6_output.c:222 ip6_send_skb+0x112/0x230 net/ipv6/ip6_output.c:1958 udp_v6_send_skb+0xbf5/0x1870 net/ipv6/udp.c:1292 udpv6_sendmsg+0x23b3/0x3270 net/ipv6/udp.c:1588 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0xef/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2585 ___sys_sendmsg net/socket.c:2639 [inline] __sys_sendmmsg+0x3b2/0x740 net/socket.c:2725 __do_sys_sendmmsg net/socket.c:2754 [inline] __se_sys_sendmmsg net/socket.c:2751 [inline] __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2751 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f [...] </TASK> We are hitting the bad offload warning because when an egress device is capable of handling segmentation offload requested by skb_shinfo(skb)->gso_type, the chain of gso_segment callbacks won't produce any segment skbs and return NULL. See the skb_gso_ok() branch in {__udp,tcp,sctp}_gso_segment helpers. To fix it, force a fallback to software USO when processing a packet with IPv6 extension headers, since we don't know if these can checksummed by all devices which offer USO. Fixes: 10154dbded6d ("udp: Allow GSO transmit from devices with no checksum offload") Reported-by: syzbot+e15b7e15b8a751a91d9a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000e1609a061d5330ce@google.com/ Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://patch.msgid.link/20240808-udp-gso-egress-from-tunnel-v4-2-f5c5b4149ab9@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net: Make USO depend on CSUM offloadJakub Sitnicki1-9/+17
UDP segmentation offload inherently depends on checksum offload. It should not be possible to disable checksum offload while leaving USO enabled. Enforce this dependency in code. There is a single tx-udp-segmentation feature flag to indicate support for both IPv4/6, hence the devices wishing to support USO must offer checksum offload for both IP versions. Fixes: 10154dbded6d ("udp: Allow GSO transmit from devices with no checksum offload") Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240808-udp-gso-egress-from-tunnel-v4-1-f5c5b4149ab9@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-08ethtool: Fix context creation with no parametersGal Pressman1-5/+8
The 'at least one change' requirement is not applicable for context creation, skip the check in such case. This allows a command such as 'ethtool -X eth0 context new' to work. The command works by mistake when using older versions of userspace ethtool due to an incompatibility issue where rxfh.input_xfrm is passed as zero (unset) instead of RXH_XFRM_NO_CHANGE as done with recent userspace. This patch does not try to solve the incompatibility issue. Link: https://lore.kernel.org/netdev/05ae8316-d3aa-4356-98c6-55ed4253c8a7@nvidia.com/ Fixes: 84a1d9c48200 ("net: ethtool: extend RXNFC API to support RSS spreading of filter matches") Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/20240807173352.3501746-1-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-08net: ethtool: fix off-by-one error in max RSS context IDsEdward Cree1-2/+3
Both ethtool_ops.rxfh_max_context_id and the default value used when it's not specified are supposed to be exclusive maxima (the former is documented as such; the latter, U32_MAX, cannot be used as an ID since it equals ETH_RXFH_CONTEXT_ALLOC), but xa_alloc() expects an inclusive maximum. Subtract one from 'limit' to produce an inclusive maximum, and pass that to xa_alloc(). Increase bnxt's max by one to prevent a (very minor) regression, as BNXT_MAX_ETH_RSS_CTX is an inclusive max. This is safe since bnxt is not actually hard-limited; BNXT_MAX_ETH_RSS_CTX is just a leftover from old driver code that managed context IDs itself. Rename rxfh_max_context_id to rxfh_max_num_contexts to make its semantics (hopefully) more obvious. Fixes: 847a8ab18676 ("net: ethtool: let the core choose RSS context IDs") Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/5a2d11a599aa5b0cc6141072c01accfb7758650c.1723045898.git.ecree.xilinx@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-08l2tp: fix lockdep splatJames Chapman1-2/+13
When l2tp tunnels use a socket provided by userspace, we can hit lockdep splats like the below when data is transmitted through another (unrelated) userspace socket which then gets routed over l2tp. This issue was previously discussed here: https://lore.kernel.org/netdev/87sfialu2n.fsf@cloudflare.com/ The solution is to have lockdep treat socket locks of l2tp tunnel sockets separately than those of standard INET sockets. To do so, use a different lockdep subclass where lock nesting is possible. ============================================ WARNING: possible recursive locking detected 6.10.0+ #34 Not tainted -------------------------------------------- iperf3/771 is trying to acquire lock: ffff8881027601d8 (slock-AF_INET/1){+.-.}-{2:2}, at: l2tp_xmit_skb+0x243/0x9d0 but task is already holding lock: ffff888102650d98 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1848/0x1e10 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(slock-AF_INET/1); lock(slock-AF_INET/1); *** DEADLOCK *** May be due to missing lock nesting notation 10 locks held by iperf3/771: #0: ffff888102650258 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_sendmsg+0x1a/0x40 #1: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x4b/0xbc0 #2: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x17a/0x1130 #3: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x28b/0x9f0 #4: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0xf9/0x260 #5: ffff888102650d98 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1848/0x1e10 #6: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x4b/0xbc0 #7: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x17a/0x1130 #8: ffffffff822ac1e0 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0xcc/0x1450 #9: ffff888101f33258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock#2){+...}-{2:2}, at: __dev_queue_xmit+0x513/0x1450 stack backtrace: CPU: 2 UID: 0 PID: 771 Comm: iperf3 Not tainted 6.10.0+ #34 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <IRQ> dump_stack_lvl+0x69/0xa0 dump_stack+0xc/0x20 __lock_acquire+0x135d/0x2600 ? srso_alias_return_thunk+0x5/0xfbef5 lock_acquire+0xc4/0x2a0 ? l2tp_xmit_skb+0x243/0x9d0 ? __skb_checksum+0xa3/0x540 _raw_spin_lock_nested+0x35/0x50 ? l2tp_xmit_skb+0x243/0x9d0 l2tp_xmit_skb+0x243/0x9d0 l2tp_eth_dev_xmit+0x3c/0xc0 dev_hard_start_xmit+0x11e/0x420 sch_direct_xmit+0xc3/0x640 __dev_queue_xmit+0x61c/0x1450 ? ip_finish_output2+0xf4c/0x1130 ip_finish_output2+0x6b6/0x1130 ? srso_alias_return_thunk+0x5/0xfbef5 ? __ip_finish_output+0x217/0x380 ? srso_alias_return_thunk+0x5/0xfbef5 __ip_finish_output+0x217/0x380 ip_output+0x99/0x120 __ip_queue_xmit+0xae4/0xbc0 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? tcp_options_write.constprop.0+0xcb/0x3e0 ip_queue_xmit+0x34/0x40 __tcp_transmit_skb+0x1625/0x1890 __tcp_send_ack+0x1b8/0x340 tcp_send_ack+0x23/0x30 __tcp_ack_snd_check+0xa8/0x530 ? srso_alias_return_thunk+0x5/0xfbef5 tcp_rcv_established+0x412/0xd70 tcp_v4_do_rcv+0x299/0x420 tcp_v4_rcv+0x1991/0x1e10 ip_protocol_deliver_rcu+0x50/0x220 ip_local_deliver_finish+0x158/0x260 ip_local_deliver+0xc8/0xe0 ip_rcv+0xe5/0x1d0 ? __pfx_ip_rcv+0x10/0x10 __netif_receive_skb_one_core+0xce/0xe0 ? process_backlog+0x28b/0x9f0 __netif_receive_skb+0x34/0xd0 ? process_backlog+0x28b/0x9f0 process_backlog+0x2cb/0x9f0 __napi_poll.constprop.0+0x61/0x280 net_rx_action+0x332/0x670 ? srso_alias_return_thunk+0x5/0xfbef5 ? find_held_lock+0x2b/0x80 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 handle_softirqs+0xda/0x480 ? __dev_queue_xmit+0xa2c/0x1450 do_softirq+0xa1/0xd0 </IRQ> <TASK> __local_bh_enable_ip+0xc8/0xe0 ? __dev_queue_xmit+0xa2c/0x1450 __dev_queue_xmit+0xa48/0x1450 ? ip_finish_output2+0xf4c/0x1130 ip_finish_output2+0x6b6/0x1130 ? srso_alias_return_thunk+0x5/0xfbef5 ? __ip_finish_output+0x217/0x380 ? srso_alias_return_thunk+0x5/0xfbef5 __ip_finish_output+0x217/0x380 ip_output+0x99/0x120 __ip_queue_xmit+0xae4/0xbc0 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? tcp_options_write.constprop.0+0xcb/0x3e0 ip_queue_xmit+0x34/0x40 __tcp_transmit_skb+0x1625/0x1890 tcp_write_xmit+0x766/0x2fb0 ? __entry_text_end+0x102ba9/0x102bad ? srso_alias_return_thunk+0x5/0xfbef5 ? __might_fault+0x74/0xc0 ? srso_alias_return_thunk+0x5/0xfbef5 __tcp_push_pending_frames+0x56/0x190 tcp_push+0x117/0x310 tcp_sendmsg_locked+0x14c1/0x1740 tcp_sendmsg+0x28/0x40 inet_sendmsg+0x5d/0x90 sock_write_iter+0x242/0x2b0 vfs_write+0x68d/0x800 ? __pfx_sock_write_iter+0x10/0x10 ksys_write+0xc8/0xf0 __x64_sys_write+0x3d/0x50 x64_sys_call+0xfaf/0x1f50 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f4d143af992 Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 01 cc ff ff 41 54 b8 02 00 00 0 RSP: 002b:00007ffd65032058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f4d143af992 RDX: 0000000000000025 RSI: 00007f4d143f3bcc RDI: 0000000000000005 RBP: 00007f4d143f2b28 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4d143f3bcc R13: 0000000000000005 R14: 0000000000000000 R15: 00007ffd650323f0 </TASK> Fixes: 0b2c59720e65 ("l2tp: close all race conditions in l2tp_tunnel_register()") Suggested-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+6acef9e0a4d1f46c83d4@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6acef9e0a4d1f46c83d4 CC: gnault@redhat.com CC: cong.wang@bytedance.com Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: Tom Parkin <tparkin@katalix.com> Link: https://patch.msgid.link/20240806160626.1248317-1-jchapman@katalix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-07Merge tag 'for-net-2024-08-07' of ↵Jakub Kicinski2-0/+15
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - hci_sync: avoid dup filtering when passive scanning with adv monitor - hci_qca: don't call pwrseq_power_off() twice for QCA6390 - hci_qca: fix QCA6390 support on non-DT platforms - hci_qca: fix a NULL-pointer derefence at shutdown - l2cap: always unlock channel in l2cap_conless_channel() * tag 'for-net-2024-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: hci_sync: avoid dup filtering when passive scanning with adv monitor Bluetooth: l2cap: always unlock channel in l2cap_conless_channel() Bluetooth: hci_qca: fix a NULL-pointer derefence at shutdown Bluetooth: hci_qca: fix QCA6390 support on non-DT platforms Bluetooth: hci_qca: don't call pwrseq_power_off() twice for QCA6390 ==================== Link: https://patch.msgid.link/20240807210103.142483-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-07net/smc: add the max value of fallback reason countZhengchao Shao1-1/+1
The number of fallback reasons defined in the smc_clc.h file has reached 36. For historical reasons, some are no longer quoted, and there's 33 actually in use. So, add the max value of fallback reason count to 36. Fixes: 6ac1e6563f59 ("net/smc: support smc v2.x features validate") Fixes: 7f0620b9940b ("net/smc: support max connections per lgr negotiation") Fixes: 69b888e3bb4b ("net/smc: support max links per lgr negotiation in clc handshake") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: D. Wythe <alibuda@linux.alibaba.com> Link: https://patch.msgid.link/20240805043856.565677-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-07Bluetooth: hci_sync: avoid dup filtering when passive scanning with adv monitorAnton Khirnov1-0/+14
This restores behaviour (including the comment) from now-removed hci_request.c, and also matches existing code for active scanning. Without this, the duplicates filter is always active when passive scanning, which makes it impossible to work with devices that send nontrivial dynamic data in their advertisement reports. Fixes: abfeea476c68 ("Bluetooth: hci_sync: Convert MGMT_OP_START_DISCOVERY") Signed-off-by: Anton Khirnov <anton@khirnov.net> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-08-07Bluetooth: l2cap: always unlock channel in l2cap_conless_channel()Dmitry Antipov1-0/+1
Add missing call to 'l2cap_chan_unlock()' on receive error handling path in 'l2cap_conless_channel()'. Fixes: a24cce144b98 ("Bluetooth: Fix reference counting of global L2CAP channels") Reported-by: syzbot+45ac74737e866894acb0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=45ac74737e866894acb0 Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-08-06net: linkwatch: use system_unbound_wqEric Dumazet1-2/+2
linkwatch_event() grabs possibly very contended RTNL mutex. system_wq is not suitable for such work. Inspired by many noisy syzbot reports. 3 locks held by kworker/0:7/5266: #0: ffff888015480948 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3206 [inline] #0: ffff888015480948 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x90a/0x1830 kernel/workqueue.c:3312 #1: ffffc90003f6fd00 ((linkwatch_work).work){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3207 [inline] , at: process_scheduled_works+0x945/0x1830 kernel/workqueue.c:3312 #2: ffffffff8fa6f208 (rtnl_mutex){+.+.}-{3:3}, at: linkwatch_event+0xe/0x60 net/core/link_watch.c:276 Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240805085821.1616528-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-06can: bcm: Remove proc entry when dev is unregistered.Kuniyuki Iwashima1-0/+4
syzkaller reported a warning in bcm_connect() below. [0] The repro calls connect() to vxcan1, removes vxcan1, and calls connect() with ifindex == 0. Calling connect() for a BCM socket allocates a proc entry. Then, bcm_sk(sk)->bound is set to 1 to prevent further connect(). However, removing the bound device resets bcm_sk(sk)->bound to 0 in bcm_notify(). The 2nd connect() tries to allocate a proc entry with the same name and sets NULL to bcm_sk(sk)->bcm_proc_read, leaking the original proc entry. Since the proc entry is available only for connect()ed sockets, let's clean up the entry when the bound netdev is unregistered. [0]: proc_dir_entry 'can-bcm/2456' already registered WARNING: CPU: 1 PID: 394 at fs/proc/generic.c:376 proc_register+0x645/0x8f0 fs/proc/generic.c:375 Modules linked in: CPU: 1 PID: 394 Comm: syz-executor403 Not tainted 6.10.0-rc7-g852e42cc2dd4 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 RIP: 0010:proc_register+0x645/0x8f0 fs/proc/generic.c:375 Code: 00 00 00 00 00 48 85 ed 0f 85 97 02 00 00 4d 85 f6 0f 85 9f 02 00 00 48 c7 c7 9b cb cf 87 48 89 de 4c 89 fa e8 1c 6f eb fe 90 <0f> 0b 90 90 48 c7 c7 98 37 99 89 e8 cb 7e 22 05 bb 00 00 00 10 48 RSP: 0018:ffa0000000cd7c30 EFLAGS: 00010246 RAX: 9e129be1950f0200 RBX: ff1100011b51582c RCX: ff1100011857cd80 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002 RBP: 0000000000000000 R08: ffd400000000000f R09: ff1100013e78cac0 R10: ffac800000cd7980 R11: ff1100013e12b1f0 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: ff1100011a99a2ec FS: 00007fbd7086f740(0000) GS:ff1100013fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000200071c0 CR3: 0000000118556004 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> proc_create_net_single+0x144/0x210 fs/proc/proc_net.c:220 bcm_connect+0x472/0x840 net/can/bcm.c:1673 __sys_connect_file net/socket.c:2049 [inline] __sys_connect+0x5d2/0x690 net/socket.c:2066 __do_sys_connect net/socket.c:2076 [inline] __se_sys_connect net/socket.c:2073 [inline] __x64_sys_connect+0x8f/0x100 net/socket.c:2073 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xd9/0x1c0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7fbd708b0e5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007fff8cd33f08 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fbd708b0e5d RDX: 0000000000000010 RSI: 0000000020000040 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000040 R09: 0000000000000040 R10: 0000000000000040 R11: 0000000000000246 R12: 00007fff8cd34098 R13: 0000000000401280 R14: 0000000000406de8 R15: 00007fbd70ab9000 </TASK> remove_proc_entry: removing non-empty directory 'net/can-bcm', leaking at least '2456' Fixes: ffd980f976e7 ("[CAN]: Add broadcast manager (bcm) protocol") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/all/20240722192842.37421-1-kuniyu@amazon.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2024-08-05net: bridge: mcast: wait for previous gc cycles when removing portNikolay Aleksandrov1-3/+1
syzbot hit a use-after-free[1] which is caused because the bridge doesn't make sure that all previous garbage has been collected when removing a port. What happens is: CPU 1 CPU 2 start gc cycle remove port acquire gc lock first wait for lock call br_multicasg_gc() directly acquire lock now but free port the port can be freed while grp timers still running Make sure all previous gc cycles have finished by using flush_work before freeing the port. [1] BUG: KASAN: slab-use-after-free in br_multicast_port_group_expired+0x4c0/0x550 net/bridge/br_multicast.c:861 Read of size 8 at addr ffff888071d6d000 by task syz.5.1232/9699 CPU: 1 PID: 9699 Comm: syz.5.1232 Not tainted 6.10.0-rc5-syzkaller-00021-g24ca36a562d6 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/07/2024 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114 print_address_description mm/kasan/report.c:377 [inline] print_report+0xc3/0x620 mm/kasan/report.c:488 kasan_report+0xd9/0x110 mm/kasan/report.c:601 br_multicast_port_group_expired+0x4c0/0x550 net/bridge/br_multicast.c:861 call_timer_fn+0x1a3/0x610 kernel/time/timer.c:1792 expire_timers kernel/time/timer.c:1843 [inline] __run_timers+0x74b/0xaf0 kernel/time/timer.c:2417 __run_timer_base kernel/time/timer.c:2428 [inline] __run_timer_base kernel/time/timer.c:2421 [inline] run_timer_base+0x111/0x190 kernel/time/timer.c:2437 Reported-by: syzbot+263426984509be19c9a0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=263426984509be19c9a0 Fixes: e12cec65b554 ("net: bridge: mcast: destroy all entries via gc") Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20240802080730.3206303-1-razor@blackwall.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>