Age | Commit message (Collapse) | Author | Files | Lines |
|
KMSAN caught uninit-value in tcp_create_openreq_child() [1]
This is caused by a recent change, combined by the fact
that TCP cleared num_timeout, num_retrans and sk fields only
when a request socket was about to be queued.
Under syncookie mode, a temporary request socket is used,
and req->num_timeout could contain garbage.
Lets clear these three fields sooner, there is really no
point trying to defer this and risk other bugs.
[1]
BUG: KMSAN: uninit-value in tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526
CPU: 1 PID: 13357 Comm: syz-executor591 Not tainted 5.2.0-rc4+ #3
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x191/0x1f0 lib/dump_stack.c:113
kmsan_report+0x162/0x2d0 mm/kmsan/kmsan.c:611
__msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:304
tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526
tcp_v6_syn_recv_sock+0x761/0x2d80 net/ipv6/tcp_ipv6.c:1152
tcp_get_cookie_sock+0x16e/0x6b0 net/ipv4/syncookies.c:209
cookie_v6_check+0x27e0/0x29a0 net/ipv6/syncookies.c:252
tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline]
tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344
tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554
ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397
ip6_input_finish net/ipv6/ip6_input.c:438 [inline]
NF_HOOK include/linux/netfilter.h:305 [inline]
ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447
dst_input include/net/dst.h:439 [inline]
ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
NF_HOOK include/linux/netfilter.h:305 [inline]
ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272
__netif_receive_skb_one_core net/core/dev.c:4981 [inline]
__netif_receive_skb net/core/dev.c:5095 [inline]
process_backlog+0x721/0x1410 net/core/dev.c:5906
napi_poll net/core/dev.c:6329 [inline]
net_rx_action+0x738/0x1940 net/core/dev.c:6395
__do_softirq+0x4ad/0x858 kernel/softirq.c:293
do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052
</IRQ>
do_softirq kernel/softirq.c:338 [inline]
__local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190
local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline]
ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117
ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150
NF_HOOK_COND include/linux/netfilter.h:294 [inline]
ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167
dst_output include/net/dst.h:433 [inline]
NF_HOOK include/linux/netfilter.h:305 [inline]
ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271
inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135
__tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156
tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline]
tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397
__tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573
tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118
tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403
inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427
inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470
__sock_release net/socket.c:601 [inline]
sock_close+0x156/0x490 net/socket.c:1273
__fput+0x4c9/0xba0 fs/file_table.c:280
____fput+0x37/0x40 fs/file_table.c:313
task_work_run+0x22e/0x2a0 kernel/task_work.c:113
tracehook_notify_resume include/linux/tracehook.h:185 [inline]
exit_to_usermode_loop arch/x86/entry/common.c:168 [inline]
prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199
syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279
do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305
entry_SYSCALL_64_after_hwframe+0x63/0xe7
RIP: 0033:0x401d50
Code: 01 f0 ff ff 0f 83 40 0d 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d dd 8d 2d 00 00 75 14 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 14 0d 00 00 c3 48 83 ec 08 e8 7a 02 00 00
RSP: 002b:00007fff1cf58cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000401d50
RDX: 000000000000001c RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00000000004a9050 R08: 0000000020000040 R09: 000000000000001c
R10: 0000000020004004 R11: 0000000000000246 R12: 0000000000402ef0
R13: 0000000000402f80 R14: 0000000000000000 R15: 0000000000000000
Uninit was created at:
kmsan_save_stack_with_flags mm/kmsan/kmsan.c:201 [inline]
kmsan_internal_poison_shadow+0x53/0xa0 mm/kmsan/kmsan.c:160
kmsan_kmalloc+0xa4/0x130 mm/kmsan/kmsan_hooks.c:177
kmem_cache_alloc+0x534/0xb00 mm/slub.c:2781
reqsk_alloc include/net/request_sock.h:84 [inline]
inet_reqsk_alloc+0xa8/0x600 net/ipv4/tcp_input.c:6384
cookie_v6_check+0xadb/0x29a0 net/ipv6/syncookies.c:173
tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline]
tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344
tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554
ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397
ip6_input_finish net/ipv6/ip6_input.c:438 [inline]
NF_HOOK include/linux/netfilter.h:305 [inline]
ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447
dst_input include/net/dst.h:439 [inline]
ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
NF_HOOK include/linux/netfilter.h:305 [inline]
ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272
__netif_receive_skb_one_core net/core/dev.c:4981 [inline]
__netif_receive_skb net/core/dev.c:5095 [inline]
process_backlog+0x721/0x1410 net/core/dev.c:5906
napi_poll net/core/dev.c:6329 [inline]
net_rx_action+0x738/0x1940 net/core/dev.c:6395
__do_softirq+0x4ad/0x858 kernel/softirq.c:293
do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052
do_softirq kernel/softirq.c:338 [inline]
__local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190
local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline]
ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117
ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150
NF_HOOK_COND include/linux/netfilter.h:294 [inline]
ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167
dst_output include/net/dst.h:433 [inline]
NF_HOOK include/linux/netfilter.h:305 [inline]
ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271
inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135
__tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156
tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline]
tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397
__tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573
tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118
tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403
inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427
inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470
__sock_release net/socket.c:601 [inline]
sock_close+0x156/0x490 net/socket.c:1273
__fput+0x4c9/0xba0 fs/file_table.c:280
____fput+0x37/0x40 fs/file_table.c:313
task_work_run+0x22e/0x2a0 kernel/task_work.c:113
tracehook_notify_resume include/linux/tracehook.h:185 [inline]
exit_to_usermode_loop arch/x86/entry/common.c:168 [inline]
prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199
syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279
do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305
entry_SYSCALL_64_after_hwframe+0x63/0xe7
Fixes: 336c39a03151 ("tcp: undo init congestion window on false SYNACK timeout")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Yuchung Cheng <[email protected]>
Cc: Neal Cardwell <[email protected]>
Cc: Soheil Hassas Yeganeh <[email protected]>
Reported-by: syzbot <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Acked-by: Yuchung Cheng <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
empty_child_inc/dec() use the ternary operator for conditional
operations. The conditions involve the post/pre in/decrement
operator and the operation is only performed when the condition
is *not* true. This is hard to parse for humans, use a regular
'if' construct instead and perform the in/decrement separately.
This also fixes two warnings that are emitted about the value
of the ternary expression being unused, when building the kernel
with clang + "kbuild: Remove unnecessary -Wno-unused-value"
(https://lore.kernel.org/patchwork/patch/1089869/):
CC net/ipv4/fib_trie.o
net/ipv4/fib_trie.c:351:2: error: expression result unused [-Werror,-Wunused-value]
++tn_info(n)->empty_children ? : ++tn_info(n)->full_children;
Fixes: 95f60ea3e99a ("fib_trie: Add collapse() and should_collapse() to resize")
Signed-off-by: Matthias Kaehlcke <[email protected]>
Reviewed-by: Douglas Anderson <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Acked-by: Alexander Duyck <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
A user reported that routes are getting installed with type 0 (RTN_UNSPEC)
where before the routes were RTN_UNICAST. One example is from accel-ppp
which apparently still uses the ioctl interface and does not set
rtmsg_type. Another is the netlink interface where ipv6 does not require
rtm_type to be set (v4 does). Prior to the commit in the Fixes tag the
ipv6 stack converted type 0 to RTN_UNICAST, so restore that behavior.
Fixes: e8478e80e5a7 ("net/ipv6: Save route type in rt6_info")
Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The DRC appears to be effectively empty after an RPC/RDMA transport
reconnect. The problem is that each connection uses a different
source port, which defeats the DRC hash.
Clients always have to disconnect before they send retransmissions
to reset the connection's credit accounting, thus every retransmit
on NFS/RDMA will miss the DRC.
An NFS/RDMA client's IP source port is meaningless for RDMA
transports. The transport layer typically sets the source port value
on the connection to a random ephemeral port. The server already
ignores it for the "secure port" check. See commit 16e4d93f6de7
("NFSD: Ignore client's source port on RDMA transports").
The Linux NFS server's DRC resolves XID collisions from the same
source IP address by using the checksum of the first 200 bytes of
the RPC call header.
Signed-off-by: Chuck Lever <[email protected]>
Cc: [email protected] # v4.14+
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Even when running as VM guest (ie pr_iucv != NULL), af_iucv can still
open HiperTransport-based connections. For robust operation these
connections require the af_iucv_netdev_notifier, so register it
unconditionally.
Also handle any error that register_netdevice_notifier() returns.
Fixes: 9fbd87d41392 ("af_iucv: handle netdev events")
Signed-off-by: Julian Wiedmann <[email protected]>
Reviewed-by: Ursula Braun <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The HiperSockets-based transport path in af_iucv is still too closely
entangled with qeth.
With commit a647a02512ca ("s390/qeth: speed-up L3 IQD xmit"), the
relevant xmit code in qeth has begun to use skb_cow_head(). So to avoid
unnecessary skb head expansions, af_iucv must learn to
1) respect dev->needed_headroom when allocating skbs, and
2) drop the header reference before cloning the skb.
While at it, also stop hard-coding the LL-header creation stage and just
use the appropriate helper.
Fixes: a647a02512ca ("s390/qeth: speed-up L3 IQD xmit")
Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
af_iucv sockets over z/VM IUCV require that their skbs are allocated
in DMA memory. This restriction doesn't apply to connections over
HiperSockets. So only set this limit for z/VM IUCV sockets, thereby
increasing the likelihood that the large (and linear!) allocations for
HiperTransport messages succeed.
Fixes: 3881ac441f64 ("af_iucv: add HiperSockets transport")
Signed-off-by: Julian Wiedmann <[email protected]>
Reviewed-by: Ursula Braun <[email protected]>
Reviewed-by: Hendrik Brueckner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Currently, the expiration of every element in a set or map
is a read-only parameter generated at kernel side.
This change will permit to set a certain expiration date
per element that will be required, for example, during
stateful replication among several nodes.
This patch handles the NFTA_SET_ELEM_EXPIRATION in order
to configure the expiration parameter per element, or
will use the timeout in the case that the expiration
is not set.
Signed-off-by: Laura Garcia Liebana <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
nf_ct_helper_ext_add may return null, which must then be checked.
Fixes: 857b46027d6f ("netfilter: nft_ct: add ct expectations support")
Reported-by: Colin Ian King <[email protected]>
Signed-off-by: Stéphane Veyret <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
Currently functions nf_synproxy_{ipc4|ipv6}_init return an uninitialized
garbage value in variable ret on a successful return. Fix this by
returning zero on success.
Addresses-Coverity: ("Uninitialized scalar variable")
Fixes: d7f9b2f18eae ("netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXY")
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
syzbot reported another issue caused by my recent patches. [1]
The issue here is that fqdir_exit() is initiating a work queue
and immediately returns. A bit later cleanup_net() was able
to free the MIB (percpu data) and the whole struct net was freed,
but we had active frag timers that fired and triggered use-after-free.
We need to make sure that timers can catch fqdir->dead being set,
to bailout.
Since RCU is used for the reader side, this means
we want to respect an RCU grace period between these operations :
1) qfdir->dead = 1;
2) netns dismantle (freeing of various data structure)
This patch uses new new (struct pernet_operations)->pre_exit
infrastructure to ensures a full RCU grace period
happens between fqdir_pre_exit() and fqdir_exit()
This also means we can use a regular work queue, we no
longer need rcu_work.
Tested:
$ time for i in {1..1000}; do unshare -n /bin/false;done
real 0m2.585s
user 0m0.160s
sys 0m2.214s
[1]
BUG: KASAN: use-after-free in ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
Read of size 8 at addr ffff88808b9fe330 by task syz-executor.4/11860
CPU: 1 PID: 11860 Comm: syz-executor.4 Not tainted 5.2.0-rc2+ #22
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
__kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
kasan_report+0x12/0x20 mm/kasan/common.c:614
__asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
call_timer_fn+0x193/0x720 kernel/time/timer.c:1322
expire_timers kernel/time/timer.c:1366 [inline]
__run_timers kernel/time/timer.c:1685 [inline]
__run_timers kernel/time/timer.c:1653 [inline]
run_timer_softirq+0x66f/0x1740 kernel/time/timer.c:1698
__do_softirq+0x25c/0x94c kernel/softirq.c:293
invoke_softirq kernel/softirq.c:374 [inline]
irq_exit+0x180/0x1d0 kernel/softirq.c:414
exiting_irq arch/x86/include/asm/apic.h:536 [inline]
smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
</IRQ>
RIP: 0010:tomoyo_domain_quota_is_ok+0x131/0x540 security/tomoyo/util.c:1035
Code: 24 4c 3b 65 d0 0f 84 9c 00 00 00 e8 19 1d 73 fe 49 8d 7c 24 18 48 ba 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 0f b6 04 10 <48> 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 69 03 00 00 41 0f b6 5c
RSP: 0018:ffff88806ae079c0 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000000 RBX: 0000000000000010 RCX: ffffc9000e655000
RDX: dffffc0000000000 RSI: ffffffff82fd88a7 RDI: ffff888086202398
RBP: ffff88806ae07a00 R08: ffff88808b6c8700 R09: ffffed100d5c0f4d
R10: ffffed100d5c0f4c R11: 0000000000000000 R12: ffff888086202380
R13: 0000000000000030 R14: 00000000000000d3 R15: 0000000000000000
tomoyo_supervisor+0x2e8/0xef0 security/tomoyo/common.c:2087
tomoyo_audit_path_number_log security/tomoyo/file.c:235 [inline]
tomoyo_path_number_perm+0x42f/0x520 security/tomoyo/file.c:734
tomoyo_file_ioctl+0x23/0x30 security/tomoyo/tomoyo.c:335
security_file_ioctl+0x77/0xc0 security/security.c:1370
ksys_ioctl+0x57/0xd0 fs/ioctl.c:711
__do_sys_ioctl fs/ioctl.c:720 [inline]
__se_sys_ioctl fs/ioctl.c:718 [inline]
__x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4592c9
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f8db5e44c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004592c9
RDX: 0000000020000080 RSI: 00000000000089f1 RDI: 0000000000000006
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8db5e456d4
R13: 00000000004cc770 R14: 00000000004d5cd8 R15: 00000000ffffffff
Allocated by task 9047:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_kmalloc mm/kasan/common.c:489 [inline]
__kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
slab_post_alloc_hook mm/slab.h:437 [inline]
slab_alloc mm/slab.c:3326 [inline]
kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3488
kmem_cache_zalloc include/linux/slab.h:732 [inline]
net_alloc net/core/net_namespace.c:386 [inline]
copy_net_ns+0xed/0x340 net/core/net_namespace.c:426
create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
ksys_unshare+0x440/0x980 kernel/fork.c:2692
__do_sys_unshare kernel/fork.c:2760 [inline]
__se_sys_unshare kernel/fork.c:2758 [inline]
__x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Freed by task 2541:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
__cache_free mm/slab.c:3432 [inline]
kmem_cache_free+0x86/0x260 mm/slab.c:3698
net_free net/core/net_namespace.c:402 [inline]
net_drop_ns.part.0+0x70/0x90 net/core/net_namespace.c:409
net_drop_ns net/core/net_namespace.c:408 [inline]
cleanup_net+0x538/0x960 net/core/net_namespace.c:571
process_one_work+0x989/0x1790 kernel/workqueue.c:2269
worker_thread+0x98/0xe40 kernel/workqueue.c:2415
kthread+0x354/0x420 kernel/kthread.c:255
ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
The buggy address belongs to the object at ffff88808b9fe100
which belongs to the cache net_namespace of size 6784
The buggy address is located 560 bytes inside of
6784-byte region [ffff88808b9fe100, ffff88808b9ffb80)
The buggy address belongs to the page:
page:ffffea00022e7f80 refcount:1 mapcount:0 mapping:ffff88821b6f60c0 index:0x0 compound_mapcount: 0
flags: 0x1fffc0000010200(slab|head)
raw: 01fffc0000010200 ffffea000256f288 ffffea0001bbef08 ffff88821b6f60c0
raw: 0000000000000000 ffff88808b9fe100 0000000100000001 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88808b9fe200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88808b9fe280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88808b9fe300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88808b9fe380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88808b9fe400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 3c8fc8782044 ("inet: frags: rework rhashtable dismantle")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Current struct pernet_operations exit() handlers are highly
discouraged to call synchronize_rcu().
There are cases where we need them, and exit_batch() does
not help the common case where a single netns is dismantled.
This patch leverages the existing synchronize_rcu() call
in cleanup_net()
Calling optional ->pre_exit() method before ->exit() or
->exit_batch() allows to benefit from a single synchronize_rcu()
call.
Note that the synchronize_rcu() calls added in this patch
are only in error paths or slow paths.
Tested:
$ time for i in {1..1000}; do unshare -n /bin/false;done
real 0m2.612s
user 0m0.171s
sys 0m2.216s
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
For DMA mapping use-case the page_pool keeps a pointer
to the struct device, which is used in DMA map/unmap calls.
For our in-flight handling, we also need to make sure that
the struct device have not disappeared. This is assured
via using get_device/put_device API.
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Reported-by: Ivan Khoronzhuk <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The xdp tracepoints for mem id disconnect don't carry information about, why
it was not safe_to_remove. The tracepoint page_pool:page_pool_inflight in
this patch can be used for extract this info for further debugging.
This patchset also adds tracepoint for the pages_state_* release/hold
transitions, including a pointer to the page. This can be used for stats
about in-flight pages, or used to debug page leakage via keeping track of
page pointer and combining this with kprobe for __put_page().
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
These tracepoints make it easier to troubleshoot XDP mem id disconnect.
The xdp:mem_disconnect tracepoint cannot be replaced via kprobe. It is
placed at the last stable place for the pointer to struct xdp_mem_allocator,
just before it's scheduled for RCU removal. It also extract info on
'safe_to_remove' and 'force'.
Detailed info about in-flight pages is not available at this layer. The next
patch will added tracepoints needed at the page_pool layer for this.
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
If bugs exists or are introduced later e.g. by drivers misusing the API,
then we want to warn about the issue, such that developer notice. This patch
will generate a bit of noise in form of periodic pr_warn every 30 seconds.
It is not nice to have this stall warning running forever. Thus, this patch
will (after 120 attempts) force disconnect the mem id (from the rhashtable)
and free the page_pool object. This will cause fallback to the put_page() as
before, which only potentially leak DMA-mappings, if objects are really
stuck for this long. In that unlikely case, a WARN_ONCE should show us the
call stack.
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This patch is needed before we can allow drivers to use page_pool for
DMA-mappings. Today with page_pool and XDP return API, it is possible to
remove the page_pool object (from rhashtable), while there are still
in-flight packet-pages. This is safely handled via RCU and failed lookups in
__xdp_return() fallback to call put_page(), when page_pool object is gone.
In-case page is still DMA mapped, this will result in page note getting
correctly DMA unmapped.
To solve this, the page_pool is extended with tracking in-flight pages. And
XDP disconnect system queries page_pool and waits, via workqueue, for all
in-flight pages to be returned.
To avoid killing performance when tracking in-flight pages, the implement
use two (unsigned) counters, that in placed on different cache-lines, and
can be used to deduct in-flight packets. This is done by mapping the
unsigned "sequence" counters onto signed Two's complement arithmetic
operations. This is e.g. used by kernel's time_after macros, described in
kernel commit 1ba3aab3033b and 5a581b367b5, and also explained in RFC1982.
The trick is these two incrementing counters only need to be read and
compared, when checking if it's safe to free the page_pool structure. Which
will only happen when driver have disconnected RX/alloc side. Thus, on a
non-fast-path.
It is chosen that page_pool tracking is also enabled for the non-DMA
use-case, as this can be used for statistics later.
After this patch, using page_pool requires more strict resource "release",
e.g. via page_pool_release_page() that was introduced in this patchset, and
previous patches implement/fix this more strict requirement.
Drivers no-longer call page_pool_destroy(). Drivers already call
xdp_rxq_info_unreg() which call xdp_rxq_info_unreg_mem_model(), which will
attempt to disconnect the mem id, and if attempt fails schedule the
disconnect for later via delayed workqueue.
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Reviewed-by: Ilias Apalodimas <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
In case driver fails to register the page_pool with XDP return API (via
xdp_rxq_info_reg_mem_model()), then the driver can free the page_pool
resources more directly than calling page_pool_destroy(), which does a
unnecessarily RCU free procedure.
This patch is preparing for removing page_pool_destroy(), from driver
invocation.
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
When converting an xdp_frame into an SKB, and sending this into the network
stack, then the underlying XDP memory model need to release associated
resources, because the network stack don't have callbacks for XDP memory
models. The only memory model that needs this is page_pool, when a driver
use the DMA-mapping feature.
Introduce page_pool_release_page(), which basically does the same as
page_pool_unmap_page(). Add xdp_release_frame() as the XDP memory model
interface for calling it, if the memory model match MEM_TYPE_PAGE_POOL, to
save the function call overhead for others. Have cpumap call
xdp_release_frame() before xdp_scrub_frame().
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Fix error handling case, where inserting ID with rhashtable_insert_slow
fails in xdp_rxq_info_reg_mem_model, which leads to never releasing the IDA
ID, as the lookup in xdp_rxq_info_unreg_mem_model fails and thus
ida_simple_remove() is never called.
Fix by releasing ID via ida_simple_remove(), and mark xdp_rxq->mem.id with
zero, which is already checked in xdp_rxq_info_unreg_mem_model().
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Reviewed-by: Ilias Apalodimas <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
On a previous patch dma addr was stored in 'struct page'.
Use that to unmap DMA addresses used by network drivers
Signed-off-by: Ilias Apalodimas <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Based on 1 normalized pattern(s):
gplv2
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 58 file(s).
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Enrico Weigelt <[email protected]>
Reviewed-by: Allison Randal <[email protected]>
Reviewed-by: Kate Stewart <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation see readme and copying for
more details
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 9 file(s).
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Kate Stewart <[email protected]>
Reviewed-by: Enrico Weigelt <[email protected]>
Reviewed-by: Allison Randal <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Based on 2 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 4122 file(s).
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Enrico Weigelt <[email protected]>
Reviewed-by: Kate Stewart <[email protected]>
Reviewed-by: Allison Randal <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Based on 1 normalized pattern(s):
this source code is licensed under general public license version 2
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 5 file(s).
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Allison Randal <[email protected]>
Reviewed-by: Enrico Weigelt <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Based on 1 normalized pattern(s):
this work is licensed under the terms of the gnu gpl version 2
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 48 file(s).
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Allison Randal <[email protected]>
Reviewed-by: Enrico Weigelt <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license this
program is distributed in the hope that it will be useful but
without any warranty without even the implied warranty of
merchantability or fitness for a particular purpose see the gnu
general public license for more details
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 53 file(s).
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Allison Randal <[email protected]>
Reviewed-by: Alexios Zavras <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with this program if not see http www gnu org
licenses
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 503 file(s).
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Alexios Zavras <[email protected]>
Reviewed-by: Allison Randal <[email protected]>
Reviewed-by: Enrico Weigelt <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Based on 2 normalized pattern(s):
this source code is licensed under the gnu general public license
version 2 see the file copying for more details
this source code is licensed under general public license version 2
see
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 52 file(s).
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Enrico Weigelt <[email protected]>
Reviewed-by: Allison Randal <[email protected]>
Reviewed-by: Alexios Zavras <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Implement support for previously added flow dissector meta key.
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Use previously introduced infra to obtain and store ingress ifindex
instead doing it locally.
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Add new key meta that contains ingress ifindex value and add a function
to dissect this from skb. The key and function is prepared to cover
other potential skb metadata values dissection.
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Take mlx5-next so we can take a dependent two patch series next.
Signed-off-by: Doug Ledford <[email protected]>
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Module autoload for masquerade and redirection does not work.
2) Leak in unqueued packets in nf_ct_frag6_queue(). Ignore duplicated
fragments, pretend they are placed into the queue. Patches from
Guillaume Nault.
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
Currently, hvsock can enter into a state where epoll_wait on EPOLLOUT will
not return even when the hvsock socket is writable, under some race
condition. This can happen under the following sequence:
- fd = socket(hvsocket)
- fd_out = dup(fd)
- fd_in = dup(fd)
- start a writer thread that writes data to fd_out with a combination of
epoll_wait(fd_out, EPOLLOUT) and
- start a reader thread that reads data from fd_in with a combination of
epoll_wait(fd_in, EPOLLIN)
- On the host, there are two threads that are reading/writing data to the
hvsocket
stack:
hvs_stream_has_space
hvs_notify_poll_out
vsock_poll
sock_poll
ep_poll
Race condition:
check for epollout from ep_poll():
assume no writable space in the socket
hvs_stream_has_space() returns 0
check for epollin from ep_poll():
assume socket has some free space < HVS_PKT_LEN(HVS_SEND_BUF_SIZE)
hvs_stream_has_space() will clear the channel pending send size
host will not notify the guest because the pending send size has
been cleared and so the hvsocket will never mark the
socket writable
Now, the EPOLLOUT will never return even if the socket write buffer is
empty.
The fix is to set the pending size to the default size and never change it.
This way the host will always notify the guest whenever the writable space
is bigger than the pending size. The host is already optimized to *only*
notify the guest when the pending size threshold boundary is crossed and
not everytime.
This change also reduces the cpu usage somewhat since hv_stream_has_space()
is in the hotpath of send:
vsock_stream_sendmsg()->hv_stream_has_space()
Earlier hv_stream_has_space was setting/clearing the pending size on every
call.
Signed-off-by: Sunil Muthuswamy <[email protected]>
Reviewed-by: Dexuan Cui <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Fixes an issue where TX Timestamps are not arriving on the error queue
when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING.
This can be illustrated with an updated updgso_bench_tx program which
includes the '-T' option to test for this condition. It also introduces
the '-P' option which will call poll() before reading the error queue.
./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18
poll timeout
udp tx: 0 MB/s 1 calls/s 1 msg/s
The "poll timeout" message above indicates that TX timestamp never
arrived.
This patch preserves tx_flags for the first UDP GSO segment. Only the
first segment is timestamped, even though in some cases there may be
benefital in timestamping both the first and last segment.
Factors in deciding on first segment timestamp only:
- Timestamping both first and last segmented is not feasible. Hardware
can only have one outstanding TS request at a time.
- Timestamping last segment may under report network latency of the
previous segments. Even though the doorbell is suppressed, the ring
producer counter has been incremented.
- Timestamping the first segment has the upside in that it reports
timestamps from the application's view, e.g. RTT.
- Timestamping the first segment has the downside that it may
underreport tx host network latency. It appears that we have to pick
one or the other. And possibly follow-up with a config flag to choose
behavior.
v2: Remove tests as noted by Willem de Bruijn <[email protected]>
Moving tests from net to net-next
v3: Update only relevant tx_flag bits as per
Willem de Bruijn <[email protected]>
v4: Update comments and commit message as per
Willem de Bruijn <[email protected]>
Fixes: ee80d1ebe5ba ("udp: add udp gso")
Signed-off-by: Fred Klassen <[email protected]>
Acked-by: Willem de Bruijn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Brendan reports that the use of netem's packet corruption capability
leads to strange crashes. This seems to be caused by
commit d66280b12bd7 ("net: netem: use a list in addition to rbtree")
which uses skb->next pointer to construct a fast-path queue of
in-order skbs.
Packet corruption code has to invoke skb_gso_segment() in case
of skbs in need of GSO. skb_gso_segment() returns a list of
skbs. If next pointers of the skbs on that list do not get cleared
fast path list may point to freed skbs or skbs which are also on
the RB tree.
Let's say skb gets segmented into 3 frames:
A -> B -> C
A gets hooked to the t_head t_tail list by tfifo_enqueue(), but it's
next pointer didn't get cleared so we have:
h t
|/
A -> B -> C
Now if B and C get also get enqueued successfully all is fine, because
tfifo_enqueue() will overwrite the list in order. IOW:
Enqueue B:
h t
| |
A -> B C
Enqueue C:
h t
| |
A -> B -> C
But if B and C get reordered we may end up with:
h t RB tree
|/ |
A -> B -> C B
\
C
Or if they get dropped just:
h t
|/
A -> B -> C
where A and B are already freed.
To reproduce either limit has to be set low to cause freeing of
segs or reorders have to happen (due to delay jitter).
Note that we only have to mark the first segment as not on the
list, "finish_segs" handling of other frags already does that.
Another caveat is that qdisc_drop_all() still has to free all
segments correctly in case of drop of first segment, therefore
we re-link segs before calling it.
v2:
- re-link before drop, v1 was leaking non-first segs if limit
was hit at the first seg
- better commit message which lead to discovering the above :)
Reported-by: Brendan Galloway <[email protected]>
Fixes: d66280b12bd7 ("net: netem: use a list in addition to rbtree")
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Dirk van der Merwe <[email protected]>
Acked-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
When GSO frame has to be corrupted netem uses skb_gso_segment()
to produce the list of frames, and re-enqueues the segments one
by one. The backlog length has to be adjusted to account for
new frames.
The current calculation is incorrect, leading to wrong backlog
lengths in the parent qdisc (both bytes and packets), and
incorrect packet backlog count in netem itself.
Parent backlog goes negative, netem's packet backlog counts
all non-first segments twice (thus remaining non-zero even
after qdisc is emptied).
Move the variables used to count the adjustment into local
scope to make 100% sure they aren't used at any stage in
backports.
Fixes: 6071bd1aa13e ("netem: Segment GSO packets on enqueue")
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Dirk van der Merwe <[email protected]>
Acked-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
udp_tunnel(6)_xmit_skb() called by tipc_udp_xmit() expects a tunnel device
to count packets on dev->tstats, a perpcu variable. However, TIPC is using
udp tunnel with no tunnel device, and pass the lower dev, like veth device
that only initializes dev->lstats(a perpcu variable) when creating it.
Later iptunnel_xmit_stats() called by ip(6)tunnel_xmit() thinks the dev as
a tunnel device, and uses dev->tstats instead of dev->lstats. tstats' each
pointer points to a bigger struct than lstats, so when tstats->tx_bytes is
increased, other percpu variable's members could be overwritten.
syzbot has reported quite a few crashes due to fib_nh_common percpu member
'nhc_pcpu_rth_output' overwritten, call traces are like:
BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190
net/ipv4/route.c:1556
rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556
__mkroute_output net/ipv4/route.c:2332 [inline]
ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564
ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393
__ip_route_output_key include/net/route.h:125 [inline]
ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651
ip_route_output_key include/net/route.h:135 [inline]
...
or:
kasan: GPF could be caused by NULL-ptr deref or user memory access
RIP: 0010:dst_dev_put+0x24/0x290 net/core/dst.c:168
<IRQ>
rt_fibinfo_free_cpus net/ipv4/fib_semantics.c:200 [inline]
free_fib_info_rcu+0x2e1/0x490 net/ipv4/fib_semantics.c:217
__rcu_reclaim kernel/rcu/rcu.h:240 [inline]
rcu_do_batch kernel/rcu/tree.c:2437 [inline]
invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline]
rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697
...
The issue exists since tunnel stats update is moved to iptunnel_xmit by
Commit 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()"),
and here to fix it by passing a NULL tunnel dev to udp_tunnel(6)_xmit_skb
so that the packets counting won't happen on dev->tstats.
Reported-by: [email protected]
Reported-by: [email protected]
Reported-by: [email protected]
Reported-by: [email protected]
Reported-by: [email protected]
Reported-by: [email protected]
Fixes: 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()")
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
iptunnel_xmit() works as a common function, also used by a udp tunnel
which doesn't have to have a tunnel device, like how TIPC works with
udp media.
In these cases, we should allow not to count pkts on dev's tstats, so
that udp tunnel can work with no tunnel device safely.
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
in IPoIB case we can't see a VF broadcast address for but
can see for PF
Before:
11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 256
link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
vf 0 MAC 14:80:00:00:66:fe, spoof checking off, link-state disable,
trust off, query_rss off
...
After:
11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 256
link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
vf 0 link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof
checking off, link-state disable, trust off, query_rss off
v1->v2: add the IFLA_VF_BROADCAST constant
v2->v3: put IFLA_VF_BROADCAST at the end
to avoid KABI breakage and set NLA_REJECT
dev_setlink
Signed-off-by: Denis Kirjanov <[email protected]>
Acked-by: Doug Ledford <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
In sock_getsockopt(), 'optlen' is fetched the first time from userspace.
'len < 0' is then checked. Then in condition 'SO_MEMINFO', 'optlen' is
fetched the second time from userspace.
If change it between two fetches may cause security problems or unexpected
behaivor, and there is no reason to fetch it a second time.
To fix this, we need to remove the second fetch.
Signed-off-by: JingYi Hou <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
It appears that a FAILOVER_MSG can come from peer even when the failure
link is resetting (i.e. just after the 'node_write_unlock()'...). This
means the failover procedure on the node has not been started yet.
The situation is as follows:
node1 node2
linkb linka linka linkb
| | | |
| | x failure |
| | RESETTING |
| | | |
| x failure RESET |
| RESETTING FAILINGOVER |
| | (FAILOVER_MSG) | |
|<-------------------------------------------------|
| *FAILINGOVER | | |
| | (dummy FAILOVER_MSG) | |
|------------------------------------------------->|
| RESET | | FAILOVER_END
| FAILINGOVER RESET |
. . . .
. . . .
. . . .
Once this happens, the link failover procedure will be triggered
wrongly on the receiving node since the node isn't in FAILINGOVER state
but then another link failover will be carried out.
The consequences are:
1) A peer might get stuck in FAILINGOVER state because the 'sync_point'
was set, reset and set incorrectly, the criteria to end the failover
would not be met, it could keep waiting for a message that has already
received.
2) The early FAILOVER_MSG(s) could be queued in the link failover
deferdq but would be purged or not pulled out because the 'drop_point'
was not set correctly.
3) The early FAILOVER_MSG(s) could be dropped too.
4) The dummy FAILOVER_MSG could make the peer leaving FAILINGOVER state
shortly, but later on it would be restarted.
The same situation can also happen when the link is in PEER_RESET state
and a FAILOVER_MSG arrives.
The commit resolves the issues by forcing the link down immediately, so
the failover procedure will be started normally (which is the same as
when receiving a FAILOVER_MSG and the link is in up state).
Also, the function "tipc_node_link_failover()" is toughen to avoid such
a situation from happening.
Acked-by: Jon Maloy <[email protected]>
Signed-off-by: Tuong Lien <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Both listeners - mlxsw and netdevsim - of IPv6 FIB notifications are now
ready to handle IPv6 multipath notifications.
Therefore, stop ignoring such notifications in both drivers and stop
sending notification for each added / deleted nexthop.
v2:
* Remove 'multipath_rt' from 'struct fib6_entry_notifier_info'
Signed-off-by: Ido Schimmel <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
If all the nexthops of a multipath route are being deleted, send one
notification for the entire route, instead of one per-nexthop.
Signed-off-by: Ido Schimmel <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Emit a notification when a multipath routes is added or replace.
Note that unlike the replace notifications sent from fib6_add_rt2node(),
it is possible we are sending a 'FIB_EVENT_ENTRY_REPLACE' when a route
was merely added and not replaced.
Signed-off-by: Ido Schimmel <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Extend the IPv6 FIB notifier info with number of sibling routes being
notified.
This will later allow listeners to process one notification for a
multipath routes instead of N, where N is the number of nexthops.
Signed-off-by: Ido Schimmel <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Honestly all the conflicts were simple overlapping changes,
nothing really interesting to report.
Signed-off-by: David S. Miller <[email protected]>
|
|
Causes crash when lifetime expires on an adress as garbage is
dereferenced soon after.
This used to look like this:
for (ifap = &ifa->ifa_dev->ifa_list;
*ifap != NULL; ifap = &(*ifap)->ifa_next) {
if (*ifap == ifa) ...
but this was changed to:
struct in_ifaddr *tmp;
ifap = &ifa->ifa_dev->ifa_list;
tmp = rtnl_dereference(*ifap);
while (tmp) {
tmp = rtnl_dereference(tmp->ifa_next); // Bogus
if (rtnl_dereference(*ifap) == ifa) {
...
ifap = &tmp->ifa_next; // Can be NULL
tmp = rtnl_dereference(*ifap); // Dereference
}
}
Remove the bogus assigment/list entry skip.
Fixes: 2638eb8b50cf ("net: ipv4: provide __rcu annotation for ifa_list")
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Pull networking fixes from David Miller:
"Lots of bug fixes here:
1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer.
2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John
Crispin.
3) Use after free in psock backlog workqueue, from John Fastabend.
4) Fix source port matching in fdb peer flow rule of mlx5, from Raed
Salem.
5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet.
6) Network header needs to be set for packet redirect in nfp, from
John Hurley.
7) Fix udp zerocopy refcnt, from Willem de Bruijn.
8) Don't assume linear buffers in vxlan and geneve error handlers,
from Stefano Brivio.
9) Fix TOS matching in mlxsw, from Jiri Pirko.
10) More SCTP cookie memory leak fixes, from Neil Horman.
11) Fix VLAN filtering in rtl8366, from Linus Walluij.
12) Various TCP SACK payload size and fragmentation memory limit fixes
from Eric Dumazet.
13) Use after free in pneigh_get_next(), also from Eric Dumazet.
14) LAPB control block leak fix from Jeremy Sowden"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits)
lapb: fixed leak of control-blocks.
tipc: purge deferredq list for each grp member in tipc_group_delete
ax25: fix inconsistent lock state in ax25_destroy_timer
neigh: fix use-after-free read in pneigh_get_next
tcp: fix compile error if !CONFIG_SYSCTL
hv_sock: Suppress bogus "may be used uninitialized" warnings
be2net: Fix number of Rx queues used for flow hashing
net: handle 802.1P vlan 0 packets properly
tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
tcp: add tcp_min_snd_mss sysctl
tcp: tcp_fragment() should apply sane memory limits
tcp: limit payload size of sacked skbs
Revert "net: phylink: set the autoneg state in phylink_phy_change"
bpf: fix nested bpf tracepoints with per-cpu data
bpf: Fix out of bounds memory access in bpf_sk_storage
vsock/virtio: set SOCK_DONE on peer shutdown
net: dsa: rtl8366: Fix up VLAN filtering
net: phylink: set the autoneg state in phylink_phy_change
net: add high_order_alloc_disable sysctl/static key
tcp: add tcp_tx_skb_cache sysctl
...
|