aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2023-02-20scm: add user copy checks to put_cmsg()Eric Dumazet1-0/+2
This is a followup of commit 2558b8039d05 ("net: use a bounce buffer for copying skb->mark") x86 and powerpc define user_access_begin, meaning that they are not able to perform user copy checks when using user_write_access_begin() / unsafe_copy_to_user() and friends [1] Instead of waiting bugs to trigger on other arches, add a check_object_size() in put_cmsg() to make sure that new code tested on x86 with CONFIG_HARDENED_USERCOPY=y will perform more security checks. [1] We can not generically call check_object_size() from unsafe_copy_to_user() because UACCESS is enabled at this point. Signed-off-by: Eric Dumazet <[email protected]> Cc: Kees Cook <[email protected]> Acked-by: Kees Cook <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-20devlink: drop leftover duplicate/unused codePaolo Abeni1-13/+0
The recent merge from net left-over some unused code in leftover.c - nomen omen. Just drop the unused bits. Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-20net: make default_rps_mask a per netns attributePaolo Abeni2-20/+54
That really was meant to be a per netns attribute from the beginning. The idea is that once proper isolation is in place in the main namespace, additional demux in the child namespaces will be redundant. Let's make child netns default rps mask empty by default. To avoid bloating the netns with a possibly large cpumask, allocate it on-demand during the first write operation. Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextDavid S. Miller9-11/+75
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for net-next: 1) Add safeguard to check for NULL tupe in objects updates via NFT_MSG_NEWOBJ, this should not ever happen. From Alok Tiwari. 2) Incorrect pointer check in the new destroy rule command, from Yang Yingliang. 3) Incorrect status bitcheck in nf_conntrack_udp_packet(), from Florian Westphal. 4) Simplify seq_print_acct(), from Ilia Gavrilov. 5) Use 2-arg optimal variant of kfree_rcu() in IPVS, from Julian Anastasov. 6) TCP connection enters CLOSE state in conntrack for locally originated TCP reset packet from the reject target, from Florian Westphal. The fixes #2 and #3 in this series address issues from the previous pull nf-next request in this net-next cycle. ==================== Signed-off-by: David S. Miller <[email protected]>
2023-02-20l2tp: Avoid possible recursive deadlock in l2tp_tunnel_register()Shigeru Yoshida1-58/+67
When a file descriptor of pppol2tp socket is passed as file descriptor of UDP socket, a recursive deadlock occurs in l2tp_tunnel_register(). This situation is reproduced by the following program: int main(void) { int sock; struct sockaddr_pppol2tp addr; sock = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP); if (sock < 0) { perror("socket"); return 1; } addr.sa_family = AF_PPPOX; addr.sa_protocol = PX_PROTO_OL2TP; addr.pppol2tp.pid = 0; addr.pppol2tp.fd = sock; addr.pppol2tp.addr.sin_family = PF_INET; addr.pppol2tp.addr.sin_port = htons(0); addr.pppol2tp.addr.sin_addr.s_addr = inet_addr("192.168.0.1"); addr.pppol2tp.s_tunnel = 1; addr.pppol2tp.s_session = 0; addr.pppol2tp.d_tunnel = 0; addr.pppol2tp.d_session = 0; if (connect(sock, (const struct sockaddr *)&addr, sizeof(addr)) < 0) { perror("connect"); return 1; } return 0; } This program causes the following lockdep warning: ============================================ WARNING: possible recursive locking detected 6.2.0-rc5-00205-gc96618275234 #56 Not tainted -------------------------------------------- repro/8607 is trying to acquire lock: ffff8880213c8130 (sk_lock-AF_PPPOX){+.+.}-{0:0}, at: l2tp_tunnel_register+0x2b7/0x11c0 but task is already holding lock: ffff8880213c8130 (sk_lock-AF_PPPOX){+.+.}-{0:0}, at: pppol2tp_connect+0xa82/0x1a30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(sk_lock-AF_PPPOX); lock(sk_lock-AF_PPPOX); *** DEADLOCK *** May be due to missing lock nesting notation 1 lock held by repro/8607: #0: ffff8880213c8130 (sk_lock-AF_PPPOX){+.+.}-{0:0}, at: pppol2tp_connect+0xa82/0x1a30 stack backtrace: CPU: 0 PID: 8607 Comm: repro Not tainted 6.2.0-rc5-00205-gc96618275234 #56 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x100/0x178 __lock_acquire.cold+0x119/0x3b9 ? lockdep_hardirqs_on_prepare+0x410/0x410 lock_acquire+0x1e0/0x610 ? l2tp_tunnel_register+0x2b7/0x11c0 ? lock_downgrade+0x710/0x710 ? __fget_files+0x283/0x3e0 lock_sock_nested+0x3a/0xf0 ? l2tp_tunnel_register+0x2b7/0x11c0 l2tp_tunnel_register+0x2b7/0x11c0 ? sprintf+0xc4/0x100 ? l2tp_tunnel_del_work+0x6b0/0x6b0 ? debug_object_deactivate+0x320/0x320 ? lockdep_init_map_type+0x16d/0x7a0 ? lockdep_init_map_type+0x16d/0x7a0 ? l2tp_tunnel_create+0x2bf/0x4b0 ? l2tp_tunnel_create+0x3c6/0x4b0 pppol2tp_connect+0x14e1/0x1a30 ? pppol2tp_put_sk+0xd0/0xd0 ? aa_sk_perm+0x2b7/0xa80 ? aa_af_perm+0x260/0x260 ? bpf_lsm_socket_connect+0x9/0x10 ? pppol2tp_put_sk+0xd0/0xd0 __sys_connect_file+0x14f/0x190 __sys_connect+0x133/0x160 ? __sys_connect_file+0x190/0x190 ? lockdep_hardirqs_on+0x7d/0x100 ? ktime_get_coarse_real_ts64+0x1b7/0x200 ? ktime_get_coarse_real_ts64+0x147/0x200 ? __audit_syscall_entry+0x396/0x500 __x64_sys_connect+0x72/0xb0 do_syscall_64+0x38/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd This patch fixes the issue by getting/creating the tunnel before locking the pppol2tp socket. Fixes: 0b2c59720e65 ("l2tp: close all race conditions in l2tp_tunnel_register()") Cc: Cong Wang <[email protected]> Signed-off-by: Shigeru Yoshida <[email protected]> Reviewed-by: Guillaume Nault <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-20ipv6: icmp6: add drop reason support to icmpv6_echo_reply()Eric Dumazet1-5/+8
Change icmpv6_echo_reply() to return a drop reason. For the moment, return NOT_SPECIFIED or SKB_CONSUMED. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-20ipv6: icmp6: add SKB_DROP_REASON_IPV6_NDISC_NS_OTHERHOSTEric Dumazet1-1/+3
Hosts can often receive neighbour discovery messages that are not for them. Use a dedicated drop reason to make clear the packet is dropped for this normal case. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-20ipv6: icmp6: add SKB_DROP_REASON_IPV6_NDISC_BAD_OPTIONSEric Dumazet1-17/+10
This is a generic drop reason for any error detected in ndisc_parse_options(). Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-20ipv6: icmp6: add drop reason support to ndisc_redirect_rcv()Eric Dumazet1-10/+11
Change ndisc_redirect_rcv() to return a drop reason. For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED and values from icmpv6_notify(). More reasons are added later. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-20ipv6: icmp6: add drop reason support to ndisc_router_discovery()Eric Dumazet1-18/+19
Change ndisc_router_discovery() to return a drop reason. For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED and SKB_CONSUMED. More reasons are added later. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-20ipv6: icmp6: add drop reason support to ndisc_recv_rs()Eric Dumazet1-5/+7
Change ndisc_recv_rs() to return a drop reason. For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED or SKB_CONSUMED. More reasons are added later. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-20ipv6: icmp6: add drop reason support to ndisc_recv_na()Eric Dumazet1-14/+14
Change ndisc_recv_na() to return a drop reason. For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED or SKB_CONSUMED. More reasons are added later. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-20ipv6: icmp6: add drop reason support to ndisc_recv_ns()Eric Dumazet1-16/+18
Change ndisc_recv_ns() to return a drop reason. For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED or SKB_CONSUMED. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-20net: add location to trace_consume_skb()Eric Dumazet2-5/+5
kfree_skb() includes the location, it makes sense to add it to consume_skb() as well. After patch: taskd_EventMana 8602 [004] 420.406239: skb:consume_skb: skbaddr=0xffff893a4a6d0500 location=unix_stream_read_generic swapper 0 [011] 422.732607: skb:consume_skb: skbaddr=0xffff89597f68cee0 location=mlx4_en_free_tx_desc discipline 9141 [043] 423.065653: skb:consume_skb: skbaddr=0xffff893a487e9c00 location=skb_consume_udp swapper 0 [010] 423.073166: skb:consume_skb: skbaddr=0xffff8949ce9cdb00 location=icmpv6_rcv borglet 8672 [014] 425.628256: skb:consume_skb: skbaddr=0xffff8949c42e9400 location=netlink_dump swapper 0 [028] 426.263317: skb:consume_skb: skbaddr=0xffff893b1589dce0 location=net_rx_action wget 14339 [009] 426.686380: skb:consume_skb: skbaddr=0xffff893a51b552e0 location=tcp_rcv_state_process Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-20xsk: support use vaddr as ringXuan Zhuo3-13/+8
When we try to start AF_XDP on some machines with long running time, due to the machine's memory fragmentation problem, there is no sufficient contiguous physical memory that will cause the start failure. If the size of the queue is 8 * 1024, then the size of the desc[] is 8 * 1024 * 8 = 16 * PAGE, but we also add struct xdp_ring size, so it is 16page+. This is necessary to apply for a 4-order memory. If there are a lot of queues, it is difficult to these machine with long running time. Here, that we actually waste 15 pages. 4-Order memory is 32 pages, but we only use 17 pages. This patch replaces __get_free_pages() by vmalloc() to allocate memory to solve these problems. Signed-off-by: Xuan Zhuo <[email protected]> Acked-by: Magnus Karlsson <[email protected]> Reviewed-by: Alexander Lobakin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-20net/smc: fix application data exceptionD. Wythe1-9/+8
There is a certain probability that following exceptions will occur in the wrk benchmark test: Running 10s test @ http://11.213.45.6:80 8 threads and 64 connections Thread Stats Avg Stdev Max +/- Stdev Latency 3.72ms 13.94ms 245.33ms 94.17% Req/Sec 1.96k 713.67 5.41k 75.16% 155262 requests in 10.10s, 23.10MB read Non-2xx or 3xx responses: 3 We will find that the error is HTTP 400 error, which is a serious exception in our test, which means the application data was corrupted. Consider the following scenarios: CPU0 CPU1 buf_desc->used = 0; cmpxchg(buf_desc->used, 0, 1) deal_with(buf_desc) memset(buf_desc->cpu_addr,0); This will cause the data received by a victim connection to be cleared, thus triggering an HTTP 400 error in the server. This patch exchange the order between clear used and memset, add barrier to ensure memory consistency. Fixes: 1c5526968e27 ("net/smc: Clear memory when release and reuse buffer") Signed-off-by: D. Wythe <[email protected]> Reviewed-by: Wenjia Zhang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-20net/smc: fix potential panic dues to unprotected smc_llc_srv_add_link()D. Wythe1-0/+2
There is a certain chance to trigger the following panic: PID: 5900 TASK: ffff88c1c8af4100 CPU: 1 COMMAND: "kworker/1:48" #0 [ffff9456c1cc79a0] machine_kexec at ffffffff870665b7 #1 [ffff9456c1cc79f0] __crash_kexec at ffffffff871b4c7a #2 [ffff9456c1cc7ab0] crash_kexec at ffffffff871b5b60 #3 [ffff9456c1cc7ac0] oops_end at ffffffff87026ce7 #4 [ffff9456c1cc7ae0] page_fault_oops at ffffffff87075715 #5 [ffff9456c1cc7b58] exc_page_fault at ffffffff87ad0654 #6 [ffff9456c1cc7b80] asm_exc_page_fault at ffffffff87c00b62 [exception RIP: ib_alloc_mr+19] RIP: ffffffffc0c9cce3 RSP: ffff9456c1cc7c38 RFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000004 RDX: 0000000000000010 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88c1ea281d00 R8: 000000020a34ffff R9: ffff88c1350bbb20 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000010 R14: ffff88c1ab040a50 R15: ffff88c1ea281d00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff9456c1cc7c60] smc_ib_get_memory_region at ffffffffc0aff6df [smc] #8 [ffff9456c1cc7c88] smcr_buf_map_link at ffffffffc0b0278c [smc] #9 [ffff9456c1cc7ce0] __smc_buf_create at ffffffffc0b03586 [smc] The reason here is that when the server tries to create a second link, smc_llc_srv_add_link() has no protection and may add a new link to link group. This breaks the security environment protected by llc_conf_mutex. Fixes: 2d2209f20189 ("net/smc: first part of add link processing as SMC server") Signed-off-by: D. Wythe <[email protected]> Reviewed-by: Larysa Zaremba <[email protected]> Reviewed-by: Wenjia Zhang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-20net/sched: taprio: dynamic max_sdu larger than the max_mtu is unlimitedVladimir Oltean1-0/+2
It makes no sense to keep randomly large max_sdu values, especially if larger than the device's max_mtu. These are visible in "tc qdisc show". Such a max_sdu is practically unlimited and will cause no packets for that traffic class to be dropped on enqueue. Just set max_sdu_dynamic to U32_MAX, which in the logic below causes taprio to save a max_frm_len of U32_MAX and a max_sdu presented to user space of 0 (unlimited). Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Kurt Kanzenbach <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-02-20net/sched: taprio: don't allow dynamic max_sdu to go negative after stab ↵Vladimir Oltean1-1/+7
adjustment The overhead specified in the size table comes from the user. With small time intervals (or gates always closed), the overhead can be larger than the max interval for that traffic class, and their difference is negative. What we want to happen is for max_sdu_dynamic to have the smallest non-zero value possible (1) which means that all packets on that traffic class are dropped on enqueue. However, since max_sdu_dynamic is u32, a negative is represented as a large value and oversized dropping never happens. Use max_t with int to force a truncation of max_frm_len to no smaller than dev->hard_header_len + 1, which in turn makes max_sdu_dynamic no smaller than 1. Fixes: fed87cc6718a ("net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations") Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Kurt Kanzenbach <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-02-20net/sched: taprio: fix calculation of maximum gate durationsVladimir Oltean1-17/+17
taprio_calculate_gate_durations() depends on netdev_get_num_tc() and this returns 0. So it calculates the maximum gate durations for no traffic class. I had tested the blamed commit only with another patch in my tree, one which in the end I decided isn't valuable enough to submit ("net/sched: taprio: mask off bits in gate mask that exceed number of TCs"). The problem is that having this patch threw off my testing. By moving the netdev_set_num_tc() call earlier, we implicitly gave to taprio_calculate_gate_durations() the information it needed. Extract only the portion from the unsubmitted change which applies the mqprio configuration to the netdev earlier. Link: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/ Fixes: a306a90c8ffe ("net/sched: taprio: calculate tc gate durations") Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Kurt Kanzenbach <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-02-20rxrpc: Fix overproduction of wakeups to recvmsg()David Howells2-2/+16
Fix three cases of overproduction of wakeups: (1) rxrpc_input_split_jumbo() conditionally notifies the app that there's data for recvmsg() to collect if it queues some data - and then its only caller, rxrpc_input_data(), goes and wakes up recvmsg() anyway. Fix the rxrpc_input_data() to only do the wakeup in failure cases. (2) If a DATA packet is received for a call by the I/O thread whilst recvmsg() is busy draining the call's rx queue in the app thread, the call will left on the recvmsg() queue for recvmsg() to pick up, even though there isn't any data on it. This can cause an unexpected recvmsg() with a 0 return and no MSG_EOR set after the reply has been posted to a service call. Fix this by discarding pending calls from the recvmsg() queue that don't need servicing yet. (3) Not-yet-completed calls get requeued after having data read from them, even if they have no data to read. Fix this by only requeuing them if they have data waiting on them; if they don't, the I/O thread will requeue them when data arrives or they fail. Signed-off-by: David Howells <[email protected]> cc: Marc Dionne <[email protected]> cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
2023-02-18ieee802154: Drop device trackersMiquel Raynal1-20/+4
In order to prevent a device from disappearing when a background job was started, dev_hold() and dev_put() calls were made. During the stabilization phase of the scan/beacon features, it was later decided that removing the device while a background job was ongoing was a valid use case, and we should instead stop the background job and then remove the device, rather than prevent the device from being removed. This is what is currently done, which means manually reference counting the device during background jobs is no longer needed. Fixes: ed3557c947e1 ("ieee802154: Add support for user scanning requests") Fixes: 9bc114504b07 ("ieee802154: Add support for user beaconing requests") Reported-by: Jakub Kicinski <[email protected]> Signed-off-by: Miquel Raynal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Stefan Schmidt <[email protected]>
2023-02-18mac802154: Fix an always true conditionMiquel Raynal1-3/+2
At this stage we simply do not care about the delayed work value, because active scan is not yet supported, so we can blindly queue another work once a beacon has been sent. It fixes a smatch warning: mac802154_beacon_worker() warn: always true condition '(local->beacon_interval >= 0) => (0-u32max >= 0)' Fixes: 3accf4762734 ("mac802154: Handle basic beaconing") Reported-by: kernel test robot <[email protected]> Signed-off-by: Miquel Raynal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Stefan Schmidt <[email protected]>
2023-02-18mac802154: Send beacons using the MLME Tx pathMiquel Raynal1-1/+19
Using ieee802154_subif_start_xmit() to bypass the net queue when sending beacons is broken because it does not acquire the HARD_TX_LOCK(), hence not preventing datagram buffers to be smashed by beacons upon contention situation. Using the mlme_tx helper is not the best fit either but at least it is not buggy and has little-to-no performance hit. More details are given in the comment explaining this choice in the code. Fixes: 3accf4762734 ("mac802154: Handle basic beaconing") Reported-by: Alexander Aring <[email protected]> Signed-off-by: Miquel Raynal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Stefan Schmidt <[email protected]>
2023-02-18ieee802154: Change error code on monitor scan netlink requestMiquel Raynal1-1/+1
Returning EPERM gives the impression that "right now" it is not possible, but "later" it could be, while what we want to express is the fact that this is not currently supported at all (might change in the future). So let's return EOPNOTSUPP instead. Fixes: ed3557c947e1 ("ieee802154: Add support for user scanning requests") Suggested-by: Alexander Aring <[email protected]> Signed-off-by: Miquel Raynal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Stefan Schmidt <[email protected]>
2023-02-18ieee802154: Convert scan error messages to extackMiquel Raynal1-6/+13
Instead of printing error messages in the kernel log, let's use extack. When there is a netlink error returned that could be further specified with a string, use extack as well. Apply this logic to the very recent scan/beacon infrastructure. Fixes: ed3557c947e1 ("ieee802154: Add support for user scanning requests") Fixes: 9bc114504b07 ("ieee802154: Add support for user beaconing requests") Suggested-by: Jakub Kicinski <[email protected]> Signed-off-by: Miquel Raynal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Stefan Schmidt <[email protected]>
2023-02-18ieee802154: Use netlink policies when relevant on scan parametersMiquel Raynal1-56/+28
Instead of open-coding scan parameters (page, channels, duration, etc), let's use the existing NLA_POLICY* macros. This help greatly reducing the error handling and clarifying the overall logic. Fixes: ed3557c947e1 ("ieee802154: Add support for user scanning requests") Suggested-by: Jakub Kicinski <[email protected]> Signed-off-by: Miquel Raynal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Stefan Schmidt <[email protected]>
2023-02-17bpf: Add BPF_FIB_LOOKUP_SKIP_NEIGH for bpf_fib_lookupMartin KaFai Lau1-13/+26
The bpf_fib_lookup() also looks up the neigh table. This was done before bpf_redirect_neigh() was added. In the use case that does not manage the neigh table and requires bpf_fib_lookup() to lookup a fib to decide if it needs to redirect or not, the bpf prog can depend only on using bpf_redirect_neigh() to lookup the neigh. It also keeps the neigh entries fresh and connected. This patch adds a bpf_fib_lookup flag, SKIP_NEIGH, to avoid the double neigh lookup when the bpf prog always call bpf_redirect_neigh() to do the neigh lookup. The params->smac output is skipped together when SKIP_NEIGH is set because bpf_redirect_neigh() will figure out the smac also. Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-02-17Revert "bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES"Martin KaFai Lau1-23/+6
This reverts commit 6c20822fada1b8adb77fa450d03a0d449686a4a9. build bot failed on arch with different cache line size: https://lore.kernel.org/bpf/[email protected]/ Signed-off-by: Martin KaFai Lau <[email protected]>
2023-02-17bpf: bpf_fib_lookup should not return neigh in NUD_FAILED stateMartin KaFai Lau1-2/+2
The bpf_fib_lookup() helper does not only look up the fib (ie. route) but it also looks up the neigh. Before returning the neigh, the helper does not check for NUD_VALID. When a neigh state (neigh->nud_state) is in NUD_FAILED, its dmac (neigh->ha) could be all zeros. The helper still returns SUCCESS instead of NO_NEIGH in this case. Because of the SUCCESS return value, the bpf prog directly uses the returned dmac and ends up filling all zero in the eth header. This patch checks for NUD_VALID and returns NO_NEIGH if the neigh is not valid. Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-02-17bpf: Disable bh in bpf_test_run for xdp and tc progMartin KaFai Lau1-0/+2
Some of the bpf helpers require bh disabled. eg. The bpf_fib_lookup helper that will be used in a latter selftest. In particular, it calls ___neigh_lookup_noref that expects the bh disabled. This patch disables bh before calling bpf_prog_run[_xdp], so the testing prog can also use those helpers. Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-02-17xsk: check IFF_UP earlier in Tx pathMaciej Fijalkowski1-26/+33
Xsk Tx can be triggered via either sendmsg() or poll() syscalls. These two paths share a call to common function xsk_xmit() which has two sanity checks within. A pseudo code example to show the two paths: __xsk_sendmsg() : xsk_poll(): if (unlikely(!xsk_is_bound(xs))) if (unlikely(!xsk_is_bound(xs))) return -ENXIO; return mask; if (unlikely(need_wait)) (...) return -EOPNOTSUPP; xsk_xmit() mark napi id (...) xsk_xmit() xsk_xmit(): if (unlikely(!(xs->dev->flags & IFF_UP))) return -ENETDOWN; if (unlikely(!xs->tx)) return -ENOBUFS; As it can be observed above, in sendmsg() napi id can be marked on interface that was not brought up and this causes a NULL ptr dereference: [31757.505631] BUG: kernel NULL pointer dereference, address: 0000000000000018 [31757.512710] #PF: supervisor read access in kernel mode [31757.517936] #PF: error_code(0x0000) - not-present page [31757.523149] PGD 0 P4D 0 [31757.525726] Oops: 0000 [#1] PREEMPT SMP NOPTI [31757.530154] CPU: 26 PID: 95641 Comm: xdpsock Not tainted 6.2.0-rc5+ #40 [31757.536871] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019 [31757.547457] RIP: 0010:xsk_sendmsg+0xde/0x180 [31757.551799] Code: 00 75 a2 48 8b 00 a8 04 75 9b 84 d2 74 69 8b 85 14 01 00 00 85 c0 75 1b 48 8b 85 28 03 00 00 48 8b 80 98 00 00 00 48 8b 40 20 <8b> 40 18 89 85 14 01 00 00 8b bd 14 01 00 00 81 ff 00 01 00 00 0f [31757.570840] RSP: 0018:ffffc90034f27dc0 EFLAGS: 00010246 [31757.576143] RAX: 0000000000000000 RBX: ffffc90034f27e18 RCX: 0000000000000000 [31757.583389] RDX: 0000000000000001 RSI: ffffc90034f27e18 RDI: ffff88984cf3c100 [31757.590631] RBP: ffff88984714a800 R08: ffff88984714a800 R09: 0000000000000000 [31757.597877] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000fffffffa [31757.605123] R13: 0000000000000000 R14: 0000000000000003 R15: 0000000000000000 [31757.612364] FS: 00007fb4c5931180(0000) GS:ffff88afdfa00000(0000) knlGS:0000000000000000 [31757.620571] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [31757.626406] CR2: 0000000000000018 CR3: 000000184b41c003 CR4: 00000000007706e0 [31757.633648] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [31757.640894] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [31757.648139] PKRU: 55555554 [31757.650894] Call Trace: [31757.653385] <TASK> [31757.655524] sock_sendmsg+0x8f/0xa0 [31757.659077] ? sockfd_lookup_light+0x12/0x70 [31757.663416] __sys_sendto+0xfc/0x170 [31757.667051] ? do_sched_setscheduler+0xdb/0x1b0 [31757.671658] __x64_sys_sendto+0x20/0x30 [31757.675557] do_syscall_64+0x38/0x90 [31757.679197] entry_SYSCALL_64_after_hwframe+0x72/0xdc [31757.687969] Code: 8e f6 ff 44 8b 4c 24 2c 4c 8b 44 24 20 41 89 c4 44 8b 54 24 28 48 8b 54 24 18 b8 2c 00 00 00 48 8b 74 24 10 8b 7c 24 08 0f 05 <48> 3d 00 f0 ff ff 77 3a 44 89 e7 48 89 44 24 08 e8 b5 8e f6 ff 48 [31757.707007] RSP: 002b:00007ffd49c73c70 EFLAGS: 00000293 ORIG_RAX: 000000000000002c [31757.714694] RAX: ffffffffffffffda RBX: 000055a996565380 RCX: 00007fb4c5727c16 [31757.721939] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 [31757.729184] RBP: 0000000000000040 R08: 0000000000000000 R09: 0000000000000000 [31757.736429] R10: 0000000000000040 R11: 0000000000000293 R12: 0000000000000000 [31757.743673] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [31757.754940] </TASK> To fix this, let's make xsk_xmit a function that will be responsible for generic Tx, where RCU is handled accordingly and pull out sanity checks and xs->zc handling. Populate sanity checks to __xsk_sendmsg() and xsk_poll(). Fixes: ca2e1a627035 ("xsk: Mark napi_id on sendmsg()") Fixes: 18b1ab7aa76b ("xsk: Fix race at socket teardown") Signed-off-by: Maciej Fijalkowski <[email protected]> Reviewed-by: Alexander Lobakin <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2023-02-17netfilter: let reset rules clean out conntrack entriesFlorian Westphal5-0/+65
iptables/nftables support responding to tcp packets with tcp resets. The generated tcp reset packet passes through both output and postrouting netfilter hooks, but conntrack will never see them because the generated skb has its ->nfct pointer copied over from the packet that triggered the reset rule. If the reset rule is used for established connections, this may result in the conntrack entry to be around for a very long time (default timeout is 5 days). One way to avoid this would be to not copy the nf_conn pointer so that the rest packet passes through conntrack too. Problem is that output rules might not have the same conntrack zone setup as the prerouting ones, so its possible that the reset skb won't find the correct entry. Generating a template entry for the skb seems error prone as well. Add an explicit "closing" function that switches a confirmed conntrack entry to closed state and wire this up for tcp. If the entry isn't confirmed, no action is needed because the conntrack entry will never be committed to the table. Reported-by: Russel King <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2023-02-17Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/netDavid S. Miller16-37/+51
Some of the devlink bits were tricky, but I think I got it right. Signed-off-by: David S. Miller <[email protected]>
2023-02-16Merge tag 'wireless-next-2023-03-16' of ↵Jakub Kicinski21-439/+1021
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== Major stack changes: * EHT channel puncturing support (client & AP) * some support for AP MLD without mac80211 * fixes for A-MSDU on mesh connections Major driver changes: iwlwifi * EHT rate reporting * Bump FW API to 74 for AX devices * STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware mt76 * switch to using page pool allocator * mt7996 EHT (Wi-Fi 7) support * Wireless Ethernet Dispatch (WED) reset support libertas * WPS enrollee support brcmfmac * Rename Cypress 89459 to BCM4355 * BCM4355 and BCM4377 support mwifiex * SD8978 chipset support rtl8xxxu * LED support ath12k * new driver for Qualcomm Wi-Fi 7 devices ath11k * IPQ5018 support * Fine Timing Measurement (FTM) responder role support * channel 177 support ath10k * store WLAN firmware version in SMEM image table * tag 'wireless-next-2023-03-16' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (207 commits) wifi: brcmfmac: p2p: Introduce generic flexible array frame member wifi: mac80211: add documentation for amsdu_mesh_control wifi: cfg80211: remove gfp parameter from cfg80211_obss_color_collision_notify description wifi: mac80211: always initialize link_sta with sta wifi: mac80211: pass 'sta' to ieee80211_rx_data_set_sta() wifi: cfg80211: Set SSID if it is not already set wifi: rtw89: move H2C of del_pkt_offload before polling FW status ready wifi: rtw89: use readable return 0 in rtw89_mac_cfg_ppdu_status() wifi: rtw88: usb: drop now unnecessary URB size check wifi: rtw88: usb: send Zero length packets if necessary wifi: rtw88: usb: Set qsel correctly wifi: mac80211: fix off-by-one link setting wifi: mac80211: Fix for Rx fragmented action frames wifi: mac80211: avoid u32_encode_bits() warning wifi: mac80211: Don't translate MLD addresses for multicast wifi: cfg80211: call reg_notifier for self managed wiphy from driver hint wifi: cfg80211: get rid of gfp in cfg80211_bss_color_notify wifi: nl80211: Allow authentication frames and set keys on NAN interface wifi: mac80211: fix non-MLO station association wifi: mac80211: Allow NSS change only up to capability ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-02-16seg6: add PSP flavor support for SRv6 End behaviorAndrea Mayer1-3/+333
The "flavors" framework defined in RFC8986 [1] represents additional operations that can modify or extend a subset of existing behaviors such as SRv6 End, End.X and End.T. We report these flavors hereafter: - Penultimate Segment Pop (PSP); - Ultimate Segment Pop (USP); - Ultimate Segment Decapsulation (USD). Depending on how the Segment Routing Header (SRH) has to be handled, an SRv6 End* behavior can support these flavors either individually or in combinations. In this patch, we only consider the PSP flavor for the SRv6 End behavior. A PSP enabled SRv6 End behavior is used by the Source/Ingress SR node (i.e., the one applying the SRv6 Policy) when it needs to instruct the penultimate SR Endpoint node listed in the SID List (carried by the SRH) to remove the SRH from the IPv6 header. Specifically, a PSP enabled SRv6 End behavior processes the SRH by: i) decreasing the Segment Left (SL) from 1 to 0; ii) copying the Last Segment IDentifier (SID) into the IPv6 Destination Address (DA); iii) removing (i.e., popping) the outer SRH from the extension headers following the IPv6 header. It is important to note that PSP operation (steps i, ii, iii) takes place only at a penultimate SR Segment Endpoint node (i.e., when the SL=1) and does not happen at non-penultimate Endpoint nodes. Indeed, when a SID of PSP flavor is processed at a non-penultimate SR Segment Endpoint node, the PSP operation is not performed because it would not be possible to decrease the SL from 1 to 0. SL=2 SL=1 SL=0 | | | For example, given the SRv6 policy (SID List := < X, Y, Z >): - a PSP enabled SRv6 End behavior bound to SID "Y" will apply the PSP operation as Segment Left (SL) is 1, corresponding to the Penultimate Segment of the SID List; - a PSP enabled SRv6 End behavior bound to SID "X" will *NOT* apply the PSP operation as the Segment Left is 2. This behavior instance will apply the "standard" End packet processing, ignoring the configured PSP flavor at all. [1] - RFC8986: https://datatracker.ietf.org/doc/html/rfc8986 Signed-off-by: Andrea Mayer <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-02-16seg6: factor out End lookup nexthop processing to a dedicated functionAndrea Mayer1-6/+10
The End nexthop lookup/input operations are moved into a new helper function named input_action_end_finish(). This avoids duplicating the code needed to compute the nexthop in the different flavors of the End behavior. Signed-off-by: Andrea Mayer <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-02-16devlink: Fix netdev notifier chain corruptionIdo Schimmel2-12/+1
Cited commit changed devlink to register its netdev notifier block on the global netdev notifier chain instead of on the per network namespace one. However, when changing the network namespace of the devlink instance, devlink still tries to unregister its notifier block from the chain of the old namespace and register it on the chain of the new namespace. This results in corruption of the notifier chains, as the same notifier block is registered on two different chains: The global one and the per network namespace one. In turn, this causes other problems such as the inability to dismantle namespaces due to netdev reference count issues. Fix by preventing devlink from moving its notifier block between namespaces. Reproducer: # echo "10 1" > /sys/bus/netdevsim/new_device # ip netns add test123 # devlink dev reload netdevsim/netdevsim10 netns test123 # ip netns del test123 [ 71.935619] unregister_netdevice: waiting for lo to become free. Usage count = 2 [ 71.938348] leaked reference. Fixes: 565b4824c39f ("devlink: change port event netdev notifier from per-net to global") Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
2023-02-16net/sched: act_pedit: use percpu overlimit counter when availablePedro Tammela1-3/+1
Since act_pedit now has access to percpu counters, use the tcf_action_inc_overlimit_qstats wrapper that will use the percpu counter whenever they are available. Reviewed-by: Jamal Hadi Salim <[email protected]> Signed-off-by: Pedro Tammela <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-02-16net/sched: act_gate: use percpu statsPedro Tammela1-14/+16
The tc action act_gate was using shared stats, move it to percpu stats. tdc results: 1..12 ok 1 5153 - Add gate action with priority and sched-entry ok 2 7189 - Add gate action with base-time ok 3 a721 - Add gate action with cycle-time ok 4 c029 - Add gate action with cycle-time-ext ok 5 3719 - Replace gate base-time action ok 6 d821 - Delete gate action with valid index ok 7 3128 - Delete gate action with invalid index ok 8 7837 - List gate actions ok 9 9273 - Flush gate actions ok 10 c829 - Add gate action with duplicate index ok 11 3043 - Add gate action with invalid index ok 12 2930 - Add gate action with cookie Reviewed-by: Jamal Hadi Salim <[email protected]> Signed-off-by: Pedro Tammela <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-02-16net/sched: act_connmark: transition to percpu stats and rcuPedro Tammela1-39/+68
The tc action act_connmark was using shared stats and taking the per action lock in the datapath. Improve it by using percpu stats and rcu. perf before: - 13.55% tcf_connmark_act - 81.18% _raw_spin_lock 80.46% native_queued_spin_lock_slowpath perf after: - 2.85% tcf_connmark_act tdc results: 1..15 ok 1 2002 - Add valid connmark action with defaults ok 2 56a5 - Add valid connmark action with control pass ok 3 7c66 - Add valid connmark action with control drop ok 4 a913 - Add valid connmark action with control pipe ok 5 bdd8 - Add valid connmark action with control reclassify ok 6 b8be - Add valid connmark action with control continue ok 7 d8a6 - Add valid connmark action with control jump ok 8 aae8 - Add valid connmark action with zone argument ok 9 2f0b - Add valid connmark action with invalid zone argument ok 10 9305 - Add connmark action with unsupported argument ok 11 71ca - Add valid connmark action and replace it ok 12 5f8f - Add valid connmark action with cookie ok 13 c506 - Replace connmark with invalid goto chain control ok 14 6571 - Delete connmark action with valid index ok 15 3426 - Delete connmark action with invalid index Reviewed-by: Jamal Hadi Salim <[email protected]> Signed-off-by: Pedro Tammela <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-02-16net/sched: act_nat: transition to percpu stats and rcuPedro Tammela1-23/+49
The tc action act_nat was using shared stats and taking the per action lock in the datapath. Improve it by using percpu stats and rcu. perf before: - 10.48% tcf_nat_act - 81.83% _raw_spin_lock 81.08% native_queued_spin_lock_slowpath perf after: - 0.48% tcf_nat_act tdc results: 1..27 ok 1 7565 - Add nat action on ingress with default control action ok 2 fd79 - Add nat action on ingress with pipe control action ok 3 eab9 - Add nat action on ingress with continue control action ok 4 c53a - Add nat action on ingress with reclassify control action ok 5 76c9 - Add nat action on ingress with jump control action ok 6 24c6 - Add nat action on ingress with drop control action ok 7 2120 - Add nat action on ingress with maximum index value ok 8 3e9d - Add nat action on ingress with invalid index value ok 9 f6c9 - Add nat action on ingress with invalid IP address ok 10 be25 - Add nat action on ingress with invalid argument ok 11 a7bd - Add nat action on ingress with DEFAULT IP address ok 12 ee1e - Add nat action on ingress with ANY IP address ok 13 1de8 - Add nat action on ingress with ALL IP address ok 14 8dba - Add nat action on egress with default control action ok 15 19a7 - Add nat action on egress with pipe control action ok 16 f1d9 - Add nat action on egress with continue control action ok 17 6d4a - Add nat action on egress with reclassify control action ok 18 b313 - Add nat action on egress with jump control action ok 19 d9fc - Add nat action on egress with drop control action ok 20 a895 - Add nat action on egress with DEFAULT IP address ok 21 2572 - Add nat action on egress with ANY IP address ok 22 37f3 - Add nat action on egress with ALL IP address ok 23 6054 - Add nat action on egress with cookie ok 24 79d6 - Add nat action on ingress with cookie ok 25 4b12 - Replace nat action with invalid goto chain control ok 26 b811 - Delete nat action with valid index ok 27 a521 - Delete nat action with invalid index Reviewed-by: Jamal Hadi Salim <[email protected]> Signed-off-by: Pedro Tammela <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-02-16net/core: refactor promiscuous mode messageJesse Brandeburg1-3/+2
The kernel stack can be more consistent by printing the IFF_PROMISC aka promiscuous enable/disable messages with the standard netdev_info message which can include bus and driver info as well as the device. typical command usage from user space looks like: ip link set eth0 promisc <on|off> But lots of utilities such as bridge, tcpdump, etc put the interface into promiscuous mode. old message: [ 406.034418] device eth0 entered promiscuous mode [ 408.424703] device eth0 left promiscuous mode new message: [ 406.034431] ice 0000:17:00.0 eth0: entered promiscuous mode [ 408.424715] ice 0000:17:00.0 eth0: left promiscuous mode Signed-off-by: Jesse Brandeburg <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-02-16net/core: print message for allmulticastJesse Brandeburg1-0/+2
When the user sets or clears the IFF_ALLMULTI flag in the netdev, there are no log messages printed to the kernel log to indicate anything happened. This is inexplicably different from most other dev->flags changes, and could suprise the user. Typically this occurs from user-space when a user: ip link set eth0 allmulticast <on|off> However, other devices like bridge set allmulticast as well, and many other flows might trigger entry into allmulticast as well. The new message uses the standard netdev_info print and looks like: [ 413.246110] ixgbe 0000:17:00.0 eth0: entered allmulticast mode [ 415.977184] ixgbe 0000:17:00.0 eth0: left allmulticast mode Signed-off-by: Jesse Brandeburg <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-02-16net/sched: Retire rsvp classifierJamal Hadi Salim5-846/+0
The rsvp classifier has served us well for about a quarter of a century but has has not been getting much maintenance attention due to lack of known users. Signed-off-by: Jamal Hadi Salim <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-02-16net/sched: Retire tcindex classifierJamal Hadi Salim3-728/+0
The tcindex classifier has served us well for about a quarter of a century but has not been getting much TLC due to lack of known users. Most recently it has become easy prey to syzkaller. For this reason, we are retiring it. Signed-off-by: Jamal Hadi Salim <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-02-16net/sched: Retire dsmark qdiscJamal Hadi Salim3-530/+0
The dsmark qdisc has served us well over the years for diffserv but has not been getting much attention due to other more popular approaches to do diffserv services. Most recently it has become a shooting target for syzkaller. For this reason, we are retiring it. Signed-off-by: Jamal Hadi Salim <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-02-16net/sched: Retire ATM qdiscJamal Hadi Salim3-721/+0
The ATM qdisc has served us well over the years but has not been getting much TLC due to lack of known users. Most recently it has become a shooting target for syzkaller. For this reason, we are retiring it. Signed-off-by: Jamal Hadi Salim <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-02-16net/sched: Retire CBQ qdiscJamal Hadi Salim3-1745/+0
While this amazing qdisc has served us well over the years it has not been getting any tender love and care and has bitrotted over time. It has become mostly a shooting target for syzkaller lately. For this reason, we are retiring it. Goodbye CBQ - we loved you. Signed-off-by: Jamal Hadi Salim <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-02-15net: msg_zerocopy: elide page accounting if RLIM_INFINITYWillem de Bruijn1-2/+6
MSG_ZEROCOPY ensures that pinned user pages do not exceed the limit. If no limit is set, skip this accounting as otherwise expensive atomic_long operations are called for no reason. This accounting is already skipped for privileged (CAP_IPC_LOCK) users. Rely on the same mechanism: if no mmp->user is set, mm_unaccount_pinned_pages does not decrement either. Tested by running tools/testing/selftests/net/msg_zerocopy.sh with an unprivileged user for the TXMODE binary: ip netns exec "${NS1}" sudo -u "{$USER}" "${BIN}" "-${IP}" ... Signed-off-by: Willem de Bruijn <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>