aboutsummaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
2021-04-09net: fix hangup on napi_disable for threaded napiPaolo Abeni1-1/+2
napi_disable() is subject to an hangup, when the threaded mode is enabled and the napi is under heavy traffic. If the relevant napi has been scheduled and the napi_disable() kicks in before the next napi_threaded_wait() completes - so that the latter quits due to the napi_disable_pending() condition, the existing code leaves the NAPI_STATE_SCHED bit set and the napi_disable() loop waiting for such bit will hang. This patch addresses the issue by dropping the NAPI_STATE_DISABLE bit test in napi_thread_wait(). The later napi_threaded_poll() iteration will take care of clearing the NAPI_STATE_SCHED. This also addresses a related problem reported by Jakub: before this patch a napi_disable()/napi_enable() pair killed the napi thread, effectively disabling the threaded mode. On the patched kernel napi_disable() simply stops scheduling the relevant thread. v1 -> v2: - let the main napi_thread_poll() loop clear the SCHED bit Reported-by: Jakub Kicinski <[email protected]> Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support") Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/883923fa22745a9589e8610962b7dc59df09fb1f.1617981844.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <[email protected]>
2021-04-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-7/+5
Daniel Borkmann says: ==================== pull-request: bpf 2021-04-08 The following pull-request contains BPF updates for your *net* tree. We've added 4 non-merge commits during the last 2 day(s) which contain a total of 4 files changed, 31 insertions(+), 10 deletions(-). The main changes are: 1) Validate and reject invalid JIT branch displacements, from Piotr Krysiuk. 2) Fix incorrect unhash restore as well as fwd_alloc memory accounting in sock map, from John Fastabend. ==================== Signed-off-by: David S. Miller <[email protected]>
2021-04-08ipv6: report errors for iftoken via netlink extackStephen Hemminger1-1/+1
Setting iftoken can fail for several different reasons but there and there was no report to user as to the cause. Add netlink extended errors to the processing of the request. This requires adding additional argument through rtnl_af_ops set_link_af callback. Reported-by: Hongren Zheng <[email protected]> Signed-off-by: Stephen Hemminger <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-04-07net: remove the new_ifindex argument from dev_change_net_namespaceAndrei Vagin2-7/+7
Here is only one place where we want to specify new_ifindex. In all other cases, callers pass 0 as new_ifindex. It looks reasonable to add a low-level function with new_ifindex and to convert dev_change_net_namespace to a static inline wrapper. Fixes: eeb85a14ee34 ("net: Allow to specify ifindex when device is moved to another namespace") Suggested-by: Jakub Kicinski <[email protected]> Signed-off-by: Andrei Vagin <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-04-07net: introduce nla_policy for IFLA_NEW_IFINDEXAndrei Vagin1-3/+1
In this case, we don't need to check that new_ifindex is positive in validate_linkmsg. Fixes: eeb85a14ee34 ("net: Allow to specify ifindex when device is moved to another namespace") Suggested-by: Jakub Kicinski <[email protected]> Signed-off-by: Andrei Vagin <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-04-07bpf, sockmap: Fix incorrect fwd_alloc accountingJohn Fastabend1-7/+5
Incorrect accounting fwd_alloc can result in a warning when the socket is torn down, [18455.319240] WARNING: CPU: 0 PID: 24075 at net/core/stream.c:208 sk_stream_kill_queues+0x21f/0x230 [...] [18455.319543] Call Trace: [18455.319556] inet_csk_destroy_sock+0xba/0x1f0 [18455.319577] tcp_rcv_state_process+0x1b4e/0x2380 [18455.319593] ? lock_downgrade+0x3a0/0x3a0 [18455.319617] ? tcp_finish_connect+0x1e0/0x1e0 [18455.319631] ? sk_reset_timer+0x15/0x70 [18455.319646] ? tcp_schedule_loss_probe+0x1b2/0x240 [18455.319663] ? lock_release+0xb2/0x3f0 [18455.319676] ? __release_sock+0x8a/0x1b0 [18455.319690] ? lock_downgrade+0x3a0/0x3a0 [18455.319704] ? lock_release+0x3f0/0x3f0 [18455.319717] ? __tcp_close+0x2c6/0x790 [18455.319736] ? tcp_v4_do_rcv+0x168/0x370 [18455.319750] tcp_v4_do_rcv+0x168/0x370 [18455.319767] __release_sock+0xbc/0x1b0 [18455.319785] __tcp_close+0x2ee/0x790 [18455.319805] tcp_close+0x20/0x80 This currently happens because on redirect case we do skb_set_owner_r() with the original sock. This increments the fwd_alloc memory accounting on the original sock. Then on redirect we may push this into the queue of the psock we are redirecting to. When the skb is flushed from the queue we give the memory back to the original sock. The problem is if the original sock is destroyed/closed with skbs on another psocks queue then the original sock will not have a way to reclaim the memory before being destroyed. Then above warning will be thrown sockA sockB sk_psock_strp_read() sk_psock_verdict_apply() -- SK_REDIRECT -- sk_psock_skb_redirect() skb_queue_tail(psock_other->ingress_skb..) sk_close() sock_map_unref() sk_psock_put() sk_psock_drop() sk_psock_zap_ingress() At this point we have torn down our own psock, but have the outstanding skb in psock_other. Note that SK_PASS doesn't have this problem because the sk_psock_drop() logic releases the skb, its still associated with our psock. To resolve lets only account for sockets on the ingress queue that are still associated with the current socket. On the redirect case we will check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting until skb is actually enqueued. When the skb is sent via skb_send_sock_locked or received with sk_psock_skb_ingress memory will be claimed on psock_other. Fixes: 6fa9201a89898 ("bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self") Reported-by: Andrii Nakryiko <[email protected]> Signed-off-by: John Fastabend <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/161731444013.68884.4021114312848535993.stgit@john-XPS-13-9370
2021-04-05net: Allow to specify ifindex when device is moved to another namespaceAndrei Vagin2-11/+32
Currently, we can specify ifindex on link creation. This change allows to specify ifindex when a device is moved to another network namespace. Even now, a device ifindex can be changed if there is another device with the same ifindex in the target namespace. So this change doesn't introduce completely new behavior, it adds more control to the process. CRIU users want to restore containers with pre-created network devices. A user will provide network devices and instructions where they have to be restored, then CRIU will restore network namespaces and move devices into them. The problem is that devices have to be restored with the same indexes that they have before C/R. Cc: Alexander Mikhalitsyn <[email protected]> Suggested-by: Christian Brauner <[email protected]> Signed-off-by: Andrei Vagin <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-04-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller4-104/+247
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-04-01 The following pull-request contains BPF updates for your *net-next* tree. We've added 68 non-merge commits during the last 7 day(s) which contain a total of 70 files changed, 2944 insertions(+), 1139 deletions(-). The main changes are: 1) UDP support for sockmap, from Cong. 2) Verifier merge conflict resolution fix, from Daniel. 3) xsk selftests enhancements, from Maciej. 4) Unstable helpers aka kernel func calling, from Martin. 5) Batches ops for LPM map, from Pedro. 6) Fix race in bpf_get_local_storage, from Yonghong. ==================== Signed-off-by: David S. Miller <[email protected]>
2021-04-01sock_map: Update sock type checks for UDPCong Wang1-1/+4
Now UDP supports sockmap and redirection, we can safely update the sock type checks for it accordingly. Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-04-01skmsg: Extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()Cong Wang1-0/+98
Although these two functions are only used by TCP, they are not specific to TCP at all, both operate on skmsg and ingress_msg, so fit in net/core/skmsg.c very well. And we will need them for non-TCP, so rename and move them to skmsg.c and export them to modules. Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-04-01sock: Introduce sk->sk_prot->psock_update_sk_prot()Cong Wang2-25/+4
Currently sockmap calls into each protocol to update the struct proto and replace it. This certainly won't work when the protocol is implemented as a module, for example, AF_UNIX. Introduce a new ops sk->sk_prot->psock_update_sk_prot(), so each protocol can implement its own way to replace the struct proto. This also helps get rid of symbol dependencies on CONFIG_INET. Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-04-01sock_map: Introduce BPF_SK_SKB_VERDICTCong Wang2-1/+31
Reusing BPF_SK_SKB_STREAM_VERDICT is possible but its name is confusing and more importantly we still want to distinguish them from user-space. So we can just reuse the stream verdict code but introduce a new type of eBPF program, skb_verdict. Users are not allowed to attach stream_verdict and skb_verdict programs to the same map. Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-04-01sock_map: Kill sock_map_link_no_progs()Cong Wang1-40/+15
Now we can fold sock_map_link_no_progs() into sock_map_link() and get rid of sock_map_link_no_progs(). Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-04-01sock_map: Simplify sock_map_link() a bitCong Wang1-4/+5
sock_map_link() passes down map progs, but it is confusing to see both map progs and psock progs. Make the map progs more obvious by retrieving it directly with sock_map_progs() inside sock_map_link(). Now it is aligned with sock_map_link_no_progs() too. Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-04-01skmsg: Use GFP_KERNEL in sk_psock_create_ingress_msg()Cong Wang1-1/+1
This function is only called in process context. Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-04-01skmsg: Use rcu work for destroying psockCong Wang1-12/+5
The RCU callback sk_psock_destroy() only queues work psock->gc, so we can just switch to rcu work to simplify the code. Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-04-01skmsg: Avoid lock_sock() in sk_psock_backlog()Cong Wang2-16/+35
We do not have to lock the sock to avoid losing sk_socket, instead we can purge all the ingress queues when we close the socket. Sending or receiving packets after orphaning socket makes no sense. We do purge these queues when psock refcnt reaches zero but here we want to purge them explicitly in sock_map_close(). There are also some nasty race conditions on testing bit SK_PSOCK_TX_ENABLED and queuing/canceling the psock work, we can expand psock->ingress_lock a bit to protect them too. As noticed by John, we still have to lock the psock->work, because the same work item could be running concurrently on different CPU's. Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-04-01net: Introduce skb_send_sock() for sock_mapCong Wang1-7/+48
We only have skb_send_sock_locked() which requires callers to use lock_sock(). Introduce a variant skb_send_sock() which locks on its own, callers do not need to lock it any more. This will save us from adding a ->sendmsg_locked for each protocol. To reuse the code, pass function pointers to __skb_send_sock() and build skb_send_sock() and skb_send_sock_locked() on top. Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Reviewed-by: Jakub Sitnicki <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-04-01skmsg: Introduce a spinlock to protect ingress_msgCong Wang1-0/+3
Currently we rely on lock_sock to protect ingress_msg, it is too big for this, we can actually just use a spinlock to protect this list like protecting other skb queues. __tcp_bpf_recvmsg() is still special because of peeking, it still has to use lock_sock. Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-04-01skmsg: Lock ingress_skb when purgingCong Wang1-1/+1
Currently we purge the ingress_skb queue only when psock refcnt goes down to 0, so locking the queue is not necessary, but in order to be called during ->close, we have to lock it here. Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Jakub Sitnicki <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-03-31xdp: fix xdp_return_frame() kernel BUG throw for page_pool memory modelOng Boon Leong1-1/+2
xdp_return_frame() may be called outside of NAPI context to return xdpf back to page_pool. xdp_return_frame() calls __xdp_return() with napi_direct = false. For page_pool memory model, __xdp_return() calls xdp_return_frame_no_direct() unconditionally and below false negative kernel BUG throw happened under preempt-rt build: [ 430.450355] BUG: using smp_processor_id() in preemptible [00000000] code: modprobe/3884 [ 430.451678] caller is __xdp_return+0x1ff/0x2e0 [ 430.452111] CPU: 0 PID: 3884 Comm: modprobe Tainted: G U E 5.12.0-rc2+ #45 Changes in v2: - This patch fixes the issue by making xdp_return_frame_no_direct() is only called if napi_direct = true, as recommended for better by Jesper Dangaard Brouer. Thanks! Fixes: 2539650fadbf ("xdp: Helpers for disabling napi_direct of xdp_return_frame") Signed-off-by: Ong Boon Leong <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-31neighbour: Disregard DEAD dst in neigh_updateTong Zhu1-1/+1
After a short network outage, the dst_entry is timed out and put in DST_OBSOLETE_DEAD. We are in this code because arp reply comes from this neighbour after network recovers. There is a potential race condition that dst_entry is still in DST_OBSOLETE_DEAD. With that, another neighbour lookup causes more harm than good. In best case all packets in arp_queue are lost. This is counterproductive to the original goal of finding a better path for those packets. I observed a worst case with 4.x kernel where a dst_entry in DST_OBSOLETE_DEAD state is associated with loopback net_device. It leads to an ethernet header with all zero addresses. A packet with all zero source MAC address is quite deadly with mac80211, ath9k and 802.11 block ack. It fails ieee80211_find_sta_by_ifaddr in ath9k (xmit.c). Ath9k flushes tx queue (ath_tx_complete_aggr). BAW (block ack window) is not updated. BAW logic is damaged and ath9k transmission is disabled. Signed-off-by: Tong Zhu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-30net: let skb_orphan_partial wake-up waiters.Paolo Abeni1-9/+3
Currently the mentioned helper can end-up freeing the socket wmem without waking-up any processes waiting for more write memory. If the partially orphaned skb is attached to an UDP (or raw) socket, the lack of wake-up can hang the user-space. Even for TCP sockets not calling the sk destructor could have bad effects on TSQ. Address the issue using skb_orphan to release the sk wmem before setting the new sock_efree destructor. Additionally bundle the whole ownership update in a new helper, so that later other potential users could avoid duplicate code. v1 -> v2: - use skb_orphan() instead of sort of open coding it (Eric) - provide an helper for the ownership change (Eric) Fixes: f6ba8d33cfbb ("netem: fix skb_orphan_partial()") Suggested-by: Eric Dumazet <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-28net: core: Correct function name netevent_unregister_notifier() in the kerneldocXiongfeng Wang1-1/+1
Fix the following W=1 kernel build warning(s): net/core/netevent.c:45: warning: expecting prototype for netevent_unregister_notifier(). Prototype was for unregister_netevent_notifier() instead Reported-by: Hulk Robot <[email protected]> Signed-off-by: Xiongfeng Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-28net: core: Correct function name dev_uc_flush() in the kerneldocXiongfeng Wang1-1/+1
Fix the following W=1 kernel build warning(s): net/core/dev_addr_lists.c:732: warning: expecting prototype for dev_uc_flush(). Prototype was for dev_uc_init() instead Reported-by: Hulk Robot <[email protected]> Signed-off-by: Xiongfeng Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-26bpf: selftests: Add kfunc_call testMartin KaFai Lau1-0/+1
This patch adds a few kernel function bpf_kfunc_call_test*() for the selftest's test_run purpose. They will be allowed for tc_cls prog. The selftest calling the kernel function bpf_kfunc_call_test*() is also added in this patch. Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-03-25net: change netdev_unregister_timeout_secs min value to 1Dmitry Vyukov2-2/+2
netdev_unregister_timeout_secs=0 can lead to printing the "waiting for dev to become free" message every jiffy. This is too frequent and unnecessary. Set the min value to 1 second. Also fix the merge issue introduced by "net: make unregister netdev warning timeout configurable": it changed "refcnt != 1" to "refcnt". Signed-off-by: Dmitry Vyukov <[email protected]> Suggested-by: Eric Dumazet <[email protected]> Fixes: 5aa3afe107d9 ("net: make unregister netdev warning timeout configurable") Cc: [email protected] Cc: [email protected] Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-25net: core: Fix a typo in dev_addr_lists.cLu Wei1-1/+1
Modify "funciton" to "function" in net/core/dev_addr_lists.c. Reported-by: Hulk Robot <[email protected]> Signed-off-by: Lu Wei <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-25Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-0/+6
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-03-24 The following pull-request contains BPF updates for your *net-next* tree. We've added 37 non-merge commits during the last 15 day(s) which contain a total of 65 files changed, 3200 insertions(+), 738 deletions(-). The main changes are: 1) Static linking of multiple BPF ELF files, from Andrii. 2) Move drop error path to devmap for XDP_REDIRECT, from Lorenzo. 3) Spelling fixes from various folks. ==================== Signed-off-by: David S. Miller <[email protected]>
2021-03-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller6-38/+135
Signed-off-by: David S. Miller <[email protected]>
2021-03-24net: resolve forwarding path from virtual netdevice and HW destination addressPablo Neira Ayuso1-0/+46
This patch adds dev_fill_forward_path() which resolves the path to reach the real netdevice from the IP forwarding side. This function takes as input the netdevice and the destination hardware address and it walks down the devices calling .ndo_fill_forward_path() for each device until the real device is found. For instance, assuming the following topology: IP forwarding / \ br0 eth0 / \ eth1 eth2 . . . ethX ab:cd:ef:ab:cd:ef where eth1 and eth2 are bridge ports and eth0 provides WAN connectivity. ethX is the interface in another box which is connected to the eth1 bridge port. For packets going through IP forwarding to br0 whose destination MAC address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path() provides the following path: br0 -> eth1 .ndo_fill_forward_path for br0 looks up at the FDB for the bridge port from the destination MAC address to get the bridge port eth1. This information allows to create a fast path that bypasses the classic bridge and IP forwarding paths, so packets go directly from the bridge port eth1 to eth0 (wan interface) and vice versa. fast path .------------------------. / \ | IP forwarding | | / \ \/ | br0 eth0 . / \ -> eth1 eth2 . . . ethX ab:cd:ef:ab:cd:ef Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-23net: make unregister netdev warning timeout configurableDmitry Vyukov2-1/+15
netdev_wait_allrefs() issues a warning if refcount does not drop to 0 after 10 seconds. While 10 second wait generally should not happen under normal workload in normal environment, it seems to fire falsely very often during fuzzing and/or in qemu emulation (~10x slower). At least it's not possible to understand if it's really a false positive or not. Automated testing generally bumps all timeouts to very high values to avoid flake failures. Add net.core.netdev_unregister_timeout_secs sysctl to make the timeout configurable for automated testing systems. Lowering the timeout may also be useful for e.g. manual bisection. The default value matches the current behavior. Signed-off-by: Dmitry Vyukov <[email protected]> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877 Cc: [email protected] Cc: [email protected] Signed-off-by: David S. Miller <[email protected]>
2021-03-22net: set initial device refcount to 1Eric Dumazet1-3/+6
When adding CONFIG_PCPU_DEV_REFCNT, I forgot that the initial net device refcount was 0. When CONFIG_PCPU_DEV_REFCNT is not set, this means the first dev_hold() triggers an illegal refcount operation (addition on 0) refcount_t: addition on 0; use-after-free. WARNING: CPU: 0 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x128/0x1a4 Fix is to change initial (and final) refcount to be 1. Also add a missing kerneldoc piece, as reported by Stephen Rothwell. Fixes: 919067cc845f ("net: add CONFIG_PCPU_DEV_REFCNT") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Guenter Roeck <[email protected]> Tested-by: Guenter Roeck <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-22net-sysfs: remove possible sleep from an RCU read-side critical sectionAntoine Tenart1-1/+1
xps_queue_show is mostly made of an RCU read-side critical section and calls bitmap_zalloc with GFP_KERNEL in the middle of it. That is not allowed as this call may sleep and such behaviours aren't allowed in RCU read-side critical sections. Fix this by using GFP_NOWAIT instead. Fixes: 5478fcd0f483 ("net: embed nr_ids in the xps maps") Reported-by: kernel test robot <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-22net: move the ptype_all and ptype_base declarations to include/linux/netdevice.hVladimir Oltean1-3/+0
ptype_all and ptype_base are declared in net/core/dev.c as non-static, because they are used by net-procfs.c too. However, a "make W=1" build complains that there was no previous declaration of ptype_all and ptype_base in a header file, so this way of declaring things constitutes a violation of coding style. Let's move the extern declarations of ptype_all and ptype_base to the linux/netdevice.h file, which is included by net-procfs.c too. Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-22net: make xps_needed and xps_rxqs_needed staticVladimir Oltean1-4/+2
Since their introduction in commit 04157469b7b8 ("net: Use static_key for XPS maps"), xps_needed and xps_rxqs_needed were never used outside net/core/dev.c, so I don't really understand why they were exported as symbols in the first place. This is needed in order to silence a "make W=1" warning about these static keys not being declared as static variables, but not having a previous declaration in a header file nonetheless. Cc: Amritha Nambiar <[email protected]> Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-19net: add CONFIG_PCPU_DEV_REFCNTEric Dumazet1-0/+10
I was working on a syzbot issue, claiming one device could not be dismantled because its refcount was -1 unregister_netdevice: waiting for sit0 to become free. Usage count = -1 It would be nice if syzbot could trigger a warning at the time this reference count became negative. This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults to per cpu variables (as before this patch) on SMP builds. v2: free_dev label in alloc_netdev_mqs() is moved to avoid a compiler warning (-Wunused-label), as reported by kernel test robot <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-18/net/core/: fix misspellings using codespell toolXiong Zhenwu1-1/+1
A typo is found out by codespell tool in 1734th line of drop_monitor.c: $ codespell ./net/core/ ./net/core/drop_monitor.c:1734: guarnateed ==> guaranteed Fix a typo found by codespell. Signed-off-by: Xiong Zhenwu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-18net: NULL the old xps map entries when freeing themAntoine Tenart1-0/+1
In __netif_set_xps_queue, old map entries from the old dev_maps are freed but their corresponding entry in the old dev_maps aren't NULLed. Fix this. Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-18net: fix use after free in xpsAntoine Tenart1-5/+12
When setting up an new dev_maps in __netif_set_xps_queue, we remove and free maps from unused CPUs/rx-queues near the end of the function; by calling remove_xps_queue. However it's possible those maps are also part of the old not-freed-yet dev_maps, which might be used concurrently. When that happens, a map can be freed while its corresponding entry in the old dev_maps table isn't NULLed, leading to: "BUG: KASAN: use-after-free" in different places. This fixes the map freeing logic for unused CPUs/rx-queues, to also NULL the map entries from the old dev_maps table. Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-18net-sysfs: move the xps cpus/rxqs retrieval in a common functionAntoine Tenart1-77/+48
Most of the xps_cpus_show and xps_rxqs_show functions share the same logic. Having it in two different functions does not help maintenance. This patch moves their common logic into a new function, xps_queue_show, to improve this. Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-18net-sysfs: move the rtnl unlock up in the xps show helpersAntoine Tenart1-14/+11
Now that nr_ids and num_tc are stored in the xps dev_maps, which are RCU protected, we do not have the need to protect the maps in the rtnl lock. Move the rtnl unlock up so we reduce the rtnl locking section. We also increase the reference count on the subordinate device if any, as we don't want this device to be freed while we use it (now that the rtnl lock isn't protecting it in the whole function). Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-18net: improve queue removal readability in __netif_set_xps_queueAntoine Tenart1-6/+9
Improve the readability of the loop removing tx-queue from unused CPUs/rx-queues in __netif_set_xps_queue. The change should only be cosmetic. Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-18net: add an helper to copy xps maps to the new dev_mapsAntoine Tenart1-20/+25
This patch adds an helper, xps_copy_dev_maps, to copy maps from dev_maps to new_dev_maps at a given index. The logic should be the same, with an improved code readability and maintenance. Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-18net: move the xps maps to an arrayAntoine Tenart2-45/+34
Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in net_device. That will simplify a lot the code removing the need for lots of if/else conditionals as the correct map will be available using its offset in the array. This should not modify the xps maps behaviour in any way. Suggested-by: Alexander Duyck <[email protected]> Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-18net: remove the xps possible_maskAntoine Tenart2-29/+15
Remove the xps possible_mask. It was an optimization but we can just loop from 0 to nr_ids now that it is embedded in the xps dev_maps. That simplifies the code a bit. Suggested-by: Alexander Duyck <[email protected]> Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-18net: embed nr_ids in the xps mapsAntoine Tenart2-40/+43
Embed nr_ids (the number of cpu for the xps cpus map, and the number of rxqs for the xps cpus map) in dev_maps. That will help not accessing out of bound memory if those values change after dev_maps was allocated. Suggested-by: Alexander Duyck <[email protected]> Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-18net: embed num_tc in the xps mapsAntoine Tenart2-50/+58
The xps cpus/rxqs map is accessed using dev->num_tc, which is used when allocating the map. But later updates of dev->num_tc can lead to having a mismatch between the maps and how they're accessed. In such cases the map values do not make any sense and out of bound accesses can occur (that can be easily seen using KASAN). This patch aims at fixing this by embedding num_tc into the maps, using the value at the time the map is created. This brings two improvements: - The maps can be accessed using the embedded num_tc, so we know for sure we won't have out of bound accesses. - Checks can be made before accessing the maps so we know the values retrieved will make sense. We also update __netif_set_xps_queue to conditionally copy old maps from dev_maps in the new one only if the number of traffic classes from both maps match. Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-18net-sysfs: make xps_cpus_show and xps_rxqs_show consistentAntoine Tenart1-15/+18
Make the implementations of xps_cpus_show and xps_rxqs_show to converge, as the two share the same logic but diverted over time. This should not modify their behaviour but will help future changes and improve maintenance. Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-03-18net-sysfs: store the return of get_netdev_queue_index in an unsigned intAntoine Tenart1-4/+7
In net-sysfs, get_netdev_queue_index returns an unsigned int. Some of its callers use an unsigned long to store the returned value. Update the code to be consistent, this should only be cosmetic. Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>