aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2016-11-30bpf, xdp: allow to pass flags to dev_change_xdp_fdDaniel Borkmann2-3/+31
Add an IFLA_XDP_FLAGS attribute that can be passed for setting up XDP along with IFLA_XDP_FD, which eventually allows user space to implement typical add/replace/delete logic for programs. Right now, calling into dev_change_xdp_fd() will always replace previous programs. When passed XDP_FLAGS_UPDATE_IF_NOEXIST, we can handle this more graceful when requested by returning -EBUSY in case we try to attach a new program, but we find that another one is already attached. This will be used by upcoming front-end for iproute2 as well. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-30tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPINGFrancis Yan4-4/+44
This patch exports the sender chronograph stats via the socket SO_TIMESTAMPING channel. Currently we can instrument how long a particular application unit of data was queued in TCP by tracking SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having these sender chronograph stats exported simultaneously along with these timestamps allow further breaking down the various sender limitation. For example, a video server can tell if a particular chunk of video on a connection takes a long time to deliver because TCP was experiencing small receive window. It is not possible to tell before this patch without packet traces. To prepare these stats, the user needs to set SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags while requesting other SOF_TIMESTAMPING TX timestamps. When the timestamps are available in the error queue, the stats are returned in a separate control message of type SCM_TIMESTAMPING_OPT_STATS, in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME, TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond. Signed-off-by: Francis Yan <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-30tcp: export sender limits chronographs to TCP_INFOFrancis Yan1-0/+20
This patch exports all the sender chronograph measurements collected in the previous patches to TCP_INFO interface. Note that busy time exported includes all the other sending limits (rwnd-limited, sndbuf-limited). Internally the time unit is jiffy but externally the measurements are in microseconds for future extensions. Signed-off-by: Francis Yan <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-30tcp: instrument how long TCP is limited by insufficient send bufferFrancis Yan3-3/+24
This patch measures the amount of time when TCP runs out of new data to send to the network due to insufficient send buffer, while TCP is still busy delivering (i.e. write queue is not empty). The goal is to indicate either the send buffer autotuning or user SO_SNDBUF setting has resulted network under-utilization. The measurement starts conservatively by checking various conditions to minimize false claims (i.e. under-estimation is more likely). The measurement stops when the SOCK_NOSPACE flag is cleared. But it does not account the time elapsed till the next application write. Also the measurement only starts if the sender is still busy sending data, s.t. the limit accounted is part of the total busy time. Signed-off-by: Francis Yan <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-30tcp: instrument how long TCP is limited by receive windowFrancis Yan1-2/+9
This patch measures the total time when the TCP stops sending because the receiver's advertised window is not large enough. Note that once the limit is lifted we are likely in the busy status if we have data pending. Signed-off-by: Francis Yan <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-30tcp: instrument how long TCP is busy sendingFrancis Yan2-3/+19
This patch measures TCP busy time, which is defined as the period of time when sender has data (or FIN) to send. The time starts when data is buffered and stops when the write queue is flushed by ACKs or error events. Note the busy time does not include SYN time, unless data is included in SYN (i.e. Fast Open). It does include FIN time even if the FIN carries no payload. Excluding pure FIN is possible but would incur one additional test in the fast path, which may not be worth it. Signed-off-by: Francis Yan <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-30tcp: instrument tcp sender limits chronographsFrancis Yan1-0/+30
This patch implements the skeleton of the TCP chronograph instrumentation on sender side limits: 1) idle (unspec) 2) busy sending data other than 3-4 below 3) rwnd-limited 4) sndbuf-limited The limits are enumerated 'tcp_chrono'. Since a connection in theory can idle forever, we do not track the actual length of this uninteresting idle period. For the rest we track how long the sender spends in each limit. At any point during the life time of a connection, the sender must be in one of the four states. If there are multiple conditions worthy of tracking in a chronograph then the highest priority enum takes precedence over the other conditions. So that if something "more interesting" starts happening, stop the previous chrono and start a new one. The time unit is jiffy(u32) in order to save space in tcp_sock. This implies application must sample the stats no longer than every 49 days of 1ms jiffy. Signed-off-by: Francis Yan <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-30ieee802154: check device type[email protected]1-1/+5
I've observed a NULL pointer dereference in ieee802154_del_iface() during netlink fuzzing. It's the ->wpan_phy dereference here: phy = dev->ieee802154_ptr->wpan_phy; My bet is that we're not checking that this is an IEEE802154 interface, so let's do what ieee802154_nl_get_dev() is doing. (Maybe we should even be calling this directly?) Cc: Lennert Buytenhek <[email protected]> Cc: Alexander Aring <[email protected]> Cc: Marcel Holtmann <[email protected]> Cc: Dmitry Eremin-Solenikov <[email protected]> Cc: Sergey Lapin <[email protected]> Signed-off-by: Vegard Nossum <[email protected]> Acked-by: Alexander Aring <[email protected]> Signed-off-by: Stefan Schmidt <[email protected]>
2016-11-30esp6: Fix integrity verification when ESN are usedTobias Brunner1-1/+1
When handling inbound packets, the two halves of the sequence number stored on the skb are already in network order. Fixes: 000ae7b2690e ("esp6: Switch to new AEAD interface") Signed-off-by: Tobias Brunner <[email protected]> Acked-by: Herbert Xu <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2016-11-30esp4: Fix integrity verification when ESN are usedTobias Brunner1-1/+1
When handling inbound packets, the two halves of the sequence number stored on the skb are already in network order. Fixes: 7021b2e1cddd ("esp4: Switch to new AEAD interface") Signed-off-by: Tobias Brunner <[email protected]> Acked-by: Herbert Xu <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2016-11-30xfrm_user: fix return value from xfrm_user_rcv_msgYi Zhao1-1/+1
It doesn't support to run 32bit 'ip' to set xfrm objdect on 64bit host. But the return value is unknown for user program: ip xfrm policy list RTNETLINK answers: Unknown error 524 Replace ENOTSUPP with EOPNOTSUPP: ip xfrm policy list RTNETLINK answers: Operation not supported Signed-off-by: Yi Zhao <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2016-11-29net: dsa: slave: fix fixed-link phydev leaksJohan Hovold1-1/+11
Make sure to deregister and free any fixed-link PHY registered using of_phy_register_fixed_link() on slave-setup errors and on slave destroy. Fixes: 0d8bcdd383b8 ("net: dsa: allow for more complex PHY setups") Signed-off-by: Johan Hovold <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-29of_mdio: add helper to deregister fixed-link PHYsJohan Hovold1-10/+2
Add helper to deregister fixed-link PHYs registered using of_phy_register_fixed_link(). Convert the two drivers that care to deregister their fixed-link PHYs to use the new helper, but note that most drivers currently fail to do so. Signed-off-by: Johan Hovold <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-29net: dsa: slave: fix of-node leak and phy priorityJohan Hovold1-2/+5
Make sure to drop the reference taken by of_parse_phandle() before returning from dsa_slave_phy_setup(). Note that this also modifies the PHY priority so that any fixed-link node is only parsed when no phy-handle is given, which is in accordance with the common scheme for this. Fixes: 0d8bcdd383b8 ("net: dsa: allow for more complex PHY setups") Signed-off-by: Johan Hovold <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-29GSO: Reload iph after pskb_may_pullArnaldo Carvalho de Melo1-1/+1
As it may get stale and lead to use after free. Acked-by: Eric Dumazet <[email protected]> Cc: Alexander Duyck <[email protected]> Cc: Andrey Konovalov <[email protected]> Fixes: cbc53e08a793 ("GSO: Add GSO type for fixed IPv4 ID") Signed-off-by: Arnaldo Carvalho de Melo <[email protected]> Acked-by: Alexander Duyck <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-29sched: cls_flower: remove from hashtable only in case skip sw flag is not setJiri Pirko1-4/+6
Be symmetric to hashtable insert and remove filter from hashtable only in case skip sw flag is not set. Fixes: e69985c67c33 ("net/sched: cls_flower: Introduce support in SKIP SW flag") Signed-off-by: Jiri Pirko <[email protected]> Reviewed-by: Amir Vadai <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-29net/dccp: fix use-after-free in dccp_invalid_packetEric Dumazet1-5/+7
pskb_may_pull() can reallocate skb->head, we need to reload dh pointer in dccp_invalid_packet() or risk use after free. Bug found by Andrey Konovalov using syzkaller. Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Andrey Konovalov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-29netlink: Call cb->done from a worker threadHerbert Xu2-4/+25
The cb->done interface expects to be called in process context. This was broken by the netlink RCU conversion. This patch fixes it by adding a worker struct to make the cb->done call where necessary. Fixes: 21e4902aea80 ("netlink: Lockless lookup with RCU grace...") Reported-by: Subash Abhinov Kasiviswanathan <[email protected]> Signed-off-by: Herbert Xu <[email protected]> Acked-by: Cong Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-29net/sched: pedit: make sure that offset is validAmir Vadai1-4/+20
Add a validation function to make sure offset is valid: 1. Not below skb head (could happen when offset is negative). 2. Validate both 'offset' and 'at'. Signed-off-by: Amir Vadai <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-29xprtrdma: Relocate connection helper functionsChuck Lever3-41/+31
Clean up: Disentangle connection helpers from RPC-over-RDMA reply decoding functions. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-11-29xprtrdma: Update dprintk in rpcrdma_count_chunksChuck Lever1-1/+1
Clean up: offset and handle should be zero-filled, just like in the chunk encoders. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-11-29xprtrdma: Shorten QP access error messageChuck Lever1-3/+3
Clean up: The convention for this type of warning message is not to show the function name or "RPC: ". Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-11-29xprtrdma: Squelch "max send, max recv" messages at connect timeChuck Lever1-2/+2
Clean up: This message was intended to be a dprintk, as it is on the server-side. Fixes: 87cfb9a0c85c ('xprtrdma: Client-side support for ...') Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-11-29xprtrdma: Update documenting commentChuck Lever1-4/+0
Clean up: If reset fails, FRMRs are no longer abandoned, rather they are released immediately. Update the comment to reflect this. Fixes: 2ffc871a574d ('xprtrdma: Release orphaned MRs immediately') Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-11-29xprtrdma: Refactor FRMR invalidationChuck Lever1-36/+21
Clean up: After some recent updates, clarifications can be made to the FRMR invalidation logic. - Both the remote and local invalidation case mark the frmr INVALID, so make that a common path. - Manage the WR list more "tastefully" by replacing the conditional that discriminates between the list head and ->next pointers. - Use mw->mw_handle in all cases, since that has the same value as f->fr_mr->rkey, and is already in cache. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-11-29xprtrdma: Avoid calls to ro_unmap_safe()Chuck Lever1-2/+4
Micro-optimization: Most of the time, calls to ro_unmap_safe are expensive no-ops. Call only when there is work to do. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-11-29xprtrdma: Address coverity complaint about wait_for_completion()Chuck Lever1-4/+13
> ** CID 114101: Error handling issues (CHECKED_RETURN) > /net/sunrpc/xprtrdma/verbs.c: 355 in rpcrdma_create_id() Commit 5675add36e76 ("RPC/RDMA: harden connection logic against missing/late rdma_cm upcalls.") replaced wait_for_completion() calls with these two call sites. The original wait_for_completion() calls were added in the initial commit of verbs.c, which was commit c56c65fb67d6 ("RPCRDMA: rpc rdma verbs interface implementation"), but these returned void. rpcrdma_create_id() is called by the RDMA connect worker, which probably won't ever be interrupted. It is also called by rpcrdma_ia_open which is in the synchronous mount path, and ^C is possible there. Add a bit of logic at those two call sites to return if the waits return ERESTARTSYS. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-11-29SUNRPC: Proper metric accounting when RPC is not transmittedChuck Lever1-4/+6
I noticed recently that during an xfstests on a krb5i mount, the retransmit count for certain operations had gone negative, and the backlog value became unreasonably large. I recall that Andy has pointed this out to me in the past. When call_refresh fails to find a valid credential for an RPC, the RPC exits immediately without sending anything on the wire. This leaves rq_ntrans, rq_xtime, and rq_rtt set to zero. The solution for om_queue is to not add the to RPC's running backlog queue total whenever rq_xtime is zero. For om_ntrans, it's a bit more difficult. A zero rq_ntrans causes om_ops to become larger than om_ntrans. The design of the RPC metrics API assumes that ntrans will always be equal to or larger than the ops count. The result is that when an RPC fails to find credentials, the RPC operation's reported retransmit count, which is computed in user space as the difference between ops and ntrans, goes negative. Ideally the kernel API should report a separate retransmit and "exited before initial transmission" metric, so that user space can sort out the difference properly. To avoid kernel API changes and changes to the way rq_ntrans is used when performing transport locking, account for untransmitted RPCs so that om_ntrans keeps up with om_ops: always add one or more to om_ntrans. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-11-29xprtrdma: Support for SG_GAP devicesChuck Lever2-7/+14
Some devices (such as the Mellanox CX-4) can register, under a single R_key, a set of memory regions that are not contiguous. When this is done, all the segments in a Reply list, say, can then be invalidated in a single LocalInv Work Request (or via Remote Invalidation, which can invalidate exactly one R_key when completing a Receive). This means a single FastReg WR is used to register, and one or zero LocalInv WRs can invalidate, the memory involved with RDMA transfers on behalf of an RPC. In addition, xprtrdma constructs some Reply chunks from three or more segments. By registering them with SG_GAP, only one segment is needed for the Reply chunk, allowing the whole chunk to be invalidated remotely. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-11-29xprtrdma: Make FRWR send queue entry accounting more accurateChuck Lever3-13/+30
Verbs providers may perform house-keeping on the Send Queue during each signaled send completion. It is necessary therefore for a verbs consumer (like xprtrdma) to occasionally force a signaled send completion if it runs unsignaled most of the time. xprtrdma does not require signaled completions for Send or FastReg Work Requests, but does signal some LocalInv Work Requests. To ensure that Send Queue house-keeping can run before the Send Queue is more than half-consumed, xprtrdma forces a signaled completion on occasion by counting the number of Send Queue Entries it consumes. It currently does this by counting each ib_post_send as one Entry. Commit c9918ff56dfb ("xprtrdma: Add ro_unmap_sync method for FRWR") introduced the ability for frwr_op_unmap_sync to post more than one Work Request with a single post_send. Thus the underlying assumption of one Send Queue Entry per ib_post_send is no longer true. Also, FastReg Work Requests are currently never signaled. They should be signaled once in a while, just as Send is, to keep the accounting of consumed SQEs accurate. While we're here, convert the CQCOUNT macros to the currently preferred kernel coding style, which is inline functions. Fixes: c9918ff56dfb ("xprtrdma: Add ro_unmap_sync method for FRWR") Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-11-29xprtrdma: Cap size of callback buffer resourcesChuck Lever1-1/+3
When the inline threshold size is set to large values (say, 32KB) any NFSv4.1 CB request from the server gets a reply with status NFS4ERR_RESOURCE. Looks like there are some upper layer assumptions about the maximum size of a reply (for example, in process_op). Cap the size of the NFSv4 client's reply resources at a page. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-11-29netfilter: ipv6: nf_defrag: drop mangled skb on ream errorFlorian Westphal2-3/+3
Dmitry Vyukov reported GPF in network stack that Andrey traced down to negative nh offset in nf_ct_frag6_queue(). Problem is that all network headers before fragment header are pulled. Normal ipv6 reassembly will drop the skb when errors occur further down the line. netfilter doesn't do this, and instead passed the original fragment along. That was also fine back when netfilter ipv6 defrag worked with cloned fragments, as the original, pristine fragment was passed on. So we either have to undo the pull op, or discard such fragments. Since they're malformed after all (e.g. overlapping fragment) it seems preferrable to just drop them. Same for temporary errors -- it doesn't make sense to accept (and perhaps forward!) only some fragments of same datagram. Fixes: 029f7f3b8701cc7ac ("netfilter: ipv6: nf_defrag: avoid/free clone operations") Reported-by: Dmitry Vyukov <[email protected]> Debugged-by: Andrey Konovalov <[email protected]> Diagnosed-by: Eric Dumazet <Eric Dumazet <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-11-28net: dsa: fix unbalanced dsa_switch_tree reference countingNikita Yushchenko1-1/+3
_dsa_register_switch() gets a dsa_switch_tree object either via dsa_get_dst() or via dsa_add_dst(). Former path does not increase kref in returned object (resulting into caller not owning a reference), while later path does create a new object (resulting into caller owning a reference). The rest of _dsa_register_switch() assumes that it owns a reference, and calls dsa_put_dst(). This causes a memory breakage if first switch in the tree initialized successfully, but second failed to initialize. In particular, freed dsa_swith_tree object is left referenced by switch that was initialized, and later access to sysfs attributes of that switch cause OOPS. To fix, need to add kref_get() call to dsa_get_dst(). Fixes: 83c0afaec7b7 ("net: dsa: Add new binding implementation") Signed-off-by: Nikita Yushchenko <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-28net: handle no dst on skb in icmp6_sendDavid Ahern1-2/+4
Andrey reported the following while fuzzing the kernel with syzkaller: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Modules linked in: CPU: 0 PID: 3859 Comm: a.out Not tainted 4.9.0-rc6+ #429 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff8800666d4200 task.stack: ffff880067348000 RIP: 0010:[<ffffffff833617ec>] [<ffffffff833617ec>] icmp6_send+0x5fc/0x1e30 net/ipv6/icmp.c:451 RSP: 0018:ffff88006734f2c0 EFLAGS: 00010206 RAX: ffff8800666d4200 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000018 RBP: ffff88006734f630 R08: ffff880064138418 R09: 0000000000000003 R10: dffffc0000000000 R11: 0000000000000005 R12: 0000000000000000 R13: ffffffff84e7e200 R14: ffff880064138484 R15: ffff8800641383c0 FS: 00007fb3887a07c0(0000) GS:ffff88006cc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000000 CR3: 000000006b040000 CR4: 00000000000006f0 Stack: ffff8800666d4200 ffff8800666d49f8 ffff8800666d4200 ffffffff84c02460 ffff8800666d4a1a 1ffff1000ccdaa2f ffff88006734f498 0000000000000046 ffff88006734f440 ffffffff832f4269 ffff880064ba7456 0000000000000000 Call Trace: [<ffffffff83364ddc>] icmpv6_param_prob+0x2c/0x40 net/ipv6/icmp.c:557 [< inline >] ip6_tlvopt_unknown net/ipv6/exthdrs.c:88 [<ffffffff83394405>] ip6_parse_tlv+0x555/0x670 net/ipv6/exthdrs.c:157 [<ffffffff8339a759>] ipv6_parse_hopopts+0x199/0x460 net/ipv6/exthdrs.c:663 [<ffffffff832ee773>] ipv6_rcv+0xfa3/0x1dc0 net/ipv6/ip6_input.c:191 ... icmp6_send / icmpv6_send is invoked for both rx and tx paths. In both cases the dst->dev should be preferred for determining the L3 domain if the dst has been set on the skb. Fallback to the skb->dev if it has not. This covers the case reported here where icmp6_send is invoked on Rx before the route lookup. Fixes: 5d41ce29e ("net: icmp6_send should use dst dev to determine L3 domain") Reported-by: Andrey Konovalov <[email protected]> Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-28tcp: Set DEFAULT_TCP_CONG to bbr if DEFAULT_BBR is setJulian Wollrath1-0/+1
Signed-off-by: Julian Wollrath <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-28net/iucv: Use explicit clean up labels in iucv_init()Sebastian Andrzej Siewior1-7/+7
Ursula suggested to use explicit labels for clean up in the error path instead of one `out_free' label, which handles multiple exits, introduced in commit 38b482929e8f ("net/iucv: Convert to hotplug state machine"). Suggested-by: Ursula Braun <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: "David S. Miller" <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-11-28net, sched: respect rcu grace period on cls destructionDaniel Borkmann8-21/+31
Roi reported a crash in flower where tp->root was NULL in ->classify() callbacks. Reason is that in ->destroy() tp->root is set to NULL via RCU_INIT_POINTER(). It's problematic for some of the classifiers, because this doesn't respect RCU grace period for them, and as a result, still outstanding readers from tc_classify() will try to blindly dereference a NULL tp->root. The tp->root object is strictly private to the classifier implementation and holds internal data the core such as tc_ctl_tfilter() doesn't know about. Within some classifiers, such as cls_bpf, cls_basic, etc, tp->root is only checked for NULL in ->get() callback, but nowhere else. This is misleading and seemed to be copied from old classifier code that was not cleaned up properly. For example, d3fa76ee6b4a ("[NET_SCHED]: cls_basic: fix NULL pointer dereference") moved tp->root initialization into ->init() routine, where before it was part of ->change(), so ->get() had to deal with tp->root being NULL back then, so that was indeed a valid case, after d3fa76ee6b4a, not really anymore. We used to set tp->root to NULL long ago in ->destroy(), see 47a1a1d4be29 ("pkt_sched: remove unnecessary xchg() in packet classifiers"); but the NULLifying was reintroduced with the RCUification, but it's not correct for every classifier implementation. In the cases that are fixed here with one exception of cls_cgroup, tp->root object is allocated and initialized inside ->init() callback, which is always performed at a point in time after we allocate a new tp, which means tp and thus tp->root was not globally visible in the tp chain yet (see tc_ctl_tfilter()). Also, on destruction tp->root is strictly kfree_rcu()'ed in ->destroy() handler, same for the tp which is kfree_rcu()'ed right when we return from ->destroy() in tcf_destroy(). This means, the head object's lifetime for such classifiers is always tied to the tp lifetime. The RCU callback invocation for the two kfree_rcu() could be out of order, but that's fine since both are independent. Dropping the RCU_INIT_POINTER(tp->root, NULL) for these classifiers here means that 1) we don't need a useless NULL check in fast-path and, 2) that outstanding readers of that tp in tc_classify() can still execute under respect with RCU grace period as it is actually expected. Things that haven't been touched here: cls_fw and cls_route. They each handle tp->root being NULL in ->classify() path for historic reasons, so their ->destroy() implementation can stay as is. If someone actually cares, they could get cleaned up at some point to avoid the test in fast path. cls_u32 doesn't set tp->root to NULL. For cls_rsvp, I just added a !head should anyone actually be using/testing it, so it at least aligns with cls_fw and cls_route. For cls_flower we additionally need to defer rhashtable destruction (to a sleepable context) after RCU grace period as concurrent readers might still access it. (Note that in this case we need to hold module reference to keep work callback address intact, since we only wait on module unload for all call_rcu()s to finish.) This fixes one race to bring RCU grace period guarantees back. Next step as worked on by Cong however is to fix 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone") to get the order of unlinking the tp in tc_ctl_tfilter() for the RTM_DELTFILTER case right by moving RCU_INIT_POINTER() before tcf_destroy() and let the notification for removal be done through the prior ->delete() callback. Both are independant issues. Once we have that right, we can then clean tp->root up for a number of classifiers by not making them RCU pointers, which requires a new callback (->uninit) that is triggered from tp's RCU callback, where we just kfree() tp->root from there. Fixes: 1f947bf151e9 ("net: sched: rcu'ify cls_bpf") Fixes: 9888faefe132 ("net: sched: cls_basic use RCU") Fixes: 70da9f0bf999 ("net: sched: cls_flow use RCU") Fixes: 77b9900ef53a ("tc: introduce Flower classifier") Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier") Fixes: 952313bd6258 ("net: sched: cls_cgroup use RCU") Reported-by: Roi Dayan <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Cc: Cong Wang <[email protected]> Cc: John Fastabend <[email protected]> Cc: Roi Dayan <[email protected]> Cc: Jiri Pirko <[email protected]> Acked-by: John Fastabend <[email protected]> Acked-by: Cong Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-27bpf: reuse dev_is_mac_header_xmit for redirectDaniel Borkmann2-24/+5
Commit dcf800344a91 ("net/sched: act_mirred: Refactor detection whether dev needs xmit at mac header") added dev_is_mac_header_xmit(); since it's also useful elsewhere, move it to if_arp.h and reuse it for BPF. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-27bpf: drop useless bpf_fd member from cls/actDaniel Borkmann2-15/+1
After setup we don't need to keep user space fd number around anymore, as it also has no useful meaning for anyone, just remove it. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-27tipc: fix link statistics counter errorsJon Paul Maloy1-16/+19
In commit e4bf4f76962b ("tipc: simplify packet sequence number handling") we changed the internal representation of the packet sequence number counters from u32 to u16, reflecting what is really sent over the wire. Since then some link statistics counters have been displaying incorrect values, partially because the counters meant to be used as sequence number snapshots are now used as direct counters, stored as u32, and partially because some counter updates are just missing in the code. In this commit we correct this in two ways. First, we base the displayed packet sent/received values on direct counters instead of as previously a calculated difference between current sequence number and a snapshot. Second, we add the missing updates of the counters. This change is compatible with the current netlink API, and requires no changes to the user space tools. Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-27Merge tag 'wireless-drivers-next-for-davem-2016-11-25' of ↵David S. Miller1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next Kalle Valo says: ==================== wireless-drivers-next patches for 4.10 Major changes: iwlwifi * finalize and enable dynamic queue allocation * use dev_coredumpmsg() to prevent locking the driver * small fix to pass the AID to the FW * use FW PS decisions with multi-queue ath9k * add device tree bindings * switch to use mac80211 intermediate software queues to reduce latency and fix bufferbloat wl18xx * allow scanning in AP mode ==================== Signed-off-by: David S. Miller <[email protected]>
2016-11-27Merge branch 'master' of ↵David S. Miller3-8/+39
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2016-11-25 1) Fix a refcount leak in vti6. From Nicolas Dichtel. 2) Fix a wrong if statement in xfrm_sk_policy_lookup. From Florian Westphal. 3) The flowcache watermarks are per cpu. Take this into account when comparing to the threshold where we refusing new allocations. From Miroslav Urbanek. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <[email protected]>
2016-11-27net: dsa: fix fixed-link-phy device leaksJohan Hovold1-1/+4
Make sure to drop the reference taken by of_phy_find_device() when registering and deregistering the fixed-link PHY-device. Fixes: 39b0c705195e ("net: dsa: Allow configuration of CPU & DSA port speeds/duplex") Signed-off-by: Johan Hovold <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller19-37/+69
udplite conflict is resolved by taking what 'net-next' did which removed the backlog receive method assignment, since it is no longer necessary. Two entries were added to the non-priv ethtool operations switch statement, one in 'net' and one in 'net-next, so simple overlapping changes. Signed-off-by: David S. Miller <[email protected]>
2016-11-25tipc: resolve connection flow control compatibility problemJon Paul Maloy1-1/+1
In commit 10724cc7bb78 ("tipc: redesign connection-level flow control") we replaced the previous message based flow control with one based on 1k blocks. In order to ensure backwards compatibility the mechanism falls back to using message as base unit when it senses that the peer doesn't support the new algorithm. The default flow control window, i.e., how many units can be sent before the sender blocks and waits for an acknowledge (aka advertisement) is 512. This was tested against the previous version, which uses an acknowledge frequency of on ack per 256 received message, and found to work fine. However, we missed the fact that versions older than Linux 3.15 use an acknowledge frequency of 512, which is exactly the limit where a 4.6+ sender will stop and wait for acknowledge. This would also work fine if it weren't for the fact that if the first sent message on a 4.6+ server side is an empty SYNACK, this one is also is counted as a sent message, while it is not counted as a received message on a legacy 3.15-receiver. This leads to the sender always being one step ahead of the receiver, a scenario causing the sender to block after 512 sent messages, while the receiver only has registered 511 read messages. Hence, the legacy receiver is not trigged to send an acknowledge, with a permanently blocked sender as result. We solve this deadlock by simply allowing the sender to send one more message before it blocks, i.e., by a making minimal change to the condition used for determining connection congestion. Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-25net: ethtool: don't require CAP_NET_ADMIN for ETHTOOL_GLINKSETTINGSMiroslav Lichvar1-0/+1
The ETHTOOL_GLINKSETTINGS command is deprecating the ETHTOOL_GSET command and likewise it shouldn't require the CAP_NET_ADMIN capability. Signed-off-by: Miroslav Lichvar <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-25tipc: improve sanity check for received domain recordsJon Paul Maloy1-5/+5
In commit 35c55c9877f8 ("tipc: add neighbor monitoring framework") we added a data area to the link monitor STATE messages under the assumption that previous versions did not use any such data area. For versions older than Linux 4.3 this assumption is not correct. In those version, all STATE messages sent out from a node inadvertently contain a 16 byte data area containing a string; -a leftover from previous RESET messages which were using this during the setup phase. This string serves no purpose in STATE messages, and should no be there. Unfortunately, this data area is delivered to the link monitor framework, where a sanity check catches that it is not a correct domain record, and drops it. It also issues a rate limited warning about the event. Since such events occur much more frequently than anticipated, we now choose to remove the warning in order to not fill the kernel log with useless contents. We also make the sanity check stricter, to further reduce the risk that such data is inavertently admitted. Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-25tipc: fix compatibility bug in link monitoringJon Paul Maloy1-2/+3
commit 817298102b0b ("tipc: fix link priority propagation") introduced a compatibility problem between TIPC versions newer than Linux 4.6 and those older than Linux 4.4. In versions later than 4.4, link STATE messages only contain a non-zero link priority value when the sender wants the receiver to change its priority. This has the effect that the receiver resets itself in order to apply the new priority. This works well, and is consistent with the said commit. However, in versions older than 4.4 a valid link priority is present in all sent link STATE messages, leading to cyclic link establishment and reset on the 4.6+ node. We fix this by adding a test that the received value should not only be valid, but also differ from the current value in order to cause the receiving link endpoint to reset. Reported-by: Amar Nv <[email protected]> Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-25net: properly flush delay-freed skbsEric Dumazet1-2/+3
Typical NAPI drivers use napi_consume_skb(skb) at TX completion time. This put skb in a percpu special queue, napi_alloc_cache, to get bulk frees. It turns out the queue is not flushed and hits the NAPI_SKB_CACHE_SIZE limit quite often, with skbs that were queued hundreds of usec earlier. I measured this can take ~6000 nsec to perform one flush. __kfree_skb_flush() can be called from two points right now : 1) From net_tx_action(), but only for skbs that were queued to sd->completion_queue. -> Irrelevant for NAPI drivers in normal operation. 2) From net_rx_action(), but only under high stress or if RPS/RFS has a pending action. This patch changes net_rx_action() to perform the flush in all cases and after more urgent operations happened (like kicking remote CPUS for RPS/RFS). Signed-off-by: Eric Dumazet <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Cc: Alexander Duyck <[email protected]> Acked-by: Alexander Duyck <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-25net: ipv4, ipv6: run cgroup eBPF egress programsDaniel Mack2-2/+33
If the cgroup associated with the receiving socket has an eBPF programs installed, run them from ip_output(), ip6_output() and ip_mc_output(). From mentioned functions we have two socket contexts as per 7026b1ddb6b8 ("netfilter: Pass socket pointer down through okfn()."). We explicitly need to use sk instead of skb->sk here, since otherwise the same program would run multiple times on egress when encap devices are involved, which is not desired in our case. eBPF programs used in this context are expected to either return 1 to let the packet pass, or != 1 to drop them. The programs have access to the skb through bpf_skb_load_bytes(), and the payload starts at the network headers (L3). Note that cgroup_bpf_run_filter() is stubbed out as static inline nop for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if the feature is unused. Signed-off-by: Daniel Mack <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>