aboutsummaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)AuthorFilesLines
2018-12-04tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWATEric Dumazet2-7/+21
TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12 as a step to enable bigger tcp sndbuf limits. It works reasonably well, but the following happens : Once the limit is reached, TCP stack generates an [E]POLLOUT event for every incoming ACK packet. This causes a high number of context switches. This patch implements the strategy David Miller added in sock_def_write_space() : - If TCP socket has a notsent_lowat constraint of X bytes, allow sendmsg() to fill up to X bytes, but send [E]POLLOUT only if number of notsent bytes is below X/2 This considerably reduces TCP_NOTSENT_LOWAT overhead, while allowing to keep the pipe full. Tested: 100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM A:/# cat /proc/sys/net/ipv4/tcp_wmem 4096 262144 64000000 A:/# super_netperf 100 -H B -l 1000 -- -K bbr & A:/# grep TCP /proc/net/sockstat TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/ A:/# vmstat 5 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 256220672 13532 694976 0 0 10 0 28 14 0 1 99 0 0 2 0 0 256320016 13532 698480 0 0 512 0 715901 5927 0 10 90 0 0 0 0 0 256197232 13532 700992 0 0 735 13 771161 5849 0 11 89 0 0 1 0 0 256233824 13532 703320 0 0 512 23 719650 6635 0 11 89 0 0 2 0 0 256226880 13532 705780 0 0 642 4 775650 6009 0 12 88 0 0 A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat A:/# grep TCP /proc/net/sockstat TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow A:/# vmstat 5 5 # check that context switches have not inflated too much. procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 260386512 13592 662148 0 0 10 0 17 14 0 1 99 0 0 0 0 0 260519680 13592 604184 0 0 512 13 726843 12424 0 10 90 0 0 1 1 0 260435424 13592 598360 0 0 512 25 764645 12925 0 10 90 0 0 1 0 0 260855392 13592 578380 0 0 512 7 722943 13624 0 11 88 0 0 1 0 0 260445008 13592 601176 0 0 614 34 772288 14317 0 10 90 0 0 Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-12-03sctp: kfree_rcu asocXin Long1-0/+2
In sctp_hash_transport/sctp_epaddr_lookup_transport, it dereferences a transport's asoc under rcu_read_lock while asoc is freed not after a grace period, which leads to a use-after-free panic. This patch fixes it by calling kfree_rcu to make asoc be freed after a grace period. Note that only the asoc's memory is delayed to free in the patch, it won't cause sk to linger longer. Thanks Neil and Marcelo to make this clear. Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport rhashtable") Fixes: cd2b70875058 ("sctp: check duplicate node before inserting a new transport") Reported-by: [email protected] Reported-by: [email protected] Suggested-by: Neil Horman <[email protected]> Signed-off-by: Xin Long <[email protected]> Acked-by: Marcelo Ricardo Leitner <[email protected]> Acked-by: Neil Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-12-03l3mdev: add function to retreive upper masterAlexis Bauvin1-0/+22
Existing functions to retreive the l3mdev of a device did not walk the master chain to find the upper master. This patch adds a function to find the l3mdev, even indirect through e.g. a bridge: +----------+ | | | vrf-blue | | | +----+-----+ | | +----+-----+ | | | br-blue | | | +----+-----+ | | +----+-----+ | | | eth0 | | | +----------+ This will properly resolve the l3mdev of eth0 to vrf-blue. Signed-off-by: Alexis Bauvin <[email protected]> Reviewed-by: Amine Kherbouche <[email protected]> Reviewed-by: David Ahern <[email protected]> Tested-by: Amine Kherbouche <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-12-03udp_tunnel: add config option to bind to a deviceAlexis Bauvin1-0/+1
UDP tunnel sockets are always opened unbound to a specific device. This patch allow the socket to be bound on a custom device, which incidentally makes UDP tunnels VRF-aware if binding to an l3mdev. Signed-off-by: Alexis Bauvin <[email protected]> Reviewed-by: Amine Kherbouche <[email protected]> Tested-by: Amine Kherbouche <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-12-03devlink: Add 'fw_load_policy' generic parameterShalom Toledo1-0/+4
Many drivers load the device's firmware image during the initialization flow either from the flash or from the disk. Currently this option is not controlled by the user and the driver decides from where to load the firmware image. 'fw_load_policy' gives the ability to control this option which allows the user to choose between different loading policies supported by the driver. This parameter can be useful while testing and/or debugging the device. For example, testing a firmware bug fix. Signed-off-by: Shalom Toledo <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-12-01netfilter: nat: remove l4 protocol port roversFlorian Westphal1-1/+1
This is a leftover from days where single-cpu systems were common: Store last port used to resolve a clash to use it as a starting point when the next conflict needs to be resolved. When we have parallel attempt to connect to same address:port pair, its likely that both cores end up computing the same "available" port, as both use same starting port, and newly used ports won't become visible to other cores until the conntrack gets confirmed later. One of the cores then has to drop the packet at insertion time because the chosen new tuple turns out to be in use after all. Lets simplify this: remove port rover and use a pseudo-random starting point. Note that this doesn't make netfilter default to 'fully random' mode; the 'rover' was only used if NAT could not reuse source port as-is. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-11-30net: reorder flowi_common fields to avoid holesPaolo Abeni1-1/+1
the flowi* structures are used and memsetted by server functions in critical path. Currently flowi_common has a couple of holes that we can eliminate reordering the struct fields. As a side effect, both flowi4 and flowi6 shrink by 8 bytes. Before: pahole -EC flowi_common struct flowi_common { // ... /* size: 40, cachelines: 1, members: 10 */ /* sum members: 32, holes: 1, sum holes: 4 */ /* padding: 4 */ /* last cacheline: 40 bytes */ }; pahole -EC flowi6 struct flowi6 { // ... /* size: 88, cachelines: 2, members: 6 */ /* padding: 4 */ /* last cacheline: 24 bytes */ }; pahole -EC flowi4 struct flowi4 { // ... /* size: 56, cachelines: 1, members: 4 */ /* padding: 4 */ /* last cacheline: 56 bytes */ }; After: struct flowi_common { // ... /* size: 32, cachelines: 1, members: 10 */ /* last cacheline: 32 bytes */ }; struct flowi6 { // ... /* size: 80, cachelines: 2, members: 6 */ /* padding: 4 */ /* last cacheline: 16 bytes */ }; struct flowi4 { // ... /* size: 48, cachelines: 1, members: 4 */ /* padding: 4 */ /* last cacheline: 48 bytes */ }; Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-30tcp: md5: add tcp_md5_needed jump labelEric Dumazet1-3/+15
Most linux hosts never setup TCP MD5 keys. We can avoid a cache line miss (accessing tp->md5ig_info) on RX and TX using a jump label. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-30tcp: make tcp_space() aware of socket backlogEric Dumazet1-1/+1
Jean-Louis Dupond reported poor iscsi TCP receive performance that we tracked to backlog drops. Apparently we fail to send window updates reflecting the fact that we are under stress. Note that we might lack a proper window increase when backlog is fully processed, since __release_sock() clears sk->sk_backlog.len _after_ all skbs have been processed. This should not matter in practice. If we had a significant load through socket backlog, we are in a dangerous situation. Reported-by: Jean-Louis Dupond <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Neal Cardwell <[email protected]> Acked-by: Yuchung Cheng <[email protected]> Tested-by: Jean-Louis Dupond<[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-30tcp: hint compiler about sack flowsEric Dumazet1-1/+1
Tell the compiler that most TCP flows are using SACK these days. There is no need to add the unlikely() clause in tcp_is_reno(), the compiler is able to infer it. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Neal Cardwell <[email protected]> Acked-by: Yuchung Cheng <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-30net/flow_dissector: correct comments on enum flow_dissector_key_idEdward Cree1-3/+3
There are no such structs flow_dissector_key_flow_vlan or flow_dissector_key_flow_tags, the actual structs used are struct flow_dissector_key_vlan and struct flow_dissector_key_tags. So correct the comments against FLOW_DISSECTOR_KEY_VLAN, FLOW_DISSECTOR_KEY_FLOW_LABEL and FLOW_DISSECTOR_KEY_CVLAN to refer to those. Signed-off-by: Edward Cree <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-2/+2
Trivial conflict in net/core/filter.c, a locally computed 'sdif' is now an argument to the function. Signed-off-by: David S. Miller <[email protected]>
2018-11-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2-2/+2
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Disable BH while holding list spinlock in nf_conncount, from Taehee Yoo. 2) List corruption in nf_conncount, also from Taehee. 3) Fix race that results in leaving around an empty list node in nf_conncount, from Taehee Yoo. 4) Proper chain handling for inactive chains from the commit path, from Florian Westphal. This includes a selftest for this. 5) Do duplicate rule handles when replacing rules, also from Florian. 6) Remove net_exit path in xt_RATEEST that results in splat, from Taehee. 7) Possible use-after-free in nft_compat when releasing extensions. From Florian. 8) Memory leak in xt_hashlimit, from Taehee. 9) Call ip_vs_dst_notifier after ipv6_dev_notf, from Xin Long. 10) Fix cttimeout with udplite and gre, from Florian. 11) Preserve oif for IPv6 link-local generated traffic from mangle table, from Alin Nastac. 12) Missing error handling in masquerade notifiers, from Taehee Yoo. 13) Use mutex to protect registration/unregistration of masquerade extensions in order to prevent a race, from Taehee. 14) Incorrect condition check in tree_nodes_free(), also from Taehee. 15) Fix chain counter leak in rule replacement path, from Taehee. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-11-27netfilter: add missing error handling code for register functionsTaehee Yoo2-2/+2
register_{netdevice/inetaddr/inet6addr}_notifier may return an error value, this patch adds the code to handle these error paths. Signed-off-by: Taehee Yoo <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-11-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+12
2018-11-23switchdev: Replace port obj add/del SDO with a notificationPetr Machata1-9/+0
Drop switchdev_ops.switchdev_port_obj_add and _del. Drop the uses of this field from all clients, which were migrated to use switchdev notification in the previous patches. Add a new function switchdev_port_obj_notify() that sends the switchdev notifications SWITCHDEV_PORT_OBJ_ADD and _DEL. Update switchdev_port_obj_del_now() to dispatch to this new function. Drop __switchdev_port_obj_add() and update switchdev_port_obj_add() likewise. Signed-off-by: Petr Machata <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-23switchdev: Add helpers to aid traversal through lower devicesPetr Machata1-0/+33
After the transition from switchdev operations to notifier chain (which will take place in following patches), the onus is on the driver to find its own devices below possible layer of LAG or other uppers. The logic to do so is fairly repetitive: each driver is looking for its own devices among the lowers of the notified device. For those that it finds, it calls a handler. To indicate that the event was handled, struct switchdev_notifier_port_obj_info.handled is set. The differences lie only in what constitutes an "own" device and what handler to call. Therefore abstract this logic into two helpers, switchdev_handle_port_obj_add() and switchdev_handle_port_obj_del(). If a driver only supports physical ports under a bridge device, it will simply avoid this layer of indirection. One area where this helper diverges from the current switchdev behavior is the case of mixed lowers, some of which are switchdev ports and some of which are not. Previously, such scenario would fail with -EOPNOTSUPP. The helper could do that for lowers for which the passed-in predicate doesn't hold. That would however break the case that switchdev ports from several different drivers are stashed under one master, a scenario that switchdev currently happily supports. Therefore tolerate any and all unknown netdevices, whether they are backed by a switchdev driver or not. Signed-off-by: Petr Machata <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-23switchdev: Add SWITCHDEV_PORT_OBJ_ADD, SWITCHDEV_PORT_OBJ_DELPetr Machata1-0/+10
An offloading driver may need to have access to switchdev events on ports that aren't directly under its control. An example is a VXLAN port attached to a bridge offloaded by a driver. The driver needs to know about VLANs configured on the VXLAN device. However the VXLAN device isn't stashed between the bridge and a front-panel-port device (such as is the case e.g. for LAG devices), so the usual switchdev ops don't reach the driver. VXLAN is likely not the only device type like this: in theory any L2 tunnel device that needs offloading will prompt requirement of this sort. This falsifies the assumption that only the lower devices of a front panel port need to be notified to achieve flawless offloading. A way to fix this is to give up the notion of port object addition / deletion as a switchdev operation, which assumes somewhat tight coupling between the message producer and consumer. And instead send the message over a notifier chain. To that end, introduce two new switchdev notifier types, SWITCHDEV_PORT_OBJ_ADD and SWITCHDEV_PORT_OBJ_DEL. These notifier types communicate the same event as the corresponding switchdev op, except in a form of a notification. struct switchdev_notifier_port_obj_info was added to carry the fields that the switchdev op carries. An additional field, handled, will be used to communicate back to switchdev that the event has reached an interested party, which will be important for the two-phase commit. The two switchdev operations themselves are kept in place. Following patches first convert individual clients to the notifier protocol, and only then are the operations removed. Signed-off-by: Petr Machata <[email protected]> Acked-by: Jiri Pirko <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-23switchdev: Add a blocking notifier chainPetr Machata1-0/+27
In general one can't assume that a switchdev notifier is called in a non-atomic context, and correspondingly, the switchdev notifier chain is an atomic one. However, port object addition and deletion messages are delivered from a process context. Even the MDB addition messages, whose delivery is scheduled from atomic context, are queued and the delivery itself takes place in blocking context. For VLAN messages in particular, keeping the blocking nature is important for error reporting. Therefore introduce a blocking notifier chain and related service functions to distribute the notifications for which a blocking context can be assumed. Signed-off-by: Petr Machata <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-23switchdev: SWITCHDEV_OBJ_PORT_{VLAN, MDB}(): SanitizePetr Machata1-4/+4
The two macros SWITCHDEV_OBJ_PORT_VLAN() and SWITCHDEV_OBJ_PORT_MDB() expand to a container_of() call, yielding an appropriate container of their sole argument. However, due to a name collision, the first argument, i.e. the contained object pointer, is not the only one to get expanded. The third argument, which is a structure member name, and should be kept literal, gets expanded as well. The only safe way to use these two macros is therefore to name the local variable passed to them "obj". To fix this, rename the sole argument of the two macros from "obj" (which collides with the member name) to "OBJ". Additionally, instead of passing "OBJ" to container_of() verbatim, parenthesize it, so that a comma in the passed-in expression doesn't pollute the container_of() invocation. Signed-off-by: Petr Machata <[email protected]> Acked-by: Jiri Pirko <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-23xfrm_user: fix freeing of xfrm states on acquireMathias Krause1-0/+1
Commit 565f0fa902b6 ("xfrm: use a dedicated slab cache for struct xfrm_state") moved xfrm state objects to use their own slab cache. However, it missed to adapt xfrm_user to use this new cache when freeing xfrm states. Fix this by introducing and make use of a new helper for freeing xfrm_state objects. Fixes: 565f0fa902b6 ("xfrm: use a dedicated slab cache for struct xfrm_state") Reported-by: Pan Bian <[email protected]> Cc: <[email protected]> # v4.18+ Signed-off-by: Mathias Krause <[email protected]> Acked-by: Herbert Xu <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2018-11-21vxlan: Add hardware FDB learningPetr Machata1-0/+2
In order to allow devices to signal learning events to VXLAN, introduce two new switchdev messages: SWITCHDEV_VXLAN_FDB_ADD_TO_BRIDGE and SWITCHDEV_VXLAN_FDB_DEL_TO_BRIDGE. Listen to these notifications in the vxlan driver. The FDB entries learned this way have an NTF_EXT_LEARNED flag, and only entries marked as such can be unlearned by the _DEL_ event. They are also immediately marked as offloaded. This is the same behavior that the bridge driver observes. Signed-off-by: Petr Machata <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-21vxlan: Mark user-added FDB entriesPetr Machata1-0/+1
The VXLAN driver needs to differentiate between FDB entries learned by the VXLAN driver, and those added by the user. The latter ones shouldn't be taken over by external learning events. This is in accordance with bridge behavior. Therefore, extend the flags bitfield to 16 bits and add a new private NTF flag to mark the user-added entries. This seems preferable to adding a dedicated boolean, because passing the flag, unlike passing e.g. a true, makes it clear what the meaning of the bit is. Signed-off-by: Petr Machata <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-19net: sched: cls_u32: add res to offload informationJakub Kicinski1-0/+1
In case of egress offloads the class/flowid assigned by the filter may be very important for offloaded Qdisc selection. Provide this info to drivers. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: John Hurley <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-19net: sched: gred: support reporting stats from offloadsJakub Kicinski1-0/+8
Allow drivers which offload GRED to report back statistics. Since A lot of GRED stats is fairly ad hoc in nature pass to drivers the standard struct gnet_stats_basic/gnet_stats_queue pairs, and untangle the values in the core. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: John Hurley <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-19net: sched: gred: add basic Qdisc offloadJakub Kicinski1-0/+36
Add basic offload for the GRED Qdisc. Inform the drivers any time Qdisc or virtual queue configuration changes. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: John Hurley <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-19Revert "sctp: remove sctp_transport_pmtu_check"Xin Long1-0/+12
This reverts commit 22d7be267eaa8114dcc28d66c1c347f667d7878a. The dst's mtu in transport can be updated by a non sctp place like in xfrm where the MTU information didn't get synced between asoc, transport and dst, so it is still needed to do the pmtu check in sctp_packet_config. Acked-by: Neil Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-19sctp: rename enum sctp_event to sctp_event_typeXin Long2-3/+3
sctp_event is a structure name defined in RFC for sockopt SCTP_EVENT. To avoid the conflict, rename it. Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-19sctp: add subscribe per asocXin Long1-0/+2
The member subscribe should be per asoc, so that sockopt SCTP_EVENT in the next patch can subscribe a event from one asoc only. Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-19sctp: define subscribe in sctp_sock as __u16Xin Long2-16/+25
The member subscribe in sctp_sock is used to indicate to which of the events it is subscribed, more like a group of flags. So it's better to be defined as __u16 (2 bytpes), instead of struct sctp_event_subscribe (13 bytes). Note that sctp_event_subscribe is an UAPI struct, used on sockopt calls, and thus it will not be removed. This patch only changes the internal storage of the flags. Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+2
2018-11-17net: align gnet_stats_basic_cpu structEric Dumazet1-1/+1
This structure is small (12 or 16 bytes depending on 64bit or 32bit kernels), but we do not want it spanning two cache lines. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-16tcp: clean up STATE_TRACEYafang Shao1-12/+0
Currently we can use bpf or tcp tracepoint to conveniently trace the tcp state transition at the run time. So we don't need to do this stuff at the compile time anymore. Signed-off-by: Yafang Shao <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-15net: get rid of __tcp_checksum_complete()Cong Wang1-6/+1
__tcp_checksum_complete() is 100% same with __skb_checksum_complete() and there is no other caller except tcp_checksum_complete(). So, just use __skb_checksum_complete() there. Signed-off-by: Cong Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-15rxrpc: Fix life checkDavid Howells1-1/+2
The life-checking function, which is used by kAFS to make sure that a call is still live in the event of a pending signal, only samples the received packet serial number counter; it doesn't actually provoke a change in the counter, rather relying on the server to happen to give us a packet in the time window. Fix this by adding a function to force a ping to be transmitted. kAFS then keeps track of whether there's been a stall, and if so, uses the new function to ping the server, resetting the timeout to allow the reply to come back. If there's a stall, a ping and the call is *still* stalled in the same place after another period, then the call will be aborted. Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Fixes: f4d15fb6f99a ("rxrpc: Provide functions for allowing cleaner handling of signals") Signed-off-by: David Howells <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-14net: sched: red: notify drivers about RED's limit parameterJakub Kicinski1-0/+1
RED qdisc's limit parameter changes the behaviour of the qdisc, for instance if it's set to 0 qdisc will drop all the packets. When replace operation happens and parameter is set to non-0 a new fifo qdisc will be instantiated and replace the old child qdisc which will be destroyed. Drivers need to know the parameter, even if they don't impose the actual limit to be able to reliably reconstruct the Qdisc hierarchy. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: John Hurley <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-14net: sched: mq: offload a graft notificationJakub Kicinski1-1/+10
Drivers offloading Qdiscs should have reasonable certainty the offloaded behaviour matches the SW path. This is impossible if the driver does not know about all Qdiscs or when Qdiscs move and are reused. Send a graft notification from MQ. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: John Hurley <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-14net: sched: red: offload a graft notificationJakub Kicinski1-0/+2
Drivers offloading Qdiscs should have reasonable certainty the offloaded behaviour matches the SW path. This is impossible if the driver does not know about all Qdiscs or when Qdiscs move and are reused. Send a graft notification from RED. The drivers are expected to simply stop offloading the Qdisc, if a non-standard child is ever grafted onto it. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: John Hurley <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-14net: sched: provide notification for graft on rootJakub Kicinski1-0/+10
Drivers are currently not notified when a Qdisc is grafted as root. This requires special casing Qdiscs added with parent = TC_H_ROOT in the driver. Also there is no notification sent to the driver when an existing Qdisc is grafted as root. Add this very simple notifications, drivers should now be able to track their Qdisc tree fully. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: John Hurley <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-12sctp: process sk_reuseport in sctp_get_port_localXin Long1-1/+3
When socks' sk_reuseport is set, the same port and address are allowed to be bound into these socks who have the same uid. Note that the difference from sk_reuse is that it allows multiple socks to listen on the same port and address. Acked-by: Neil Horman <[email protected]> Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-12sctp: add sock_reuseport for the sock in __sctp_hash_endpointXin Long2-1/+3
This is a part of sk_reuseport support for sctp. It defines a helper sctp_bind_addrs_check() to check if the bind_addrs in two socks are matched. It will add sock_reuseport if they are completely matched, and return err if they are partly matched, and alloc sock_reuseport if all socks are not matched at all. It will work until sk_reuseport support is added in sctp_get_port_local() in the next patch. v1->v2: - use 'laddr->valid && laddr2->valid' check instead as Marcelo pointed in sctp_bind_addrs_check(). Acked-by: Neil Horman <[email protected]> Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-12netfilter: nf_flow_table: make nf_flow_table_iterate() staticTaehee Yoo1-4/+0
nf_flow_table_iterate() is local function, make it static. Signed-off-by: Taehee Yoo <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-11-11net: sched: register callbacks for indirect tc block bindsJohn Hurley2-0/+37
Currently drivers can register to receive TC block bind/unbind callbacks by implementing the setup_tc ndo in any of their given netdevs. However, drivers may also be interested in binds to higher level devices (e.g. tunnel drivers) to potentially offload filters applied to them. Introduce indirect block devs which allows drivers to register callbacks for block binds on other devices. The callback is triggered when the device is bound to a block, allowing the driver to register for rules applied to that block using already available functions. Freeing an indirect block callback will trigger an unbind event (if necessary) to direct the driver to remove any offloaded rules and unreg any block rule callbacks. It is the responsibility of the implementing driver to clean any registered indirect block callbacks before exiting, if the block it still active at such a time. Allow registering an indirect block dev callback for a device that is already bound to a block. In this case (if it is an ingress block), register and also trigger the callback meaning that any already installed rules can be replayed to the calling driver. Signed-off-by: John Hurley <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-11-09xfrm: policy: store inexact policies in a tree ordered by destination addressFlorian Westphal1-0/+1
This adds inexact lists per destination network, stored in a search tree. Inexact lookups now return two 'candidate lists', the 'any' policies ('any' destionations), and a list of policies that share same daddr/prefix. Next patch will add a second search tree for 'saddr:any' policies so we can avoid placing those on the 'any:any' list too. Signed-off-by: Florian Westphal <[email protected]> Acked-by: David S. Miller <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2018-11-09xfrm: policy: add inexact policy search tree infrastructureFlorian Westphal1-0/+1
At this time inexact policies are all searched in-order until the first match is found. After removal of the flow cache, this resolution has to be performed for every packetm resulting in major slowdown when number of inexact policies is high. This adds infrastructure to later sort inexact policies into a tree. This only introduces a single class: any:any. Next patch will add a search tree to pre-sort policies that have a fixed daddr/prefixlen, so in this patch the any:any class will still be used for all policies. Signed-off-by: Florian Westphal <[email protected]> Acked-by: David S. Miller <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2018-11-09xfrm: policy: store inexact policies in an rhashtableFlorian Westphal2-0/+3
Switch packet-path lookups for inexact policies to rhashtable. In this initial version, we now no longer need to search policies with non-matching address family and type. Next patch will add the if_id as well so lookups from the xfrm interface driver only need to search inexact policies for that device. Future patches will augment the hlist in each rhash bucket with a tree and pre-sort policies according to daddr/prefix. A single rhashtable is used. In order to avoid a full rhashtable walk on netns exit, the bins get placed on a pernet list, i.e. we add almost no cost for network namespaces that had no xfrm policies. The inexact lists are kept in place, and policies are added to both the per-rhash-inexact list and a pernet one. The latter is needed for the control plane to handle migrate -- these requests do not consider the if_id, so if we'd remove the inexact_list now we would have to search all hash buckets and then figure out which matching policy candidate is the most recent one -- this appears a bit harder than just keeping the 'old' inexact list for this purpose. Signed-off-by: Florian Westphal <[email protected]> Acked-by: David S. Miller <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2018-11-09{nl,mac}80211: add rssi to mesh candidatesBob Copeland1-1/+2
When peering is in userspace, some implementations may want to control which peers are accepted based on RSSI in addition to the information elements being sent today. Add signal level so that info is available to clients. Signed-off-by: Bob Copeland <[email protected]> Signed-off-by: Johannes Berg <[email protected]>
2018-11-09{nl,mac}80211: add dot11MeshConnectedToMeshGate to meshconfBob Copeland1-0/+5
When userspace is controlling mesh routing, it may have better knowledge about whether a mesh STA is connected to a mesh gate than the kernel mpath table. Add dot11MeshConnectedToMeshGate to the mesh config so that such applications can explicitly signal that a mesh STA is connected to a gate, which will then be advertised in the beacon. Signed-off-by: Bob Copeland <[email protected]> Signed-off-by: Johannes Berg <[email protected]>
2018-11-09{nl,mac}80211: report gate connectivity in station infoBob Copeland1-0/+3
Capture the current state of gate connectivity from the mesh formation field in mesh config whenever we receive a beacon, and report that via GET_STATION. This allows applications doing mesh peering in userspace to make peering decisions based on peers' current upstream connectivity. Signed-off-by: Bob Copeland <[email protected]> Signed-off-by: Johannes Berg <[email protected]>
2018-11-09mac80211: allow hardware scan to fall back to softwareJohannes Berg1-0/+5
In some cases, like in the rsi driver hardware scan offload, there may be scenarios in which hardware scan might not be available or desirable. Allow drivers to cope with this by letting them fall back to software scan by returning the special value 1 from the hardware scan method. Requested-by: Sushant Kumar Mishra <[email protected]> Requested-by: Siva Rebbagondla <[email protected]> Signed-off-by: Johannes Berg <[email protected]>