aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2016-04-14soreuseport: fix ordering for mixed v4/v6 socketsCraig Gallek1-2/+7
With the SO_REUSEPORT socket option, it is possible to create sockets in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address. This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on the AF_INET6 sockets. Prior to the commits referenced below, an incoming IPv4 packet would always be routed to a socket of type AF_INET when this mixed-mode was used. After those changes, the same packet would be routed to the most recently bound socket (if this happened to be an AF_INET6 socket, it would have an IPv4 mapped IPv6 address). The change in behavior occurred because the recent SO_REUSEPORT optimizations short-circuit the socket scoring logic as soon as they find a match. They did not take into account the scoring logic that favors AF_INET sockets over AF_INET6 sockets in the event of a tie. To fix this problem, this patch changes the insertion order of AF_INET and AF_INET6 addresses in the TCP and UDP socket lists when the sockets have SO_REUSEPORT set. AF_INET sockets will be inserted at the head of the list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at the tail of the list. This will force AF_INET sockets to always be considered first. Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection") Reported-by: Maciej Żenczykowski <[email protected]> Signed-off-by: Craig Gallek <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-14packet: uses kfree_skb() for errors.Weongyo Jeong1-2/+12
consume_skb() isn't for error cases that kfree_skb() is more proper one. At this patch, it fixed tpacket_rcv() and packet_rcv() to be consistent for error or non-error cases letting perf trace its event properly. Signed-off-by: Weongyo Jeong <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-14tipc: fix a race condition leading to subscriber refcnt bugParthasarathy Bhuvaragan3-10/+17
Until now, the requests sent to topology server are queued to a workqueue by the generic server framework. These messages are processed by worker threads and trigger the registered callbacks. To reduce latency on uniprocessor systems, explicit rescheduling is performed using cond_resched() after MAX_RECV_MSG_COUNT(25) messages. This implementation on SMP systems leads to an subscriber refcnt error as described below: When a worker thread yields by calling cond_resched() in a SMP system, a new worker is created on another CPU to process the pending workitem. Sometimes the sleeping thread wakes up before the new thread finishes execution. This breaks the assumption on ordering and being single threaded. The fault is more frequent when MAX_RECV_MSG_COUNT is lowered. If the first thread was processing subscription create and the second thread processing close(), the close request will free the subscriber and the create request oops as follows: [31.224137] WARNING: CPU: 2 PID: 266 at include/linux/kref.h:46 tipc_subscrb_rcv_cb+0x317/0x380 [tipc] [31.228143] CPU: 2 PID: 266 Comm: kworker/u8:1 Not tainted 4.5.0+ #97 [31.228377] Workqueue: tipc_rcv tipc_recv_work [tipc] [...] [31.228377] Call Trace: [31.228377] [<ffffffff812fbb6b>] dump_stack+0x4d/0x72 [31.228377] [<ffffffff8105a311>] __warn+0xd1/0xf0 [31.228377] [<ffffffff8105a3fd>] warn_slowpath_null+0x1d/0x20 [31.228377] [<ffffffffa0098067>] tipc_subscrb_rcv_cb+0x317/0x380 [tipc] [31.228377] [<ffffffffa00a4984>] tipc_receive_from_sock+0xd4/0x130 [tipc] [31.228377] [<ffffffffa00a439b>] tipc_recv_work+0x2b/0x50 [tipc] [31.228377] [<ffffffff81071925>] process_one_work+0x145/0x3d0 [31.246554] ---[ end trace c3882c9baa05a4fd ]--- [31.248327] BUG: spinlock bad magic on CPU#2, kworker/u8:1/266 [31.249119] BUG: unable to handle kernel NULL pointer dereference at 0000000000000428 [31.249323] IP: [<ffffffff81099d0c>] spin_dump+0x5c/0xe0 [31.249323] PGD 0 [31.249323] Oops: 0000 [#1] SMP In this commit, we - rename tipc_conn_shutdown() to tipc_conn_release(). - move connection release callback execution from tipc_close_conn() to a new function tipc_sock_release(), which is executed before we free the connection. Thus we release the subscriber during connection release procedure rather than connection shutdown procedure. Signed-off-by: Parthasarathy Bhuvaragan <[email protected]> Acked-by: Ying Xue <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-14ipv6: udp: Do a route lookup and update during release_cbMartin KaFai Lau2-0/+21
This patch adds a release_cb for UDPv6. It does a route lookup and updates sk->sk_dst_cache if it is needed. It picks up the left-over job from ip6_sk_update_pmtu() if the sk was owned by user during the pmtu update. It takes a rcu_read_lock to protect the __sk_dst_get() operations because another thread may do ip6_dst_store() without taking the sk lock (e.g. sendmsg). Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception") Signed-off-by: Martin KaFai Lau <[email protected]> Reported-by: Wei Wang <[email protected]> Cc: Cong Wang <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Wei Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-14ipv6: datagram: Update dst cache of a connected datagram sk during pmtu updateMartin KaFai Lau2-9/+23
There is a case in connected UDP socket such that getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible sequence could be the following: 1. Create a connected UDP socket 2. Send some datagrams out 3. Receive a ICMPV6_PKT_TOOBIG 4. No new outgoing datagrams to trigger the sk_dst_check() logic to update the sk->sk_dst_cache. 5. getsockopt(IPV6_MTU) returns the mtu from the invalid sk->sk_dst_cache instead of the newly created RTF_CACHE clone. This patch updates the sk->sk_dst_cache for a connected datagram sk during pmtu-update code path. Note that the sk->sk_v6_daddr is used to do the route lookup instead of skb->data (i.e. iph). It is because a UDP socket can become connected after sending out some datagrams in un-connected state. or It can be connected multiple times to different destinations. Hence, iph may not be related to where sk is currently connected to. It is done under '!sock_owned_by_user(sk)' condition because the user may make another ip6_datagram_connect() (i.e changing the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update code path. For the sock_owned_by_user(sk) == true case, the next patch will introduce a release_cb() which will update the sk->sk_dst_cache. Test: Server (Connected UDP Socket): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Route Details: [root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac' 2fac::/64 dev eth0 proto kernel metric 256 pref medium 2fac:face::/64 via 2fac::face dev eth0 metric 1024 pref medium A simple python code to create a connected UDP socket: import socket import errno HOST = '2fac::1' PORT = 8080 s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM) s.bind((HOST, PORT)) s.connect(('2fac:face::face', 53)) print("connected") while True: try: data = s.recv(1024) except socket.error as se: if se.errno == errno.EMSGSIZE: pmtu = s.getsockopt(41, 24) print("PMTU:%d" % pmtu) break s.close() Python program output after getting a ICMPV6_PKT_TOOBIG: [root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py connected PMTU:1300 Cache routes after recieving TOOBIG: [root@arch-fb-vm1 ~]# ip -6 r show table cache 2fac:face::face via 2fac::face dev eth0 metric 0 cache expires 463sec mtu 1300 pref medium Client (Send the ICMPV6_PKT_TOOBIG): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ scapy is used to generate the TOOBIG message. Here is the scapy script I have used: >>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', type=0x86dd)/IPv6(src='2fac::face', dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac:: 1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53) >>> sendp(p, iface='qemubr0') Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception") Signed-off-by: Martin KaFai Lau <[email protected]> Reported-by: Wei Wang <[email protected]> Cc: Cong Wang <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Wei Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-14ipv6: datagram: Refactor dst lookup and update codes to a new functionMartin KaFai Lau1-46/+57
This patch moves the route lookup and update codes for connected datagram sk to a newly created function ip6_datagram_dst_update() It will be reused during the pmtu update in the later patch. Signed-off-by: Martin KaFai Lau <[email protected]> Cc: Cong Wang <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Wei Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-14ipv6: datagram: Refactor flowi6 init codes to a new functionMartin KaFai Lau1-20/+30
Move flowi6 init codes for connected datagram sk to a newly created function ip6_datagram_flow_key_init(). Notes: 1. fl6_flowlabel is used instead of fl6.flowlabel in __ip6_datagram_connect 2. ipv6_addr_is_multicast(&fl6->daddr) is used instead of (addr_type & IPV6_ADDR_MULTICAST) in ip6_datagram_flow_key_init() This new function will be reused during pmtu update in the later patch. Signed-off-by: Martin KaFai Lau <[email protected]> Cc: Cong Wang <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Wei Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-14GSO: Support partial segmentation offloadAlexander Duyck8-22/+137
This patch adds support for something I am referring to as GSO partial. The basic idea is that we can support a broader range of devices for segmentation if we use fixed outer headers and have the hardware only really deal with segmenting the inner header. The idea behind the naming is due to the fact that everything before csum_start will be fixed headers, and everything after will be the region that is handled by hardware. With the current implementation it allows us to add support for the following GSO types with an inner TSO_MANGLEID or TSO6 offload: NETIF_F_GSO_GRE NETIF_F_GSO_GRE_CSUM NETIF_F_GSO_IPIP NETIF_F_GSO_SIT NETIF_F_UDP_TUNNEL NETIF_F_UDP_TUNNEL_CSUM In the case of hardware that already supports tunneling we may be able to extend this further to support TSO_TCPV4 without TSO_MANGLEID if the hardware can support updating inner IPv4 headers. Signed-off-by: Alexander Duyck <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-14GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID valuesAlexander Duyck4-10/+50
This patch does two things. First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field. As a result we should now be able to aggregate flows that were converted from IPv6 to IPv4. In addition this allows us more flexibility for future implementations of segmentation as we may be able to use a fixed IP ID when segmenting the flow. The second thing this does is that it places limitations on the outer IPv4 ID header in the case of tunneled frames. Specifically it forces the IP ID to be incrementing by 1 unless the DF bit is set in the outer IPv4 header. This way we can avoid creating overlapping series of IP IDs that could possibly be fragmented if the frame goes through GRO and is then resegmented via GSO. Signed-off-by: Alexander Duyck <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-14GSO: Add GSO type for fixed IPv4 IDAlexander Duyck7-15/+48
This patch adds support for TSO using IPv4 headers with a fixed IP ID field. This is meant to allow us to do a lossless GRO in the case of TCP flows that use a fixed IP ID such as those that convert IPv6 header to IPv4 headers. In addition I am adding a feature that for now I am referring to TSO with IP ID mangling. Basically when this flag is enabled the device has the option to either output the flow with incrementing IP IDs or with a fixed IP ID regardless of what the original IP ID ordering was. This is useful in cases where the DF bit is set and we do not care if the original IP ID value is maintained. Signed-off-by: Alexander Duyck <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-14ethtool: Add support for toggling any of the GSO offloadsAlexander Duyck1-0/+2
The strings were missing for several of the GSO offloads that are available. This patch provides the missing strings so that we can toggle or query any of them via the ethtool command. Signed-off-by: Alexander Duyck <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-14devlink: implement shared buffer occupancy monitoring interfaceJiri Pirko1-6/+92
User needs to monitor shared buffer occupancy. For that, he issues a snapshot command in order to instruct hardware to catch current and maximal occupancy values, and clear command in order to clear the historical maximal values. Also port-pool and tc-pool-bind command response messages are extended to carry occupancy values. Signed-off-by: Jiri Pirko <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-14devlink: add shared buffer configurationJiri Pirko1-0/+940
Define userspace API and drivers API for configuration of shared buffers. Four basic objects are defined: shared buffer - attributes are size, number of pools and TCs pool - chunk of sharedbuffer definition, it has some size and either static or dynamic threshold port pool threshold - to set per-port threshold for each pool port tc threshold bind - to bind port and TC to specified pool with threshold. Signed-off-by: Jiri Pirko <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-14Merge tag 'mac80211-for-davem-2016-04-14' of ↵David S. Miller1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== This has just the single fix from Dmitry Ivanov, adding the missing netlink notifier family check to avoid the socket close DoS problem. ==================== Signed-off-by: David S. Miller <[email protected]>
2016-04-14net: sched: do not requeue a NULL skbLars Persson1-1/+4
A failure in validate_xmit_skb_list() triggered an unconditional call to dev_requeue_skb with skb=NULL. This slowly grows the queue discipline's qlen count until all traffic through the queue stops. We take the optimistic approach and continue running the queue after a failure since it is unknown if later packets also will fail in the validate path. Fixes: 55a93b3ea780 ("qdisc: validate skb without holding lock") Signed-off-by: Lars Persson <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-14packet: fix heap info leak in PACKET_DIAG_MCLIST sock_diag interfaceMathias Krause1-0/+1
Because we miss to wipe the remainder of i->addr[] in packet_mc_add(), pdiag_put_mclist() leaks uninitialized heap bytes via the PACKET_DIAG_MCLIST netlink attribute. Fix this by explicitly memset(0)ing the remaining bytes in i->addr[]. Fixes: eea68e2f1a00 ("packet: Report socket mclist info via diag module") Signed-off-by: Mathias Krause <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Pavel Emelyanov <[email protected]> Acked-by: Pavel Emelyanov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-14Merge branch 'for-davem' of ↵David S. Miller1-13/+10
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
2016-04-14net: remove netdevice gso_min_segsEric Dumazet1-2/+1
After introduction of ndo_features_check(), we believe that very specific checks for rare features should not be done in core networking stack. No driver uses gso_min_segs yet, so we revert this feature and save few instructions per tx packet in fast path. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-14qdisc: constify meta_type_ops structuresJulia Lawall1-4/+4
The meta_type_ops structures are never modified, so declare them as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-13route: do not cache fib route info on local routes with oifChris Friesen1-0/+12
For local routes that require a particular output interface we do not want to cache the result. Caching the result causes incorrect behaviour when there are multiple source addresses on the interface. The end result being that if the intended recipient is waiting on that interface for the packet he won't receive it because it will be delivered on the loopback interface and the IP_PKTINFO ipi_ifindex will be set to the loopback interface as well. This can be tested by running a program such as "dhcp_release" which attempts to inject a packet on a particular interface so that it is received by another program on the same board. The receiving process should see an IP_PKTINFO ipi_ifndex value of the source interface (e.g., eth1) instead of the loopback interface (e.g., lo). The packet will still appear on the loopback interface in tcpdump but the important aspect is that the CMSG info is correct. Sample dhcp_release command line: dhcp_release eth1 192.168.204.222 02:11:33:22:44:66 Signed-off-by: Allain Legacy <[email protected]> Signed off-by: Chris Friesen <[email protected]> Reviewed-by: Julian Anastasov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-13sctp: delay calls to sk_data_ready() as much as possibleMarcelo Ricardo Leitner2-2/+9
Currently processing of multiple chunks in a single SCTP packet leads to multiple calls to sk_data_ready, causing multiple wake up signals which are costy and doesn't make it wake up any faster. With this patch it will note that the wake up is pending and will do it before leaving the state machine interpreter, latest place possible to do it realiably and cleanly. Note that sk_data_ready events are not dependent on asocs, unlike waking up writers. v2: series re-checked v3: use local vars to cleanup the code, suggested by Jakub Sitnicki Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-13net: ipv6: Do not keep linklocal and loopback addressesDavid Ahern1-2/+10
f1705ec197e7 added the option to retain user configured addresses on an admin down. A comment to one of the later revisions suggested using the IFA_F_PERMANENT flag rather than adding a user_managed boolean to the ifaddr struct. A side effect of this change is that link local and loopback addresses are also retained which is not part of the objective of f1705ec197e7. Add check to drop those addresses. Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional") Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-13bridge: a netlink notification should be sent when those attributes are ↵Xin Long1-16/+24
changed by ioctl Now when we change the attributes of bridge or br_port by netlink, a relevant netlink notification will be sent, but if we change them by ioctl or sysfs, no notification will be sent. We should ensure that whenever those attributes change internally or from sysfs/ioctl, that a netlink notification is sent out to listeners. Also, NetworkManager will use this in the future to listen for out-of-band bridge master attribute updates and incorporate them into the runtime configuration. This patch is used for ioctl. Signed-off-by: Xin Long <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-13bridge: a netlink notification should be sent when those attributes are ↵Xin Long1-2/+3
changed by br_sysfs_if Now when we change the attributes of bridge or br_port by netlink, a relevant netlink notification will be sent, but if we change them by ioctl or sysfs, no notification will be sent. We should ensure that whenever those attributes change internally or from sysfs/ioctl, that a netlink notification is sent out to listeners. Also, NetworkManager will use this in the future to listen for out-of-band bridge master attribute updates and incorporate them into the runtime configuration. This patch is used for br_sysfs_if, and we also move br_ifinfo_notify out of store_flag. Signed-off-by: Xin Long <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-13bridge: a netlink notification should be sent when those attributes are ↵Xin Long2-37/+14
changed by br_sysfs_br Now when we change the attributes of bridge or br_port by netlink, a relevant netlink notification will be sent, but if we change them by ioctl or sysfs, no notification will be sent. We should ensure that whenever those attributes change internally or from sysfs/ioctl, that a netlink notification is sent out to listeners. Also, NetworkManager will use this in the future to listen for out-of-band bridge master attribute updates and incorporate them into the runtime configuration. This patch is used for br_sysfs_br. and we also need to remove some rtnl_trylock in old functions so that we can call it in a common one. For group_addr_store, we cannot make it use store_bridge_parm, because it's not a string-to-long convert, we will add notification on it individually. Signed-off-by: Xin Long <[email protected]> Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-13bridge: simplify the stp_state_store by calling store_bridge_parmXin Long1-15/+9
There are some repetitive codes in stp_state_store, we can remove them by calling store_bridge_parm. Signed-off-by: Xin Long <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-13bridge: simplify the forward_delay_store by calling store_bridge_parmXin Long1-17/+10
There are some repetitive codes in forward_delay_store, we can remove them by calling store_bridge_parm. Signed-off-by: Xin Long <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-13bridge: simplify the flush_store by calling store_bridge_parmXin Long1-7/+7
There are some repetitive codes in flush_store, we can remove them by calling store_bridge_parm, also, it would send rtnl notification after we add it in store_bridge_parm in the following patches. Signed-off-by: Xin Long <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-13ipv6, token: allow for clearing the current device tokenDaniel Borkmann1-4/+6
The original tokenized iid support implemented via f53adae4eae5 ("net: ipv6: add tokenized interface identifier support") didn't allow for clearing a device token as it was intended that this addressing mode was the only one active for globally scoped IPv6 addresses. Later we relaxed that restriction via 617fe29d45bd ("net: ipv6: only invalidate previously tokenized addresses"), and we should also allow for clearing tokens as there's no good reason why it shouldn't be allowed. Fixes: 617fe29d45bd ("net: ipv6: only invalidate previously tokenized addresses") Reported-by: Robin H. Johnson <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Cc: Hannes Frederic Sowa <[email protected]> Acked-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-13sock: tigthen lockdep checks for sock_owned_by_userHannes Frederic Sowa4-6/+4
sock_owned_by_user should not be used without socket lock held. It seems to be a common practice to check .owned before lock reclassification, so provide a little help to abstract this check away. Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-13udp: do not expect udp headers in recv cmsg IP_CMSG_CHECKSUMWillem de Bruijn2-2/+3
On udp sockets, recv cmsg IP_CMSG_CHECKSUM returns a checksum over the packet payload. Since commit e6afc8ace6dd pulled the headers, taking skb->data as the start of transport header is incorrect. Use the transport header pointer. Also, when peeking at an offset from the start of the packet, only return a checksum from the start of the peeked data. Note that the cmsg does not subtract a tail checkum when reading truncated data. Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-13udp: do not expect udp headers on ioctl SIOCINQWillem de Bruijn1-2/+0
On udp sockets, ioctl SIOCINQ returns the payload size of the first packet. Since commit e6afc8ace6dd pulled the headers, the result is incorrect when subtracting header length. Remove that operation. Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-04-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller3-1/+15
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree. More specifically, they are: 1) Fix missing filter table per-netns registration in arptables, from Florian Westphal. 2) Resolve out of bound access when parsing TCP options in nf_conntrack_tcp, patch from Jozsef Kadlecsik. 3) Prefer NFPROTO_BRIDGE extensions over NFPROTO_UNSPEC in ebtables, this resolves conflict between xt_limit and ebt_limit, from Phil Sutter. ==================== Signed-off-by: David S. Miller <[email protected]>
2016-04-14netfilter: ctnetlink: remove unnecessary inliningPablo Neira Ayuso1-69/+48
Many of these functions are called from control plane path. Move ctnetlink_nlmsg_size() under CONFIG_NF_CONNTRACK_EVENTS to avoid a compilation warning when CONFIG_NF_CONNTRACK_EVENTS=n. Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-04-14netfilter: x_tables: introduce and use xt_copy_counters_from_userFlorian Westphal4-130/+89
The three variants use same copy&pasted code, condense this into a helper and use that. Make sure info.name is 0-terminated. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-04-14netfilter: x_tables: remove obsolete checkFlorian Westphal3-22/+0
Since 'netfilter: x_tables: validate targets of jumps' change we validate that the target aligns exactly with beginning of a rule, so offset test is now redundant. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-04-14netfilter: x_tables: remove obsolete overflow check for compat case tooFlorian Westphal3-6/+0
commit 9e67d5a739327c44885adebb4f3a538050be73e4 ("[NETFILTER]: x_tables: remove obsolete overflow check") left the compat parts alone, but we can kill it there as well. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-04-14netfilter: x_tables: do compat validation via translate_tableFlorian Westphal4-342/+83
This looks like refactoring, but its also a bug fix. Problem is that the compat path (32bit iptables, 64bit kernel) lacks a few sanity tests that are done in the normal path. For example, we do not check for underflows and the base chain policies. While its possible to also add such checks to the compat path, its more copy&pastry, for instance we cannot reuse check_underflow() helper as e->target_offset differs in the compat case. Other problem is that it makes auditing for validation errors harder; two places need to be checked and kept in sync. At a high level 32 bit compat works like this: 1- initial pass over blob: validate match/entry offsets, bounds checking lookup all matches and targets do bookkeeping wrt. size delta of 32/64bit structures assign match/target.u.kernel pointer (points at kernel implementation, needed to access ->compatsize etc.) 2- allocate memory according to the total bookkeeping size to contain the translated ruleset 3- second pass over original blob: for each entry, copy the 32bit representation to the newly allocated memory. This also does any special match translations (e.g. adjust 32bit to 64bit longs, etc). 4- check if ruleset is free of loops (chase all jumps) 5-first pass over translated blob: call the checkentry function of all matches and targets. The alternative implemented by this patch is to drop steps 3&4 from the compat process, the translation is changed into an intermediate step rather than a full 1:1 translate_table replacement. In the 2nd pass (step #3), change the 64bit ruleset back to a kernel representation, i.e. put() the kernel pointer and restore ->u.user.name . This gets us a 64bit ruleset that is in the format generated by a 64bit iptables userspace -- we can then use translate_table() to get the 'native' sanity checks. This has two drawbacks: 1. we re-validate all the match and target entry structure sizes even though compat translation is supposed to never generate bogus offsets. 2. we put and then re-lookup each match and target. THe upside is that we get all sanity tests and ruleset validations provided by the normal path and can remove some duplicated compat code. iptables-restore time of autogenerated ruleset with 300k chains of form -A CHAIN0001 -m limit --limit 1/s -j CHAIN0002 -A CHAIN0002 -m limit --limit 1/s -j CHAIN0003 shows no noticeable differences in restore times: old: 0m30.796s new: 0m31.521s 64bit: 0m25.674s Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-04-14netfilter: x_tables: xt_compat_match_from_user doesn't need a retvalFlorian Westphal4-50/+25
Always returned 0. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-04-14netfilter: arp_tables: simplify translate_compat_table argsFlorian Westphal1-46/+36
Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-04-14netfilter: ip6_tables: simplify translate_compat_table argsFlorian Westphal1-35/+24
Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-04-14netfilter: ip_tables: simplify translate_compat_table argsFlorian Westphal1-35/+24
Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-04-14netfilter: x_tables: validate all offsets and sizes in a ruleFlorian Westphal1-5/+76
Validate that all matches (if any) add up to the beginning of the target and that each match covers at least the base structure size. The compat path should be able to safely re-use the function as the structures only differ in alignment; added a BUILD_BUG_ON just in case we have an arch that adds padding as well. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-04-14netfilter: x_tables: check for bogus target offsetFlorian Westphal4-8/+24
We're currently asserting that targetoff + targetsize <= nextoff. Extend it to also check that targetoff is >= sizeof(xt_entry). Since this is generic code, add an argument pointing to the start of the match/target, we can then derive the base structure size from the delta. We also need the e->elems pointer in a followup change to validate matches. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-04-14netfilter: x_tables: check standard target size tooFlorian Westphal1-0/+15
We have targets and standard targets -- the latter carries a verdict. The ip/ip6tables validation functions will access t->verdict for the standard targets to fetch the jump offset or verdict for chainloop detection, but this happens before the targets get checked/validated. Thus we also need to check for verdict presence here, else t->verdict can point right after a blob. Spotted with UBSAN while testing malformed blobs. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-04-14netfilter: x_tables: add compat version of xt_check_entry_offsetsFlorian Westphal4-3/+28
32bit rulesets have different layout and alignment requirements, so once more integrity checks get added to xt_check_entry_offsets it will reject well-formed 32bit rulesets. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-04-14netfilter: x_tables: assert minimum target sizeFlorian Westphal1-0/+3
The target size includes the size of the xt_entry_target struct. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-04-14netfilter: x_tables: kill check_entry helperFlorian Westphal3-35/+24
Once we add more sanity testing to xt_check_entry_offsets it becomes relvant if we're expecting a 32bit 'config_compat' blob or a normal one. Since we already have a lot of similar-named functions (check_entry, compat_check_entry, find_and_check_entry, etc.) and the current incarnation is short just fold its contents into the callers. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-04-14netfilter: x_tables: add and use xt_check_entry_offsetsFlorian Westphal4-32/+37
Currently arp/ip and ip6tables each implement a short helper to check that the target offset is large enough to hold one xt_entry_target struct and that t->u.target_size fits within the current rule. Unfortunately these checks are not sufficient. To avoid adding new tests to all of ip/ip6/arptables move the current checks into a helper, then extend this helper in followup patches. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-04-14netfilter: x_tables: validate targets of jumpsFlorian Westphal3-0/+48
When we see a jump also check that the offset gets us to beginning of a rule (an ipt_entry). The extra overhead is negible, even with absurd cases. 300k custom rules, 300k jumps to 'next' user chain: [ plus one jump from INPUT to first userchain ]: Before: real 0m24.874s user 0m7.532s sys 0m16.076s After: real 0m27.464s user 0m7.436s sys 0m18.840s Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>