aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2016-12-06tcp: warn on bogus MSS and try to amend itMarcelo Ricardo Leitner1-1/+21
There have been some reports lately about TCP connection stalls caused by NIC drivers that aren't setting gso_size on aggregated packets on rx path. This causes TCP to assume that the MSS is actually the size of the aggregated packet, which is invalid. Although the proper fix is to be done at each driver, it's often hard and cumbersome for one to debug, come to such root cause and report/fix it. This patch amends this situation in two ways. First, it adds a warning on when this situation occurs, so it gives a hint to those trying to debug this. It also limit the maximum probed MSS to the adverised MSS, as it should never be any higher than that. The result is that the connection may not have the best performance ever but it shouldn't stall, and the admin will have a hint on what to look for. Tested with virtio by forcing gso_size to 0. v2: updated msg per David's suggestion v3: use skb_iif to find the interface and also log its name, per Eric Dumazet's suggestion. As the skb may be backlogged and the interface gone by then, we need to check if the number still has a meaning. v4: use helper tcp_gro_dev_warn() and avoid pr_warn_once inside __once, per David's suggestion Cc: Jonathan Maxwell <[email protected]> Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-06net/udp: do not touch skb->peeked unless really neededEric Dumazet1-9/+10
In UDP recvmsg() path we currently access 3 cache lines from an skb while holding receive queue lock, plus another one if packet is dequeued, since we need to change skb->next->prev 1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08) 2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e) 3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4) skb->peeked is only needed to make sure 0-length packets are properly handled while MSG_PEEK is operated. I had first the intent to remove skb->peeked but the "MSG_PEEK at non-zero offset" support added by Sam Kumar makes this not possible. This patch avoids one cache line miss during the locked section, when skb->len and skb->peeked do not have to be read. It also avoids the skb_set_peeked() cost for non empty UDP datagrams. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Paolo Abeni <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05netlink: Do not schedule work from sk_destructHerbert Xu1-17/+15
It is wrong to schedule a work from sk_destruct using the socket as the memory reserve because the socket will be freed immediately after the return from sk_destruct. Instead we should do the deferral prior to sk_free. This patch does just that. Fixes: 707693c8a498 ("netlink: Call cb->done from a worker thread") Signed-off-by: Herbert Xu <[email protected]> Tested-by: Andrey Konovalov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05bpf: add prog_digest and expose it via fdinfo/netlinkDaniel Borkmann2-0/+17
When loading a BPF program via bpf(2), calculate the digest over the program's instruction stream and store it in struct bpf_prog's digest member. This is done at a point in time before any instructions are rewritten by the verifier. Any unstable map file descriptor number part of the imm field will be zeroed for the hash. fdinfo example output for progs: # cat /proc/1590/fdinfo/5 pos: 0 flags: 02000002 mnt_id: 11 prog_type: 1 prog_jited: 1 prog_digest: b27e8b06da22707513aa97363dfb11c7c3675d28 memlock: 4096 When programs are pinned and retrieved by an ELF loader, the loader can check the program's digest through fdinfo and compare it against one that was generated over the ELF file's program section to see if the program needs to be reloaded. Furthermore, this can also be exposed through other means such as netlink in case of a tc cls/act dump (or xdp in future), but also through tracepoints or other facilities to identify the program. Other than that, the digest can also serve as a base name for the work in progress kallsyms support of programs. The digest doesn't depend/select the crypto layer, since we need to keep dependencies to a minimum. iproute2 will get support for this facility. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05bpf, cls: consolidate prog deletion pathDaniel Borkmann1-17/+13
Commit 18cdb37ebf4c ("net: sched: do not use tcf_proto 'tp' argument from call_rcu") removed the last usage of tp from cls_bpf_delete_prog(), so also remove it from the function as argument to not give a wrong impression. tp is illegal to access from this callback, since it could already have been freed. Refactor the deletion code a bit, so that cls_bpf_destroy() can call into the same code for prog deletion as cls_bpf_delete() op, instead of having it unnecessarily duplicated. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05bpf: remove type arg from __is_valid_{,xdp_}accessDaniel Borkmann1-9/+6
Commit d691f9e8d440 ("bpf: allow programs to write to certain skb fields") pushed access type check outside of __is_valid_access() to have different restrictions for socket filters and tc programs. type is thus not used anymore within __is_valid_access() and should be removed as a function argument. Same for __is_valid_xdp_access() introduced by 6a773a15a1e8 ("bpf: add XDP prog type for early driver filter"). Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05net_sched: gen_estimator: complete rewrite of rate estimatorsEric Dumazet13-253/+164
1) Old code was hard to maintain, due to complex lock chains. (We probably will be able to remove some kfree_rcu() in callers) 2) Using a single timer to update all estimators does not scale. 3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity is not supposed to work well) In this rewrite : - I removed the RB tree that had to be scanned in gen_estimator_active(). qdisc dumps should be much faster. - Each estimator has its own timer. - Estimations are maintained in net_rate_estimator structure, instead of dirtying the qdisc. Minor, but part of the simplification. - Reading the estimator uses RCU and a seqcount to provide proper support for 32bit kernels. - We reduce memory need when estimators are not used, since we store a pointer, instead of the bytes/packets counters. - xt_rateest_mt() no longer has to grab a spinlock. (In the future, xt_rateest_tg() could be switched to per cpu counters) Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05net/sched: cls_flower: Set the filter Hardware device for all use-casesHadar Hen Zion1-1/+4
Check if the returned device from tcf_exts_get_dev function supports tc offload and in case the rule can't be offloaded, set the filter hw_dev parameter to the original device given by the user. The filter hw_device parameter should always be set by fl_hw_replace_filter function, since this pointer is used by dump stats and destroy filter for each flower rule (offloaded or not). Fixes: 7091d8c7055d ('net/sched: cls_flower: Add offload support using egress Hardware device') Signed-off-by: Hadar Hen Zion <[email protected]> Reported-by: Simon Horman <[email protected]> Tested-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05ipv6: Allow IPv4-mapped address as next-hopErik Nordmark1-1/+4
Made kernel accept IPv6 routes with IPv4-mapped address as next-hop. It is possible to configure IP interfaces with IPv4-mapped addresses, and one can add IPv6 routes for IPv4-mapped destinations/prefixes, yet prior to this fix the kernel returned an EINVAL when attempting to add an IPv6 route with an IPv4-mapped address as a nexthop/gateway. RFC 4798 (a proposed standard RFC) uses IPv4-mapped addresses as nexthops, thus in order to support that type of address configuration the kernel needs to allow IPv4-mapped addresses as nexthops. Signed-off-by: Erik Nordmark <[email protected]> Signed-off-by: Bob Gilligan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05net: caif: remove ineffective checkPan Bian1-4/+1
The check of the return value of sock_register() is ineffective. "if(!err)" seems to be a typo. It is better to propagate the error code to the callers of caif_sktinit_module(). This patch removes the check statment and directly returns the result of sock_register(). Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188751 Signed-off-by: Pan Bian <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05switch getfrag callbacks to ..._full() primitivesAl Viro2-6/+6
Signed-off-by: Al Viro <[email protected]>
2016-12-05[iov_iter] new primitives - copy_from_iter_full() and friendsAl Viro4-10/+7
copy_from_iter_full(), copy_from_iter_full_nocache() and csum_and_copy_from_iter_full() - counterparts of copy_from_iter() et.al., advancing iterator only in case of successful full copy and returning whether it had been successful or not. Convert some obvious users. *NOTE* - do not blindly assume that something is a good candidate for those unless you are sure that not advancing iov_iter in failure case is the right thing in this case. Anything that does short read/short write kind of stuff (or is in a loop, etc.) is unlikely to be a good one. Signed-off-by: Al Viro <[email protected]>
2016-12-05Merge branch 'for-upstream' of ↵David S. Miller1-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2016-12-03 Here's a set of Bluetooth & 802.15.4 patches for net-next (i.e. 4.10 kernel): - Fix for a potential NULL deref in the ieee802154 netlink code - Fix for the ED values of the at86rf2xx driver - Documentation updates to ieee802154 - Cleanups to u8 vs __u8 usage - Timer API usage cleanups in HCI drivers Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <[email protected]>
2016-12-05net: ping: check minimum size on ICMP header lengthKees Cook1-0/+4
Prior to commit c0371da6047a ("put iov_iter into msghdr") in v3.19, there was no check that the iovec contained enough bytes for an ICMP header, and the read loop would walk across neighboring stack contents. Since the iov_iter conversion, bad arguments are noticed, but the returned error is EFAULT. Returning EINVAL is a clearer error and also solves the problem prior to v3.19. This was found using trinity with KASAN on v3.18: BUG: KASAN: stack-out-of-bounds in memcpy_fromiovec+0x60/0x114 at addr ffffffc071077da0 Read of size 8 by task trinity-c2/9623 page:ffffffbe034b9a08 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x0() page dumped because: kasan: bad access detected CPU: 0 PID: 9623 Comm: trinity-c2 Tainted: G BU 3.18.0-dirty #15 Hardware name: Google Tegra210 Smaug Rev 1,3+ (DT) Call trace: [<ffffffc000209c98>] dump_backtrace+0x0/0x1ac arch/arm64/kernel/traps.c:90 [<ffffffc000209e54>] show_stack+0x10/0x1c arch/arm64/kernel/traps.c:171 [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffc000f18dc4>] dump_stack+0x7c/0xd0 lib/dump_stack.c:50 [< inline >] print_address_description mm/kasan/report.c:147 [< inline >] kasan_report_error mm/kasan/report.c:236 [<ffffffc000373dcc>] kasan_report+0x380/0x4b8 mm/kasan/report.c:259 [< inline >] check_memory_region mm/kasan/kasan.c:264 [<ffffffc00037352c>] __asan_load8+0x20/0x70 mm/kasan/kasan.c:507 [<ffffffc0005b9624>] memcpy_fromiovec+0x5c/0x114 lib/iovec.c:15 [< inline >] memcpy_from_msg include/linux/skbuff.h:2667 [<ffffffc000ddeba0>] ping_common_sendmsg+0x50/0x108 net/ipv4/ping.c:674 [<ffffffc000dded30>] ping_v4_sendmsg+0xd8/0x698 net/ipv4/ping.c:714 [<ffffffc000dc91dc>] inet_sendmsg+0xe0/0x12c net/ipv4/af_inet.c:749 [< inline >] __sock_sendmsg_nosec net/socket.c:624 [< inline >] __sock_sendmsg net/socket.c:632 [<ffffffc000cab61c>] sock_sendmsg+0x124/0x164 net/socket.c:643 [< inline >] SYSC_sendto net/socket.c:1797 [<ffffffc000cad270>] SyS_sendto+0x178/0x1d8 net/socket.c:1761 CVE-2016-8399 Reported-by: Qidan He <[email protected]> Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Cc: [email protected] Signed-off-by: Kees Cook <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05tcp: tsq: move tsq_flags close to sk_wmem_allocEric Dumazet5-19/+17
tsq_flags being in the same cache line than sk_wmem_alloc makes a lot of sense. Both fields are changed from tcp_wfree() and more generally by various TSQ related functions. Prior patch made room in struct sock and added sk_tsq_flags, this patch deletes tsq_flags from struct tcp_sock. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05tcp: tcp_mtu_probe() is likely to exit earlyEric Dumazet1-9/+9
Adding a likely() in tcp_mtu_probe() moves its code which used to be inlined in front of tcp_write_xmit() We still have a cache line miss to access icsk->icsk_mtup.enabled, we will probably have to reorganize fields to help data locality. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05tcp: tsq: add a shortcut in tcp_small_queue_check()Eric Dumazet1-0/+9
Always allow the two first skbs in write queue to be sent, regardless of sk_wmem_alloc/sk_pacing_rate values. This helps a lot in situations where TX completions are delayed either because of driver latencies or softirq latencies. Test is done with no cache line misses. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05tcp: tsq: avoid one atomic in tcp_wfree()Eric Dumazet1-1/+4
Under high load, tcp_wfree() has an atomic operation trying to schedule a tasklet over and over. We can schedule it only if our per cpu list was empty. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05tcp: tsq: add shortcut in tcp_tasklet_func()Eric Dumazet1-10/+12
Under high stress, I've seen tcp_tasklet_func() consuming ~700 usec, handling ~150 tcp sockets. By setting TCP_TSQ_DEFERRED in tcp_wfree(), we give a chance for other cpus/threads entering tcp_write_xmit() to grab it, allowing tcp_tasklet_func() to skip sockets that already did an xmit cycle. In the future, we might give to ACK processing an increased budget to reduce even more tcp_tasklet_func() amount of work. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05tcp: tsq: remove one locked operation in tcp_wfree()Eric Dumazet1-3/+10
Instead of atomically clear TSQ_THROTTLED and atomically set TSQ_QUEUED bits, use one cmpxchg() to perform a single locked operation. Since the following patch will also set TCP_TSQ_DEFERRED here, this cmpxchg() will make this addition free. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05tcp: tsq: add tsq_flags / tsq_enumEric Dumazet1-8/+8
This is a cleanup, to ease code review of following patches. Old 'enum tsq_flags' is renamed, and a new enumeration is added with the flags used in cmpxchg() operations as opposed to single bit operations. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05net: bridge: set error code on failurePan Bian1-0/+1
Function br_sysfs_addbr() does not set error code when the call kobject_create_and_add() returns a NULL pointer. It may be better to return "-ENOMEM" when kobject_create_and_add() fails. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188781 Signed-off-by: Pan Bian <[email protected]> Acked-by: Stephen Hemminger <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05net: af_mpls.c add space before open parenthesisSuraj Deshmukh1-1/+1
Adding space after switch keyword before open parenthesis for readability purpose. This patch fixes the checkpatch.pl warning: space required before the open parenthesis '(' Signed-off-by: Suraj Deshmukh <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05ipv4: Drop suffix update from resize codeAlexander Duyck1-21/+21
It has been reported that update_suffix can be expensive when it is called on a large node in which most of the suffix lengths are the same. The time required to add 200K entries had increased from around 3 seconds to almost 49 seconds. In order to address this we need to move the code for updating the suffix out of resize and instead just have it handled in the cases where we are pushing a node that increases the suffix length, or will decrease the suffix length. Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length") Reported-by: Robert Shearman <[email protected]> Signed-off-by: Alexander Duyck <[email protected]> Reviewed-by: Robert Shearman <[email protected]> Tested-by: Robert Shearman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-05ipv4: Drop leaf from suffix pull/push functionsAlexander Duyck1-12/+14
It wasn't necessary to pass a leaf in when doing the suffix updates so just drop it. Instead just pass the suffix and work with that. Since we dropped the leaf there is no need to include that in the name so the names are updated to node_push_suffix and node_pull_suffix. Finally I noticed that the logic for pulling the suffix length back actually had some issues. Specifically it would stop prematurely if there was a longer suffix, but it was not as long as the original suffix. I updated the code to address that in node_pull_suffix. Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length") Suggested-by: Robert Shearman <[email protected]> Signed-off-by: Alexander Duyck <[email protected]> Reviewed-by: Robert Shearman <[email protected]> Tested-by: Robert Shearman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-04netfilter: conntrack: add nf_conntrack_default_on sysctlFlorian Westphal2-1/+28
This switch (default on) can be used to disable automatic registration of connection tracking functionality in newly created network namespaces. This means that when net namespace goes down (or the tracker protocol module is unloaded) we *might* have to unregister the hooks. We can either add another per-netns variable that tells if the hooks got registered by default, or, alternatively, just call the protocol _put() function and have the callee deal with a possible 'extra' put() operation that doesn't pair with a get() one. This uses the latter approach, i.e. a put() without a get has no effect. Conntrack is still enabled automatically regardless of the new sysctl setting if the new net namespace requires connection tracking, e.g. when NAT rules are created. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-12-04netfilter: conntrack: register hooks in netns when needed by rulesetFlorian Westphal3-24/+123
This makes use of nf_ct_netns_get/put added in previous patch. We add get/put functions to nf_conntrack_l3proto structure, ipv4 and ipv6 then implement use-count to track how many users (nft or xtables modules) have a dependency on ipv4 and/or ipv6 connection tracking functionality. When count reaches zero, the hooks are unregistered. This delays activation of connection tracking inside a namespace until stateful firewall rule or nat rule gets added. This patch breaks backwards compatibility in the sense that connection tracking won't be active anymore when the protocol tracker module is loaded. This breaks e.g. setups that ctnetlink for flow accounting and the like, without any '-m conntrack' packet filter rules. Followup patch restores old behavour and makes new delayed scheme optional via sysctl. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-12-04netfilter: nf_tables: add conntrack dependencies for nat/masq/redir expressionsFlorian Westphal7-3/+40
so that conntrack core will add the needed hooks in this namespace. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-12-04netfilter: nat: add dependencies on conntrack moduleFlorian Westphal4-6/+43
MASQUERADE, S/DNAT and REDIRECT already call functions that depend on the conntrack module. However, since the conntrack hooks are now registered in a lazy fashion (i.e., only when needed) a symbol reference is not enough. Thus, when something is added to a nat table, make sure that it will see packets by calling nf_ct_netns_get() which will register the conntrack hooks in the current netns. An alternative would be to add these dependencies to the NAT table. However, that has problems when using non-modular builds -- we might register e.g. ipv6 conntrack before its initcall has run, leading to NULL deref crashes since its per-netns storage has not yet been allocated. Adding the dependency in the modules instead has the advantage that nat table also does not register its hooks until rules are added. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-12-04netfilter: add and use nf_ct_netns_get/putFlorian Westphal14-42/+54
currently aliased to try_module_get/_put. Will be changed in next patch when we add functions to make use of ->net argument to store usercount per l3proto tracker. This is needed to avoid registering the conntrack hooks in all netns and later only enable connection tracking in those that need conntrack. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-12-04netfilter: conntrack: remove unused init_net hookFlorian Westphal2-14/+0
since adf0516845bcd0 ("netfilter: remove ip_conntrack* sysctl compat code") the only user (ipv4 tracker) sets this to an empty stub function. After this change nf_ct_l3proto_pernet_register() is also empty, but this will change in a followup patch to add conditional register of the hooks. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-12-04netfilter: conntrack: built-in support for UDPliteDavide Caratti5-73/+19
CONFIG_NF_CT_PROTO_UDPLITE is no more a tristate. When set to y, connection tracking support for UDPlite protocol is built-in into nf_conntrack.ko. footprint test: $ ls -l net/netfilter/nf_conntrack{_proto_udplite,}.ko \ net/ipv4/netfilter/nf_conntrack_ipv4.ko \ net/ipv6/netfilter/nf_conntrack_ipv6.ko (builtin)|| udplite| ipv4 | ipv6 |nf_conntrack ---------++--------+--------+--------+-------------- none || 432538 | 828755 | 828676 | 6141434 UDPlite || - | 829649 | 829362 | 6498204 Signed-off-by: Davide Caratti <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-12-04netfilter: conntrack: built-in support for SCTPDavide Caratti5-72/+19
CONFIG_NF_CT_PROTO_SCTP is no more a tristate. When set to y, connection tracking support for SCTP protocol is built-in into nf_conntrack.ko. footprint test: $ ls -l net/netfilter/nf_conntrack{_proto_sctp,}.ko \ net/ipv4/netfilter/nf_conntrack_ipv4.ko \ net/ipv6/netfilter/nf_conntrack_ipv6.ko (builtin)|| sctp | ipv4 | ipv6 | nf_conntrack ---------++--------+--------+--------+-------------- none || 498243 | 828755 | 828676 | 6141434 SCTP || - | 829254 | 829175 | 6547872 Signed-off-by: Davide Caratti <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-12-04netfilter: conntrack: built-in support for DCCPDavide Caratti5-74/+20
CONFIG_NF_CT_PROTO_DCCP is no more a tristate. When set to y, connection tracking support for DCCP protocol is built-in into nf_conntrack.ko. footprint test: $ ls -l net/netfilter/nf_conntrack{_proto_dccp,}.ko \ net/ipv4/netfilter/nf_conntrack_ipv4.ko \ net/ipv6/netfilter/nf_conntrack_ipv6.ko (builtin)|| dccp | ipv4 | ipv6 | nf_conntrack ---------++--------+--------+--------+-------------- none || 469140 | 828755 | 828676 | 6141434 DCCP || - | 830566 | 829935 | 6533526 Signed-off-by: Davide Caratti <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-12-04Merge tag 'ipvs-for-v4.10' of ↵Pablo Neira Ayuso2-1/+55
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next Simon Horman says: ==================== IPVS Updates for v4.10 please consider these enhancements to the IPVS for v4.10. * Decrement the IP ttl in all the modes in order to prevent infinite route loops. Thanks to Dwip Banerjee. * Use IS_ERR_OR_NULL macro. Clean-up from Gao Feng. ==================== Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-12-04netfilter: nfnetlink_log: add "nf-logger-5-1" module alias nameLiping Zhang1-0/+1
So we can autoload nfnetlink_log.ko when the user adding nft log group X rule in netdev family. Signed-off-by: Liping Zhang <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-12-04netfilter: nf_log: do not assume ethernet header in netdev familyLiping Zhang3-3/+6
In netdev family, we will handle non ethernet packets, so using eth_hdr(skb)->h_proto is incorrect. Meanwhile, we can use socket(AF_PACKET...) to sending packets, so skb->protocol is not always set in bridge family. Add an extra parameter into nf_log_l2packet to solve this issue. Fixes: 1fddf4bad0ac ("netfilter: nf_log: add packet logging for netdev family") Signed-off-by: Liping Zhang <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-12-04netfilter: built-in NAT support for UDPliteDavide Caratti4-38/+8
CONFIG_NF_NAT_PROTO_UDPLITE is no more a tristate. When set to y, NAT support for UDPlite protocol is built-in into nf_nat.ko. footprint test: (nf_nat_proto_) |udplite || nf_nat --------------------------+--------++-------- no builtin | 408048 || 2241312 UDPLITE builtin | - || 2577256 Signed-off-by: Davide Caratti <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-12-04netfilter: built-in NAT support for SCTPDavide Caratti4-36/+7
CONFIG_NF_NAT_PROTO_SCTP is no more a tristate. When set to y, NAT support for SCTP protocol is built-in into nf_nat.ko. footprint test: (nf_nat_proto_) | sctp || nf_nat --------------------------+--------++-------- no builtin | 428344 || 2241312 SCTP builtin | - || 2597032 Signed-off-by: Davide Caratti <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-12-04netfilter: built-in NAT support for DCCPDavide Caratti4-37/+8
CONFIG_NF_NAT_PROTO_DCCP is no more a tristate. When set to y, NAT support for DCCP protocol is built-in into nf_nat.ko. footprint test: (nf_nat_proto_) | dccp || nf_nat --------------------------+--------++-------- no builtin | 409800 || 2241312 DCCP builtin | - || 2578968 Signed-off-by: Davide Caratti <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-12-04netfilter: update Arturo Borrero Gonzalez email addressArturo Borrero Gonzalez6-12/+12
The email address has changed, let's update the copyright statements. Signed-off-by: Arturo Borrero Gonzalez <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2016-12-03net: dcb: set error code on failuresPan Bian1-0/+1
In function dcbnl_cee_fill(), returns the value of variable err on errors. However, on some error paths (e.g. nla put fails), its value may be 0. It may be better to explicitly set a negative errno to variable err before returning. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188881 Signed-off-by: Pan Bian <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-03ipv6 addrconf: Implemented enhanced DAD (RFC7527)Erik Nordmark3-5/+48
Implemented RFC7527 Enhanced DAD. IPv6 duplicate address detection can fail if there is some temporary loopback of Ethernet frames. RFC7527 solves this by including a random nonce in the NS messages used for DAD, and if an NS is received with the same nonce it is assumed to be a looped back DAD probe and is ignored. RFC7527 is enabled by default. Can be disabled by setting both of conf/{all,interface}/enhanced_dad to zero. Signed-off-by: Erik Nordmark <[email protected]> Signed-off-by: Bob Gilligan <[email protected]> Reviewed-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-03ipv4: fib: Replay events when registering FIB notifierIdo Schimmel1-2/+146
Commit b90eb7549499 ("fib: introduce FIB notification infrastructure") introduced a new notification chain to notify listeners (f.e., switchdev drivers) about addition and deletion of routes. However, upon registration to the chain the FIB tables can already be populated, which means potential listeners will have an incomplete view of the tables. Solve that by dumping the FIB tables and replaying the events to the passed notification block. The dump itself is done using RCU in order not to starve consumers that need RTNL to make progress. The integrity of the dump is ensured by reading the FIB change sequence counter before and after the dump under RTNL. This allows us to avoid the problematic situation in which the dumping process sends a ENTRY_ADD notification following ENTRY_DEL generated by another process holding RTNL. Callers of the registration function may pass a callback that is executed in case the dump was inconsistent with current FIB tables. The number of retries until a consistent dump is achieved is set to a fixed number to prevent callers from looping for long periods of time. In case current limit proves to be problematic in the future, it can be easily converted to be configurable using a sysctl. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-03ipv4: fib: Allow for consistent FIB dumpingIdo Schimmel2-0/+3
The next patch will enable listeners of the FIB notification chain to request a dump of the FIB tables. However, since RTNL isn't taken during the dump, it's possible for the FIB tables to change mid-dump, which will result in inconsistency between the listener's table and the kernel's. Allow listeners to know about changes that occurred mid-dump, by adding a change sequence counter to each net namespace. The counter is incremented just before a notification is sent in the FIB chain. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-03ipv4: fib: Convert FIB notification chain to be atomicIdo Schimmel1-4/+4
In order not to hold RTNL for long periods of time we're going to dump the FIB tables using RCU. Convert the FIB notification chain to be atomic, as we can't block in RCU critical sections. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-03ipv4: fib: Export free_fib_info()Ido Schimmel1-0/+1
The FIB notification chain is going to be converted to an atomic chain, which means switchdev drivers will have to offload FIB entries in deferred work, as hardware operations entail sleeping. However, while the work is queued fib info might be freed, so a reference must be taken. To release the reference (and potentially free the fib info) fib_info_put() will be called, which in turn calls free_fib_info(). Export free_fib_info() so that modules will be able to invoke fib_info_put(). Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-03act_mirred: fix a typo in get_devWANG Cong1-1/+1
Fixes: 255cb30425c0 ("net/sched: act_mirred: Add new tc_action_ops get_dev()") Cc: Hadar Hen Zion <[email protected]> Cc: Jiri Pirko <[email protected]> Signed-off-by: Cong Wang <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-03udp: be less conservative with sock rmem accountingPaolo Abeni1-2/+2
Before commit 850cbaddb52d ("udp: use it's own memory accounting schema"), the udp protocol allowed sk_rmem_alloc to grow beyond the rcvbuf by the whole current packet's truesize. After said commit we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue is empty. As reported by Jesper this cause a performance regression for some (small) values of rcvbuf. This commit is intended to fix the regression restoring the old handling of the rcvbuf limit. Reported-by: Jesper Dangaard Brouer <[email protected]> Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema") Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-03Merge tag 'batadv-net-for-davem-20161202' of git://git.open-mesh.org/linux-mergeDavid S. Miller1-2/+2
Simon Wunderlich says: ==================== Here is another batman-adv bugfix: - fix checking for failed allocation of TVLV blocks in TT local data, by Sven Eckelmann ==================== Signed-off-by: David S. Miller <[email protected]>